Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Fleet] Avoid setting upgraded_at to null during upgrades #139704

Closed
1 task
joshdover opened this issue Aug 30, 2022 · 21 comments · Fixed by #140078
Closed
1 task

[Fleet] Avoid setting upgraded_at to null during upgrades #139704

joshdover opened this issue Aug 30, 2022 · 21 comments · Fixed by #140078
Assignees
Labels
Project:FleetScaling Team:Fleet Team label for Observability Data Collection Fleet team v8.5.0

Comments

@joshdover
Copy link
Contributor

joshdover commented Aug 30, 2022

We've been encountering this Lucene bug (apache/lucene#11393) during upgrade tests at scale. This results in queries against the .fleet-agents index to fail with an error message like:

image

search_phase_execution_exception: [null_pointer_exception] Reason: Cannot invoke "org.apache.lucene.index.PointValues.getDocCount()" because the return value of "org.apache.lucene.index.LeafReader.getPointValues(String)" is null

After discussion with @javanna it seems this bug happens when we're "unlucky" enough to set a field to null for all documents in the same lucene segment. We likely trigger exactly this during bulk agent upgrades here in Kibana when we set the upgraded_at field to null:

await bulkUpdateAgents(
esClient,
agentsToUpdate.map((agent) => ({
agentId: agent.id,
data: {
upgraded_at: null,
upgrade_started_at: now,
},
}))
);

While we wait for apache/lucene#11393 to be fixed and picked up by upstream Elasticserch, we should consider a workaround from our end to avoid setting this field to null. Here's everywhere I can find where we depend on the value of this field:

We also potentially have the same issue when the upgrade completes and Fleet Server modifies the upgraded_started_at field to null. This may not be as likely to trigger the same scenario though since these happen more asynchronously as agents finish upgrading rather than all at once when the upgrade is scheduled. https://github.com/elastic/fleet-server/blob/4e6dcce693e141a0a10875dc8e76cbedcba7153a/internal/pkg/api/handleAck.go#L436

Some potential options:

  1. Don't set these fields to null at all and replace the conditions above to check where upgrade_started_at > upgraded_at
    • Pros: less writes
    • Cons: the agent status query will need to use a script instead of a simple range query (you can't range and reference another field value). This may be slower.
  2. Similar to above but use a hardcoded date to represent null, such as the lowest possible date value (Jan 1 1970, etc.). Update the queries to look for this value instead of null.
    • Pros: same amount of writes as today, likely the same query perf as today
    • Cons: a little hacky and could be unclear to support or developers who don't know why it does this. Need to be careful about upgrade path from 8.4 -> 8.5 where some documents will still have null set.
  3. Replace this state tracking with other schema
    • Cons: more work to really avoid a bug that should be fixed in Lucene/Elasticsearch

Right now I lean towards option 2, but would like to hear from the group.

cc @juliaElastic @AndersonQ @pjbertels @ablnk

Workaround for production clusters

It's possible that a force merge on the .fleet-agents index could resolve this problem for customers having this problem in production today, however we have not tested the impact of this.

@ablnk is this something we could test on a cluster where you can reproduce this?

  • Add fix for the edge case when clients might use the new kibana version with old FS version. To avoid leaving the agent in upgrade_status: started state, kibana could do a check that upgraded_at > upgrade_started_at and update upgrade_status: completed as a fallback. We could do this as a scheduled task or some other place (maybe /agent_status API that is periodically called from UI)
@joshdover joshdover added Team:Fleet Team label for Observability Data Collection Fleet team Project:FleetScaling v8.5.0 labels Aug 30, 2022
@elasticmachine
Copy link
Contributor

Pinging @elastic/fleet (Team:Fleet)

@juliaElastic
Copy link
Contributor

@joshdover good catch!
I agree that option 2 sounds like a reasonable workaround until the lucene error is fixed.
For option 3, can't we do something simple, like introduce a new field to signal that an upgrade has started?
Something like:

 data: { 
       upgraded_at: <previous upgraded at date>, 
       upgrade_started_at: now, 
       upgrade_status: started / completed
     }, 

@joshdover
Copy link
Contributor Author

In light of the work being planned for elastic/elastic-agent#778, adding a upgrade_status would actually make sense. We've been discussing how we should signal that an upgrade is retrying and I think having a dedicated field that is separate from last_checkin_status is probably better. @michel-laterman does that make sense to you?

@jlind23
Copy link
Contributor

jlind23 commented Aug 30, 2022

@joshdover I agree that option 2 seems the less time consuming. We should just make sure that his new default value is explicitly documented and explained somewhere.

Happy to hear what @michel-laterman as to say about this new field which makes totally sense.

@joshdover
Copy link
Contributor Author

It'd be good to do a quick test and see if avoiding setting upgraded_at to null actually works around the problem or not. I would hate to do this work if it doesn't help avoid this bug.

@juliaElastic
Copy link
Contributor

I raised the draft prs, the change looks working fine, I can't reproduce the error.

@michel-laterman
Copy link

Maybe I'm missing the historical information for this; but why does the agent/fleet-server not have an explicit UPGRADING status?

@juliaElastic
Copy link
Contributor

juliaElastic commented Sep 7, 2022

@michel-laterman I'm not sure about why, but we don't have an explicit agent status field on agent documents, status is a calculated field on Fleet UI based on several other fields (active, upgraded_at, etc.)

EDIT: I was also wondering if we could have a generic status field that would be easier to query and monitor instead of having those complex queries in Fleet UI.
Though it would be tricky to update to "offline" status, because currently offline means 5 minutes passed after last checkin, so we would have to poll agents and update to offline if that interval has passed.

@joshdover
Copy link
Contributor Author

Maybe I'm missing the historical information for this; but why does the agent/fleet-server not have an explicit UPGRADING status?

Not sure either.

EDIT: I was also wondering if we could have a generic status field that would be easier to query and monitor instead of having those complex queries in Fleet UI.

This could be an option as well. I do want to make sure we can still report the ingest/integration status separately from the agent, but I think we have enough in the new input status fields for this to be able to calculate an overall status for ingest. If it's simpler to just add a new enum to the existing last_checkin_status that may be fine.

@michel-laterman I do think we should make sure that the retrying state is something specific to upgrading since there could be different types of retrying that we want to display differently in the UI. This could be as simple as UPGRADE_RETRYING

@juliaElastic
Copy link
Contributor

juliaElastic commented Sep 7, 2022

I think it is a bigger refactor if we want to use the last_checkin_status for other statuses like upgrading, upgrade_completed, upgrade_retrying. Currently I am not sure what are the possible transitions of the status, I see it is updated in Fleet Server during checkin. For example, I wouldn't want to set the status to upgrade_completed if it might be set to degraded by an agent at the same time.

Should we go ahead with the new upgrade_status field for this fix? So we can have a quicker solution and come back to unifying the status later if needed.

@jlind23
Copy link
Contributor

jlind23 commented Sep 8, 2022

Should we go ahead with the new upgrade_status field for this fix? So we can have a quicker solution and come back to unifying the status later if needed.

I think that would be the best option.

@juliaElastic
Copy link
Contributor

Added an open task to issue description to come back to the workaround discussed here after FF.

@juliaElastic
Copy link
Contributor

Unfortunately I encountered the lucene bug locally again.
I nailed it down to the upgrade_started_at exists filter:

export function buildKueryForUpgradingAgents(path: string = '') {
return `(${path}upgrade_started_at:*) and not (${path}upgrade_status:completed)`;

There is still a logic in cancel action that sets upgrade_started_at field to null:

I think it would be more robust to have an enriched agent status field and query that directly in agent_status API, instead of maintaining this logic to make sure we don't introduce exists filter again.

@juliaElastic juliaElastic reopened this Sep 14, 2022
@joshdover
Copy link
Contributor Author

I think it would be more robust to have an enriched agent status field and query that directly in agent_status API, instead of maintaining this logic to make sure we don't introduce exists filter again.

This makes sense to me. I also think we probably need to take a step back and reconsider how agent status should behave to achieve the user outcomes we want to support. I think we need to strike a balance between giving the Fleet admin insight into what's happening and avoiding communicating transient and unactionable status levels.

I also think this needs to be paired with end-to-end testing with simulated failures and network conditions. There are many moving parts and nuances in how agent status is calculated on the agent, transmitted to Fleet Server, persisted in ES, and then finally read and determined in the UI.

WDYT? Would this be a useful exercise?

@juliaElastic
Copy link
Contributor

@joshdover I would even start with documenting the possible status transitions and fields associated with them, if we don't have such documentation already.
Having end-to-end tests would definitely help, do you mean automated ones? I am not really sure though what is the best tool for this, maybe the e2e-testing repo can be used to simulate different kinds of errors.

@joshdover
Copy link
Contributor Author

Having end-to-end tests would definitely help, do you mean automated ones? I am not really sure though what is the best tool for this, maybe the e2e-testing repo can be used to simulate different kinds of errors.

Yes I think e2e-testing would probably be appropriate for this, in combination with some layer to emulate network issues.

@juliaElastic
Copy link
Contributor

juliaElastic commented Sep 21, 2022

Workaround that was working for a customer:
First query which field is null:

GET .fleet-agents/_search?filter_path=hits.hits._source.upgrade*&size=10000

Update the null fields to a string value:

curl -sk -XPOST --user elastic:changeme -H 'content-type:application/json' \
    https://ES_HOST:PORT/_security/role/fleet_superuser -d '
        {
            "indices": [
                {
                    "names": [".fleet*"],
                    "privileges": ["all"],
                    "allow_restricted_indices": true
                }
            ]
        }'

curl -sk -XPOST --user elastic:changeme -H 'content-type:application/json' \
    https://ES_HOST:PORT/_security/user/fleet_superuser -d '
        {
            "password": "password",
            "roles": ["superuser", "fleet_superuser"]
        }'

curl -sk -XPOST --user fleet_superuser:password -H 'content-type:application/json' \
    -H'x-elastic-product-origin:fleet' \
   https://ES_HOST:PORT/.fleet-agents/_update_by_query -d '
        {
  "script": {
    "source": "ctx._source.upgrade_started_at = ctx._source.upgraded_at",
    "lang": "painless"
  },
  "query": { 
    "bool": {
      "must_not": [
        {
          "exists": {
            "field": "upgrade_started_at"
          }
        }
      ],
      "minimum_should_match": 1
    }
  }
}'

@joshdover
Copy link
Contributor Author

joshdover commented Sep 21, 2022

Have an update to share on apache/lucene#11393. There are plans for a short-term fix which is planned for the Lucene version that Elasticsearch 8.5 will be shipping with. Team is evaluating whether or not a backport to 8.4.x is possible as well.

@juliaElastic
Copy link
Contributor

juliaElastic commented Sep 21, 2022

@joshdover I am wondering if we should revert the fix added for upgraded_at field, because it still has an outstanding issue, and it doesn't fully solve the problem (upgrade_started_at field is still set to null on abort)

@joshdover
Copy link
Contributor Author

joshdover commented Sep 27, 2022

@juliaElastic I'm +1 on reverting that (and related ES change).

I'm also going to close this issue now that the Lucene issue was fixed in apache/lucene#11794 and ES was bumped to a Lucene snapshot that includes this: elastic/elasticsearch#90339

@juliaElastic
Copy link
Contributor

@joshdover I wouldn't revert the ES change, because the retry upgrade functionality is using the same field: #140225

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Project:FleetScaling Team:Fleet Team label for Observability Data Collection Fleet team v8.5.0
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants