-
Notifications
You must be signed in to change notification settings - Fork 8.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Fleet] Avoid setting upgraded_at to null
during upgrades
#139704
Comments
Pinging @elastic/fleet (Team:Fleet) |
@joshdover good catch!
|
In light of the work being planned for elastic/elastic-agent#778, adding a |
@joshdover I agree that option 2 seems the less time consuming. We should just make sure that his new default value is explicitly documented and explained somewhere. Happy to hear what @michel-laterman as to say about this new field which makes totally sense. |
It'd be good to do a quick test and see if avoiding setting |
I raised the draft prs, the change looks working fine, I can't reproduce the error. |
Maybe I'm missing the historical information for this; but why does the agent/fleet-server not have an explicit |
@michel-laterman I'm not sure about why, but we don't have an explicit agent status field on agent documents, status is a calculated field on Fleet UI based on several other fields (active, upgraded_at, etc.) EDIT: I was also wondering if we could have a generic status field that would be easier to query and monitor instead of having those complex queries in Fleet UI. |
Not sure either.
This could be an option as well. I do want to make sure we can still report the ingest/integration status separately from the agent, but I think we have enough in the new input status fields for this to be able to calculate an overall status for ingest. If it's simpler to just add a new enum to the existing @michel-laterman I do think we should make sure that the retrying state is something specific to upgrading since there could be different types of retrying that we want to display differently in the UI. This could be as simple as |
I think it is a bigger refactor if we want to use the Should we go ahead with the new |
I think that would be the best option. |
Added an open task to issue description to come back to the workaround discussed here after FF. |
Unfortunately I encountered the lucene bug locally again. kibana/x-pack/plugins/fleet/common/services/agent_status.ts Lines 80 to 81 in 59ea09f
There is still a logic in cancel action that sets upgrade_started_at field to null :
I think it would be more robust to have an enriched agent status field and query that directly in |
This makes sense to me. I also think we probably need to take a step back and reconsider how agent status should behave to achieve the user outcomes we want to support. I think we need to strike a balance between giving the Fleet admin insight into what's happening and avoiding communicating transient and unactionable status levels. I also think this needs to be paired with end-to-end testing with simulated failures and network conditions. There are many moving parts and nuances in how agent status is calculated on the agent, transmitted to Fleet Server, persisted in ES, and then finally read and determined in the UI. WDYT? Would this be a useful exercise? |
@joshdover I would even start with documenting the possible status transitions and fields associated with them, if we don't have such documentation already. |
Yes I think e2e-testing would probably be appropriate for this, in combination with some layer to emulate network issues. |
Workaround that was working for a customer:
Update the null fields to a string value:
|
Have an update to share on apache/lucene#11393. There are plans for a short-term fix which is planned for the Lucene version that Elasticsearch 8.5 will be shipping with. Team is evaluating whether or not a backport to 8.4.x is possible as well. |
@joshdover I am wondering if we should revert the fix added for |
@juliaElastic I'm +1 on reverting that (and related ES change). I'm also going to close this issue now that the Lucene issue was fixed in apache/lucene#11794 and ES was bumped to a Lucene snapshot that includes this: elastic/elasticsearch#90339 |
@joshdover I wouldn't revert the ES change, because the retry upgrade functionality is using the same field: #140225 |
We've been encountering this Lucene bug (apache/lucene#11393) during upgrade tests at scale. This results in queries against the
.fleet-agents
index to fail with an error message like:After discussion with @javanna it seems this bug happens when we're "unlucky" enough to set a field to
null
for all documents in the same lucene segment. We likely trigger exactly this during bulk agent upgrades here in Kibana when we set theupgraded_at
field tonull
:kibana/x-pack/plugins/fleet/server/services/agents/upgrade.ts
Lines 205 to 214 in ee6318a
While we wait for apache/lucene#11393 to be fixed and picked up by upstream Elasticserch, we should consider a workaround from our end to avoid setting this field to null. Here's everywhere I can find where we depend on the value of this field:
kibana/x-pack/plugins/fleet/common/services/agent_status.ts
Line 44 in ee6318a
kibana/x-pack/plugins/fleet/common/services/agent_status.ts
Line 81 in ee6318a
kibana/x-pack/plugins/fleet/common/services/is_agent_upgradeable.ts
Line 28 in ee6318a
We also potentially have the same issue when the upgrade completes and Fleet Server modifies the
upgraded_started_at
field to null. This may not be as likely to trigger the same scenario though since these happen more asynchronously as agents finish upgrading rather than all at once when the upgrade is scheduled. https://github.com/elastic/fleet-server/blob/4e6dcce693e141a0a10875dc8e76cbedcba7153a/internal/pkg/api/handleAck.go#L436Some potential options:
null
at all and replace the conditions above to check whereupgrade_started_at > upgraded_at
null
set.Right now I lean towards option 2, but would like to hear from the group.
cc @juliaElastic @AndersonQ @pjbertels @ablnk
Workaround for production clusters
It's possible that a force merge on the
.fleet-agents
index could resolve this problem for customers having this problem in production today, however we have not tested the impact of this.@ablnk is this something we could test on a cluster where you can reproduce this?
upgrade_status: started
state, kibana could do a check thatupgraded_at > upgrade_started_at
and updateupgrade_status: completed
as a fallback. We could do this as a scheduled task or some other place (maybe/agent_status
API that is periodically called from UI)The text was updated successfully, but these errors were encountered: