[Fleet] Avoid setting upgraded_at to `null` during upgrades #139704

joshdover · 2022-08-30T11:17:07Z

We've been encountering this Lucene bug (apache/lucene#11393) during upgrade tests at scale. This results in queries against the .fleet-agents index to fail with an error message like:

search_phase_execution_exception: [null_pointer_exception] Reason: Cannot invoke "org.apache.lucene.index.PointValues.getDocCount()" because the return value of "org.apache.lucene.index.LeafReader.getPointValues(String)" is null

After discussion with @javanna it seems this bug happens when we're "unlucky" enough to set a field to null for all documents in the same lucene segment. We likely trigger exactly this during bulk agent upgrades here in Kibana when we set the upgraded_at field to null:

kibana/x-pack/plugins/fleet/server/services/agents/upgrade.ts

Lines 205 to 214 in ee6318a

    
           await bulkUpdateAgents( 
        
             esClient, 
        
             agentsToUpdate.map((agent) => ({ 
        
               agentId: agent.id, 
        
               data: { 
        
                 upgraded_at: null, 
        
                 upgrade_started_at: now, 
        
               }, 
        
             })) 
        
           );

While we wait for apache/lucene#11393 to be fixed and picked up by upstream Elasticserch, we should consider a workaround from our end to avoid setting this field to null. Here's everywhere I can find where we depend on the value of this field:

Calculating the "upgrading" status:
- kibana/x-pack/plugins/fleet/common/services/agent_status.ts
  
  Line 44 in ee6318a
  
  if (!policyRevision || (agent.upgrade_started_at && !agent.upgraded_at)) {
- kibana/x-pack/plugins/fleet/common/services/agent_status.ts
  
  Line 81 in ee6318a
  
  return `(${path}upgrade_started_at:*) and not (${path}upgraded_at:*)`;
Detecting if an agent is "upgradable":
- kibana/x-pack/plugins/fleet/common/services/is_agent_upgradeable.ts
  
  Line 28 in ee6318a
  
  if (agent.upgrade_started_at && !agent.upgraded_at) {
There's no display of this field in the UI
Fleet Server does not read or query on this field

We also potentially have the same issue when the upgrade completes and Fleet Server modifies the upgraded_started_at field to null. This may not be as likely to trigger the same scenario though since these happen more asynchronously as agents finish upgrading rather than all at once when the upgrade is scheduled. https://github.com/elastic/fleet-server/blob/4e6dcce693e141a0a10875dc8e76cbedcba7153a/internal/pkg/api/handleAck.go#L436

Some potential options:

Don't set these fields to null at all and replace the conditions above to check where upgrade_started_at > upgraded_at
- Pros: less writes
- Cons: the agent status query will need to use a script instead of a simple range query (you can't range and reference another field value). This may be slower.
Similar to above but use a hardcoded date to represent null, such as the lowest possible date value (Jan 1 1970, etc.). Update the queries to look for this value instead of null.
- Pros: same amount of writes as today, likely the same query perf as today
- Cons: a little hacky and could be unclear to support or developers who don't know why it does this. Need to be careful about upgrade path from 8.4 -> 8.5 where some documents will still have null set.
Replace this state tracking with other schema
- Cons: more work to really avoid a bug that should be fixed in Lucene/Elasticsearch

Right now I lean towards option 2, but would like to hear from the group.

cc @juliaElastic @AndersonQ @pjbertels @ablnk

Workaround for production clusters

It's possible that a force merge on the .fleet-agents index could resolve this problem for customers having this problem in production today, however we have not tested the impact of this.

@ablnk is this something we could test on a cluster where you can reproduce this?

Add fix for the edge case when clients might use the new kibana version with old FS version. To avoid leaving the agent in upgrade_status: started state, kibana could do a check that upgraded_at > upgrade_started_at and update upgrade_status: completed as a fallback. We could do this as a scheduled task or some other place (maybe /agent_status API that is periodically called from UI)

The text was updated successfully, but these errors were encountered:

elasticmachine · 2022-08-30T11:17:09Z

Pinging @elastic/fleet (Team:Fleet)

juliaElastic · 2022-08-30T11:33:13Z

@joshdover good catch!
I agree that option 2 sounds like a reasonable workaround until the lucene error is fixed.
For option 3, can't we do something simple, like introduce a new field to signal that an upgrade has started?
Something like:

 data: { 
       upgraded_at: <previous upgraded at date>, 
       upgrade_started_at: now, 
       upgrade_status: started / completed
     },

joshdover · 2022-08-30T11:44:08Z

In light of the work being planned for elastic/elastic-agent#778, adding a upgrade_status would actually make sense. We've been discussing how we should signal that an upgrade is retrying and I think having a dedicated field that is separate from last_checkin_status is probably better. @michel-laterman does that make sense to you?

jlind23 · 2022-08-30T14:22:22Z

@joshdover I agree that option 2 seems the less time consuming. We should just make sure that his new default value is explicitly documented and explained somewhere.

Happy to hear what @michel-laterman as to say about this new field which makes totally sense.

joshdover · 2022-09-05T15:21:57Z

It'd be good to do a quick test and see if avoiding setting upgraded_at to null actually works around the problem or not. I would hate to do this work if it doesn't help avoid this bug.

juliaElastic · 2022-09-06T14:43:15Z

I raised the draft prs, the change looks working fine, I can't reproduce the error.

michel-laterman · 2022-09-06T14:47:11Z

Maybe I'm missing the historical information for this; but why does the agent/fleet-server not have an explicit UPGRADING status?

juliaElastic · 2022-09-07T06:43:19Z

@michel-laterman I'm not sure about why, but we don't have an explicit agent status field on agent documents, status is a calculated field on Fleet UI based on several other fields (active, upgraded_at, etc.)

EDIT: I was also wondering if we could have a generic status field that would be easier to query and monitor instead of having those complex queries in Fleet UI.
Though it would be tricky to update to "offline" status, because currently offline means 5 minutes passed after last checkin, so we would have to poll agents and update to offline if that interval has passed.

joshdover · 2022-09-07T11:13:17Z

Maybe I'm missing the historical information for this; but why does the agent/fleet-server not have an explicit UPGRADING status?

Not sure either.

EDIT: I was also wondering if we could have a generic status field that would be easier to query and monitor instead of having those complex queries in Fleet UI.

This could be an option as well. I do want to make sure we can still report the ingest/integration status separately from the agent, but I think we have enough in the new input status fields for this to be able to calculate an overall status for ingest. If it's simpler to just add a new enum to the existing last_checkin_status that may be fine.

@michel-laterman I do think we should make sure that the retrying state is something specific to upgrading since there could be different types of retrying that we want to display differently in the UI. This could be as simple as UPGRADE_RETRYING

juliaElastic · 2022-09-07T14:10:22Z

I think it is a bigger refactor if we want to use the last_checkin_status for other statuses like upgrading, upgrade_completed, upgrade_retrying. Currently I am not sure what are the possible transitions of the status, I see it is updated in Fleet Server during checkin. For example, I wouldn't want to set the status to upgrade_completed if it might be set to degraded by an agent at the same time.

Should we go ahead with the new upgrade_status field for this fix? So we can have a quicker solution and come back to unifying the status later if needed.

jlind23 · 2022-09-08T07:29:59Z

Should we go ahead with the new upgrade_status field for this fix? So we can have a quicker solution and come back to unifying the status later if needed.

I think that would be the best option.

juliaElastic · 2022-09-08T13:12:36Z

Added an open task to issue description to come back to the workaround discussed here after FF.

juliaElastic · 2022-09-14T07:20:25Z

Unfortunately I encountered the lucene bug locally again.
I nailed it down to the upgrade_started_at exists filter:

kibana/x-pack/plugins/fleet/common/services/agent_status.ts

Lines 80 to 81 in 59ea09f

    
           export function buildKueryForUpgradingAgents(path: string = '') { 
        
             return `(${path}upgrade_started_at:*) and not (${path}upgrade_status:completed)`;

There is still a logic in cancel action that sets upgrade_started_at field to null:

kibana/x-pack/plugins/fleet/server/services/agents/actions.ts

Line 170 in 59ea09f

upgrade_started_at: null,

I think it would be more robust to have an enriched agent status field and query that directly in agent_status API, instead of maintaining this logic to make sure we don't introduce exists filter again.

joshdover · 2022-09-14T08:18:10Z

I think it would be more robust to have an enriched agent status field and query that directly in agent_status API, instead of maintaining this logic to make sure we don't introduce exists filter again.

This makes sense to me. I also think we probably need to take a step back and reconsider how agent status should behave to achieve the user outcomes we want to support. I think we need to strike a balance between giving the Fleet admin insight into what's happening and avoiding communicating transient and unactionable status levels.

I also think this needs to be paired with end-to-end testing with simulated failures and network conditions. There are many moving parts and nuances in how agent status is calculated on the agent, transmitted to Fleet Server, persisted in ES, and then finally read and determined in the UI.

WDYT? Would this be a useful exercise?

juliaElastic · 2022-09-14T08:24:04Z

@joshdover I would even start with documenting the possible status transitions and fields associated with them, if we don't have such documentation already.
Having end-to-end tests would definitely help, do you mean automated ones? I am not really sure though what is the best tool for this, maybe the e2e-testing repo can be used to simulate different kinds of errors.

joshdover · 2022-09-15T12:22:41Z

Having end-to-end tests would definitely help, do you mean automated ones? I am not really sure though what is the best tool for this, maybe the e2e-testing repo can be used to simulate different kinds of errors.

Yes I think e2e-testing would probably be appropriate for this, in combination with some layer to emulate network issues.

juliaElastic · 2022-09-21T06:56:47Z

Workaround that was working for a customer:
First query which field is null:

GET .fleet-agents/_search?filter_path=hits.hits._source.upgrade*&size=10000

Update the null fields to a string value:

curl -sk -XPOST --user elastic:changeme -H 'content-type:application/json' \
    https://ES_HOST:PORT/_security/role/fleet_superuser -d '
        {
            "indices": [
                {
                    "names": [".fleet*"],
                    "privileges": ["all"],
                    "allow_restricted_indices": true
                }
            ]
        }'

curl -sk -XPOST --user elastic:changeme -H 'content-type:application/json' \
    https://ES_HOST:PORT/_security/user/fleet_superuser -d '
        {
            "password": "password",
            "roles": ["superuser", "fleet_superuser"]
        }'

curl -sk -XPOST --user fleet_superuser:password -H 'content-type:application/json' \
    -H'x-elastic-product-origin:fleet' \
   https://ES_HOST:PORT/.fleet-agents/_update_by_query -d '
        {
  "script": {
    "source": "ctx._source.upgrade_started_at = ctx._source.upgraded_at",
    "lang": "painless"
  },
  "query": { 
    "bool": {
      "must_not": [
        {
          "exists": {
            "field": "upgrade_started_at"
          }
        }
      ],
      "minimum_should_match": 1
    }
  }
}'

joshdover · 2022-09-21T08:28:01Z

Have an update to share on apache/lucene#11393. There are plans for a short-term fix which is planned for the Lucene version that Elasticsearch 8.5 will be shipping with. Team is evaluating whether or not a backport to 8.4.x is possible as well.

juliaElastic · 2022-09-21T09:21:35Z

@joshdover I am wondering if we should revert the fix added for upgraded_at field, because it still has an outstanding issue, and it doesn't fully solve the problem (upgrade_started_at field is still set to null on abort)

joshdover · 2022-09-27T12:48:02Z

@juliaElastic I'm +1 on reverting that (and related ES change).

I'm also going to close this issue now that the Lucene issue was fixed in apache/lucene#11794 and ES was bumped to a Lucene snapshot that includes this: elastic/elasticsearch#90339

juliaElastic · 2022-09-27T13:11:56Z

@joshdover I wouldn't revert the ES change, because the retry upgrade functionality is using the same field: #140225

joshdover added Team:Fleet Team label for Observability Data Collection Fleet team Project:FleetScaling v8.5.0 labels Aug 30, 2022

juliaElastic self-assigned this Sep 6, 2022

This was referenced Sep 6, 2022

[Fleet] replaced upgraded_at null by upgrade_status #140078

Merged

updating upgrade_status: completed elastic/fleet-server#1833

Merged

This was referenced Sep 7, 2022

Added upgrade_status field to fleet-agents juliaElastic/elasticsearch#1

Closed

Fleet: Added upgrade_status field to fleet-agents elastic/elasticsearch#89845

Merged

michel-laterman mentioned this issue Sep 7, 2022

[Fleet] show upgrade retries #140225

Closed

juliaElastic closed this as completed in #140078 Sep 8, 2022

juliaElastic mentioned this issue Sep 8, 2022

[Fleet] Add workaround for upgrade_status not updated to completed if client runs a FS old version #140290

Closed

juliaElastic reopened this Sep 14, 2022

kpollich mentioned this issue Sep 20, 2022

[Fleet] Improve storage of agent status field #141107

Open

joshdover closed this as completed Sep 27, 2022

This was referenced Sep 27, 2022

revert upgrade_status change #141937

Merged

Revert "updating upgrade_status: completed (#1833)" elastic/fleet-server#1920

Merged

jlind23 mentioned this issue Sep 28, 2022

Possible to not ACK upgrade action elastic/elastic-agent#760

Closed

1 task

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Fleet] Avoid setting upgraded_at to `null` during upgrades #139704

[Fleet] Avoid setting upgraded_at to `null` during upgrades #139704

joshdover commented Aug 30, 2022 •

edited by juliaElastic

Loading

elasticmachine commented Aug 30, 2022

juliaElastic commented Aug 30, 2022

joshdover commented Aug 30, 2022

jlind23 commented Aug 30, 2022

joshdover commented Sep 5, 2022

juliaElastic commented Sep 6, 2022

michel-laterman commented Sep 6, 2022

juliaElastic commented Sep 7, 2022 •

edited

Loading

joshdover commented Sep 7, 2022

juliaElastic commented Sep 7, 2022 •

edited

Loading

jlind23 commented Sep 8, 2022

juliaElastic commented Sep 8, 2022

juliaElastic commented Sep 14, 2022

joshdover commented Sep 14, 2022

juliaElastic commented Sep 14, 2022

joshdover commented Sep 15, 2022

juliaElastic commented Sep 21, 2022 •

edited

Loading

joshdover commented Sep 21, 2022 •

edited

Loading

juliaElastic commented Sep 21, 2022 •

edited

Loading

joshdover commented Sep 27, 2022 •

edited

Loading

juliaElastic commented Sep 27, 2022

[Fleet] Avoid setting upgraded_at to null during upgrades #139704

[Fleet] Avoid setting upgraded_at to null during upgrades #139704

Comments

joshdover commented Aug 30, 2022 • edited by juliaElastic Loading

Workaround for production clusters

elasticmachine commented Aug 30, 2022

juliaElastic commented Aug 30, 2022

joshdover commented Aug 30, 2022

jlind23 commented Aug 30, 2022

joshdover commented Sep 5, 2022

juliaElastic commented Sep 6, 2022

michel-laterman commented Sep 6, 2022

juliaElastic commented Sep 7, 2022 • edited Loading

joshdover commented Sep 7, 2022

juliaElastic commented Sep 7, 2022 • edited Loading

jlind23 commented Sep 8, 2022

juliaElastic commented Sep 8, 2022

juliaElastic commented Sep 14, 2022

joshdover commented Sep 14, 2022

juliaElastic commented Sep 14, 2022

joshdover commented Sep 15, 2022

juliaElastic commented Sep 21, 2022 • edited Loading

joshdover commented Sep 21, 2022 • edited Loading

juliaElastic commented Sep 21, 2022 • edited Loading

joshdover commented Sep 27, 2022 • edited Loading

juliaElastic commented Sep 27, 2022

[Fleet] Avoid setting upgraded_at to `null` during upgrades #139704

[Fleet] Avoid setting upgraded_at to `null` during upgrades #139704

joshdover commented Aug 30, 2022 •

edited by juliaElastic

Loading

juliaElastic commented Sep 7, 2022 •

edited

Loading

juliaElastic commented Sep 7, 2022 •

edited

Loading

juliaElastic commented Sep 21, 2022 •

edited

Loading

joshdover commented Sep 21, 2022 •

edited

Loading

juliaElastic commented Sep 21, 2022 •

edited

Loading

joshdover commented Sep 27, 2022 •

edited

Loading