Elastic Agent can get stuck in Updating state on Fleet upgrade #828

gbanasiak · 2021-10-28T09:52:42Z

Kibana version: 7.15.1

Describe the bug:

Elastic Agents can operate in unstable environment with network/DNS issues. When Elastic Agent upgrade is initiated in Fleet mode, Elastic Agent can fail to ack upgrade success to Fleet server causing constant Updating state as reported in Kibana. This is confusing for end user as it suggests there's ongoing upgrade, which is not the case. It's also not obvious how to recover because upgrade action is only available if agent reports not the latest version.

Steps to reproduce:

install vanilla 7.15.1 deployment in Elastic Cloud (ES, Kibana, APM/Fleet server)
enroll 7.14.2 agent (I've used Linux version)
modify /etc/hosts on the agent host to resolve Fleet server hostname to unreachable IP address:

198.51.100.1    <fleet-cluster-ID>.fleet.<region>.<cloud-provider>.elastic-cloud.com

initiate upgrade, see how state changes from Healthy to Updating
wait for the upgrade to complete
comment out or remove /etc/hosts entry to restore correct IP address resolution
the agent seems to run fine with all beats upgraded but Updating state never changes back to Healthy

The following error can be noticed in agent logs:

[elastic_agent][warn] failed to ack update acknowledge action 'aec922d1-6591-4f69-90a1-3a6b7fa890d9' for elastic-agent 'c1fa118a-9676-446f-a025-154d2c42069b' failed: fail to ack to fleet: Post "https://<fleet-cluster-ID>.fleet.<region>.<cloud-provider>.elastic-cloud.com:443/api/fleet/agents/c1fa118a-9676-446f-a025-154d2c42069b/acks?": dial tcp 198.51.100.1:443: connect: connection timed out

Expected behavior:

The agent should ultimately be reported as Healthy if the upgrade was successful without any manual interventions (see workarounds).

Screenshots (if relevant):

Workarounds:

If an agent is restarted manually with elastic-agent restart on the host within 15 (?) minutes from the upgrade (and there are no more network/DNS issues), the successful upgrade will be reported and state shown by Kibana will flip from Updating to Healthy.

Alternatively, an agent can be forced to upgrade again with the following API call:

curl --request POST \
  --url https://<KIBANA_HOST>/api/fleet/agents/<AGENT_ID>/upgrade \
  --user "<SUPERUSER_NAME>:<SUPERUSER_PASSWORD>" \
  --header 'Content-Type: application/json' \
  --header 'kbn-xsrf: as' \
  --data '{
	"version": "7.15.1",
	"force": true
}'

Other comments:

Elastic Agent communicates with 3 entities during the upgrade - Fleet server endpoint, Elasticsearch endpoint and artifacts.elastic.co (to download binaries). The above scenario shows a failure with Fleet server endpoint only. It's possible other stuck states are possible if different combination of network/DNS failures occurs.

The text was updated successfully, but these errors were encountered:

elasticmachine · 2021-11-02T07:34:19Z

Pinging @elastic/fleet (Team:Fleet)

jen-huang · 2021-11-02T18:08:26Z

Transferring to Fleet Server, maybe upgraded_at on agent info is not properly set after network is restored?

michel-laterman · 2022-06-27T21:41:25Z

I was unable to recreate this on ubuntu 20.04 using v8.2.0 -> v8.2.3, after commenting out the my alteration to /etc/hosts the agent went healthy in the UI.

jlind23 · 2022-06-29T12:20:25Z

Closing for now following @michel-laterman comment..

jdixon-86 · 2022-09-16T20:00:18Z

@jlind23 I'm going from 8.3.1 to 8.4.1 and still experience this. I'm getting around it with the force command but I have 600 agents stuck updating at this time. Restarting the agent doesn't fix it and only the force upgrade does.

jlind23 · 2022-09-19T07:59:40Z

@KnowMoreIT we do have this issue that will fix the problem

jen-huang transferred this issue from elastic/kibana Nov 2, 2021

jen-huang added the Team:Elastic-Agent-Control-Plane Label for the Agent Control Plane team label Nov 2, 2021

jlind23 added 8.3-candidate bug Something isn't working labels Jan 18, 2022

jlind23 added v8.3.0 and removed 8.3-candidate labels Mar 23, 2022

jlind23 added v8.4.0 and removed v8.3.0 labels Jun 1, 2022

jlind23 closed this as completed Jun 29, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Elastic Agent can get stuck in Updating state on Fleet upgrade #828

Elastic Agent can get stuck in Updating state on Fleet upgrade #828

gbanasiak commented Oct 28, 2021

elasticmachine commented Nov 2, 2021

jen-huang commented Nov 2, 2021

michel-laterman commented Jun 27, 2022

jlind23 commented Jun 29, 2022

jdixon-86 commented Sep 16, 2022

jlind23 commented Sep 19, 2022

Elastic Agent can get stuck in Updating state on Fleet upgrade #828

Elastic Agent can get stuck in Updating state on Fleet upgrade #828

Comments

gbanasiak commented Oct 28, 2021

elasticmachine commented Nov 2, 2021

jen-huang commented Nov 2, 2021

michel-laterman commented Jun 27, 2022

jlind23 commented Jun 29, 2022

jdixon-86 commented Sep 16, 2022

jlind23 commented Sep 19, 2022