Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Elastic Agent can get stuck in Updating state on Fleet upgrade #828

Closed
gbanasiak opened this issue Oct 28, 2021 · 6 comments
Closed

Elastic Agent can get stuck in Updating state on Fleet upgrade #828

gbanasiak opened this issue Oct 28, 2021 · 6 comments
Labels
bug Something isn't working Team:Elastic-Agent-Control-Plane Label for the Agent Control Plane team v8.4.0

Comments

@gbanasiak
Copy link

Kibana version: 7.15.1

Describe the bug:

Elastic Agents can operate in unstable environment with network/DNS issues. When Elastic Agent upgrade is initiated in Fleet mode, Elastic Agent can fail to ack upgrade success to Fleet server causing constant Updating state as reported in Kibana. This is confusing for end user as it suggests there's ongoing upgrade, which is not the case. It's also not obvious how to recover because upgrade action is only available if agent reports not the latest version.

Steps to reproduce:

  • install vanilla 7.15.1 deployment in Elastic Cloud (ES, Kibana, APM/Fleet server)
  • enroll 7.14.2 agent (I've used Linux version)
  • modify /etc/hosts on the agent host to resolve Fleet server hostname to unreachable IP address:
198.51.100.1    <fleet-cluster-ID>.fleet.<region>.<cloud-provider>.elastic-cloud.com
  • initiate upgrade, see how state changes from Healthy to Updating
  • wait for the upgrade to complete
  • comment out or remove /etc/hosts entry to restore correct IP address resolution
  • the agent seems to run fine with all beats upgraded but Updating state never changes back to Healthy

The following error can be noticed in agent logs:

[elastic_agent][warn] failed to ack update acknowledge action 'aec922d1-6591-4f69-90a1-3a6b7fa890d9' for elastic-agent 'c1fa118a-9676-446f-a025-154d2c42069b' failed: fail to ack to fleet: Post "https://<fleet-cluster-ID>.fleet.<region>.<cloud-provider>.elastic-cloud.com:443/api/fleet/agents/c1fa118a-9676-446f-a025-154d2c42069b/acks?": dial tcp 198.51.100.1:443: connect: connection timed out

Expected behavior:

The agent should ultimately be reported as Healthy if the upgrade was successful without any manual interventions (see workarounds).

Screenshots (if relevant):

Screenshot 2021-10-25 at 16 36 13

Workarounds:

If an agent is restarted manually with elastic-agent restart on the host within 15 (?) minutes from the upgrade (and there are no more network/DNS issues), the successful upgrade will be reported and state shown by Kibana will flip from Updating to Healthy.

Alternatively, an agent can be forced to upgrade again with the following API call:

curl --request POST \
  --url https://<KIBANA_HOST>/api/fleet/agents/<AGENT_ID>/upgrade \
  --user "<SUPERUSER_NAME>:<SUPERUSER_PASSWORD>" \
  --header 'Content-Type: application/json' \
  --header 'kbn-xsrf: as' \
  --data '{
	"version": "7.15.1",
	"force": true
}'

Other comments:

Elastic Agent communicates with 3 entities during the upgrade - Fleet server endpoint, Elasticsearch endpoint and artifacts.elastic.co (to download binaries). The above scenario shows a failure with Fleet server endpoint only. It's possible other stuck states are possible if different combination of network/DNS failures occurs.

@elasticmachine
Copy link
Collaborator

Pinging @elastic/fleet (Team:Fleet)

@jen-huang jen-huang transferred this issue from elastic/kibana Nov 2, 2021
@jen-huang jen-huang added the Team:Elastic-Agent-Control-Plane Label for the Agent Control Plane team label Nov 2, 2021
@jen-huang
Copy link

Transferring to Fleet Server, maybe upgraded_at on agent info is not properly set after network is restored?

@jlind23 jlind23 added 8.3-candidate bug Something isn't working labels Jan 18, 2022
@jlind23 jlind23 added v8.4.0 and removed v8.3.0 labels Jun 1, 2022
@michel-laterman
Copy link
Contributor

I was unable to recreate this on ubuntu 20.04 using v8.2.0 -> v8.2.3, after commenting out the my alteration to /etc/hosts the agent went healthy in the UI.

@jlind23
Copy link
Contributor

jlind23 commented Jun 29, 2022

Closing for now following @michel-laterman comment..

@jlind23 jlind23 closed this as completed Jun 29, 2022
@jdixon-86
Copy link

@jlind23 I'm going from 8.3.1 to 8.4.1 and still experience this. I'm getting around it with the force command but I have 600 agents stuck updating at this time. Restarting the agent doesn't fix it and only the force upgrade does.

@jlind23
Copy link
Contributor

jlind23 commented Sep 19, 2022

@KnowMoreIT we do have this issue that will fix the problem

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working Team:Elastic-Agent-Control-Plane Label for the Agent Control Plane team v8.4.0
Projects
None yet
Development

No branches or pull requests

6 participants