Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Agent Upgrade]: Linux agent fails on upgrade from 7.16.3>8.1.2 Snapshot and goes Unhealthy. #275

Closed
ghost opened this issue Mar 31, 2022 · 39 comments
Assignees
Labels
bug Something isn't working impact:high Short-term priority; add to current release, or definitely next. Team:Elastic-Agent-Control-Plane Label for the Agent Control Plane team

Comments

@ghost
Copy link

ghost commented Mar 31, 2022

Describe the bug:
Agent upgrade fails and goes Unhealthy from 7.16.3>8.1.2 Snapshot.

Build Details:
VERSION: 8.1.2-SNAPSHOT
COMMIT: 7dce1a1c7cf6aba8782d0af02fc2d95edb5be999
BUILD: 50688
ARTIFACT LINK OF 7.16.3: https://www.elastic.co/downloads/past-releases/elastic-agent-7-16-3

Preconditions:

  1. Elastic 8.1.2-SNAPSHOT environment should be deployed.

Steps to Reproduce:

  1. Login to Kibana environment.
  2. Navigate to Fleet Tab.
  3. Upgrade lower version agents from the Agents tab.
  4. Observe that agent upgrade fails and goes Unhealthy while upgrading from 7.16.3>8.1.2 Snapshot.

Actual Result:
Agent upgrade fails and goes Unhealthy from 7.16.3>8.1.2 Snapshot.

Expected Result:
Agent should be upgraded to latest version from 7.16.3>8.1.2 Snapshot.

What's Working:

Agent is successfully upgraded from 7.16.3>8.1.2-Snapshot without Endpoint Security Integration.
Screenshot (287)

Screenshot:

Screenshot (285)
Screenshot (284)

@ghost ghost added bug Something isn't working impact:high Short-term priority; add to current release, or definitely next. Team:Fleet Label for the Fleet team labels Mar 31, 2022
@elasticmachine
Copy link
Collaborator

Pinging @elastic/fleet (Team:Fleet)

@manishgupta-qasource
Copy link

Secondary review for this ticket is Done

@ghost ghost changed the title [Agent Upgrade]: Agent upgrade fails and goes Unhealthy from 7.16.3>8.1.2 Snapshot. [Agent Upgrade]: Linux agent fails on upgrade from 7.16.3>8.1.2 Snapshot and goes Unhealthy. Mar 31, 2022
@joshdover joshdover transferred this issue from elastic/kibana Mar 31, 2022
@joshdover joshdover added Team:Elastic-Agent-Control-Plane Label for the Agent Control Plane team and removed Team:Fleet Label for the Fleet team labels Mar 31, 2022
@joshdover
Copy link
Contributor

joshdover commented Mar 31, 2022

I've seen this issue happen as well on my linux hosts. Getting this on all about half of my Linux hosts after upgrading from 8.1.1 to 8.1.2 today.

@joshdover
Copy link
Contributor

Also getting this on my mac as well with similar error message:

[elastic_agent][error] failed to dispatch actions, error: failed verification of agent binary: 2 errors occurred:
	* fetching asc file from '/Library/Elastic/Agent/data/elastic-agent-7f30bb/downloads/elastic-agent-8.1.2-darwin-x86_64.tar.gz.asc': open /Library/Elastic/Agent/data/elastic-agent-7f30bb/downloads/elastic-agent-8.1.2-darwin-x86_64.tar.gz.asc: no such file or directory
	* open /Library/Elastic/Agent/data/elastic-agent-7f30bb/downloads/elastic-agent-8.1.2-darwin-x86_64.tar.gz.sha512: no such file or directory

@ph ph self-assigned this Mar 31, 2022
@ph
Copy link
Contributor

ph commented Mar 31, 2022

@joshdover can you do a ls -l /Library/Elastic/Agent/data/elastic-agent-7f30bb/downloads/ ?

@ph
Copy link
Contributor

ph commented Mar 31, 2022

@joshdover Can you send me a diagnostic of one of the problematic host?

@joshdover
Copy link
Contributor

@ph below is the downloads directory and I sent you the diagnostic over DM.

sudo ls -l /Library/Elastic/Agent/data/elastic-agent-7f30bb/downloads/
total 412480
-rw-r--r--  1 root  wheel  32333273 Mar 23 14:49 apm-server-8.1.1-darwin-x86_64.tar.gz
-rw-r--r--  1 root  wheel       488 Mar 23 14:49 apm-server-8.1.1-darwin-x86_64.tar.gz.asc
-rw-r--r--  1 root  wheel       168 Mar 23 14:49 apm-server-8.1.1-darwin-x86_64.tar.gz.sha512
-rw-r--r--  1 root  wheel  41477489 Mar 23 14:49 endpoint-security-8.1.1-darwin-x86_64.tar.gz
-rw-r--r--  1 root  wheel       488 Mar 23 14:49 endpoint-security-8.1.1-darwin-x86_64.tar.gz.asc
-rw-r--r--  1 root  wheel       175 Mar 23 14:49 endpoint-security-8.1.1-darwin-x86_64.tar.gz.sha512
-rw-r--r--  1 root  wheel  30800476 Mar 23 14:49 filebeat-8.1.1-darwin-x86_64.tar.gz
-rw-r--r--  1 root  wheel       488 Mar 23 14:49 filebeat-8.1.1-darwin-x86_64.tar.gz.asc
-rw-r--r--  1 root  wheel       166 Mar 23 14:49 filebeat-8.1.1-darwin-x86_64.tar.gz.sha512
-rw-r--r--  1 root  wheel   8186973 Mar 23 14:49 fleet-server-8.1.1-darwin-x86_64.tar.gz
-rw-r--r--  1 root  wheel       488 Mar 23 14:49 fleet-server-8.1.1-darwin-x86_64.tar.gz.asc
-rw-r--r--  1 root  wheel       170 Mar 23 14:49 fleet-server-8.1.1-darwin-x86_64.tar.gz.sha512
-rw-r--r--  1 root  wheel  24269481 Mar 23 14:49 heartbeat-8.1.1-darwin-x86_64.tar.gz
-rw-r--r--  1 root  wheel       488 Mar 23 14:49 heartbeat-8.1.1-darwin-x86_64.tar.gz.asc
-rw-r--r--  1 root  wheel       167 Mar 23 14:49 heartbeat-8.1.1-darwin-x86_64.tar.gz.sha512
-rw-r--r--  1 root  wheel  36101125 Mar 23 14:49 metricbeat-8.1.1-darwin-x86_64.tar.gz
-rw-r--r--  1 root  wheel       488 Mar 23 14:49 metricbeat-8.1.1-darwin-x86_64.tar.gz.asc
-rw-r--r--  1 root  wheel       168 Mar 23 14:49 metricbeat-8.1.1-darwin-x86_64.tar.gz.sha512
-rw-r--r--  1 root  wheel  37950282 Mar 23 14:49 osquerybeat-8.1.1-darwin-x86_64.tar.gz
-rw-r--r--  1 root  wheel       488 Mar 23 14:49 osquerybeat-8.1.1-darwin-x86_64.tar.gz.asc
-rw-r--r--  1 root  wheel       169 Mar 23 14:49 osquerybeat-8.1.1-darwin-x86_64.tar.gz.sha512

@pjbertels
Copy link
Contributor

My experience:
When trying to upgrade a recent elastic agent snapshot from Fleet and I am getting stuck with the following errors. This worked on this cluster(ess-hnvgd launched 2022-03-31 15:39:21 running 8.1.2) on a previous version of the agent but I installed my VMs with https://staging.elastic.co/8.1.1-6f681121/downloads/beats/elastic-agent/elastic-agent-8.1.1-linux-x86_64.tar.gz and now upgrades fail. The upgrade starts, goes wrong and then rolls back.

failed to dispatch actions, error: failed verification of agent binary: 2 errors occurred:     * fetching asc file from '/opt/Elastic/Agent/data/elastic-agent-7f30bb/downloads/elastic-agent-8.1.2-linux-x86_64.tar.gz.asc': open /opt/Elastic/Agent/data/elastic-agent-7f30bb/downloads/elastic-agent-8.1.2-linux-x86_64.tar.gz.asc: no such file or directory     * open /opt/Elastic/Agent/data/elastic-agent-7f30bb/downloads/elastic-agent-8.1.2-linux-x86_64.tar.gz.sha512: no such file or directory 

@jlind23
Copy link
Contributor

jlind23 commented Apr 4, 2022

@pjbertels do you also have endpoint integration enable?

@michel-laterman
Copy link
Contributor

I have not been able to recreate this so far, I've tried an 8.1.2-SNAPSHOT cluster, with a policy that just has the system integration and a policy that has system+endpoint.

@redundancydisorder
Copy link

I am seeing this same issue
[elastic_agent][error] failed to dispatch actions, error: failed verification of agent binary: 2 errors occurred: * fetching asc file from '/opt/Elastic/Agent/data/elastic-agent-7f30bb/downloads/elastic-agent-8.1.2-linux-x86_64.tar.gz.asc': open /opt/Elastic/Agent/data/elastic-agent-7f30bb/downloads/elastic-agent-8.1.2-linux-x86_64.tar.gz.asc: no such file or directory * open /opt/Elastic/Agent/data/elastic-agent-7f30bb/downloads/elastic-agent-8.1.2-linux-x86_64.tar.gz.sha512: no such file or directory

Agent goes unhealth then back to healthy but not upgraded

manual download of agent on client
curl -L -O https://artifacts.elastic.co/downloads/beats/elastic-agent/elastic-agent-8.1.2-linux-x86_64.tar.gz
works without issue so connectivity to internet and elastic.co is not an issue.

@michel-laterman
Copy link
Contributor

@redundancydisorder, what integrations/os' are you running on?

@redundancydisorder
Copy link

redundancydisorder commented Apr 4, 2022

All agents that fail are running Ubuntu 20.04 (most) or 18.04 (a few). The example from above that failed was just running the system integration on 20.04. A couple agents are running Custom Logs.

Cluster and kibana were installed at 8.1.1 and just updated to 8.12. Agents can be added with 8.1.2 with no issues.

@samoz83
Copy link

samoz83 commented Apr 5, 2022

We're getting the same as @redundancydisorder with agents running on Rocky 8.4/8.5. With a range of different integrations, but all have at least system and auditd integrations.

@ph ph assigned michel-laterman and unassigned ph Apr 5, 2022
@WiegerElastic
Copy link

I am having the same issue with my deployment. I spun up a new deployment last night on Elastic Cloud, on version 8.1.1. Added the agents (2 VPSes, running Ubuntu server 20.04 LTS). Upgraded the stack to 8.1.2. Then wanted to proceed in upgrading the machines to Agent 8.1.2. Same error on both devices.

12:59:03.830
elastic_agent
[elastic_agent][info] 2022-04-05T12:59:03+02:00 - message: Application: [a4d493a5-db03-4cda-af1a-9d76ed258667]: State changed to UPDATING: Update to version '8.1.2' started - type: 'STATE' - sub_type: 'UPDATING'
12:59:18.151
elastic_agent
[elastic_agent][error] 2022-04-05T12:59:18+02:00 - message: Application: [a4d493a5-db03-4cda-af1a-9d76ed258667]: State changed to FAILED: failed verification of agent binary: 2 errors occurred:
* fetching asc file from '/opt/Elastic/Agent/data/elastic-agent-7f30bb/downloads/elastic-agent-8.1.2-linux-x86_64.tar.gz.asc': open /opt/Elastic/Agent/data/elastic-agent-7f30bb/downloads/elastic-agent-8.1.2-linux-x86_64.tar.gz.asc: no such file or directory
* open /opt/Elastic/Agent/data/elastic-agent-7f30bb/downloads/elastic-agent-8.1.2-linux-x86_64.tar.gz.sha512: no such file or directory

  • type: 'ERROR' - sub_type: 'FAILED'
    12:59:18.152
    elastic_agent
    [elastic_agent][error] failed to dispatch actions, error: failed verification of agent binary: 2 errors occurred:
    • fetching asc file from '/opt/Elastic/Agent/data/elastic-agent-7f30bb/downloads/elastic-agent-8.1.2-linux-x86_64.tar.gz.asc': open /opt/Elastic/Agent/data/elastic-agent-7f30bb/downloads/elastic-agent-8.1.2-linux-x86_64.tar.gz.asc: no such file or directory
    • open /opt/Elastic/Agent/data/elastic-agent-7f30bb/downloads/elastic-agent-8.1.2-linux-x86_64.tar.gz.sha512: no such file or directory

@joshdover
Copy link
Contributor

@ph Should this be listed as an 8.2 blocker (assuming the problem is still present)?

@ph
Copy link
Contributor

ph commented Apr 5, 2022

@joshdover @jlind23 I wouldn't consider this a blocker, the upgrade is still beta (experimental?)

@redundancydisorder
Copy link

redundancydisorder commented Apr 5, 2022

With debug log on no information is recorded between:
State changed to UPDATING: Update to version '8.1.2' started - type: 'STATE' - sub_type: 'UPDATING'
and
State changed to FAILED: failed verification of agent binary: 2 errors occurred:

At the same time the referenced directory:
/opt/Elastic/Agent/data/elastic-agent-7f30bb/downloads/
had zero changes to the file system on disk. It looks like this failing download or writing to disk prior to the verification failing.

Repeated attempts to update while monitoring file system changes and can't find any attempt to write a download to disk.

@michel-laterman
Copy link
Contributor

Looks like the issue here was introduced as a side effect of elastic/beats#30281

it effects v8.1.1+

@michel-laterman
Copy link
Contributor

Changes from #255 will fix it if we need to backport to 8.1.x if another release is planned as the download succeeded when attempting to upgrade from 8.2.0-SNAPSHOT to 8.3.0-SNAPSHOT (however the upgrade did not install correctly as the binary was not linked)

@ph
Copy link
Contributor

ph commented Apr 6, 2022

@michel-laterman Can you backport the fix to the appropriate branches @jlind23

@jlind23
Copy link
Contributor

jlind23 commented Apr 11, 2022

@samratbhadra-qasource can you test it again with the latest snapshot from 8.1.X?

@ghost
Copy link
Author

ghost commented Apr 12, 2022

Hi @jlind23,
We have revalidated this issue on 02 different VMs and observed that the issue is still reproducible. Please find below the testing details:

  • Linux agent fails on upgrade from 8.0.1>8.1.3-Snapshot and goes Unhealthy.

Screenshot:
Screenshot (348)

Screenshot (345)

Build Details:
VERSION: 8.1.3-SNAPSHOT
BUILD: 50718
COMMIT: a05409860677938d5bbfc6c6065c85230f54848b
ARTIFACT LINK OF 8.0.1: https://www.elastic.co/downloads/past-releases/elastic-agent-8-0-1

Thanks!

@jlind23
Copy link
Contributor

jlind23 commented Apr 12, 2022

@samratbhadra-qasource Where is this commit sha coming from: COMMIT: a05409860677938d5bbfc6c6065c85230f54848b
?

Is it a stack one? I do not see it in Beats repository.

@ghost
Copy link
Author

ghost commented Apr 13, 2022

Hi @jlind23,
We have deployed the 8.1.3-SNAPSHOT kibana cloud build from https://console.qa.cld.elstc.co/home.
This COMMIT is from the above deployed build.
For your reference we have provided below screenshot that we use use for capturing kibana build details.

NOTE: The screenshot is from the latest 8.1.3-Snapshot build available on cloud-qa.
Screenshot (353)

@jlind23
Copy link
Contributor

jlind23 commented Apr 13, 2022

Could you please use today's snapshot, the one with the "c44c8c" SHA? The previous one was a version built last week.

@joshdover
Copy link
Contributor

Since the problem was present on older versions of Agent, isn't this always going to be reproducible when trying to upgrade from an affected to version even if the version that is being upgraded to is fixed? @michel-laterman maybe you could confirm based on the nature of the fix.

If what I suspect is true, then it points to the very painful impact of having bugs in our upgrade code. We probably need to have better e2e test coverage in place for this feature before we can call it GA.

@jlind23
Copy link
Contributor

jlind23 commented Apr 13, 2022

If you are trying to upgrade from an 8.1.1 or an 8.1.2 to something else indeed it will fail.
Otherwise it should be working.

Here @samratbhadra-qasource tried to upgrade from 7.16.3 to 8.1.3 if I am not misreading.

For the e2e test coverage it is an ongoing topic with manu in the e2e testing repository.
elastic/e2e-testing#1643

@jlind23
Copy link
Contributor

jlind23 commented Apr 13, 2022

@samratbhadra-qasource looking again at the screenshot provided here: #275 (comment)

It seems to be a timeout issue and an HTTP 404, I do not think this is related to the upgrade issue we have seen.

@amolnater-qasource
Copy link

Hi @jlind23
We had reported an issue on 7.15 Snapshot for this linux agent upgrade failure at #148 (In case it is helpful).
We even attempted to troubleshoot doing various ways like increasing agent.download.timeout: 300. However issue didn't resolve.
Thanks!

@jlind23
Copy link
Contributor

jlind23 commented Apr 13, 2022

@amolnater-qasource and blake is currently working on this there: #104

@eric-ooi
Copy link

Upgrading from 8.1.1/8.1.2 to 8.1.3 on Linux and Windows continues to fail for me.

@ghost
Copy link
Author

ghost commented Apr 21, 2022

Hi @jlind23,
We have revalidated this issue on the released 8.1.3 cloud environment .
We have upgraded 8.0.1 released agent to 8.1.3 and had below observations.

Agent upgrade:
Windows: PASS
Linux: FAIL
MAC: PASS

Screenshot:
Screenshot (382)
Screenshot (383)

Build Details:
VERSION: 8.1.3
COMMIT: c44c8c44c82ed80d1ae3dd990291dcc85b7a27dc
BUILD: 50723
ARTIFACT LINK OF 8.0.1: https://www.elastic.co/downloads/past-releases/elastic-agent-8-0-1

Please let us know if anything else is required from our end.

Thanks!

@blakerouse
Copy link
Contributor

This is definitely a timeout issue. You can see the upgrade started at 12:54:55.063 and the timeout error message is logged at 12:56:56.460. Which is roughly 1min 30secs which is the default timeout for download, in version 8.0.1. Version 8.3.0 changed this value to 10 minutes, that will help with these type of issues.

Just because Windows and Mac finished is probably just a timing issue. The linux one could be very close to finishing but that timeout is hit right at the end.

@nimarezainia
Copy link
Contributor

@samratbhadra-qasource would it be possible to repeat this test between two 8.3 snapshots to confirm it's fixed? we will work on getting the fix back ported.

@ghost
Copy link
Author

ghost commented Apr 25, 2022

Hi @nimarezainia,

We have validated this issue on 02 latest 8.3.0-Snapshot cloud environments and had below observations.

We have upgraded:

  • 8.0.1 released agents with endpoint Security on all OSs to 8.3.0-Snapshot build.
  • 8.1.3 released agents with endpoint Security on all OSs to 8.3.0-Snapshot build.

Agent upgrade:
Windows: PASS
Linux: still FAILING
MAC: PASS

Screenshot:
Screenshot (393)
Screenshot (394)
Screenshot (395)

Build Details:
VERSION: 8.2.0-SNAPSHOT
COMMIT: 64479235912eecd8364a7dfd4231931839c61cf9
BUILD: 52260
ARTIFACT LINK OF 8.0.1: https://www.elastic.co/downloads/past-releases/elastic-agent-8-0-1
ARTIFACT LINK OF 8.1.3: https://www.elastic.co/downloads/past-releases/elastic-agent-8-1-3

Note: We are closing this issue as we are tracking same issue under #173

Thanks!

@ghost ghost closed this as completed Apr 25, 2022
@mukeshelastic
Copy link

@ph @jlind23 Assuming the underlying cause was timeout, is there a timeout configuration parameter that can be tweaked by users and not require agent binary upgrade?

@jlind23
Copy link
Contributor

jlind23 commented May 2, 2022

@mukeshelastic download timeout can be configured indeed. Moreover Blake worked on a progress reporter: #308
It will give users much more insights about it.

@amolnater-qasource
Copy link

Bug Conversion

  • Testcase already exists for this scenario under Fleet test suite at link:

Thanks

This issue was closed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working impact:high Short-term priority; add to current release, or definitely next. Team:Elastic-Agent-Control-Plane Label for the Agent Control Plane team
Projects
None yet
Development

No branches or pull requests