-
Notifications
You must be signed in to change notification settings - Fork 136
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Agent Upgrade]: Linux agent fails on upgrade from 7.16.3>8.1.2 Snapshot and goes Unhealthy. #275
Comments
Pinging @elastic/fleet (Team:Fleet) |
Secondary review for this ticket is Done |
I've seen this issue happen as well on my linux hosts. Getting this on all about half of my Linux hosts after upgrading from 8.1.1 to 8.1.2 today. |
Also getting this on my mac as well with similar error message:
|
@joshdover can you do a ls -l /Library/Elastic/Agent/data/elastic-agent-7f30bb/downloads/ ? |
@joshdover Can you send me a diagnostic of one of the problematic host? |
@ph below is the downloads directory and I sent you the diagnostic over DM.
|
My experience:
|
@pjbertels do you also have endpoint integration enable? |
I have not been able to recreate this so far, I've tried an |
I am seeing this same issue Agent goes unhealth then back to healthy but not upgraded manual download of agent on client |
@redundancydisorder, what integrations/os' are you running on? |
All agents that fail are running Ubuntu 20.04 (most) or 18.04 (a few). The example from above that failed was just running the system integration on 20.04. A couple agents are running Custom Logs. Cluster and kibana were installed at 8.1.1 and just updated to 8.12. Agents can be added with 8.1.2 with no issues. |
We're getting the same as @redundancydisorder with agents running on Rocky 8.4/8.5. With a range of different integrations, but all have at least system and auditd integrations. |
I am having the same issue with my deployment. I spun up a new deployment last night on Elastic Cloud, on version 8.1.1. Added the agents (2 VPSes, running Ubuntu server 20.04 LTS). Upgraded the stack to 8.1.2. Then wanted to proceed in upgrading the machines to Agent 8.1.2. Same error on both devices. 12:59:03.830
|
@ph Should this be listed as an 8.2 blocker (assuming the problem is still present)? |
@joshdover @jlind23 I wouldn't consider this a blocker, the upgrade is still beta (experimental?) |
With debug log on no information is recorded between: At the same time the referenced directory: Repeated attempts to update while monitoring file system changes and can't find any attempt to write a download to disk. |
Looks like the issue here was introduced as a side effect of elastic/beats#30281 it effects |
Changes from #255 will fix it if we need to backport to |
@michel-laterman Can you backport the fix to the appropriate branches @jlind23 |
@samratbhadra-qasource can you test it again with the latest snapshot from 8.1.X? |
Hi @jlind23,
Build Details: Thanks! |
@samratbhadra-qasource Where is this commit sha coming from: COMMIT: a05409860677938d5bbfc6c6065c85230f54848b Is it a stack one? I do not see it in Beats repository. |
Hi @jlind23, NOTE: The screenshot is from the latest 8.1.3-Snapshot build available on cloud-qa. |
Could you please use today's snapshot, the one with the "c44c8c" SHA? The previous one was a version built last week. |
Since the problem was present on older versions of Agent, isn't this always going to be reproducible when trying to upgrade from an affected to version even if the version that is being upgraded to is fixed? @michel-laterman maybe you could confirm based on the nature of the fix. If what I suspect is true, then it points to the very painful impact of having bugs in our upgrade code. We probably need to have better e2e test coverage in place for this feature before we can call it GA. |
If you are trying to upgrade from an 8.1.1 or an 8.1.2 to something else indeed it will fail. Here @samratbhadra-qasource tried to upgrade from 7.16.3 to 8.1.3 if I am not misreading. For the e2e test coverage it is an ongoing topic with manu in the e2e testing repository. |
@samratbhadra-qasource looking again at the screenshot provided here: #275 (comment) It seems to be a timeout issue and an HTTP 404, I do not think this is related to the upgrade issue we have seen. |
@amolnater-qasource and blake is currently working on this there: #104 |
Upgrading from 8.1.1/8.1.2 to 8.1.3 on Linux and Windows continues to fail for me. |
Hi @jlind23, Agent upgrade: Build Details: Please let us know if anything else is required from our end. Thanks! |
This is definitely a timeout issue. You can see the upgrade started at 12:54:55.063 and the timeout error message is logged at 12:56:56.460. Which is roughly 1min 30secs which is the default timeout for download, in version 8.0.1. Version 8.3.0 changed this value to 10 minutes, that will help with these type of issues. Just because Windows and Mac finished is probably just a timing issue. The linux one could be very close to finishing but that timeout is hit right at the end. |
@samratbhadra-qasource would it be possible to repeat this test between two 8.3 snapshots to confirm it's fixed? we will work on getting the fix back ported. |
Hi @nimarezainia, We have validated this issue on 02 latest 8.3.0-Snapshot cloud environments and had below observations. We have upgraded:
Agent upgrade: Build Details: Note: We are closing this issue as we are tracking same issue under #173 Thanks! |
@mukeshelastic download timeout can be configured indeed. Moreover Blake worked on a progress reporter: #308 |
|
Describe the bug:
Agent upgrade fails and goes Unhealthy from 7.16.3>8.1.2 Snapshot.
Build Details:
VERSION: 8.1.2-SNAPSHOT
COMMIT: 7dce1a1c7cf6aba8782d0af02fc2d95edb5be999
BUILD: 50688
ARTIFACT LINK OF 7.16.3: https://www.elastic.co/downloads/past-releases/elastic-agent-7-16-3
Preconditions:
Steps to Reproduce:
Actual Result:
Agent upgrade fails and goes Unhealthy from 7.16.3>8.1.2 Snapshot.
Expected Result:
Agent should be upgraded to latest version from 7.16.3>8.1.2 Snapshot.
What's Working:
Agent is successfully upgraded from 7.16.3>8.1.2-Snapshot without Endpoint Security Integration.
Screenshot:
The text was updated successfully, but these errors were encountered: