-
Notifications
You must be signed in to change notification settings - Fork 113
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use elastic-agent status as healthcheck #329
Conversation
@ruflin We managed to switch to latest Docker image snapshots. Now it's the time to enable the right healthcheck. What is the healthcheck recommendation for the fleet-server? EDIT: seems that the blocker is still there: elastic/beats#24956 |
💚 Build Succeeded
Expand to view the summary
Build stats
Test stats 🧪
Trends 🧪 |
@michalpristas Instead of using |
depends on the use case, if we want to wait for agent to be healthy and apps green, status is the right approach. also if agent is not able to connect to fleet it will return unhealthy status... |
I might have found a workaround here. If you add the following to the agent env variables it seems to work:
It overwrites the default state path which is @mtojek Could you try if you get it running with the above env variable? |
/test |
@ruflin it seems to work, I'll retry the CI job to double-check. Do you think we should go with this workaround or is it better to wait for the exact fix? So far the default Docker image is broken in the context of this feature. Maybe hardcode the STATE_PATH value in the image? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would like to get some feedback on @blakerouse on this before merging to make sure it does not have some unexpected side effects.
@mtojek It seems you just "run" the command, shouldn't there also be some check for the output itself?
retries: 90 | ||
interval: 1s | ||
hostname: docker-fleet-agent | ||
environment: | ||
- "FLEET_ENROLL=1" | ||
- "FLEET_INSECURE=1" | ||
- "FLEET_URL=http://fleet-server:8220" | ||
- "STATE_PATH=/usr/share/elastic-agent" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we change it for both fleet-server and elastic-agent?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can try and change it, but previously it didn't work well (maybe it was the STATE_PATH issue). I will re-check it :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I updated the fleet-server section to use "elastic-package status". Let's see.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why are you setting the STATE_PATH
? The /usr/share/elastic-agent/state
is the default, why not use that? Is this a workaround for the status
command not working.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@blakerouse Yes, otherwise the status command does not work as it does it looks into the wrong path.
I expect the "exit code" to be the source of truth. |
/test |
I will re-test it once again to see if it works correctly. @blakerouse would you mind sharing the recommendation for merging this change? |
/test |
Ok, something is unstable here: https://beats-ci.elastic.co/blue/organizations/jenkins/Ingest-manager%2Felastic-package/detail/PR-329/9/pipeline/ I'm waiting for the timeout to grab other logs. Not sure if it's related to latest master or these changes. EDIT: This one failed too: https://beats-ci.elastic.co/blue/organizations/jenkins/Ingest-manager%2Felastic-package/detail/PR-329/10/pipeline/ I think we have to switch to the workaround healthcheck for the fleet-server. |
/test |
The error is not related to this PR. Something is wrong with the latest snapshot. |
/test |
It seems that the STATE_PATH did the trick for fleet-server as well! @ruflin should we merge this one or is it better to wait for the proper fix? |
Nope, flaky again:
Should I switch to the previous healthcheck for the fleet-server? |
This does not really look related to the health status. I wonder if we hit a timeout? Can you reproduce it locally? |
@ruflin Probably yes as it's flaky. Are you sure it's not related to the healthcheck? It didn't happened with the previous healthcheck (curl BTW Isn't it an action that the agent should retry? |
/test |
2 similar comments
/test |
/test |
I'm trying to reproduce it locally, but it always succeeds. Must be a really weird edge case. Call for improving logging around this place? For reference:
k.sh:
|
@ruflin @blakerouse any suggestions on how to solve this issue? Seems that the healthcheck is not ideal for the fleet server. Should I open a separate issue for this? |
@michalpristas What would you recommend for the health check for the fleet-server. The status command or the API endpoint? @mtojek It seems now all checks have passed? Is it still flaky? |
/test |
Yes, I'm afraid so. With "api/status" it was the same problem, but I introduced a workaround - increased interval between next trials (higher chance to encounter the healthy state). I suspect it's related with restart of the agent. EDIT: It seems to be passing now, but I wouldn't like to merge it unverified. Did we pushed something specific recently, so it might have improved the healthcheck? |
/test |
3 similar comments
/test |
/test |
/test |
/test |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. I think we still need to work on "what is our recommended way" for docker.
In elastic#329 a non-default STATE_PATH was set in the container environments to make the elastic-agent status command function. That does not appear to be necessary so let's remove it. Having this in our envionrment makes us operate our tests differently than how users deploy the software.
In #329 a non-default STATE_PATH was set in the container environments to make the elastic-agent status command function. That does not appear to be necessary so let's remove it. Having this in our envionrment makes us operate our tests differently than how users deploy the software.
This PR modifies the healthcheck for elastic-agent and
fleet-server, so the correct one is used.Blockers: