Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use elastic-agent status as healthcheck #329

Merged
merged 11 commits into from
Apr 27, 2021

Conversation

mtojek
Copy link
Contributor

@mtojek mtojek commented Apr 20, 2021

This PR modifies the healthcheck for elastic-agent and fleet-server, so the correct one is used.

Blockers:

@mtojek mtojek self-assigned this Apr 20, 2021
@mtojek
Copy link
Contributor Author

mtojek commented Apr 20, 2021

@ruflin We managed to switch to latest Docker image snapshots. Now it's the time to enable the right healthcheck. What is the healthcheck recommendation for the fleet-server?

EDIT:

seems that the blocker is still there: elastic/beats#24956

@elasticmachine
Copy link
Collaborator

elasticmachine commented Apr 20, 2021

💚 Build Succeeded

the below badges are clickable and redirect to their specific view in the CI or DOCS
Pipeline View Test View Changes Artifacts preview

Expand to view the summary

Build stats

  • Build Cause: mtojek commented: /test

  • Start Time: 2021-04-27T14:01:25.217+0000

  • Duration: 23 min 57 sec

  • Commit: 11154a9

Test stats 🧪

Test Results
Failed 0
Passed 316
Skipped 1
Total 317

Trends 🧪

Image of Build Times

Image of Tests

@ruflin
Copy link
Member

ruflin commented Apr 20, 2021

@michalpristas Instead of using elastic-agent status, could we also just start the agent monitoring, expose it and check there?

@michalpristas
Copy link

depends on the use case, if we want to wait for agent to be healthy and apps green, status is the right approach. also if agent is not able to connect to fleet it will return unhealthy status...
if we want to check for agent capability to run server we can go that way, but this wont reflect any failures or restarts of agent and its apps.

@ruflin
Copy link
Member

ruflin commented Apr 20, 2021

I might have found a workaround here. If you add the following to the agent env variables it seems to work:

- "STATE_PATH=/usr/share/elastic-agent"

It overwrites the default state path which is /usr/share/elastic-agent/state. I'm not sure about the other things this might break. The end solution should be that elastic-agent status respects the STATE_PATH set.

@mtojek Could you try if you get it running with the above env variable?

@mtojek
Copy link
Contributor Author

mtojek commented Apr 20, 2021

/test

@mtojek
Copy link
Contributor Author

mtojek commented Apr 20, 2021

@ruflin it seems to work, I'll retry the CI job to double-check. Do you think we should go with this workaround or is it better to wait for the exact fix? So far the default Docker image is broken in the context of this feature. Maybe hardcode the STATE_PATH value in the image?

@mtojek mtojek requested a review from ruflin April 20, 2021 14:27
Copy link
Member

@ruflin ruflin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would like to get some feedback on @blakerouse on this before merging to make sure it does not have some unexpected side effects.

@mtojek It seems you just "run" the command, shouldn't there also be some check for the output itself?

retries: 90
interval: 1s
hostname: docker-fleet-agent
environment:
- "FLEET_ENROLL=1"
- "FLEET_INSECURE=1"
- "FLEET_URL=http://fleet-server:8220"
- "STATE_PATH=/usr/share/elastic-agent"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we change it for both fleet-server and elastic-agent?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can try and change it, but previously it didn't work well (maybe it was the STATE_PATH issue). I will re-check it :)

Copy link
Contributor Author

@mtojek mtojek Apr 20, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I updated the fleet-server section to use "elastic-package status". Let's see.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are you setting the STATE_PATH? The /usr/share/elastic-agent/state is the default, why not use that? Is this a workaround for the status command not working.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@blakerouse Yes, otherwise the status command does not work as it does it looks into the wrong path.

@mtojek
Copy link
Contributor Author

mtojek commented Apr 20, 2021

@mtojek It seems you just "run" the command, shouldn't there also be some check for the output itself?

I expect the "exit code" to be the source of truth.

@mtojek
Copy link
Contributor Author

mtojek commented Apr 21, 2021

/test

@mtojek
Copy link
Contributor Author

mtojek commented Apr 21, 2021

I will re-test it once again to see if it works correctly.

@blakerouse would you mind sharing the recommendation for merging this change?

@mtojek
Copy link
Contributor Author

mtojek commented Apr 21, 2021

/test

@mtojek
Copy link
Contributor Author

mtojek commented Apr 21, 2021

Ok, something is unstable here: https://beats-ci.elastic.co/blue/organizations/jenkins/Ingest-manager%2Felastic-package/detail/PR-329/9/pipeline/

I'm waiting for the timeout to grab other logs. Not sure if it's related to latest master or these changes.

EDIT:

This one failed too: https://beats-ci.elastic.co/blue/organizations/jenkins/Ingest-manager%2Felastic-package/detail/PR-329/10/pipeline/

I think we have to switch to the workaround healthcheck for the fleet-server.

@mtojek
Copy link
Contributor Author

mtojek commented Apr 21, 2021

/test

@mtojek
Copy link
Contributor Author

mtojek commented Apr 21, 2021

The error is not related to this PR. Something is wrong with the latest snapshot.

@mtojek
Copy link
Contributor Author

mtojek commented Apr 21, 2021

/test

@mtojek
Copy link
Contributor Author

mtojek commented Apr 21, 2021

It seems that the STATE_PATH did the trick for fleet-server as well! @ruflin should we merge this one or is it better to wait for the proper fix?

@mtojek mtojek requested a review from ruflin April 21, 2021 11:25
@mtojek
Copy link
Contributor Author

mtojek commented Apr 21, 2021

Nope, flaky again:

Attaching to elastic-package-stack_elastic-agent_1
�[36melastic-agent_1              |�[0m The Elastic Agent is currently in BETA and should not be used in production
�[36melastic-agent_1              |�[0m 
�[36melastic-agent_1              |�[0m Error: fail to enroll: fail to execute request to Kibana: could not decode the response, raw response: 
�[36melastic-agent_1              |�[0m 
�[36melastic-agent_1              |�[0m Error: enrollment failed: exit status 1

Should I switch to the previous healthcheck for the fleet-server?

@mtojek mtojek requested a review from blakerouse April 21, 2021 13:40
@ruflin
Copy link
Member

ruflin commented Apr 21, 2021

This does not really look related to the health status. I wonder if we hit a timeout? Can you reproduce it locally?

@mtojek
Copy link
Contributor Author

mtojek commented Apr 21, 2021

@ruflin Probably yes as it's flaky. Are you sure it's not related to the healthcheck? It didn't happened with the previous healthcheck (curl /api/status).

BTW Isn't it an action that the agent should retry?

@mtojek
Copy link
Contributor Author

mtojek commented Apr 21, 2021

/test

2 similar comments
@mtojek
Copy link
Contributor Author

mtojek commented Apr 21, 2021

/test

@mtojek
Copy link
Contributor Author

mtojek commented Apr 22, 2021

/test

@mtojek
Copy link
Contributor Author

mtojek commented Apr 22, 2021

I'm trying to reproduce it locally, but it always succeeds. Must be a really weird edge case. Call for improving logging around this place?

For reference:

untilfail ./k.sh

k.sh:

#!/bin/bash

set -ex

elastic-package stack up -v -d
elastic-package stack down

@mtojek
Copy link
Contributor Author

mtojek commented Apr 26, 2021

@ruflin @blakerouse any suggestions on how to solve this issue? Seems that the healthcheck is not ideal for the fleet server. Should I open a separate issue for this?

@ruflin
Copy link
Member

ruflin commented Apr 26, 2021

@michalpristas What would you recommend for the health check for the fleet-server. The status command or the API endpoint?

@mtojek It seems now all checks have passed? Is it still flaky?

@mtojek
Copy link
Contributor Author

mtojek commented Apr 26, 2021

/test

@mtojek
Copy link
Contributor Author

mtojek commented Apr 26, 2021

@mtojek It seems now all checks have passed? Is it still flaky?

Yes, I'm afraid so. With "api/status" it was the same problem, but I introduced a workaround - increased interval between next trials (higher chance to encounter the healthy state). I suspect it's related with restart of the agent.

EDIT:

It seems to be passing now, but I wouldn't like to merge it unverified. Did we pushed something specific recently, so it might have improved the healthcheck?

@mtojek
Copy link
Contributor Author

mtojek commented Apr 26, 2021

/test

3 similar comments
@mtojek
Copy link
Contributor Author

mtojek commented Apr 26, 2021

/test

@mtojek
Copy link
Contributor Author

mtojek commented Apr 27, 2021

/test

@mtojek
Copy link
Contributor Author

mtojek commented Apr 27, 2021

/test

@mtojek
Copy link
Contributor Author

mtojek commented Apr 27, 2021

/test

Copy link
Member

@ruflin ruflin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. I think we still need to work on "what is our recommended way" for docker.

@mtojek mtojek merged commit 48bc70c into elastic:master Apr 27, 2021
andrewkroh added a commit to andrewkroh/elastic-package that referenced this pull request Dec 9, 2021
In elastic#329 a non-default STATE_PATH was set in the container environments
to make the elastic-agent status command function. That does not appear
to be necessary so let's remove it. Having this in our envionrment makes
us operate our tests differently than how users deploy the software.
mtojek pushed a commit that referenced this pull request Dec 13, 2021
In #329 a non-default STATE_PATH was set in the container environments
to make the elastic-agent status command function. That does not appear
to be necessary so let's remove it. Having this in our envionrment makes
us operate our tests differently than how users deploy the software.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants