Use elastic-agent status as healthcheck #329

mtojek · 2021-04-20T09:05:38Z

This PR modifies the healthcheck for elastic-agent and ~~fleet-server~~, so the correct one is used.

Blockers:

elastic-agent status: elastic-agent-control.sock: connect: no such file or directory beats#24956

mtojek · 2021-04-20T09:15:56Z

@ruflin We managed to switch to latest Docker image snapshots. Now it's the time to enable the right healthcheck. What is the healthcheck recommendation for the fleet-server?

EDIT:

seems that the blocker is still there: elastic/beats#24956

elasticmachine · 2021-04-20T09:22:18Z

💚 Build Succeeded

the below badges are clickable and redirect to their specific view in the CI or DOCS

Expand to view the summary

Build stats

Build Cause: mtojek commented: /test
Start Time: 2021-04-27T14:01:25.217+0000
Duration: 23 min 57 sec
Commit: 11154a9

Test stats 🧪

Test	Results
Failed	0
Passed	316
Skipped	1
Total	317

Trends 🧪

ruflin · 2021-04-20T12:54:03Z

@michalpristas Instead of using elastic-agent status, could we also just start the agent monitoring, expose it and check there?

michalpristas · 2021-04-20T12:59:31Z

depends on the use case, if we want to wait for agent to be healthy and apps green, status is the right approach. also if agent is not able to connect to fleet it will return unhealthy status...
if we want to check for agent capability to run server we can go that way, but this wont reflect any failures or restarts of agent and its apps.

ruflin · 2021-04-20T13:23:06Z

I might have found a workaround here. If you add the following to the agent env variables it seems to work:

- "STATE_PATH=/usr/share/elastic-agent"

It overwrites the default state path which is /usr/share/elastic-agent/state. I'm not sure about the other things this might break. The end solution should be that elastic-agent status respects the STATE_PATH set.

@mtojek Could you try if you get it running with the above env variable?

mtojek · 2021-04-20T13:56:10Z

/test

mtojek · 2021-04-20T13:57:19Z

@ruflin it seems to work, I'll retry the CI job to double-check. Do you think we should go with this workaround or is it better to wait for the exact fix? So far the default Docker image is broken in the context of this feature. Maybe hardcode the STATE_PATH value in the image?

ruflin

I would like to get some feedback on @blakerouse on this before merging to make sure it does not have some unexpected side effects.

@mtojek It seems you just "run" the command, shouldn't there also be some check for the output itself?

ruflin · 2021-04-20T19:50:40Z

internal/install/static_snapshot_yml.go

      retries: 90
      interval: 1s
    hostname: docker-fleet-agent
    environment:
    - "FLEET_ENROLL=1"
    - "FLEET_INSECURE=1"
    - "FLEET_URL=http://fleet-server:8220"
+    - "STATE_PATH=/usr/share/elastic-agent"


Should we change it for both fleet-server and elastic-agent?

I can try and change it, but previously it didn't work well (maybe it was the STATE_PATH issue). I will re-check it :)

I updated the fleet-server section to use "elastic-package status". Let's see.

Why are you setting the STATE_PATH? The /usr/share/elastic-agent/state is the default, why not use that? Is this a workaround for the status command not working.

@blakerouse Yes, otherwise the status command does not work as it does it looks into the wrong path.

mtojek · 2021-04-20T20:07:21Z

@mtojek It seems you just "run" the command, shouldn't there also be some check for the output itself?

I expect the "exit code" to be the source of truth.

mtojek · 2021-04-21T07:40:22Z

/test

mtojek · 2021-04-21T07:41:48Z

I will re-test it once again to see if it works correctly.

@blakerouse would you mind sharing the recommendation for merging this change?

mtojek · 2021-04-21T08:03:09Z

/test

mtojek · 2021-04-21T08:10:06Z

Ok, something is unstable here: https://beats-ci.elastic.co/blue/organizations/jenkins/Ingest-manager%2Felastic-package/detail/PR-329/9/pipeline/

I'm waiting for the timeout to grab other logs. Not sure if it's related to latest master or these changes.

EDIT:

This one failed too: https://beats-ci.elastic.co/blue/organizations/jenkins/Ingest-manager%2Felastic-package/detail/PR-329/10/pipeline/

I think we have to switch to the workaround healthcheck for the fleet-server.

mtojek · 2021-04-21T08:17:59Z

/test

mtojek · 2021-04-21T08:45:35Z

The error is not related to this PR. Something is wrong with the latest snapshot.

mtojek · 2021-04-21T10:57:57Z

/test

mtojek · 2021-04-21T11:25:11Z

It seems that the STATE_PATH did the trick for fleet-server as well! @ruflin should we merge this one or is it better to wait for the proper fix?

mtojek · 2021-04-21T13:38:50Z

Nope, flaky again:

Attaching to elastic-package-stack_elastic-agent_1
�[36melastic-agent_1              |�[0m The Elastic Agent is currently in BETA and should not be used in production
�[36melastic-agent_1              |�[0m 
�[36melastic-agent_1              |�[0m Error: fail to enroll: fail to execute request to Kibana: could not decode the response, raw response: 
�[36melastic-agent_1              |�[0m 
�[36melastic-agent_1              |�[0m Error: enrollment failed: exit status 1

Should I switch to the previous healthcheck for the fleet-server?

ruflin · 2021-04-21T14:16:47Z

This does not really look related to the health status. I wonder if we hit a timeout? Can you reproduce it locally?

mtojek · 2021-04-21T14:20:33Z

@ruflin Probably yes as it's flaky. Are you sure it's not related to the healthcheck? It didn't happened with the previous healthcheck (curl /api/status).

BTW Isn't it an action that the agent should retry?

mtojek · 2021-04-21T14:28:56Z

/test

mtojek · 2021-04-21T19:47:09Z

/test

mtojek · 2021-04-22T06:56:22Z

/test

mtojek · 2021-04-22T07:54:48Z

I'm trying to reproduce it locally, but it always succeeds. Must be a really weird edge case. Call for improving logging around this place?

For reference:

untilfail ./k.sh

k.sh:

#!/bin/bash

set -ex

elastic-package stack up -v -d
elastic-package stack down

mtojek · 2021-04-26T07:23:41Z

@ruflin @blakerouse any suggestions on how to solve this issue? Seems that the healthcheck is not ideal for the fleet server. Should I open a separate issue for this?

ruflin · 2021-04-26T09:12:15Z

@michalpristas What would you recommend for the health check for the fleet-server. The status command or the API endpoint?

@mtojek It seems now all checks have passed? Is it still flaky?

mtojek · 2021-04-26T11:52:14Z

/test

mtojek · 2021-04-26T11:54:03Z

@mtojek It seems now all checks have passed? Is it still flaky?

Yes, I'm afraid so. With "api/status" it was the same problem, but I introduced a workaround - increased interval between next trials (higher chance to encounter the healthy state). I suspect it's related with restart of the agent.

EDIT:

It seems to be passing now, but I wouldn't like to merge it unverified. Did we pushed something specific recently, so it might have improved the healthcheck?

mtojek · 2021-04-26T12:10:50Z

/test

mtojek · 2021-04-26T12:18:23Z

/test

mtojek · 2021-04-27T06:58:20Z

/test

mtojek · 2021-04-27T07:24:59Z

/test

mtojek · 2021-04-27T14:01:02Z

/test

ruflin

LGTM. I think we still need to work on "what is our recommended way" for docker.

In elastic#329 a non-default STATE_PATH was set in the container environments to make the elastic-agent status command function. That does not appear to be necessary so let's remove it. Having this in our envionrment makes us operate our tests differently than how users deploy the software.

In #329 a non-default STATE_PATH was set in the container environments to make the elastic-agent status command function. That does not appear to be necessary so let's remove it. Having this in our envionrment makes us operate our tests differently than how users deploy the software.

mtojek added 3 commits April 20, 2021 11:04

Use elastic-agent status as healthcheck

7984f56

Fix: kubernetes

fa1a711

Revert for fleet-server

62f6b82

mtojek self-assigned this Apr 20, 2021

mtojek mentioned this pull request Apr 20, 2021

Elastic Agent: Error: fail to enroll: fail to execute request to Kibana: EOF elastic/beats#25022

Closed

Merge branch 'master' into enable-healthcheck

2d30327

Try with STATE_PATH

367df3b

mtojek requested a review from ruflin April 20, 2021 14:27

ruflin reviewed Apr 20, 2021

View reviewed changes

Use status command as fleet-server's healthcheck

fcf84cc

Try to decrease the healthcheck interval

18fd739

Merge branch 'master' into enable-healthcheck

ae8a245

mtojek requested a review from ruflin April 21, 2021 11:25

Merge branch 'master' into enable-healthcheck

dadcc99

mtojek requested a review from blakerouse April 21, 2021 13:40

mtojek mentioned this pull request Apr 22, 2021

[Elastic Agent] Improve logging elastic/beats#25230

Closed

Merge branch 'master' into enable-healthcheck

43499be

Merge branch 'master' into enable-healthcheck

11154a9

ruflin approved these changes Apr 27, 2021

View reviewed changes

mtojek merged commit 48bc70c into elastic:master Apr 27, 2021

mtojek mentioned this pull request Apr 27, 2021

Update dependency on elastic-package elastic/integrations#950

Merged

andrewkroh mentioned this pull request Dec 9, 2021

Remove STATE_PATH from agent/fleet-server containers #617

Merged

Use elastic-agent status as healthcheck #329

Use elastic-agent status as healthcheck #329

Conversation

mtojek commented Apr 20, 2021 • edited Loading

mtojek commented Apr 20, 2021 • edited Loading

elasticmachine commented Apr 20, 2021 • edited Loading

💚 Build Succeeded

Build stats

Test stats 🧪

Trends 🧪

ruflin commented Apr 20, 2021

michalpristas commented Apr 20, 2021

ruflin commented Apr 20, 2021

mtojek commented Apr 20, 2021

mtojek commented Apr 20, 2021 • edited Loading

ruflin left a comment

Choose a reason for hiding this comment

ruflin Apr 20, 2021

Choose a reason for hiding this comment

mtojek Apr 20, 2021

Choose a reason for hiding this comment

mtojek Apr 20, 2021 • edited Loading

Choose a reason for hiding this comment

blakerouse Apr 20, 2021

Choose a reason for hiding this comment

ruflin Apr 21, 2021

Choose a reason for hiding this comment

mtojek commented Apr 20, 2021

mtojek commented Apr 21, 2021

mtojek commented Apr 21, 2021

mtojek commented Apr 21, 2021

mtojek commented Apr 21, 2021 • edited Loading

mtojek commented Apr 21, 2021

mtojek commented Apr 21, 2021

mtojek commented Apr 21, 2021

mtojek commented Apr 21, 2021 • edited Loading

mtojek commented Apr 21, 2021

ruflin commented Apr 21, 2021

mtojek commented Apr 21, 2021 • edited Loading

mtojek commented Apr 21, 2021

mtojek commented Apr 21, 2021

mtojek commented Apr 22, 2021

mtojek commented Apr 22, 2021 • edited Loading

mtojek commented Apr 26, 2021

ruflin commented Apr 26, 2021

mtojek commented Apr 26, 2021

mtojek commented Apr 26, 2021 • edited Loading

mtojek commented Apr 26, 2021

mtojek commented Apr 26, 2021

mtojek commented Apr 27, 2021

mtojek commented Apr 27, 2021

mtojek commented Apr 27, 2021

ruflin left a comment

Choose a reason for hiding this comment

mtojek commented Apr 20, 2021 •

edited

Loading

mtojek commented Apr 20, 2021 •

edited

Loading

elasticmachine commented Apr 20, 2021 •

edited

Loading

mtojek commented Apr 20, 2021 •

edited

Loading

mtojek Apr 20, 2021 •

edited

Loading

mtojek commented Apr 21, 2021 •

edited

Loading

mtojek commented Apr 21, 2021 •

edited

Loading

mtojek commented Apr 21, 2021 •

edited

Loading

mtojek commented Apr 22, 2021 •

edited

Loading

mtojek commented Apr 26, 2021 •

edited

Loading