Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Fleet]: Hosted fleet server gets unhealthy on 8.16 Snapshot. #3946

Open
harshitgupta-qasource opened this issue Sep 24, 2024 · 6 comments
Open
Labels
bug Something isn't working impact:critical Immediate priority; high value or cost to the product. Team:Fleet Label for the Fleet team

Comments

@harshitgupta-qasource
Copy link

Deployment Links:

Description:
Hosted fleet server gets unhealthy on 8.16 Snapshot and we have observed APM integration shouws error.

Build details:
VERSION: 8.16.0 SNAPSHOT
BUILD: 78494
COMMIT: 156a76cb03e60a89792f905642817405002099a1

Screenshot
Image
Image

@harshitgupta-qasource harshitgupta-qasource added bug Something isn't working impact:critical Immediate priority; high value or cost to the product. labels Sep 24, 2024
@harshitgupta-qasource
Copy link
Author

@amolnater-qasource Kindly review

@amolnater-qasource
Copy link
Collaborator

Secondary Review for this ticket is Done.

@amolnater-qasource amolnater-qasource added the Team:Fleet Label for the Fleet team label Sep 24, 2024
@ycombinator
Copy link
Contributor

I'm able to reproduce this simply by creating an 8.16.0-SNAPSHOT deployment on ESS production in the CFT region. I downloaded the diagnostic and checked for errors in the logs. Here's what I see:

$ cat elastic-agent-20240924-1.ndjson | grep error | jq '.message'
...
"Component state changed apm-es-containerhost (STARTING->FAILED): Failed: pid '11978' exited with code '1'"
"Unit state changed apm-es-containerhost (STARTING->FAILED): Failed: pid '11978' exited with code '1'"
"Unit state changed apm-es-containerhost-elastic-cloud-apm (STARTING->FAILED): Failed: pid '11978' exited with code '1'"
"Error: error loading config file: stat apm-server.yml: no such file or directory"
"Usage:"
"apm-server [flags]"
"apm-server [command]"
"Available Commands:"
"apikey      Manage API Keys for communication between APM agents and server (deprecated)"
"export      Export current config"
"help        Help about any command"
"keystore    Manage secrets keystore"
"run         Run APM Server"
"test        Test config"
"version     Show current version info"
"Flags:"
"-E, --E setting=value      Configuration overwrite"
"-N, --N                    Disable actual publishing for testing"
"-c, --c string             Configuration file, relative to path.config (default \"apm-server.yml\")"
"--cpuprofile string    Write cpu profile to file"
"-d, --d stringArray        Enable certain debug selectors"
"-e, --e                    Log to stderr and disable syslog/file output"
"--environment string   Set the environment in which the process is running (default \"default\")"
"-h, --help                 help for apm-server"
"--httpprof string      Start pprof http server"
"--memprofile string    Write memory profile to this file"
"--path.config string   Configuration path"
"--path.data string     Data path"
"--path.home string     Home path"
...

So it seems like the apm-server.yml file is missing.

@cmacknz
Copy link
Member

cmacknz commented Sep 24, 2024

Pinged APM server team in Slack for ideas.

@cmacknz
Copy link
Member

cmacknz commented Sep 24, 2024

From Slack discussion looks related to the group membership change in the Wolfi container, where the agent user is no long in gid user. The permissions of the apm-server.yml file make it unreadable to the elastic-agent processes:

apm-server binary is 1000:1000 while apm-server.yml is 0:0

@ycombinator
Copy link
Contributor

Thanks @cmacknz. I'm guessing the fix here is on the APM Server end where the apm-server.yml file (and any others) need to be readable by the Elastic Agent user?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working impact:critical Immediate priority; high value or cost to the product. Team:Fleet Label for the Fleet team
Projects
None yet
Development

No branches or pull requests

4 participants