Skip to content
This repository has been archived by the owner on Sep 17, 2024. It is now read-only.

Enrolling on Windows is flaky #1557

Closed
adam-stokes opened this issue Sep 8, 2021 · 7 comments
Closed

Enrolling on Windows is flaky #1557

adam-stokes opened this issue Sep 8, 2021 · 7 comments
Assignees
Labels
area:test Anything related to the Test automation priority:medium Important work, but not urgent or blocking. size:M 1-5 days Team:Automation Label for the Observability productivity team Team:Elastic-Agent Label for the Agent team triaged Triaged issues will end up in Backlog column in Robots GH Project windows

Comments

@adam-stokes
Copy link
Contributor

There seems to be a necessary wait period after fleet is setup according to Kibana before we can enroll an agent. For now we will work around it with a utils.Sleep but we should investigate <fleet-server>:8220/api/status to see what we can use there for determining when Fleet server is actually ready

@adam-stokes adam-stokes added area:test Anything related to the Test automation priority:medium Important work, but not urgent or blocking. size:M 1-5 days triaged Triaged issues will end up in Backlog column in Robots GH Project labels Sep 8, 2021
@adam-stokes adam-stokes self-assigned this Sep 8, 2021
@adam-stokes adam-stokes added Team:Elastic-Agent Label for the Agent team Team:Automation Label for the Observability productivity team windows labels Sep 8, 2021
@adam-stokes
Copy link
Contributor Author

Linking elastic/beats#27836

@adam-stokes
Copy link
Contributor Author

I think we need to make sure we're capturing /usr/share/elastic-agent/state/data/logs/ on the fleet server

@adam-stokes
Copy link
Contributor Author

The current flakyiness is due to a 400 BadRequest error:

[2021-09-28T06:23:32.430Z] time="2021-09-28T06:23:31Z" level=error msg="Error executing command" args="[install -e -v --force --insecure --enrollment-token=WS1JUUszd0J0TzRsX0ppXzBMVXE6N3JGVjFGOUxSd2lHVktfdG84T3ZjUQ== --url http://10.224.0.24:8220]" baseDir=. command="C:\\elastic-agent\\elastic-agent.exe" error="exit status 1" stderr="2021-09-28T06:23:27.512Z\tINFO\t[composable.providers.docker]\tdocker/docker.go:43\tDocker provider skipped, unable to connect: protocol not available\n2021-09-28T06:23:27.612Z\tINFO\tcapabilities/capabilities.go:59\tcapabilities file not found in C:\\elastic-agent\\capabilities.yml\n2021-09-28T06:23:30.035Z\tWARN\t[tls]\ttlscommon/tls_config.go:98\tSSL/TLS verifications disabled.\n2021-09-28T06:23:30.178Z\tINFO\tcmd/enroll_cmd.go:432\tStarting enrollment to URL: http://10.224.0.24:8220/\nError: fail to enroll: fail to execute request to fleet-server: status code: 400, fleet-server returned an error: BadRequest\nFor help, please see our troubleshooting guide at https://www.elastic.co/guide/en/fleet/8.0/fleet-troubleshooting.html\n2021-09-28T06:23:31.126Z\tINFO\t[composable.providers.docker]\tdocker/docker.go:43\tDocker provider skipped, unable to connect: protocol not available\n2021-09-28T06:23:31.226Z\tINFO\tcapabilities/capabilities.go:59\tcapabilities file not found in C:\\elastic-agent\\capabilities.yml\n2021-09-28T06:23:31.475Z\tINFO\t[composable.providers.docker]\tdocker/docker.go:43\tDocker provider skipped, unable to connect: protocol not available\n2021-09-28T06:23:31.578Z\tINFO\tcapabilities/capabilities.go:59\tcapabilities file not found in C:\\elastic-agent\\capabilities.yml\nError: enroll command failed with exit code: 1\nFor help, please see our troubleshooting guide at https://www.elastic.co/guide/en/fleet/8.0/fleet-troubleshooting.html\n"

It is not easy to reproduce and we are instrumenting some additional data collection in the pipeline to help determine what's going on.

@cachedout
Copy link
Contributor

@adam-stokes Hi! I'm searching for a bit of clarity on this issue. It looks like we moved it out of the backlog on September 27 but it hasn't seen much activity since then.

You wrote:

we are instrumenting some additional data collection in the pipeline to help determine what's going on.

Has that instrumentation been added? I'm just trying to get a sense of what specifically we're trying to accomplish with this issue. Until we can articulate that clearly, I'm not sure that this is ready to be added to the list of near-term work.

@adam-stokes
Copy link
Contributor Author

This is a hard problem to track down, it doesn't happen consistently, right now this is on the back burner until I can finish out #1740

@cachedout
Copy link
Contributor

Thanks, @adam-stokes . I think we should probably move this into the backlog until we have a clear plan about what to do here. WDYT?

@mdelapenya
Copy link
Contributor

I think we can close this one, as the Windows machine on AWS is working on a daily basis

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
area:test Anything related to the Test automation priority:medium Important work, but not urgent or blocking. size:M 1-5 days Team:Automation Label for the Observability productivity team Team:Elastic-Agent Label for the Agent team triaged Triaged issues will end up in Backlog column in Robots GH Project windows
Projects
None yet
Development

No branches or pull requests

3 participants