Skip to content
This repository has been archived by the owner on Jan 23, 2024. It is now read-only.

Error while running the ALL stage on CI builds #1188

Closed
mdelapenya opened this issue Jun 30, 2021 · 12 comments · Fixed by #1242
Closed

Error while running the ALL stage on CI builds #1188

mdelapenya opened this issue Jun 30, 2021 · 12 comments · Fixed by #1242
Assignees
Labels
area:test Anything related to the Test automation bug impact:critical Immediate priority; high value or cost to the product. priority:blocker Work is on-hold for a product team, business is at risk until resolution of issue Team:Automation Label for the Observability productivity team triaged Triaged issues will end up in Backlog column in Robots GH Project

Comments

@mdelapenya
Copy link
Contributor

mdelapenya commented Jun 30, 2021

It seems it's the apm-server container, which is not healthy on compose startup:

[2021-06-30T03:26:23.880Z] ERROR: for apm-server  Container "d609427c2b3f" is unhealthy.
[2021-06-30T03:26:23.880Z] Encountered errors while bringing up the project.
[2021-06-30T03:26:23.880Z] Starting/Building stack services..
[2021-06-30T03:26:23.880Z] 
[2021-06-30T03:26:23.880Z] Traceback (most recent call last):
[2021-06-30T03:26:23.880Z]   File "scripts/compose.py", line 31, in <module>
[2021-06-30T03:26:23.880Z]     main()
[2021-06-30T03:26:23.880Z]   File "scripts/compose.py", line 17, in main
[2021-06-30T03:26:23.880Z]     setup()
[2021-06-30T03:26:23.880Z]   File "/var/lib/jenkins/workspace/tegration-test-downstream_master/src/github.com/elastic/apm-integration-testing/scripts/modules/cli.py", line 204, in __call__
[2021-06-30T03:26:23.880Z]     self.args.func()
[2021-06-30T03:26:23.880Z]   File "/var/lib/jenkins/workspace/tegration-test-downstream_master/src/github.com/elastic/apm-integration-testing/scripts/modules/cli.py", line 586, in start_handler
[2021-06-30T03:26:23.880Z]     self.build_start_handler("start")
[2021-06-30T03:26:23.880Z]   File "/var/lib/jenkins/workspace/tegration-test-downstream_master/src/github.com/elastic/apm-integration-testing/scripts/modules/cli.py", line 732, in build_start_handler
[2021-06-30T03:26:23.880Z]     self.run_docker_compose_process(docker_compose_cmd + up_params)
[2021-06-30T03:26:23.880Z]   File "/var/lib/jenkins/workspace/tegration-test-downstream_master/src/github.com/elastic/apm-integration-testing/scripts/modules/cli.py", line 472, in run_docker_compose_process
[2021-06-30T03:26:23.880Z]     subprocess.check_call(docker_compose_cmd)
[2021-06-30T03:26:23.880Z]   File "/usr/lib/python3.6/subprocess.py", line 311, in check_call
[2021-06-30T03:26:23.880Z]     raise CalledProcessError(retcode, cmd)
[2021-06-30T03:26:23.880Z] subprocess.CalledProcessError: Command '['docker-compose', '-f', '/var/lib/jenkins/workspace/tegration-test-downstream_master/src/github.com/elastic/apm-integration-testing/docker-compose.yml', '--no-ansi', '--log-level', 'ERROR', 'up', '-d', '--quiet-pull']' returned non-zero exit status 1.
[2021-06-30T03:26:23.880Z] Makefile:75: recipe for target 'start-env' failed
[2021-06-30T03:26:23.880Z] make[1]: *** [start-env] Error 1
[2021-06-30T03:26:23.880Z] make[1]: Leaving directory '/var/lib/jenkins/workspace/tegration-test-downstream_master/src/github.com/elastic/apm-integration-testing'
[2021-06-30T03:26:23.880Z] Makefile:92: recipe for target 'env-agent-all' failed
[2021-06-30T03:26:23.880Z] make: *** [env-agent-all] Error 2

CI Job: https://apm-ci.elastic.co/blue/organizations/jenkins/apm-integration-test-downstream/detail/master/9929/pipeline

@mdelapenya mdelapenya added bug Team:Automation Label for the Observability productivity team area:test Anything related to the Test automation priority:blocker Work is on-hold for a product team, business is at risk until resolution of issue impact:critical Immediate priority; high value or cost to the product. labels Jun 30, 2021
@mdelapenya
Copy link
Contributor Author

Both Python and PHP error fail with the following message, which is the very same error/stacktrace as in the all stage.

[2021-06-28T13:51:07.848Z] ERROR: for apm-server  Container "2646f30a194d" is unhealthy.
[2021-06-28T13:51:07.848Z] Encountered errors while bringing up the project.
[2021-06-28T13:51:07.848Z] Starting/Building stack services..
[2021-06-28T13:51:07.848Z] 
[2021-06-28T13:51:07.848Z] Traceback (most recent call last):
[2021-06-28T13:51:07.848Z]   File "scripts/compose.py", line 31, in <module>
[2021-06-28T13:51:07.848Z]     main()
[2021-06-28T13:51:07.848Z]   File "scripts/compose.py", line 17, in main
[2021-06-28T13:51:07.848Z]     setup()
[2021-06-28T13:51:07.848Z]   File "/var/lib/jenkins/workspace/tegration-test-downstream_master/src/github.com/elastic/apm-integration-testing/scripts/modules/cli.py", line 204, in __call__
[2021-06-28T13:51:07.848Z]     self.args.func()
[2021-06-28T13:51:07.848Z]   File "/var/lib/jenkins/workspace/tegration-test-downstream_master/src/github.com/elastic/apm-integration-testing/scripts/modules/cli.py", line 586, in start_handler
[2021-06-28T13:51:07.848Z]     self.build_start_handler("start")
[2021-06-28T13:51:07.848Z]   File "/var/lib/jenkins/workspace/tegration-test-downstream_master/src/github.com/elastic/apm-integration-testing/scripts/modules/cli.py", line 732, in build_start_handler
[2021-06-28T13:51:07.848Z]     self.run_docker_compose_process(docker_compose_cmd + up_params)
[2021-06-28T13:51:07.848Z]   File "/var/lib/jenkins/workspace/tegration-test-downstream_master/src/github.com/elastic/apm-integration-testing/scripts/modules/cli.py", line 472, in run_docker_compose_process
[2021-06-28T13:51:07.848Z]     subprocess.check_call(docker_compose_cmd)
[2021-06-28T13:51:07.848Z]   File "/usr/lib/python3.6/subprocess.py", line 311, in check_call
[2021-06-28T13:51:07.848Z]     raise CalledProcessError(retcode, cmd)
[2021-06-28T13:51:07.848Z] subprocess.CalledProcessError: Command '['docker-compose', '-f', '/var/lib/jenkins/workspace/tegration-test-downstream_master/src/github.com/elastic/apm-integration-testing/docker-compose.yml', '--no-ansi', '--log-level', 'ERROR', 'up', '-d', '--quiet-pull']' returned non-zero exit status 1.
[2021-06-28T13:51:07.848Z] Makefile:75: recipe for target 'start-env' failed
[2021-06-28T13:51:07.848Z] make[1]: *** [start-env] Error 1
[2021-06-28T13:51:07.848Z] make[1]: Leaving directory '/var/lib/jenkins/workspace/tegration-test-downstream_master/src/github.com/elastic/apm-integration-testing'
[2021-06-28T13:51:07.848Z] Makefile:92: recipe for target 'env-agent-python' failed
[2021-06-28T13:51:07.848Z] make: *** [env-agent-python] Error 2
script returned exit code 2

@cachedout
Copy link
Contributor

cachedout commented Jul 30, 2021

This is related to elastic/kibana#106749

EDIT: This is now at elastic/kibana#107300

@cachedout cachedout self-assigned this Jul 30, 2021
@cachedout
Copy link
Contributor

cachedout commented Aug 18, 2021

We think we have a fix for this now. We're just waiting for new images to be built and hopefully the ITs should come back. So, we're blocked until those images are built and we can test them.

@cachedout
Copy link
Contributor

It does not look like the fix to Kibana startup time had the desired effect.

The next area of exploration is going to be to try an upgrade of Docker across the workers, in case this is somehow related to a bug in Docker disrupting connectivity between services.

@cachedout
Copy link
Contributor

cachedout commented Aug 30, 2021

Learned from reading this issue that /api/status will actually return a 503 if it detects that Kibana is not healthy. It's still curious that we aren't seeing anything in the logs which would indicate an error, but it might be good to start logging the responses to the HTTP health checks just so we can confirm this theory.

@trentm
Copy link
Member

trentm commented Sep 1, 2021

Hi. Should we re-open this or start a separate issue?

The IT tests for some languages (at least Node.js and Java) are still failing occasionally with:

[2021-09-01T14:52:55.945Z] Creating localtesting_8.0.0_kibana        ... 
[2021-09-01T14:52:55.945Z] Creating localtesting_8.0.0_kibana        ... done
[2021-09-01T14:58:02.704Z] 
[2021-09-01T14:58:02.704Z] ERROR: for apm-server  Container "e8f98e425b33" is unhealthy.

On a slack discussion, @graphaelli noticed that some of these errors are with the .ci/scripts/opbeans-app.sh ... step. For example:

The README for opbeans-app.sh says (emphasis mine):

  • opbeans-app.sh: runs the apm-integration-testing app and validates the stack can be started. You can choose the versions to run see the environment variables configuration.

Give that, it doesn't strike me as totally valid to just excluding starting Kibana from this test script. Thoughts?

@trentm
Copy link
Member

trentm commented Sep 1, 2021

@cachedout Regarding #1245 setting --no-kibana: I asked above:

The README for opbeans-app.sh says (emphasis mine):

  • opbeans-app.sh: runs the apm-integration-testing app and validates the stack can be started. You can choose the versions to run see the environment variables configuration.

Given that, it doesn't strike me as totally valid to just excluding starting Kibana from this test script. Thoughts?

However, I'm pretty ignorant on the intent of "opbeans-app.sh", so if others think excluding Kibana here is fine, then cool.

Has there been any understanding of or digging into why the Kibana container is not starting -- or not starting fast enough?

@cachedout
Copy link
Contributor

Hi @trentm

There's been a pretty significant amount of debugging around the Kibana startup issue. Much of it is linked to this issue in comments above.

The largest problem that we have here is that it's very hard to replicate this problem locally so it's hard to get an environment where can get the Kibana folks usable information. (I've seen it occur once locally out of about 200 tries.)

To make matters worse, I can't even replicate it using the APM Integration Test suite on a manually provisioned worker. The only place we ever see this is in the CI pipeline for the APM Integration Test suite. We've done things like trying to turn up the logging, but haven't been able to capture any additional information.

Kibana is tracking a couple of issues right now which seem to present themselves in similar ways. Here is the most recent and probably the most relevant:

elastic/kibana#110583

It's certainly possible that Kibana will fix this upstream and that will resolve the issue. However, I also noticed that on our end that we originally wrote test startup scripts for agents which included --no-kibana and so I added it to the others for the sake of consistency.

I will cc: @v1v here who originally wrote the README entry that you reference. I'm not sure whether he meant "the Elastic stack" (ES+KB..etc) or if by "stack" he meant the set of opbeans applications.

Regardless, it's certainly fine to shelf this PR and hope that Kibana fixes this upstream but I don't think there's much more we can do from our end outside of working around it as this PR does.

@v1v
Copy link
Member

v1v commented Sep 2, 2021

The original PR was #763

The idea of the validation was to ensure all the pieces work as expected:

  • APM Agent
  • Opbean app that consumes the above APM agent version
  • Elastic Stack service that are used by the Opbeans app

The reason was to detect breaking changes earlier in case the APM agent version affected the opbeans app. Though, there are no tests that verify the Opbeans App in the APM-ITs; the APM agent team didn't see any reason to add any tests.

If the latest Kibana versions cause some issues, the propose --no-kibana seems to be the fix.

@trentm
Copy link
Member

trentm commented Sep 2, 2021

Thanks for the information. As long as none of the APM agent or Opbeans App integration tests are testing central config (that would require a Kibana, or later Fleet server, I guess), then it sounds good to me to add --no-kibana to ".ci/scripts/opbeans-app.sh".


Here is my understanding of what is/was failing. Apologies if this was already known and/or written down somewhere -- I didn't see it. Don't feel obliged the answer the questions I have here. They are only things I would investigate further if issues persist after merging #1245

  • ERROR: for apm-server Container "d609427c2b3f" is unhealthy. is because one of the depends_on conditions for service apm-server failed:
      "depends_on": {
        "elasticsearch": {
          "condition": "service_healthy"
        },
        "kibana": {
          "condition": "service_healthy"
        }
      }
  • That d609427c2b3f container is almost certainly Kibana and not Elasticsearch. (Do we have docker ps and/or docker inspect ... output in the logs that confirm this?)

  • The healthcheck for Kibana is:

      "healthcheck": {
        "interval": "10s",
        "retries": 30,
        "start_period": "10s",
        "test": [
          "CMD",
          "curl",
          "--write-out",
          "'HTTP %{http_code}'",
          "-k",
          "--fail",
          "--silent",
          "--output",
          "/dev/null",
          "http://kibana:5601/api/status"
        ],
        "timeout": "5s"
      },
  • Per https://docs.docker.com/engine/reference/builder/#healthcheck it would take 30 consecutive failures of that curl ... command for the container to be marked "unhealthy". The --write-out 'HTTP %{http_code}' output of those curl command should be available on the docker inspect $containerId of that container. Could we get a failing test to dump docker inspect ... for every container? Or perhaps we already have that?

Aside: Interestingly it seems that docker-compose config version 3 has dropped support for having a condition under depends_on: https://docs.docker.com/compose/compose-file/compose-file-v3/#depends_on
Instead they require that one implements "waiting for upstream services to be ready" either in one's application or with some command that waits at the start of the CMD step in a Dockerfile: https://docs.docker.com/compose/startup-order/
We are currently using version 2 (

). However, eventually we may have to deal with that. Not for this issue, though. :)

@cachedout
Copy link
Contributor

We are now tracking what we believe to be the true cause of this issue upstream in the Kibana repo:

elastic/kibana#110583

@kseniia-kolpakova kseniia-kolpakova added the triaged Triaged issues will end up in Backlog column in Robots GH Project label Oct 5, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
area:test Anything related to the Test automation bug impact:critical Immediate priority; high value or cost to the product. priority:blocker Work is on-hold for a product team, business is at risk until resolution of issue Team:Automation Label for the Observability productivity team triaged Triaged issues will end up in Backlog column in Robots GH Project
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants