Error while running the ALL stage on CI builds #1188

mdelapenya · 2021-06-30T11:22:12Z

It seems it's the apm-server container, which is not healthy on compose startup:

[2021-06-30T03:26:23.880Z] ERROR: for apm-server  Container "d609427c2b3f" is unhealthy.
[2021-06-30T03:26:23.880Z] Encountered errors while bringing up the project.
[2021-06-30T03:26:23.880Z] Starting/Building stack services..
[2021-06-30T03:26:23.880Z] 
[2021-06-30T03:26:23.880Z] Traceback (most recent call last):
[2021-06-30T03:26:23.880Z]   File "scripts/compose.py", line 31, in <module>
[2021-06-30T03:26:23.880Z]     main()
[2021-06-30T03:26:23.880Z]   File "scripts/compose.py", line 17, in main
[2021-06-30T03:26:23.880Z]     setup()
[2021-06-30T03:26:23.880Z]   File "/var/lib/jenkins/workspace/tegration-test-downstream_master/src/github.com/elastic/apm-integration-testing/scripts/modules/cli.py", line 204, in __call__
[2021-06-30T03:26:23.880Z]     self.args.func()
[2021-06-30T03:26:23.880Z]   File "/var/lib/jenkins/workspace/tegration-test-downstream_master/src/github.com/elastic/apm-integration-testing/scripts/modules/cli.py", line 586, in start_handler
[2021-06-30T03:26:23.880Z]     self.build_start_handler("start")
[2021-06-30T03:26:23.880Z]   File "/var/lib/jenkins/workspace/tegration-test-downstream_master/src/github.com/elastic/apm-integration-testing/scripts/modules/cli.py", line 732, in build_start_handler
[2021-06-30T03:26:23.880Z]     self.run_docker_compose_process(docker_compose_cmd + up_params)
[2021-06-30T03:26:23.880Z]   File "/var/lib/jenkins/workspace/tegration-test-downstream_master/src/github.com/elastic/apm-integration-testing/scripts/modules/cli.py", line 472, in run_docker_compose_process
[2021-06-30T03:26:23.880Z]     subprocess.check_call(docker_compose_cmd)
[2021-06-30T03:26:23.880Z]   File "/usr/lib/python3.6/subprocess.py", line 311, in check_call
[2021-06-30T03:26:23.880Z]     raise CalledProcessError(retcode, cmd)
[2021-06-30T03:26:23.880Z] subprocess.CalledProcessError: Command '['docker-compose', '-f', '/var/lib/jenkins/workspace/tegration-test-downstream_master/src/github.com/elastic/apm-integration-testing/docker-compose.yml', '--no-ansi', '--log-level', 'ERROR', 'up', '-d', '--quiet-pull']' returned non-zero exit status 1.
[2021-06-30T03:26:23.880Z] Makefile:75: recipe for target 'start-env' failed
[2021-06-30T03:26:23.880Z] make[1]: *** [start-env] Error 1
[2021-06-30T03:26:23.880Z] make[1]: Leaving directory '/var/lib/jenkins/workspace/tegration-test-downstream_master/src/github.com/elastic/apm-integration-testing'
[2021-06-30T03:26:23.880Z] Makefile:92: recipe for target 'env-agent-all' failed
[2021-06-30T03:26:23.880Z] make: *** [env-agent-all] Error 2

CI Job: https://apm-ci.elastic.co/blue/organizations/jenkins/apm-integration-test-downstream/detail/master/9929/pipeline

The text was updated successfully, but these errors were encountered:

mdelapenya · 2021-06-30T11:26:12Z

Adding more context about last 6 days of build execution:

June 23rd (https://apm-ci.elastic.co/blue/organizations/jenkins/apm-integration-tests/detail/master/933/pipeline): errors on PHP and Python stages
June 24th (https://apm-ci.elastic.co/blue/organizations/jenkins/apm-integration-tests/detail/master/934/pipeline): errors on PHP and Python stages
June 25th (https://apm-ci.elastic.co/blue/organizations/jenkins/apm-integration-tests/detail/master/935/pipeline): errors on All
June 28th (https://apm-ci.elastic.co/blue/organizations/jenkins/apm-integration-tests/detail/master/936/pipeline): errors on All, PHP and Python
June 28th (https://apm-ci.elastic.co/blue/organizations/jenkins/apm-integration-tests/detail/master/937/pipeline): errors on All, PHP and Python
June 29th (https://apm-ci.elastic.co/blue/organizations/jenkins/apm-integration-tests/detail/master/938/pipeline): errors on All, PHP and Python
June 30th (https://apm-ci.elastic.co/blue/organizations/jenkins/apm-integration-tests/detail/master/939/pipeline): errors on All

mdelapenya · 2021-06-30T11:27:40Z

Both Python and PHP error fail with the following message, which is the very same error/stacktrace as in the all stage.

[2021-06-28T13:51:07.848Z] ERROR: for apm-server  Container "2646f30a194d" is unhealthy.
[2021-06-28T13:51:07.848Z] Encountered errors while bringing up the project.
[2021-06-28T13:51:07.848Z] Starting/Building stack services..
[2021-06-28T13:51:07.848Z] 
[2021-06-28T13:51:07.848Z] Traceback (most recent call last):
[2021-06-28T13:51:07.848Z]   File "scripts/compose.py", line 31, in <module>
[2021-06-28T13:51:07.848Z]     main()
[2021-06-28T13:51:07.848Z]   File "scripts/compose.py", line 17, in main
[2021-06-28T13:51:07.848Z]     setup()
[2021-06-28T13:51:07.848Z]   File "/var/lib/jenkins/workspace/tegration-test-downstream_master/src/github.com/elastic/apm-integration-testing/scripts/modules/cli.py", line 204, in __call__
[2021-06-28T13:51:07.848Z]     self.args.func()
[2021-06-28T13:51:07.848Z]   File "/var/lib/jenkins/workspace/tegration-test-downstream_master/src/github.com/elastic/apm-integration-testing/scripts/modules/cli.py", line 586, in start_handler
[2021-06-28T13:51:07.848Z]     self.build_start_handler("start")
[2021-06-28T13:51:07.848Z]   File "/var/lib/jenkins/workspace/tegration-test-downstream_master/src/github.com/elastic/apm-integration-testing/scripts/modules/cli.py", line 732, in build_start_handler
[2021-06-28T13:51:07.848Z]     self.run_docker_compose_process(docker_compose_cmd + up_params)
[2021-06-28T13:51:07.848Z]   File "/var/lib/jenkins/workspace/tegration-test-downstream_master/src/github.com/elastic/apm-integration-testing/scripts/modules/cli.py", line 472, in run_docker_compose_process
[2021-06-28T13:51:07.848Z]     subprocess.check_call(docker_compose_cmd)
[2021-06-28T13:51:07.848Z]   File "/usr/lib/python3.6/subprocess.py", line 311, in check_call
[2021-06-28T13:51:07.848Z]     raise CalledProcessError(retcode, cmd)
[2021-06-28T13:51:07.848Z] subprocess.CalledProcessError: Command '['docker-compose', '-f', '/var/lib/jenkins/workspace/tegration-test-downstream_master/src/github.com/elastic/apm-integration-testing/docker-compose.yml', '--no-ansi', '--log-level', 'ERROR', 'up', '-d', '--quiet-pull']' returned non-zero exit status 1.
[2021-06-28T13:51:07.848Z] Makefile:75: recipe for target 'start-env' failed
[2021-06-28T13:51:07.848Z] make[1]: *** [start-env] Error 1
[2021-06-28T13:51:07.848Z] make[1]: Leaving directory '/var/lib/jenkins/workspace/tegration-test-downstream_master/src/github.com/elastic/apm-integration-testing'
[2021-06-28T13:51:07.848Z] Makefile:92: recipe for target 'env-agent-python' failed
[2021-06-28T13:51:07.848Z] make: *** [env-agent-python] Error 2
script returned exit code 2

cachedout · 2021-07-30T14:21:45Z

~~This is related to elastic/kibana#106749~~

EDIT: This is now at elastic/kibana#107300

cachedout · 2021-08-18T09:15:42Z

We think we have a fix for this now. We're just waiting for new images to be built and hopefully the ITs should come back. So, we're blocked until those images are built and we can test them.

cachedout · 2021-08-23T09:44:01Z

It does not look like the fix to Kibana startup time had the desired effect.

The next area of exploration is going to be to try an upgrade of Docker across the workers, in case this is somehow related to a bug in Docker disrupting connectivity between services.

cachedout · 2021-08-30T08:38:10Z

Learned from reading this issue that /api/status will actually return a 503 if it detects that Kibana is not healthy. It's still curious that we aren't seeing anything in the logs which would indicate an error, but it might be good to start logging the responses to the HTTP health checks just so we can confirm this theory.

trentm · 2021-09-01T17:02:58Z

Hi. Should we re-open this or start a separate issue?

The IT tests for some languages (at least Node.js and Java) are still failing occasionally with:

[2021-09-01T14:52:55.945Z] Creating localtesting_8.0.0_kibana        ... 
[2021-09-01T14:52:55.945Z] Creating localtesting_8.0.0_kibana        ... done
[2021-09-01T14:58:02.704Z] 
[2021-09-01T14:58:02.704Z] ERROR: for apm-server  Container "e8f98e425b33" is unhealthy.

On a slack discussion, @graphaelli noticed that some of these errors are with the .ci/scripts/opbeans-app.sh ... step. For example:

https://apm-ci.elastic.co/blue/organizations/jenkins/apm-integration-tests-selector-mbp/detail/master/20103/pipeline/ for a node.js example
https://apm-ci.elastic.co/blue/organizations/jenkins/apm-integration-tests-selector-mbp/detail/master/20147/pipeline for a java example

The README for opbeans-app.sh says (emphasis mine):

opbeans-app.sh: runs the apm-integration-testing app and validates the stack can be started. You can choose the versions to run see the environment variables configuration.

Give that, it doesn't strike me as totally valid to just excluding starting Kibana from this test script. Thoughts?

trentm · 2021-09-01T20:06:09Z

@cachedout Regarding #1245 setting --no-kibana: I asked above:

The README for opbeans-app.sh says (emphasis mine):

opbeans-app.sh: runs the apm-integration-testing app and validates the stack can be started. You can choose the versions to run see the environment variables configuration.

Given that, it doesn't strike me as totally valid to just excluding starting Kibana from this test script. Thoughts?

However, I'm pretty ignorant on the intent of "opbeans-app.sh", so if others think excluding Kibana here is fine, then cool.

Has there been any understanding of or digging into why the Kibana container is not starting -- or not starting fast enough?

cachedout · 2021-09-02T09:47:46Z

Hi @trentm

There's been a pretty significant amount of debugging around the Kibana startup issue. Much of it is linked to this issue in comments above.

The largest problem that we have here is that it's very hard to replicate this problem locally so it's hard to get an environment where can get the Kibana folks usable information. (I've seen it occur once locally out of about 200 tries.)

To make matters worse, I can't even replicate it using the APM Integration Test suite on a manually provisioned worker. The only place we ever see this is in the CI pipeline for the APM Integration Test suite. We've done things like trying to turn up the logging, but haven't been able to capture any additional information.

Kibana is tracking a couple of issues right now which seem to present themselves in similar ways. Here is the most recent and probably the most relevant:

elastic/kibana#110583

It's certainly possible that Kibana will fix this upstream and that will resolve the issue. However, I also noticed that on our end that we originally wrote test startup scripts for agents which included --no-kibana and so I added it to the others for the sake of consistency.

I will cc: @v1v here who originally wrote the README entry that you reference. I'm not sure whether he meant "the Elastic stack" (ES+KB..etc) or if by "stack" he meant the set of opbeans applications.

Regardless, it's certainly fine to shelf this PR and hope that Kibana fixes this upstream but I don't think there's much more we can do from our end outside of working around it as this PR does.

v1v · 2021-09-02T10:36:53Z

The original PR was #763

The idea of the validation was to ensure all the pieces work as expected:

APM Agent
Opbean app that consumes the above APM agent version
Elastic Stack service that are used by the Opbeans app

The reason was to detect breaking changes earlier in case the APM agent version affected the opbeans app. Though, there are no tests that verify the Opbeans App in the APM-ITs; the APM agent team didn't see any reason to add any tests.

If the latest Kibana versions cause some issues, the propose --no-kibana seems to be the fix.

trentm · 2021-09-02T17:29:57Z

Thanks for the information. As long as none of the APM agent or Opbeans App integration tests are testing central config (that would require a Kibana, or later Fleet server, I guess), then it sounds good to me to add --no-kibana to ".ci/scripts/opbeans-app.sh".

Here is my understanding of what is/was failing. Apologies if this was already known and/or written down somewhere -- I didn't see it. Don't feel obliged the answer the questions I have here. They are only things I would investigate further if issues persist after merging #1245

ERROR: for apm-server Container "d609427c2b3f" is unhealthy. is because one of the depends_on conditions for service apm-server failed:

      "depends_on": {
        "elasticsearch": {
          "condition": "service_healthy"
        },
        "kibana": {
          "condition": "service_healthy"
        }
      }

That d609427c2b3f container is almost certainly Kibana and not Elasticsearch. (Do we have docker ps and/or docker inspect ... output in the logs that confirm this?)
The healthcheck for Kibana is:

      "healthcheck": {
        "interval": "10s",
        "retries": 30,
        "start_period": "10s",
        "test": [
          "CMD",
          "curl",
          "--write-out",
          "'HTTP %{http_code}'",
          "-k",
          "--fail",
          "--silent",
          "--output",
          "/dev/null",
          "http://kibana:5601/api/status"
        ],
        "timeout": "5s"
      },

Per https://docs.docker.com/engine/reference/builder/#healthcheck it would take 30 consecutive failures of that curl ... command for the container to be marked "unhealthy". The --write-out 'HTTP %{http_code}' output of those curl command should be available on the docker inspect $containerId of that container. Could we get a failing test to dump docker inspect ... for every container? Or perhaps we already have that?

Aside: Interestingly it seems that docker-compose config version 3 has dropped support for having a condition under depends_on: https://docs.docker.com/compose/compose-file/compose-file-v3/#depends_on
Instead they require that one implements "waiting for upstream services to be ready" either in one's application or with some command that waits at the start of the CMD step in a Dockerfile: https://docs.docker.com/compose/startup-order/
We are currently using version 2 (

apm-integration-testing/scripts/modules/cli.py

Line 668 in ca232dd

version="2.4",

). However, eventually we may have to deal with that. Not for this issue, though. :)

cachedout · 2021-09-07T10:11:36Z

We are now tracking what we believe to be the true cause of this issue upstream in the Kibana repo:

elastic/kibana#110583

cachedout self-assigned this Jul 30, 2021

This was referenced Aug 24, 2021

refactor: clean up types slightly (drop GenericSpan, s/agent./apm./ namespace) elastic/apm-agent-nodejs#2300

Merged

ci: fix use of missing DOCKER_FOLDER var that breaks CI test cleanup elastic/apm-agent-nodejs#2304

Merged

This was referenced Aug 30, 2021

Enable health verbose log #1235

Merged

List images #1239

Merged

Start PHP/Python tests with --no-kibana #1242

Merged

cachedout closed this as completed in #1242 Aug 31, 2021

cachedout reopened this Sep 1, 2021

trentm mentioned this issue Sep 1, 2021

feat: Add "longFieldMaxLength" config var elastic/apm-agent-nodejs#2193

Merged

7 tasks

cachedout mentioned this issue Sep 1, 2021

No Kibana for opbeans-app #1245

Merged

cachedout mentioned this issue Sep 2, 2021

/status endpoint returns 503 when any plugin is unavailable elastic/kibana#110583

Closed

v1v mentioned this issue Sep 15, 2021

Add minimal sample doc (backport #1265) #1268

Merged

kseniia-kolpakova added the triaged Triaged issues will end up in Backlog column in Robots GH Project label Oct 5, 2021

cachedout closed this as completed Nov 2, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Error while running the ALL stage on CI builds #1188

Error while running the ALL stage on CI builds #1188

mdelapenya commented Jun 30, 2021 •

edited

Loading

mdelapenya commented Jun 30, 2021

mdelapenya commented Jun 30, 2021

cachedout commented Jul 30, 2021 •

edited

Loading

cachedout commented Aug 18, 2021 •

edited

Loading

cachedout commented Aug 23, 2021

cachedout commented Aug 30, 2021 •

edited

Loading

trentm commented Sep 1, 2021

trentm commented Sep 1, 2021

cachedout commented Sep 2, 2021

v1v commented Sep 2, 2021

trentm commented Sep 2, 2021

cachedout commented Sep 7, 2021

Error while running the ALL stage on CI builds #1188

Error while running the ALL stage on CI builds #1188

Comments

mdelapenya commented Jun 30, 2021 • edited Loading

mdelapenya commented Jun 30, 2021

mdelapenya commented Jun 30, 2021

cachedout commented Jul 30, 2021 • edited Loading

cachedout commented Aug 18, 2021 • edited Loading

cachedout commented Aug 23, 2021

cachedout commented Aug 30, 2021 • edited Loading

trentm commented Sep 1, 2021

trentm commented Sep 1, 2021

cachedout commented Sep 2, 2021

v1v commented Sep 2, 2021

trentm commented Sep 2, 2021

cachedout commented Sep 7, 2021

mdelapenya commented Jun 30, 2021 •

edited

Loading

cachedout commented Jul 30, 2021 •

edited

Loading

cachedout commented Aug 18, 2021 •

edited

Loading

cachedout commented Aug 30, 2021 •

edited

Loading