[CI] DocsClientYamlTestSuiteIT timed out after 30 minutes #49753

andreidan · 2019-12-02T10:47:17Z

Build failure https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+master+matrix-java-periodic/ES_BUILD_JAVA=openjdk12,ES_RUNTIME_JAVA=corretto11,nodes=general-purpose/362/console

Suite timeout exceeded (>= 1800000 msec)

Build scan available at https://gradle-enterprise.elastic.co/s/lp2vd4vqmj25o

The text was updated successfully, but these errors were encountered:

elasticmachine · 2019-12-02T10:47:19Z

Pinging @elastic/es-docs (:Docs)

mark-vieira · 2020-01-14T17:06:23Z

This is happening increasingly often over the past couple of days due to the scope of the suite simply slowly increasing over time. I think we may need to bump this timeout in the interim until we sort out a better solution like splitting it up.

mark-vieira · 2020-01-14T18:55:59Z

Now I am indeed starting to believe we've introduced some kind of regression. This is failing consistently across many OSes and JREs. Here's the runtime graph for one such CI build over the last several builds. You can see the sudden spike here.

mark-vieira · 2020-01-16T20:43:42Z

So something has clearly changed to make this test suite suddenly run so much slower. The reason I don't believe we are seeing this failure in intake and pull request checks is that those are split up into multiple smaller parallel builds and therefore this is much less contention on the worker then for the full periodic builds.

There's no single test that's running long. It simply looks as though all test cases are just a bit slower, adding up to a pretty substantial runtime hit.

Another interesting data point is this is only happening in master and 7.x. We are not seeing the increased runtimes in 7.5. My closest guess at a potential culprit for a large change that might impact performance here is perhaps #50907.

@ywelsch @DaveCTurner Any thoughts? Any chance that change might have an adverse effect on performance on a system with lot of other CPU and IO contention (lots of other concurrent tests).

@danielmitterdorfer Have you seen any recent changes in performance benchmarks that might indicate some kind regression might have been introduced?

hendrikmuhs · 2020-01-17T07:27:05Z

For me this look like a duplicate of #49579

However, since 20/01/13 the number of failures have increased, which support the investigations above.

DaveCTurner · 2020-01-17T08:12:22Z

I suspect #50907 and #50928. Although these changes save potentially hundreds of fsync calls from some cluster state updates in large clusters they will add an fsync or two to cluster state updates in tiny clusters, and these tests use a tiny cluster which does a lot of cluster state updates.

Do these tests run using a ramdisk? If not, can we move them to use a ramdisk? That would make fsync calls free again.

elasticmachine · 2020-01-17T08:49:03Z

Pinging @elastic/es-core-infra (:Core/Infra/Build)

hendrikmuhs · 2020-01-17T08:51:45Z

While investigations are running which hopefully result in a proper fix, I will increase the timeout as a temporary solution, because too many builds are failing at the moment.

increase timeout of DocsClientYamlTestSuiteIT to 35 minutes, temporary solution for issue #49753

hendrikmuhs · 2020-01-17T11:59:47Z

I increased the timeout to 35 minutes. This is a temporary solution, please revert once we have a proper solution

tvernum · 2020-01-20T03:03:31Z

It doesn't look like 35 minutes is enough. Suite is still timing out.

I'll push the timeout up further.

The docs test suite is still timing out on CI at 35 minutes, so pushing it to 40 minutes while we determine the cause of the slowdown. Relates: elastic#49753

The docs test suite is still timing out on CI at 35 minutes, so pushing it to 40 minutes while we determine the cause of the slowdown. Relates: #49753

The docs test suite is still timing out on CI at 35 minutes, so pushing it to 40 minutes while we determine the cause of the slowdown. Relates: elastic#49753 Backport of: elastic#51200

The docs test suite is still timing out on CI at 35 minutes, so pushing it to 40 minutes while we determine the cause of the slowdown. Relates: #49753 Backport of: #51200

increase timeout of DocsClientYamlTestSuiteIT to 35 minutes, temporary solution for issue elastic#49753

The docs test suite is still timing out on CI at 35 minutes, so pushing it to 40 minutes while we determine the cause of the slowdown. Relates: elastic#49753

polyfractal · 2020-01-23T16:43:29Z

Master docs yaml tests timed out today. I couldn't find evidence of any particular test killing the cluster or the test itself failing, so looks to be another instance of general slowdown.

https://gradle-enterprise.elastic.co/s/xrrdvuh3wza2s

ywelsch · 2020-01-23T17:30:14Z

@mark-vieira can you establish a base-line performance for docs:check on our CI machines?

On my machine, for example, it takes 32 minutes, running on an SSD, but on RamDisk, it takes 16 minutes, which brings it well below the range of the timeout.

polyfractal · 2020-01-23T17:33:01Z

Another maybe-useful data point: I just saw three timeouts come in back-to-back, and they were all from the same host. Perhaps there is something specific to certain hosts/machines/OS combinations that is exacerbating the issue? I know the OS/JREs have already been analyzed, but wasnt sure about hosts so thought I'd mention

mark-vieira · 2020-01-23T17:40:32Z

I just saw three timeouts come in back-to-back, and they were all from the same host

We use ephemeral workers so we never run multiple builds on a single agent. Perhaps the same type of host. Our compatibility matrix tests show this timing out across all operating systems so I think specific host types is not the issue. Also, as I mentioned before, this regression is not in 7.5 so that indicates to me something that got introduced in 7.x and later.

mark-vieira · 2020-01-23T17:44:16Z

On my machine, for example, it takes 32 minutes, running on an SSD, but on RamDisk, it takes 16 minutes, which brings it well below the range of the timeout.

Keep in mind this is running only a single test suite so resource contention is very low. This is the same reason why we generally don't see timeouts in pull-request or intake builds because those jobs are split up into smaller pieces with fewer concurrently executing tests.

The graph below is average execution time for this suite over the past 30 days for intake/PR builds (these use a ramdisk). You can see while we are still well below the timeout threshold there is a clear bump in execution time around the time of Jan 12-13. The additional load of our periodic tests (which run the entire test suite in a single build) is enough to push this over the edge.

mark-vieira · 2020-01-23T17:47:18Z

Here's the same graph but for XPackRestIT. The bump in execution time isn't as dramatic but it's there, and the timetables align.

mark-vieira · 2020-01-23T17:55:50Z

Here are the graphs for DocsClientYamlTestSuiteIT per branch across all CI builds:

7.5

7.x

master

The clear disparity between branches makes me want to rule out anything infrastructure related given that all this stuff runs on the same infra. That said, there might have been some change in 7.x and later that interacts awkwardly with our infra, or some infra change in some way.

mark-vieira · 2020-01-24T00:33:06Z

Here's an interesting thing I just found. Here's the syslogs for the CI agent running two identical build configurations:

7.5: https://infra-stats.elastic.co/app/infra#/logs?_g=()&flyoutOptions=(flyoutId:!n,flyoutVisibility:hidden,surroundingLogsId:!n)&logFilter=(expression:'host.name:elasticsearch-ci-immutable-ubuntu-1604-1579811827669019501',kind:kuery)&logPosition=(position:(tiebreaker:82329295,time:1579813230000),streamLive:!f)&logTextview=(textScale:small,wrap:!f)
7.x: https://infra-stats.elastic.co/app/infra#/logs?_g=()&flyoutOptions=(flyoutId:!n,flyoutVisibility:hidden,surroundingLogsId:!n)&logFilter=(expression:'host.name:elasticsearch-ci-immutable-fedora-29-1579709304051438758',kind:kuery)&logPosition=(position:(tiebreaker:53366760,time:1579713794000),streamLive:!f)&logTextview=(textScale:small,wrap:!t)

The 7.x builds are producing 50x more output in the syslogs due to what seems to be us logging every syscall. This is undoubtably contributing to some overhead, not only to perform the logging but whatever instrumentation is enabled that's performing this logging. Anyone have any ideas why we might be getting all this extra logging all of a sudden?

mark-vieira · 2020-01-24T00:45:49Z

Ok, I take back the conclusion in my previous comment. The additional syscall logging seems to be specific to certain OSes, not branches, and we have extended test execution times even on systems where this logging isn't present so this doesn't seem to be the cause of our problem.

Damn, back to the drawing board.

ywelsch · 2020-01-24T15:38:56Z

I've found the source for slowness here and will open a PR shortly. The gist here is that with ILM/SLM we do a lot of unnecessary setup / teardown work on each test. Compounded with the slightly slower cluster state storage mechanism, this causes the tests to run much slower.

On the core/infra side, we could also look at speeding up plugin installs. It currently takes 30-40 seconds at the beginning of docs:integTest to just install the plugins:

[2020-01-24T15:22:56.594355Z] [BUILD] Installing 16 plugins
[2020-01-24T15:23:41.084037Z] [BUILD] Starting Elasticsearch process

Edit: PR #51430 added

rjernst · 2020-01-24T16:07:58Z

On the core/infra side, we could also look at speeding up plugin installs.

While we could and should do that (now that we can install them all in one plugin command), the timeout here is in the test suite, while we do plugin installation externally from gradle.

mark-vieira · 2020-01-24T16:12:41Z

I don't believe the solution proposed in #51418 is going to help here. As I've stated before, we already run most of our CI builds in ramdisk workspace and those builds continue to timeout at 40 minutes whereas we were running under 20 minutes before.

The docs tests have recently been running much slower than before (see #49753). The gist here is that with ILM/SLM we do a lot of unnecessary setup / teardown work on each test. Compounded with the slightly slower cluster state storage mechanism, this causes the tests to run much slower. In particular, on RAMDisk, docs:check is taking ES 7.4: 6:55 minutes ES master: 16:09 minutes ES with this commit: 6:52 minutes on SSD, docs:check is taking ES 7.4: ??? minutes ES master: 32:20 minutes ES with this commit: 11:21 minutes

The docs tests have recently been running much slower than before (see elastic#49753). The gist here is that with ILM/SLM we do a lot of unnecessary setup / teardown work on each test. Compounded with the slightly slower cluster state storage mechanism, this causes the tests to run much slower. In particular, on RAMDisk, docs:check is taking ES 7.4: 6:55 minutes ES master: 16:09 minutes ES with this commit: 6:52 minutes on SSD, docs:check is taking ES 7.4: ??? minutes ES master: 32:20 minutes ES with this commit: 11:21 minutes

The docs tests have recently been running much slower than before (see #49753). The gist here is that with ILM/SLM we do a lot of unnecessary setup / teardown work on each test. Compounded with the slightly slower cluster state storage mechanism, this causes the tests to run much slower. In particular, on RAMDisk, docs:check is taking ES 7.4: 6:55 minutes ES master: 16:09 minutes ES with this commit: 6:52 minutes on SSD, docs:check is taking ES 7.4: ??? minutes ES master: 32:20 minutes ES with this commit: 11:21 minutes

mark-vieira · 2020-01-29T20:31:19Z

This looks to have done the trick. Thanks @ywelsch!

master

7.x

ywelsch · 2020-02-05T11:42:41Z

We got another timeout on elastic+elasticsearch+master+multijob-darwin-compatibility: https://gradle-enterprise.elastic.co/s/fczylflkgrebm

Looking at the output, the tests are running at an excruciatingly slow pace (3-4 seconds per test). Perhaps darwin tests are not using RAMDisk, or doing too many things in parallel?

mark-vieira · 2020-02-05T20:28:15Z

Thanks for reporting this @ywelsch. This looks to be an issue with that macos Jenkins worker. We've seen a number of build failures specific to this machine for which I have an open infra issue for (https://github.com/elastic/infra/issues/17621).

You can see this test is passing just fine and running in a reasonable time on other mac workers.

Since this is an infra problem, and I've reached out regarding this, I'm going to close this issue again.

andreidan added >test-failure Triaged test failures from CI :Docs labels Dec 2, 2019

andreidan changed the title ~~[CI] [7.x] DocsClientYamlTestSuiteIT timed out after 30 minutes~~ [CI] DocsClientYamlTestSuiteIT timed out after 30 minutes Dec 2, 2019

hendrikmuhs mentioned this issue Jan 17, 2020

[CI] DocsClientYamlTest regularly fails due to suite timeout #49579

Closed

hendrikmuhs added the :Delivery/Build Build or test infrastructure label Jan 17, 2020

hendrikmuhs removed the :Docs label Jan 17, 2020

hendrikmuhs mentioned this issue Jan 17, 2020

increase timeout of DocsClientYamlTestSuiteIT #51139

Merged

hendrikmuhs pushed a commit that referenced this issue Jan 17, 2020

increase timeout of DocsClientYamlTestSuiteIT

394f09c

increase timeout of DocsClientYamlTestSuiteIT to 35 minutes, temporary solution for issue #49753

hendrikmuhs pushed a commit that referenced this issue Jan 17, 2020

increase timeout of DocsClientYamlTestSuiteIT

8282744

increase timeout of DocsClientYamlTestSuiteIT to 35 minutes, temporary solution for issue #49753

hendrikmuhs pushed a commit that referenced this issue Jan 17, 2020

increase timeout of DocsClientYamlTestSuiteIT

a3c20d9

increase timeout of DocsClientYamlTestSuiteIT to 35 minutes, temporary solution for issue #49753

henningandersen mentioned this issue Jan 17, 2020

Fix ActionListener.map exception handling #50886

Merged

tvernum added a commit to tvernum/elasticsearch that referenced this issue Jan 20, 2020

Bump docs test suite timeout to 40min

820aeea

The docs test suite is still timing out on CI at 35 minutes, so pushing it to 40 minutes while we determine the cause of the slowdown. Relates: elastic#49753

tvernum mentioned this issue Jan 20, 2020

Bump docs test suite timeout to 40min #51200

Merged

tvernum added a commit that referenced this issue Jan 20, 2020

Bump docs test suite timeout to 40min (#51200)

910ea93

The docs test suite is still timing out on CI at 35 minutes, so pushing it to 40 minutes while we determine the cause of the slowdown. Relates: #49753

tvernum mentioned this issue Jan 20, 2020

[7.x] Bump docs test suite timeout to 40min #51204

Merged

tvernum mentioned this issue Jan 20, 2020

[7.6] Bump docs test suite timeout to 40min #51205

Merged

tvernum added a commit that referenced this issue Jan 20, 2020

Bump docs test suite timeout to 40min (#51204)

815ccef

The docs test suite is still timing out on CI at 35 minutes, so pushing it to 40 minutes while we determine the cause of the slowdown. Relates: #49753 Backport of: #51200

tvernum added a commit that referenced this issue Jan 20, 2020

Bump docs test suite timeout to 40min (#51205)

b34b79b

The docs test suite is still timing out on CI at 35 minutes, so pushing it to 40 minutes while we determine the cause of the slowdown. Relates: #49753 Backport of: #51200

SivagurunathanV pushed a commit to SivagurunathanV/elasticsearch that referenced this issue Jan 23, 2020

increase timeout of DocsClientYamlTestSuiteIT

95aca04

increase timeout of DocsClientYamlTestSuiteIT to 35 minutes, temporary solution for issue elastic#49753

ywelsch mentioned this issue Jan 24, 2020

Speed up doc tests using RamDirectory #51418

Closed

ywelsch mentioned this issue Jan 24, 2020

Avoid unnecessary setup and teardown in docs tests #51430

Merged

mark-vieira closed this as completed Jan 29, 2020

ywelsch reopened this Feb 5, 2020

ywelsch assigned mark-vieira Feb 5, 2020

mark-vieira closed this as completed Feb 5, 2020

astefan mentioned this issue Feb 17, 2020

[CI] XPackRestIT timed out after 30 mins #52428

Closed

danielmitterdorfer mentioned this issue Jul 9, 2020

DocsClientYamlTestSuiteIT times out #59297

Closed

mark-vieira added the Team:Delivery Meta label for Delivery team label Nov 11, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[CI] DocsClientYamlTestSuiteIT timed out after 30 minutes #49753

[CI] DocsClientYamlTestSuiteIT timed out after 30 minutes #49753

andreidan commented Dec 2, 2019 •

edited

Loading

elasticmachine commented Dec 2, 2019

mark-vieira commented Jan 14, 2020

mark-vieira commented Jan 14, 2020 •

edited

Loading

mark-vieira commented Jan 16, 2020

hendrikmuhs commented Jan 17, 2020 •

edited

Loading

DaveCTurner commented Jan 17, 2020 •

edited

Loading

elasticmachine commented Jan 17, 2020

hendrikmuhs commented Jan 17, 2020 •

edited

Loading

hendrikmuhs commented Jan 17, 2020

tvernum commented Jan 20, 2020 •

edited

Loading

polyfractal commented Jan 23, 2020

ywelsch commented Jan 23, 2020

polyfractal commented Jan 23, 2020 •

edited

Loading

mark-vieira commented Jan 23, 2020

mark-vieira commented Jan 23, 2020

mark-vieira commented Jan 23, 2020

mark-vieira commented Jan 23, 2020 •

edited

Loading

mark-vieira commented Jan 24, 2020 •

edited

Loading

mark-vieira commented Jan 24, 2020 •

edited

Loading

ywelsch commented Jan 24, 2020 •

edited

Loading

rjernst commented Jan 24, 2020 •

edited

Loading

mark-vieira commented Jan 24, 2020

mark-vieira commented Jan 29, 2020

ywelsch commented Feb 5, 2020

mark-vieira commented Feb 5, 2020

[CI] DocsClientYamlTestSuiteIT timed out after 30 minutes #49753

[CI] DocsClientYamlTestSuiteIT timed out after 30 minutes #49753

Comments

andreidan commented Dec 2, 2019 • edited Loading

elasticmachine commented Dec 2, 2019

mark-vieira commented Jan 14, 2020

mark-vieira commented Jan 14, 2020 • edited Loading

mark-vieira commented Jan 16, 2020

hendrikmuhs commented Jan 17, 2020 • edited Loading

DaveCTurner commented Jan 17, 2020 • edited Loading

elasticmachine commented Jan 17, 2020

hendrikmuhs commented Jan 17, 2020 • edited Loading

hendrikmuhs commented Jan 17, 2020

tvernum commented Jan 20, 2020 • edited Loading

polyfractal commented Jan 23, 2020

ywelsch commented Jan 23, 2020

polyfractal commented Jan 23, 2020 • edited Loading

mark-vieira commented Jan 23, 2020

mark-vieira commented Jan 23, 2020

mark-vieira commented Jan 23, 2020

mark-vieira commented Jan 23, 2020 • edited Loading

mark-vieira commented Jan 24, 2020 • edited Loading

mark-vieira commented Jan 24, 2020 • edited Loading

ywelsch commented Jan 24, 2020 • edited Loading

rjernst commented Jan 24, 2020 • edited Loading

mark-vieira commented Jan 24, 2020

mark-vieira commented Jan 29, 2020

ywelsch commented Feb 5, 2020

mark-vieira commented Feb 5, 2020

andreidan commented Dec 2, 2019 •

edited

Loading

mark-vieira commented Jan 14, 2020 •

edited

Loading

hendrikmuhs commented Jan 17, 2020 •

edited

Loading

DaveCTurner commented Jan 17, 2020 •

edited

Loading

hendrikmuhs commented Jan 17, 2020 •

edited

Loading

tvernum commented Jan 20, 2020 •

edited

Loading

polyfractal commented Jan 23, 2020 •

edited

Loading

mark-vieira commented Jan 23, 2020 •

edited

Loading

mark-vieira commented Jan 24, 2020 •

edited

Loading

mark-vieira commented Jan 24, 2020 •

edited

Loading

ywelsch commented Jan 24, 2020 •

edited

Loading

rjernst commented Jan 24, 2020 •

edited

Loading