Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[CI] DocsClientYamlTestSuiteIT timed out after 30 minutes #49753

Closed
andreidan opened this issue Dec 2, 2019 · 32 comments
Closed

[CI] DocsClientYamlTestSuiteIT timed out after 30 minutes #49753

andreidan opened this issue Dec 2, 2019 · 32 comments
Assignees
Labels
:Delivery/Build Build or test infrastructure Team:Delivery Meta label for Delivery team >test-failure Triaged test failures from CI

Comments

@andreidan
Copy link
Contributor

andreidan commented Dec 2, 2019

Build failure https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+master+matrix-java-periodic/ES_BUILD_JAVA=openjdk12,ES_RUNTIME_JAVA=corretto11,nodes=general-purpose/362/console

Suite timeout exceeded (>= 1800000 msec)

Build scan available at https://gradle-enterprise.elastic.co/s/lp2vd4vqmj25o

@andreidan andreidan added >test-failure Triaged test failures from CI :Docs labels Dec 2, 2019
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-docs (:Docs)

@andreidan andreidan changed the title [CI] [7.x] DocsClientYamlTestSuiteIT timed out after 30 minutes [CI] DocsClientYamlTestSuiteIT timed out after 30 minutes Dec 2, 2019
@mark-vieira
Copy link
Contributor

This is happening increasingly often over the past couple of days due to the scope of the suite simply slowly increasing over time. I think we may need to bump this timeout in the interim until we sort out a better solution like splitting it up.

@mark-vieira
Copy link
Contributor

mark-vieira commented Jan 14, 2020

Now I am indeed starting to believe we've introduced some kind of regression. This is failing consistently across many OSes and JREs. Here's the runtime graph for one such CI build over the last several builds. You can see the sudden spike here.

image

@mark-vieira
Copy link
Contributor

So something has clearly changed to make this test suite suddenly run so much slower. The reason I don't believe we are seeing this failure in intake and pull request checks is that those are split up into multiple smaller parallel builds and therefore this is much less contention on the worker then for the full periodic builds.

There's no single test that's running long. It simply looks as though all test cases are just a bit slower, adding up to a pretty substantial runtime hit.

Another interesting data point is this is only happening in master and 7.x. We are not seeing the increased runtimes in 7.5. My closest guess at a potential culprit for a large change that might impact performance here is perhaps #50907.

@ywelsch @DaveCTurner Any thoughts? Any chance that change might have an adverse effect on performance on a system with lot of other CPU and IO contention (lots of other concurrent tests).

@danielmitterdorfer Have you seen any recent changes in performance benchmarks that might indicate some kind regression might have been introduced?

@hendrikmuhs
Copy link
Contributor

hendrikmuhs commented Jan 17, 2020

For me this look like a duplicate of #49579

However, since 20/01/13 the number of failures have increased, which support the investigations above.

@DaveCTurner
Copy link
Contributor

DaveCTurner commented Jan 17, 2020

I suspect #50907 and #50928. Although these changes save potentially hundreds of fsync calls from some cluster state updates in large clusters they will add an fsync or two to cluster state updates in tiny clusters, and these tests use a tiny cluster which does a lot of cluster state updates.

Do these tests run using a ramdisk? If not, can we move them to use a ramdisk? That would make fsync calls free again.

@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-core-infra (:Core/Infra/Build)

@hendrikmuhs hendrikmuhs removed the :Docs label Jan 17, 2020
@hendrikmuhs
Copy link
Contributor

hendrikmuhs commented Jan 17, 2020

While investigations are running which hopefully result in a proper fix, I will increase the timeout as a temporary solution, because too many builds are failing at the moment.

hendrikmuhs pushed a commit that referenced this issue Jan 17, 2020
increase timeout of DocsClientYamlTestSuiteIT to 35 minutes, temporary solution for issue #49753
hendrikmuhs pushed a commit that referenced this issue Jan 17, 2020
increase timeout of DocsClientYamlTestSuiteIT to 35 minutes, temporary solution for issue #49753
hendrikmuhs pushed a commit that referenced this issue Jan 17, 2020
increase timeout of DocsClientYamlTestSuiteIT to 35 minutes, temporary solution for issue #49753
@hendrikmuhs
Copy link
Contributor

I increased the timeout to 35 minutes. This is a temporary solution, please revert once we have a proper solution

@tvernum
Copy link
Contributor

tvernum commented Jan 20, 2020

It doesn't look like 35 minutes is enough. Suite is still timing out.

I'll push the timeout up further.

tvernum added a commit to tvernum/elasticsearch that referenced this issue Jan 20, 2020
The docs test suite is still timing out on CI at 35 minutes, so
pushing it to 40 minutes while we determine the cause of the slowdown.

Relates: elastic#49753
tvernum added a commit that referenced this issue Jan 20, 2020
The docs test suite is still timing out on CI at 35 minutes, so
pushing it to 40 minutes while we determine the cause of the slowdown.

Relates: #49753
tvernum added a commit to tvernum/elasticsearch that referenced this issue Jan 20, 2020
The docs test suite is still timing out on CI at 35 minutes, so
pushing it to 40 minutes while we determine the cause of the slowdown.

Relates: elastic#49753
Backport of: elastic#51200
tvernum added a commit to tvernum/elasticsearch that referenced this issue Jan 20, 2020
The docs test suite is still timing out on CI at 35 minutes, so
pushing it to 40 minutes while we determine the cause of the slowdown.

Relates: elastic#49753
Backport of: elastic#51200
tvernum added a commit that referenced this issue Jan 20, 2020
The docs test suite is still timing out on CI at 35 minutes, so
pushing it to 40 minutes while we determine the cause of the slowdown.

Relates: #49753
Backport of: #51200
tvernum added a commit that referenced this issue Jan 20, 2020
The docs test suite is still timing out on CI at 35 minutes, so
pushing it to 40 minutes while we determine the cause of the slowdown.

Relates: #49753
Backport of: #51200
SivagurunathanV pushed a commit to SivagurunathanV/elasticsearch that referenced this issue Jan 23, 2020
increase timeout of DocsClientYamlTestSuiteIT to 35 minutes, temporary solution for issue elastic#49753
SivagurunathanV pushed a commit to SivagurunathanV/elasticsearch that referenced this issue Jan 23, 2020
The docs test suite is still timing out on CI at 35 minutes, so
pushing it to 40 minutes while we determine the cause of the slowdown.

Relates: elastic#49753
@polyfractal
Copy link
Contributor

Master docs yaml tests timed out today. I couldn't find evidence of any particular test killing the cluster or the test itself failing, so looks to be another instance of general slowdown.

https://gradle-enterprise.elastic.co/s/xrrdvuh3wza2s

@ywelsch
Copy link
Contributor

ywelsch commented Jan 23, 2020

@mark-vieira can you establish a base-line performance for docs:check on our CI machines?

On my machine, for example, it takes 32 minutes, running on an SSD, but on RamDisk, it takes 16 minutes, which brings it well below the range of the timeout.

@polyfractal
Copy link
Contributor

polyfractal commented Jan 23, 2020

Another maybe-useful data point: I just saw three timeouts come in back-to-back, and they were all from the same host. Perhaps there is something specific to certain hosts/machines/OS combinations that is exacerbating the issue? I know the OS/JREs have already been analyzed, but wasnt sure about hosts so thought I'd mention

@mark-vieira
Copy link
Contributor

I just saw three timeouts come in back-to-back, and they were all from the same host

We use ephemeral workers so we never run multiple builds on a single agent. Perhaps the same type of host. Our compatibility matrix tests show this timing out across all operating systems so I think specific host types is not the issue. Also, as I mentioned before, this regression is not in 7.5 so that indicates to me something that got introduced in 7.x and later.

@mark-vieira
Copy link
Contributor

On my machine, for example, it takes 32 minutes, running on an SSD, but on RamDisk, it takes 16 minutes, which brings it well below the range of the timeout.

Keep in mind this is running only a single test suite so resource contention is very low. This is the same reason why we generally don't see timeouts in pull-request or intake builds because those jobs are split up into smaller pieces with fewer concurrently executing tests.

The graph below is average execution time for this suite over the past 30 days for intake/PR builds (these use a ramdisk). You can see while we are still well below the timeout threshold there is a clear bump in execution time around the time of Jan 12-13. The additional load of our periodic tests (which run the entire test suite in a single build) is enough to push this over the edge.

image

@mark-vieira
Copy link
Contributor

Here's the same graph but for XPackRestIT. The bump in execution time isn't as dramatic but it's there, and the timetables align.

image

@mark-vieira
Copy link
Contributor

mark-vieira commented Jan 23, 2020

Here are the graphs for DocsClientYamlTestSuiteIT per branch across all CI builds:

7.5
image

7.x
image

master
image

The clear disparity between branches makes me want to rule out anything infrastructure related given that all this stuff runs on the same infra. That said, there might have been some change in 7.x and later that interacts awkwardly with our infra, or some infra change in some way.

@mark-vieira
Copy link
Contributor

mark-vieira commented Jan 24, 2020

Here's an interesting thing I just found. Here's the syslogs for the CI agent running two identical build configurations:

7.5: https://infra-stats.elastic.co/app/infra#/logs?_g=()&flyoutOptions=(flyoutId:!n,flyoutVisibility:hidden,surroundingLogsId:!n)&logFilter=(expression:'host.name:elasticsearch-ci-immutable-ubuntu-1604-1579811827669019501',kind:kuery)&logPosition=(position:(tiebreaker:82329295,time:1579813230000),streamLive:!f)&logTextview=(textScale:small,wrap:!f)
7.x: https://infra-stats.elastic.co/app/infra#/logs?_g=()&flyoutOptions=(flyoutId:!n,flyoutVisibility:hidden,surroundingLogsId:!n)&logFilter=(expression:'host.name:elasticsearch-ci-immutable-fedora-29-1579709304051438758',kind:kuery)&logPosition=(position:(tiebreaker:53366760,time:1579713794000),streamLive:!f)&logTextview=(textScale:small,wrap:!t)

The 7.x builds are producing 50x more output in the syslogs due to what seems to be us logging every syscall. This is undoubtably contributing to some overhead, not only to perform the logging but whatever instrumentation is enabled that's performing this logging. Anyone have any ideas why we might be getting all this extra logging all of a sudden?

@mark-vieira
Copy link
Contributor

mark-vieira commented Jan 24, 2020

Ok, I take back the conclusion in my previous comment. The additional syscall logging seems to be specific to certain OSes, not branches, and we have extended test execution times even on systems where this logging isn't present so this doesn't seem to be the cause of our problem.

Damn, back to the drawing board.

@ywelsch
Copy link
Contributor

ywelsch commented Jan 24, 2020

I've found the source for slowness here and will open a PR shortly. The gist here is that with ILM/SLM we do a lot of unnecessary setup / teardown work on each test. Compounded with the slightly slower cluster state storage mechanism, this causes the tests to run much slower.

On the core/infra side, we could also look at speeding up plugin installs. It currently takes 30-40 seconds at the beginning of docs:integTest to just install the plugins:

[2020-01-24T15:22:56.594355Z] [BUILD] Installing 16 plugins
[2020-01-24T15:23:41.084037Z] [BUILD] Starting Elasticsearch process

Edit: PR #51430 added

@rjernst
Copy link
Member

rjernst commented Jan 24, 2020

On the core/infra side, we could also look at speeding up plugin installs.

While we could and should do that (now that we can install them all in one plugin command), the timeout here is in the test suite, while we do plugin installation externally from gradle.

@mark-vieira
Copy link
Contributor

I don't believe the solution proposed in #51418 is going to help here. As I've stated before, we already run most of our CI builds in ramdisk workspace and those builds continue to timeout at 40 minutes whereas we were running under 20 minutes before.

ywelsch added a commit that referenced this issue Jan 28, 2020
The docs tests have recently been running much slower than before (see #49753).

The gist here is that with ILM/SLM we do a lot of unnecessary setup / teardown work on each
test. Compounded with the slightly slower cluster state storage mechanism, this causes the
tests to run much slower.

In particular, on RAMDisk, docs:check is taking

ES 7.4: 6:55 minutes
ES master: 16:09 minutes
ES with this commit: 6:52 minutes

on SSD, docs:check is taking

ES 7.4: ??? minutes
ES master: 32:20 minutes
ES with this commit: 11:21 minutes
ywelsch added a commit to ywelsch/elasticsearch that referenced this issue Jan 28, 2020
The docs tests have recently been running much slower than before (see elastic#49753).

The gist here is that with ILM/SLM we do a lot of unnecessary setup / teardown work on each
test. Compounded with the slightly slower cluster state storage mechanism, this causes the
tests to run much slower.

In particular, on RAMDisk, docs:check is taking

ES 7.4: 6:55 minutes
ES master: 16:09 minutes
ES with this commit: 6:52 minutes

on SSD, docs:check is taking

ES 7.4: ??? minutes
ES master: 32:20 minutes
ES with this commit: 11:21 minutes
ywelsch added a commit that referenced this issue Jan 28, 2020
The docs tests have recently been running much slower than before (see #49753).

The gist here is that with ILM/SLM we do a lot of unnecessary setup / teardown work on each
test. Compounded with the slightly slower cluster state storage mechanism, this causes the
tests to run much slower.

In particular, on RAMDisk, docs:check is taking

ES 7.4: 6:55 minutes
ES master: 16:09 minutes
ES with this commit: 6:52 minutes

on SSD, docs:check is taking

ES 7.4: ??? minutes
ES master: 32:20 minutes
ES with this commit: 11:21 minutes
ywelsch added a commit that referenced this issue Jan 28, 2020
The docs tests have recently been running much slower than before (see #49753).

The gist here is that with ILM/SLM we do a lot of unnecessary setup / teardown work on each
test. Compounded with the slightly slower cluster state storage mechanism, this causes the
tests to run much slower.

In particular, on RAMDisk, docs:check is taking

ES 7.4: 6:55 minutes
ES master: 16:09 minutes
ES with this commit: 6:52 minutes

on SSD, docs:check is taking

ES 7.4: ??? minutes
ES master: 32:20 minutes
ES with this commit: 11:21 minutes
@mark-vieira
Copy link
Contributor

This looks to have done the trick. Thanks @ywelsch!

master
image

7.x
image

@ywelsch
Copy link
Contributor

ywelsch commented Feb 5, 2020

We got another timeout on elastic+elasticsearch+master+multijob-darwin-compatibility: https://gradle-enterprise.elastic.co/s/fczylflkgrebm

Looking at the output, the tests are running at an excruciatingly slow pace (3-4 seconds per test). Perhaps darwin tests are not using RAMDisk, or doing too many things in parallel?

@mark-vieira
Copy link
Contributor

Thanks for reporting this @ywelsch. This looks to be an issue with that macos Jenkins worker. We've seen a number of build failures specific to this machine for which I have an open infra issue for (https://github.com/elastic/infra/issues/17621).

You can see this test is passing just fine and running in a reasonable time on other mac workers.
image

Since this is an infra problem, and I've reached out regarding this, I'm going to close this issue again.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Delivery/Build Build or test infrastructure Team:Delivery Meta label for Delivery team >test-failure Triaged test failures from CI
Projects
None yet
Development

No branches or pull requests