[CI] ClusterRerouteIT.testDelayWithALargeAmountOfShards timed out waiting for green state #52000

tlrx · 2020-02-06T14:06:53Z

The test ClusterRerouteIT » testDelayWithALargeAmountOfShards failed today on CI for the 7.x branch with the error:

timed out waiting for green state
java.lang.AssertionError
: 
timed out waiting for green state
Close stacktrace
at __randomizedtesting.SeedInfo.seed([B25CE8847D0ACBC2:9E2DE2656C14E120]:0)
at org.junit.Assert.fail(Assert.java:88)
at org.elasticsearch.test.ESIntegTestCase.ensureColor(ESIntegTestCase.java:929)
at org.elasticsearch.test.ESIntegTestCase.ensureGreen(ESIntegTestCase.java:882)
at org.elasticsearch.cluster.allocation.ClusterRerouteIT.testDelayWithALargeAmountOfShards(ClusterRerouteIT.java:224)

Build scan: https://gradle-enterprise.elastic.co/s/3l2rtho4qznu6

Looking at the build stats it also fails on 7.6 (https://gradle-enterprise.elastic.co/s/bxlllkb4jubxs) and seems to be related to the new way of persisting metadata state (see #50907), in which case tests might need to be adapted?

The text was updated successfully, but these errors were encountered:

elasticmachine · 2020-02-06T14:06:56Z

Pinging @elastic/es-distributed (:Distributed/Cluster Coordination)

pugnascotia · 2020-02-14T11:23:17Z

Failed again on 7.6 - https://gradle-enterprise.elastic.co/s/wmy4bjcpxqnte

henningandersen · 2020-02-18T07:50:06Z

Failed again on master: https://gradle-enterprise.elastic.co/s/getynn6hnzfm2

Looked a bit into it, looks like retention lease sync causes the shards to become stale one by one, not sure if this is part of why it is timing out, but it is at least a significant difference since this test was written.

Did not reproduce locally.

It looks like elastic#52000 is caused by a slowdown in cluster state application (maybe due to elastic#50907) but I would like to understand the details to ensure that there's nothing else going on here too before simply increasing the timeout. This commit enables some relevant `DEBUG` loggers and also captures stack traces from all threads rather than just the three hottest ones.

It looks like #52000 is caused by a slowdown in cluster state application (maybe due to #50907) but I would like to understand the details to ensure that there's nothing else going on here too before simply increasing the timeout. This commit enables some relevant `DEBUG` loggers and also captures stack traces from all threads rather than just the three hottest ones.

DaveCTurner · 2020-02-18T17:46:52Z

Hmm. All 7 of the failures of this test over the last 30 days were using org.apache.lucene.mockfile.WindowsFS. It typically completes in 30-60 seconds but the failures are 6-8 minutes long. WindowsFS adds extra validation that we behave correctly on filesystems that don't let you delete open files; I wonder if this is introducing extra slowness.

Issue elastic#52000 looks like a case of cluster state updates being slower than expected, but it seems that these slowdowns are relatively rare: most invocations of `testDelayWithALargeAmountOfShards` take well under a minute in CI, but there are occasional failures that take 6+ minutes instead. When it fails like this, cluster state persistence seems generally slow: most are slower than expected, with some small updates even taking over 2 seconds to complete. The failures all have in common that they use `WindowsFS` to emulate Windows' behaviour of refusing to delete files that are still open, by tracking all files (really, inodes) and validating that deleted files are really closed first. There is a suggestion that this is a little slow in the Lucene test framework [1]. To see if we can attribute the slowdown to that common factor, this commit suppresses the use of `WindowsFS` for this test suite. [1] https://github.com/apache/lucene-solr/blob/4a513fa99f638cb65e0cae59bfdf7af410c0327a/lucene/test-framework/src/java/org/apache/lucene/util/TestRuleTemporaryFilesCleanup.java#L166

DaveCTurner · 2020-02-18T18:18:20Z

I opened #52488 to suppress WindowsFS on this test suite.

I am not marking #52488 as closes #52000 since I am not fully certain that this is the underlying problem. I would prefer to have a couple of weeks without failures before declaring this issue closed.

Issue #52000 looks like a case of cluster state updates being slower than expected, but it seems that these slowdowns are relatively rare: most invocations of `testDelayWithALargeAmountOfShards` take well under a minute in CI, but there are occasional failures that take 6+ minutes instead. When it fails like this, cluster state persistence seems generally slow: most are slower than expected, with some small updates even taking over 2 seconds to complete. The failures all have in common that they use `WindowsFS` to emulate Windows' behaviour of refusing to delete files that are still open, by tracking all files (really, inodes) and validating that deleted files are really closed first. There is a suggestion that this is a little slow in the Lucene test framework [1]. To see if we can attribute the slowdown to that common factor, this commit suppresses the use of `WindowsFS` for this test suite. [1] https://github.com/apache/lucene-solr/blob/4a513fa99f638cb65e0cae59bfdf7af410c0327a/lucene/test-framework/src/java/org/apache/lucene/util/TestRuleTemporaryFilesCleanup.java#L166

DaveCTurner · 2020-02-19T07:59:02Z

Echoing Henning's comment here: if after a couple of weeks we decide that WindowsFS was the source of our slowdown we have a few options to discuss:

reduce the scope of the suppression by moving the one slow test into its own suite and suppress WindowsFS there
reinstate WindowsFS and either
- extend the timeout anyway, possibly only if WindowsFS is in use, or
- dig further into why WindowsFS is slow for us here and fix it upstream
take no further action

DaveCTurner · 2020-02-19T14:26:51Z

We discussed this today and decided that we'll take no further action unless we see another failure, and therefore we can close this; please reopen it if you come across another failure.

It looks like elastic#52000 is caused by a slowdown in cluster state application (maybe due to elastic#50907) but I would like to understand the details to ensure that there's nothing else going on here too before simply increasing the timeout. This commit enables some relevant `DEBUG` loggers and also captures stack traces from all threads rather than just the three hottest ones.

Issue elastic#52000 looks like a case of cluster state updates being slower than expected, but it seems that these slowdowns are relatively rare: most invocations of `testDelayWithALargeAmountOfShards` take well under a minute in CI, but there are occasional failures that take 6+ minutes instead. When it fails like this, cluster state persistence seems generally slow: most are slower than expected, with some small updates even taking over 2 seconds to complete. The failures all have in common that they use `WindowsFS` to emulate Windows' behaviour of refusing to delete files that are still open, by tracking all files (really, inodes) and validating that deleted files are really closed first. There is a suggestion that this is a little slow in the Lucene test framework [1]. To see if we can attribute the slowdown to that common factor, this commit suppresses the use of `WindowsFS` for this test suite. [1] https://github.com/apache/lucene-solr/blob/4a513fa99f638cb65e0cae59bfdf7af410c0327a/lucene/test-framework/src/java/org/apache/lucene/util/TestRuleTemporaryFilesCleanup.java#L166

probakowski · 2020-03-04T22:26:15Z

Another failure from today: https://gradle-enterprise.elastic.co/s/gws7m6tms6coy

Issue #52000 looks like a case of cluster state updates being slower than expected, but it seems that these slowdowns are relatively rare: most invocations of `testDelayWithALargeAmountOfShards` take well under a minute in CI, but there are occasional failures that take 6+ minutes instead. When it fails like this, cluster state persistence seems generally slow: most are slower than expected, with some small updates even taking over 2 seconds to complete. The failures all have in common that they use `WindowsFS` to emulate Windows' behaviour of refusing to delete files that are still open, by tracking all files (really, inodes) and validating that deleted files are really closed first. There is a suggestion that this is a little slow in the Lucene test framework [1]. To see if we can attribute the slowdown to that common factor, this commit suppresses the use of `WindowsFS` for this test suite. [1] https://github.com/apache/lucene-solr/blob/4a513fa99f638cb65e0cae59bfdf7af410c0327a/lucene/test-framework/src/java/org/apache/lucene/util/TestRuleTemporaryFilesCleanup.java#L166

DaveCTurner · 2020-03-05T08:23:56Z

WindowsFS strikes again. I backported #52488 to the 7.6 branch in 2878ec8.

tlrx added >test-failure Triaged test failures from CI :Distributed/Cluster Coordination Cluster formation and cluster state publication, including cluster membership and fault detection. labels Feb 6, 2020

DaveCTurner self-assigned this Feb 12, 2020

DaveCTurner mentioned this issue Feb 18, 2020

Add extra logging for investigation into #52000 #52472

Merged

DaveCTurner mentioned this issue Feb 18, 2020

Avoid using WindowsFS in ClusterRerouteIT #52488

Merged

DaveCTurner closed this as completed Feb 19, 2020

probakowski reopened this Mar 4, 2020

DaveCTurner closed this as completed Mar 5, 2020

original-brownbear mentioned this issue Apr 2, 2020

[CI] S3BlobStoreRepositoryTests testSnapshotAndRestore failure #52125

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[CI] ClusterRerouteIT.testDelayWithALargeAmountOfShards timed out waiting for green state #52000

[CI] ClusterRerouteIT.testDelayWithALargeAmountOfShards timed out waiting for green state #52000

tlrx commented Feb 6, 2020

elasticmachine commented Feb 6, 2020

pugnascotia commented Feb 14, 2020

henningandersen commented Feb 18, 2020 •

edited

Loading

DaveCTurner commented Feb 18, 2020

DaveCTurner commented Feb 18, 2020

DaveCTurner commented Feb 19, 2020

DaveCTurner commented Feb 19, 2020

probakowski commented Mar 4, 2020

DaveCTurner commented Mar 5, 2020

[CI] ClusterRerouteIT.testDelayWithALargeAmountOfShards timed out waiting for green state #52000

[CI] ClusterRerouteIT.testDelayWithALargeAmountOfShards timed out waiting for green state #52000

Comments

tlrx commented Feb 6, 2020

elasticmachine commented Feb 6, 2020

pugnascotia commented Feb 14, 2020

henningandersen commented Feb 18, 2020 • edited Loading

DaveCTurner commented Feb 18, 2020

DaveCTurner commented Feb 18, 2020

DaveCTurner commented Feb 19, 2020

DaveCTurner commented Feb 19, 2020

probakowski commented Mar 4, 2020

DaveCTurner commented Mar 5, 2020

henningandersen commented Feb 18, 2020 •

edited

Loading