-
Notifications
You must be signed in to change notification settings - Fork 24.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[CI] IndexRecoveryIT testRerouteRecovery failing #99941
Comments
Pinging @elastic/es-distributed (Team:Distributed) |
Muted in #100209 |
Consistently in the failure cases, I can see fewer data files found for recovery Failed run:
Successful run:
In a successful run, mostly stats commands are running while waiting for throttling stats to increase. And the code finishes in short order.
For a failed run, however, stats commands are relatively infrequent, and the stats commands seem to stall for roughly 6 seconds. Somehow Node A shows increasing throttling stats, but Node B never does.
|
As an update here, Dave used git bisect and identified when the failures started occurring. Apparently it was a JDK 21 update that initiated the failures. The conclusion is that the test likely isn't stable, and JDK changes shook things up enough to destabilize the test. So, next steps are to split the test into two separate tests, testing throttling on the source node, and then throttling on the target node, separately: the notion is that throttling on the source reasonably results in no throttling being necessary on the target. But it's still curious that we do see throttling on the target in the test happy path, so that will be investigated further to find out why: one lead is that the file size affects throttling, since there are consistently few data files in the failure case, and that could turn out to be a bug. |
Confirmed that it makes sense that Node B makes no progress on throttling. RateLimiter.pause() returns the time waited, and from adding logging to this code, the test calls that code once per second, because the source recovery is throttling progress. The SimpleRateLimter.pause() function (part of the Lucene library) will return 0 time waiting if the desired wait time has already elapsed with the last pause. My notions about why the test is failing are a) just random whether timing lines up such that the target node throttles a little (enough for the test to pass) or not at all; or b) perhaps the persisted settings aren't picked up. I've put up a PR fix, in which the tests reliably pass, and I don't think this is worth further time investigating. |
The test is currently flaky and it was upstream as well: - elastic/elasticsearch#99941 In ES they split out the throttle checks into dedicated tests: - elastic/elasticsearch#100788 We can't copy them due to the License, but we can at least remove the broken checks to make the test non-flaky.
The test is currently flaky and it was upstream as well: - elastic/elasticsearch#99941 In ES they split out the throttle checks into dedicated tests: - elastic/elasticsearch#100788 We can't copy them due to the License, but we can at least remove the broken checks to make the test non-flaky.
Build scan:
https://gradle-enterprise.elastic.co/s/d3cfn2jnvzye4/tests/:server:internalClusterTest/org.elasticsearch.indices.recovery.IndexRecoveryIT/testRerouteRecovery
Reproduction line:
Applicable branches:
main
Reproduces locally?:
Didn't try
UPDATE:
Does not reproduces locally on my Mac, but does on a remote cloud host running
Failure history:
https://gradle-enterprise.elastic.co/scans/tests?tests.container=org.elasticsearch.indices.recovery.IndexRecoveryIT&tests.test=testRerouteRecovery
Failure excerpt:
The text was updated successfully, but these errors were encountered: