Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[CI] XPackRestIT timed out after 30 mins #52428

Closed
astefan opened this issue Feb 17, 2020 · 8 comments
Closed

[CI] XPackRestIT timed out after 30 mins #52428

astefan opened this issue Feb 17, 2020 · 8 comments
Labels
:Delivery/Build Build or test infrastructure Team:Delivery Meta label for Delivery team >test-failure Triaged test failures from CI

Comments

@astefan
Copy link
Contributor

astefan commented Feb 17, 2020

https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+master+matrix-java-periodic/ES_RUNTIME_JAVA=java11,nodes=general-purpose/516/console
https://gradle-enterprise.elastic.co/s/frzrlcgeo45j4

But this occurred in a scenario where one of tests it seems it left a task behind - p0=transform/transforms_stats/Test get continuous transform stats - and all the subsequent tests to spend 10 seconds in assertBusy and, thus, to add up and go over the 30 minutes timeout. Still, opening this issue for a second pair of eyes over this and given the history with yaml tests timing out lately.

@astefan astefan added :Delivery/Build Build or test infrastructure >test-failure Triaged test failures from CI labels Feb 17, 2020
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-core-infra (:Core/Infra/Build)

@droberts195
Copy link
Contributor

Many tests wait for pending tasks to complete during their cleanup phase so that tasks started by one test do not continue to run during the next test and cause hard-to-reproduce failures in that subsequent test.

As you say, this means that if one test leaves a rogue task behind that never completes then it fails every subsequent test that waits for pending tasks and also makes the test suite take an eternity to complete.

So maybe we should change the "wait for pending tasks" test cleanup functionality such that if the wait times out it instantly fails the entire test suite. It's just a question of how to do this in such a way that:

  • The test for which the "wait for pending tasks" timed out is reported as a failure
  • Subsequent tests are ideally not reported as failures, but as skipped

@hendrikmuhs
Copy link
Contributor

hendrikmuhs commented Feb 18, 2020

The problem seems to be the test that failed before, which is the problem described in #52429

The task leakage happens because after the failure, the test does not continue execution and therefore does not stop the task.

Independent of #52429 it raises the question why execution continues after the failure. Because after the root cause, task cleanup does not happen, therefore all subsequent tests have the leaked task. I see 2 solutions:

  • fail fast: do not continue further test execution
  • cleanup all transform tasks as part of @After, e.g. rollup does this

Conceptually I like "fail fast", however there is a reason why 2 has been implemented at least for rollup. I like to learn about the reason.

@astefan I am not understanding the links below the first failure. 7.6 should not be affected by this issue. If I follow the links the builds look fine to me.

@astefan
Copy link
Contributor Author

astefan commented Feb 18, 2020

@hendrikmuhs you are right, those links are not needed. Will remove them from the initial description.

@hendrikmuhs
Copy link
Contributor

created a PR for #52429, this should fix the failure.

I keep this issue open to clarify the desired behavior when a test fails and whether it should stop immediately or not, see #52428 (comment)

@rjernst
Copy link
Member

rjernst commented Feb 25, 2020

@hendrikmuhs I think we generally agree we should fail fast. IMO the problem is the cleanup is being done as normal test actions. Instead, could it be done as actual cleanup? Currently each yml file has a single setup/cleanup that is run before/after each test section. If the cleanup of these can't go into that cleanup section, maybe we need to consider a dedication teardown within each test section.

@hendrikmuhs
Copy link
Contributor

IMO the problem is the cleanup is being done as normal test actions. Instead, could it be done as actual cleanup? Currently each yml file has a single setup/cleanup that is run before/after each test section.

@rjernst Yes, that's a good point and should be possible, I have to move the test into a separate file but this also in the interest of cleaner test separation. I will implement your suggestion.

hendrikmuhs pushed a commit to hendrikmuhs/elasticsearch that referenced this issue Feb 28, 2020
hendrikmuhs pushed a commit that referenced this issue Mar 3, 2020
restructure transform yaml tests to run cleanup in teardown phase

relates #52428
hendrikmuhs pushed a commit that referenced this issue Mar 3, 2020
restructure transform yaml tests to run cleanup in teardown phase

relates #52428
@rjernst rjernst added the Team:Core/Infra Meta label for core/infra team label May 4, 2020
@rjernst
Copy link
Member

rjernst commented Jun 24, 2020

While XPackRestIT still suffers from timeouts on MacOS, this failure was broader and fixed a while ago.

@rjernst rjernst closed this as completed Jun 24, 2020
@mark-vieira mark-vieira added Team:Delivery Meta label for Delivery team and removed Team:Core/Infra Meta label for core/infra team labels Nov 11, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Delivery/Build Build or test infrastructure Team:Delivery Meta label for Delivery team >test-failure Triaged test failures from CI
Projects
None yet
Development

No branches or pull requests

6 participants