[CI] Tests timing out occasionally #4209

IAlibay · 2023-07-25T01:18:59Z

For some reason some of the tests are every so often timing out. It's been mostly the ubuntu full deps py3.10 gh actions tests, but we've seen it in the macos and azure runners too.

It's rather hard to debug and I'm currenttly cycling CI with pytest-timeout in #4197

IAlibay · 2023-07-25T01:22:39Z

From a look at the top 50 slowest, there seems to be some kind of systematic increase in test run times in the failing runners.

Note: test_streamplot_2D is > 20s even on a good day!

IAlibay · 2023-07-28T08:48:15Z

@MDAnalysis/coredevs this is actually a pretty big issue, if anyone has any ideas here that'd be appreciated.

IAlibay · 2023-07-28T12:42:11Z

My initial suspicion is that this might be happening: pytest-dev/pytest#11174

Need to gather a bit more evidence though (which annoyingly means re-running the tests until we can gather more random failures).

A potential short term solution is to just stick on pytest-timeout, but it really only shifts the problem from a job not finishing to one failing (so really it counts as a failure both ways and CI remains red).

IAlibay · 2023-07-30T12:27:22Z

Unfortunately I've been unable to narrow down where this is happening. From runs on #4197 my current conclusions are:

The frequency at which these stalling runs occus is somewhat low. It affects nearly every PR at some point, but we're looking at maybe a ~ 1 / 20 chance?
There are no specific tests which time out (once it was a multiprocessing one, another was nsgrid..)
It is not isolated to github actions (we saw it happen on azure pipelines).

My current completely random guess is that it's something to do with the latest 7.4.0 release and maybe coverage? (I haven't seen it happen on runner not trying to use pytest-cov functionality).

I've ran out of ideas for now and I don't think I have the time to debug any furtther. So I'm going to propose the following:

We add pytest-timeout to our CI runs. At least runners will complete even though they will fail. That will save a few trees and allow us to just ignore timeout issues if we don't think they are relevant.
We keep this issue ope and re-asses over time.

hmacdope · 2023-07-30T13:41:02Z

Sounds good to me @IAlibay, thanks for putting in a great effort trying to figure it out.

RMeli · 2023-07-30T15:24:37Z

Sounds very sensible. Thanks for all the work!

* speed up `test_streamplot_2D()` because it has been reported to take ~20 seconds on a regular basis in CI in MDAnalysisgh-4209 * we don't really need to plot the output, which was taking most of the time, and instead we can just check the data structures that MDAnalysis returns (this may be a better test by some definitions anyway...); I suppose we could also spot check a few values in the arrays if we wanted as well * locally, that single test seems to run in 0.39 s on this branch vs. 4.7 s on `develop`

drew-parsons · 2023-08-03T13:24:58Z

I'm seeing a consistent timeout (after 2.5 hours) in mdanalysis 2.5.0 on i386 after updating gsd to v3, see https://ci.debian.net/packages/m/mdanalysis/testing/i386/

The timeout occurs in testsuite/MDAnalysisTests/parallelism/test_multiprocessing.py
So I suspect it might be a race condition in multiprocessing, hence might be a separate issue from the pytest-timeout issue discussed here.

IAlibay · 2023-08-03T21:36:26Z

Interesting, I completely forgot that the latest version of GSD had a very strange memory buffer thing going on.

We did see tests time out in other places though, so I'm not fully convinced it's the sole issue here, but it is definitely worth investigating.

IAlibay · 2023-08-03T21:41:00Z

Although @drew-parsons, MDAnalysis 2.5.0 doesn't support GSD 3.0+. The GSD release was after the 2.5.0 release so we didn't get to retrospectively pin this in our setup.py but it is reflected in our conda recipe.

We're pretty swamped with just the two package indexes (pypi and conda-forge) so this kind of stuff tends to completely fall off our radar. Is there any way we can better synch up / report on this?

drew-parsons · 2023-08-04T12:37:57Z

True, I patched MDAnalysis 2.5.0 with PR #4153 and PR #4174 .

IAlibay · 2023-08-04T12:50:23Z

@drew-parsons, I would very much appreciate if you didn't have to do that. Ideally the source code which we provide for releases here should match what folks encounter in the wild.

If you need us to do a bugfix release then I am more than happy to do that if necessary.

Would you mabe be interested in having a quick call to discuss your needs here? I don't fully understand the debian packaging ecosystem and I would like to make sure we are providing the right things where possible.

If this is of interest could maybe ping us an email at mdanalysis@numfocus.org or maybe we can discuss this more on our discord? (https://discord.gg/fXTSfDJyxE)

drew-parsons · 2023-08-04T13:32:11Z

In a sense it's not a huge issue. Only i386 is failing here, other arches are happy As far as the Debian distributions goes, the issue is in Debian unstable, it's "supposed" to break from time to time. We'll fix it. The reason Debian needed gsd 3 is because 2.7 was failing to build with latest python tools, https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=1042136. It was simplest to upgrade gsd to v3 where this was fixed. gsd is being relegated to optional anyway from the perspective of mdanalysis, so temporary disruption on i386 is not a big problem (we can just skip the test if necessary). But a bugfix release would also be welcome. Good to not skip tests if we can run them successfully.

drew-parsons · 2023-08-04T18:09:52Z

Worth noting, the i386 test passes now.
https://ci.debian.net/data/autopkgtest/testing/i386/m/mdanalysis/36427124/log.gz

* TST: test_streamplot_2D faster * speed up `test_streamplot_2D()` because it has been reported to take ~20 seconds on a regular basis in CI in gh-4209 * we don't really need to plot the output, which was taking most of the time, and instead we can just check the data structures that MDAnalysis returns (this may be a better test by some definitions anyway...); I suppose we could also spot check a few values in the arrays if we wanted as well * locally, that single test seems to run in 0.39 s on this branch vs. 4.7 s on `develop` * TST: PR 4221 revisions * use `pytest.approx` for single value comparisons * `u1` and `v1` are now checked more thoroughly for their actual floating point values

IAlibay · 2023-09-02T11:15:56Z

This is still an issue, we are still getting a ton of failures with test_creating_multiple_universe_without_offset. Can one of the @MDAnalysis/coredevs please step up and actively try to fix this?

orbeckst · 2023-09-02T17:51:39Z

test_creating_multiple_universe_without_offset

mdanalysis/testsuite/MDAnalysisTests/parallelism/test_multiprocessing.py

Line 148 in c2e6df9

def test_creating_multiple_universe_without_offset(temp_xtc, ncopies=3):

is a bit of an odd test: it was supposed to test that we fixed a race condition but the comments say that it doesn't actually test the situation:

    #  test if they can be created without generating
    #  the offset simultaneously.
    #  The tested XTC file is way too short to induce a race scenario but the
    #  test is included as documentation for the scenario that used to create
    #  a problem (see PR #3375 and issues #3230, #1988)

@yuxuanzhuang you wrote the test. Do you have some more insights what might be happening here? Are we using some sort of file-locking that might block everything?

I have seen this test come up (always?) when I restarted a timed-out CI runner so it makes sense to figure this one out and then see if anything else is odd.

yuxuanzhuang · 2023-09-07T13:13:22Z

I have been trying to battle with it over the past couple of days, but none of my attempts seem to be working. The branch of testing I am currently working on is the one with #4162 merged---so much higher failure rate :) (yuxuanzhuang#6).

Here are the attempts that have failed:

Running pytest without multiple workers.
Removing fastener code.
Disabling the test_creating_multiple_universe_without_offset (the test can fail elsewhere when a multiprocessing pool is initiated). As a side note, I have no idea why the test passed last time we disabled it in @marinegor's PR; apparently, it's a scholastic thing :)
Adding mp.set_start_method("spawn", force=True) as suggested in the comments above.
Using --dist loadgroup to restrain all parallel code from starting in the same worker (since running pytest with only one worker also failed, obviously it won't work).
add p.join() to terminate all child processes

The only solution seems to be disabling all tests that involve multiprocessing (dask, on the other hand, seems fine).

I believe the test timeout is related to the multiprocessing pool not being able to terminate when a lot of jobs are present. Given that it's almost a 100% failure rate if parallel analysis codes are merged, compared to a 1/20 chance (?), I would assume that the more multiprocessing pools there are, the higher the chance of failure.

The other problem is that it never failed on my local workstation or on the compute cluster nodes—so I need a 30-minute testing cycle to rule out any possible cases on GitHub, which is really frustrating. Moreover, sometimes the timeout appears as a test failure, and sometimes it simply doesn't show up except in the duration (e.g. here).

IAlibay · 2023-09-07T13:18:02Z

@yuxuanzhuang that last test result you point to is a codecov failure, that's a different problem?

yuxuanzhuang · 2023-09-07T13:58:31Z

that last test result you point to is a codecov failure, that's a different problem?

I think the codecov failure is just because I hacked the repository to run the test on my own repo?

IAlibay · 2023-09-07T14:01:39Z

Are you talking about this failed runner? https://github.com/yuxuanzhuang/mdanalysis/actions/runs/6074912996/job/16479999506

yuxuanzhuang · 2023-09-07T14:04:16Z

yes

[2023-09-04T15:14:35.507Z] ['error'] There was an error running the uploader: Error uploading to https://codecov.io: Error: There was an error fetching the storage URL during POST: 400 - [ErrorDetail(string='This repository has been deactivated. To resume uploading to it, please activate the repository in the codecov UI: https://app.codecov.io/github/yuxuanzhuang/mdanalysis/settings', code='invalid')]

IAlibay · 2024-03-10T14:50:40Z

@hmacdope I'm assigning this to you since you had done so in #4475

orbeckst · 2024-03-29T21:35:10Z

Primarily I now see test_creating_multiple_universe_without_offset() failing across multiple Python versions and architectures. This is getting annoying. I prefer to merge PRs when everything is green but the last three merges that I did all had to override failures due to timing out on test_creating_multiple_universe_without_offset().

hmacdope · 2024-04-01T10:16:38Z

@orbeckst I will try and dig into this one ASAP.

yuxuanzhuang · 2024-09-07T21:55:00Z

I believe I’ve finally identified the culprit behind the timeout: the SIGTERM handler introduced by GSD: https://github.com/glotzerlab/gsd/blob/trunk-patch/gsd/__init__.py.

I’m not very familiar with Python's signal handling, I welcome any insights or suggestions from others on whether we can address it downstream or need to contact GSD people.

Below is a minimal example that reproduces the hanging issue on my laptop (Mac M3 Pro):

for i in {1..100}; do python test_sig.py; done

...
Exception ignored in: <function _get_module_lock.<locals>.cb at 0x1016ae160>
Traceback (most recent call last):
  File "<frozen importlib._bootstrap>", line 451, in cb
  File "/Users/scottzhuang/jupyter_ground/sig.py", line 5, in <lambda>
    signal.signal(signal.SIGTERM, lambda n, f: sys.exit(1))
                                               ^^^^^^^^^^^
SystemExit: 1

`test_sig.py`

#import logging
import sig

#logger = logging.getLogger("mp_test")

if __name__ == "__main__":
    from multiprocessing import Pool, log_to_stderr

    # Uncomment for logging
    # logger = log_to_stderr()
    # logger.setLevel(logging.DEBUG)

    with Pool(processes=12) as pool:
        results = pool.map(sig.get_number, range(12))
    print(results)

`sig.py`

import signal
import sys

# Introduced in GSD (https://github.com/glotzerlab/gsd/blob/trunk-patch/gsd/__init__.py)
try:
    signal.signal(signal.SIGTERM, lambda n, f: sys.exit(1))
except ValueError:
    pass

def get_number(x):
    return x + 1

orbeckst · 2024-09-08T04:38:42Z

FYI: There was some concern about this code raised by @IAlibay in glotzerlab/gsd#255 .

IAlibay · 2024-09-08T08:53:50Z

Ahhhh this makes a lot of sense. I had completely forgotten about that one - indeed the way GSD is doing this seems like it would be prone to issues.

Thanks so much for all your work identifying this @yuxuanzhuang !

My take is that we should just raise this upstream and remove GSD from relevant multiprocessing tests, it's not something that can be fixed without a patch upstream.

hmacdope · 2024-09-08T09:35:00Z

Amazing work @yuxuanzhuang! Thanks so much for diving deep on this, much appreciated.

yuxuanzhuang · 2024-09-08T13:02:00Z

A few additional thoughts:

Since gsd is already imported in our initial import, simply removing GSD file multiprocessing testing won’t help, as the job will still hang when the worker is justing running import MDAnalysis, even though it’s not doing anything GSD-related.
One approach could be to lazily import gsd only when it is actually needed.
Another option is to reset the signal handler using signal.signal(signal.SIGTERM, signal.SIG_DFL) right before starting any multiprocessing tasks on our end.

Since the second approach could potentially cause unforeseen interference with other packages, I would prefer the first approach if it’s feasible.

IAlibay · 2024-09-08T13:58:36Z

One approach could be to lazily import gsd only when it is actually needed.

GSD is an optional dependency, this is something we should be doing as much as possible.

IAlibay · 2024-09-08T14:04:18Z

My suggested approach:

Report the issue upstream and see what they say
If it doesn't look like it will be fixed any time soon, then we look into lazy importing where necessary
If that looks like it would take more than a day of work, then we remove GSD from our optional test dependencies for the standard CI - we can have GSD on our cron jobs, that's less disruptive / easier to just restart until it passes + we can have an optional xfail if GSD is installed.
If a fix doesn't look like it would come by our next release +1, we deprecate GSD.

IAlibay · 2024-09-08T14:05:11Z

Actually I would even recommend we move towards #3 either way, we can always revert CI inclusion later.

orbeckst · 2024-09-09T16:53:39Z

Really anything to get the CI to a state where we can get just a green pass for normal PRs... Option 3 sounds sensible.

If we xfail GSD tests in standard CI, should we reduce the timeout duration from 200s to, say, 20s, so that these failures happen more quickly and not hold up runners?

orbeckst · 2024-09-09T16:55:59Z

I assigned @yuxuanzhuang , primarily to indicate who's been doing the expert code sleuthing here ;-) 🔎

IAlibay added the testing label Jul 29, 2023

This was referenced Jul 30, 2023

Add pytest-timeout to CI #4215

Merged

Updated atom ID representation order to match AtomGroup #4191

Merged

tylerjereddy mentioned this issue Aug 1, 2023

TST: test_streamplot_2D faster #4221

Merged

1 task

orbeckst added the downstream label Aug 4, 2023

IAlibay removed the downstream label Sep 2, 2023

orbeckst mentioned this issue Sep 2, 2023

Nuclinfo Major Pair and Minor Pair overhaul #3735

Merged

4 tasks

IAlibay mentioned this issue Mar 1, 2024

Test timeout on multiprocessing Pool XTC test in CI #4475

Closed

IAlibay assigned hmacdope Mar 10, 2024

IAlibay mentioned this issue Mar 29, 2024

test_creating_multiple_universe_without_offset fails spuriously #4540

Closed

orbeckst added defect parallelization labels Mar 29, 2024

orbeckst mentioned this issue Mar 29, 2024

atomname methods can handle empty groups #4529

Merged

5 tasks

yuxuanzhuang mentioned this issue May 1, 2024

Limit threads usage in numpy during test to avoid time-out #4584

Draft

5 tasks

orbeckst mentioned this issue Jun 18, 2024

[GSoC] Parallelisation of AnalysisBase with multiprocessing and dask #4162

Merged

4 tasks

orbeckst mentioned this issue Sep 9, 2024

Implementation of Parallelization to analysis.gnm #4700

Merged

5 tasks

orbeckst assigned yuxuanzhuang and unassigned hmacdope Sep 9, 2024

yuxuanzhuang mentioned this issue Sep 9, 2024

Remove GSD from optional test dependencies #4707

Merged

5 tasks

hmacdope closed this as completed in #4707 Sep 10, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[CI] Tests timing out occasionally #4209

[CI] Tests timing out occasionally #4209

IAlibay commented Jul 25, 2023

IAlibay commented Jul 25, 2023

IAlibay commented Jul 28, 2023

IAlibay commented Jul 28, 2023 •

edited

Loading

IAlibay commented Jul 30, 2023

hmacdope commented Jul 30, 2023

RMeli commented Jul 30, 2023

drew-parsons commented Aug 3, 2023 •

edited

Loading

IAlibay commented Aug 3, 2023

IAlibay commented Aug 3, 2023

drew-parsons commented Aug 4, 2023

IAlibay commented Aug 4, 2023

drew-parsons commented Aug 4, 2023

drew-parsons commented Aug 4, 2023

IAlibay commented Sep 2, 2023

orbeckst commented Sep 2, 2023

yuxuanzhuang commented Sep 7, 2023 •

edited

Loading

IAlibay commented Sep 7, 2023

yuxuanzhuang commented Sep 7, 2023

IAlibay commented Sep 7, 2023

yuxuanzhuang commented Sep 7, 2023

IAlibay commented Mar 10, 2024

orbeckst commented Mar 29, 2024

hmacdope commented Apr 1, 2024

yuxuanzhuang commented Sep 7, 2024

orbeckst commented Sep 8, 2024

IAlibay commented Sep 8, 2024

hmacdope commented Sep 8, 2024

yuxuanzhuang commented Sep 8, 2024

IAlibay commented Sep 8, 2024

IAlibay commented Sep 8, 2024

IAlibay commented Sep 8, 2024

orbeckst commented Sep 9, 2024

orbeckst commented Sep 9, 2024

[CI] Tests timing out occasionally #4209

[CI] Tests timing out occasionally #4209

Comments

IAlibay commented Jul 25, 2023

IAlibay commented Jul 25, 2023

IAlibay commented Jul 28, 2023

IAlibay commented Jul 28, 2023 • edited Loading

IAlibay commented Jul 30, 2023

hmacdope commented Jul 30, 2023

RMeli commented Jul 30, 2023

drew-parsons commented Aug 3, 2023 • edited Loading

IAlibay commented Aug 3, 2023

IAlibay commented Aug 3, 2023

drew-parsons commented Aug 4, 2023

IAlibay commented Aug 4, 2023

drew-parsons commented Aug 4, 2023

drew-parsons commented Aug 4, 2023

IAlibay commented Sep 2, 2023

orbeckst commented Sep 2, 2023

yuxuanzhuang commented Sep 7, 2023 • edited Loading

IAlibay commented Sep 7, 2023

yuxuanzhuang commented Sep 7, 2023

IAlibay commented Sep 7, 2023

yuxuanzhuang commented Sep 7, 2023

IAlibay commented Mar 10, 2024

orbeckst commented Mar 29, 2024

hmacdope commented Apr 1, 2024

yuxuanzhuang commented Sep 7, 2024

test_sig.py

sig.py

orbeckst commented Sep 8, 2024

IAlibay commented Sep 8, 2024

hmacdope commented Sep 8, 2024

yuxuanzhuang commented Sep 8, 2024

IAlibay commented Sep 8, 2024

IAlibay commented Sep 8, 2024

IAlibay commented Sep 8, 2024

orbeckst commented Sep 9, 2024

orbeckst commented Sep 9, 2024

IAlibay commented Jul 28, 2023 •

edited

Loading

drew-parsons commented Aug 3, 2023 •

edited

Loading

yuxuanzhuang commented Sep 7, 2023 •

edited

Loading

`test_sig.py`

`sig.py`