Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[DNM] Debug freeze on CentOS 7 CI #2939

Closed
wants to merge 5 commits into from

Conversation

kolyshkin
Copy link
Contributor

@kolyshkin kolyshkin commented May 5, 2021

GHA CI almost always fails on CentOS 7 (#2907):

=== RUN   TestFreeze
    utils_test.go:85: exec_test.go:539: unexpected error: unable to freeze
        
--- FAIL: TestFreeze (0.71s)

Trying to find out what to do about it.

This is complicated because the kind of mac os x host GHA gives for the test is a lottery. In most cases it's good, and sometimes it's slow and buggy.

@kolyshkin
Copy link
Contributor Author

--- PASS: TestAdditionalGroups (0.30s)
=== RUN   TestFreeze
    utils_test.go:85: exec_test.go:539: unexpected error: unable to freeze (1000 retries, 20 thaws, last state: FREEZING)
        
--- FAIL: TestFreeze (0.63s)

Not sure what to do here. Add more iterations? Increase the timeout?

@kolyshkin kolyshkin force-pushed the debug-freeze branch 2 times, most recently from 8683ae4 to a59d45a Compare May 5, 2021 22:22
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
1. These tests can't be run in parallel since they do check
   a global variable (mbaScEnabled).

2. findIntelRdtMountpointDir() relies on mbaScEnabled to be initially
   set to the default value (false) and this the test fails if run
   more than once:

> go test -count 2
> ...
> intelrdt_test.go:243: expected mbaScEnabled=false, got true
>    --- FAIL: TestFindIntelRdtMountpointDir/Valid_mountinfo_with_MBA_Software_Controller_disabled (0.00s)

Fixes: 2c70d23
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
@kolyshkin kolyshkin force-pushed the debug-freeze branch 5 times, most recently from ca1340e to 92c8cb1 Compare May 6, 2021 01:05
500x each test (with and without systemd).

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
I hate to keep adding those kludges, but lately TestFreeze (and
TestSystemdFreeze) from libcontainer/integration fails a lot. The
failure comes and goes, and is probably this is caused by a slow host
allocated for the test, and a slow VM on top of it.

To remediate, add a small sleep on every 25th iteration in between
asking the kernel to freeze and checking its status.

In the worst case scenario (failure to freeze) this adds 0.4 μs
to the duration of the call (nothing compared to that sleep after
the temporary thaw).

It is hard to measure how this affects CI but (with added debug prints)
on a histogram of number of retries I saw peaks at and after numbers
25, 50, 75 etc. meaning this works.

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
@kolyshkin
Copy link
Contributor Author

OK, the conclusion is adding an occasional short delay between writing "frozen" and reading the status back helps for this case (very slow system).

@kolyshkin kolyshkin closed this May 6, 2021
@kolyshkin
Copy link
Contributor Author

The fix is #2941

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant