Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[QUIC] Move ByteMixingOrNativeAVE_MinimalFailingTest to OuterLoop #55595

Merged
merged 2 commits into from
Jul 13, 2021

Conversation

CarnaViire
Copy link
Member

@CarnaViire CarnaViire commented Jul 13, 2021

Disable ByteMixingOrNativeAVE_MinimalFailingTest test by ActiveIssue #55588
Move ByteMixingOrNativeAVE_MinimalFailingTest to outerloop. Fixes #55588

@ghost
Copy link

ghost commented Jul 13, 2021

Tagging subscribers to this area: @dotnet/ncl
See info in area-owners.md if you want to be subscribed.

Issue Details

Disable ByteMixingOrNativeAVE_MinimalFailingTest test by ActiveIssue #55588

Author: CarnaViire
Assignees: -
Labels:

area-System.Net

Milestone: -

@ghost
Copy link

ghost commented Jul 13, 2021

Tagging subscribers to this area: @dotnet/ncl
See info in area-owners.md if you want to be subscribed.

Issue Details

Disable ByteMixingOrNativeAVE_MinimalFailingTest test by ActiveIssue #55588

Author: CarnaViire
Assignees: -
Labels:

area-System.Net.Quic

Milestone: -

@wfurt
Copy link
Member

wfurt commented Jul 13, 2021

can we just move just to outer loop? That may be good enough as it won't block others.

@CarnaViire
Copy link
Member Author

CarnaViire commented Jul 13, 2021

can we just move just to outer loop? That may be good enough as it won't block others.

Yep, I was not sure what will be better, so did both 😂

@wfurt
Copy link
Member

wfurt commented Jul 13, 2021

Let's try it. Outerloop would still probably catch crashes.

@CarnaViire CarnaViire changed the title [QUIC] Disable ByteMixingOrNativeAVE_MinimalFailingTest [QUIC] Move ByteMixingOrNativeAVE_MinimalFailingTest to OuterLoop Jul 13, 2021
Copy link
Member

@wfurt wfurt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@geoffkizer
Copy link
Contributor

We shouldn't just move this to Outerloop, because it's just going to keep timing out there.

This tests takes a long time to run by design. We should either reduce the time it takes (e.g. reduce iterations from 20 to ??), or increase the timeout.

@wfurt
Copy link
Member

wfurt commented Jul 13, 2021

How much time you think we need @geoffkizer ?

@geoffkizer
Copy link
Contributor

Am I reading the code correctly, that the current timeout is 16 minutes? (1000 seconds?)

How is it possible that a test that runs in 10s on my dev box takes more than 16 minutes on a CI machine?

@wfurt
Copy link
Member

wfurt commented Jul 13, 2021

I think that is overall timeout for all the test. Each test may fail faster. So as each operation within the test can fail sooner.

@CarnaViire
Copy link
Member Author

CarnaViire commented Jul 13, 2021

RunClientServer actually has a default timeout of 10s

internal async Task RunClientServer(Func<QuicConnection, Task> clientFunction, Func<QuicConnection, Task> serverFunction, int iterations = 1, int millisecondsTimeout = 10_000)

so 1000s outside does not make sense in the end :(

10s is a timeout for 1 iteration as @wfurt said

@CarnaViire
Copy link
Member Author

CarnaViire commented Jul 13, 2021

Shall we pass 50s as an iteration timeout then? @geoffkizer

@CarnaViire
Copy link
Member Author

Or, merge it as it is to unblock CI and discuss timeouts or other ways to fix it later

@karelz
Copy link
Member

karelz commented Jul 13, 2021

This is blocking everyone, we need to disable it or move to Outerloop ASAP. @CarnaViire can you please merge your PR? We can discuss proper solution later. Thanks!

@CarnaViire CarnaViire merged commit a13f713 into dotnet:main Jul 13, 2021
@ManickaP
Copy link
Member

@geoffkizer

How is it possible that a test that runs in 10s on my dev box takes more than 16 minutes on a CI machine?

From my experience, if it gets stuck for 16 minutes on a CI machine and eventually killed by CI infra, it's a deadlock (in tests mostly, rarely in production code). The CI is slow enough, or has a different timing, so that different races happen and we end up with something like this.
After all, we're using semaphores, manualresetevents, taskcompletionsources etc. to synchronize work between client and server. And there's plenty of room to not to see all the possible execution paths and getting a hang/deadlock.

Some of these problem I've successfully debugged in the past by crashing the process on purpose after a certain amount of time and then getting a dump from which you can (in ~90% of times) figure out where it got stuck.

@CarnaViire
Copy link
Member Author

if it gets stuck for 16 minutes

It didn't stuck for 16 mins, it timeouted from one iteration timeout inside RunClientServer, which is 10s by default
https://helixre8s23ayyeko0k025g8.blob.core.windows.net/dotnet-runtime-refs-heads-main-8899a62e6f614ea1b3/System.Net.Quic.Functional.Tests/console.d6482988.log?%3F%253Fsv%253D2019-07-07%2526se%253D2021-08-01T16%25253A44%25253A18Z%2526sr%253Dc%2526sp%253Drl%2526sig%253DMVc106OR9ELNjM4Y3Vq93U4lMQzyII0ikOA86WYj08Y%25253D
you can see it in the logs, and the full test suite took 31s

@CarnaViire
Copy link
Member Author

That being said, we still need to investigate why 10s was not enough, it is possible CI machine is just slow and this test is big, so we just need to increase the timeout for the iteration, but some deadlock is also possible

@karelz karelz added this to the 6.0.0 milestone Jul 15, 2021
@ghost ghost locked as resolved and limited conversation to collaborators Aug 14, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

system.net.quic.tests.msquictests.bytemixingornativeave_minimalfailingtest times out on Fedora
6 participants