Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

build: the end-2-end tests have become extremely flaky #4606

Closed
jiceatscion opened this issue Aug 30, 2024 · 4 comments · Fixed by #4620
Closed

build: the end-2-end tests have become extremely flaky #4606

jiceatscion opened this issue Aug 30, 2024 · 4 comments · Fixed by #4620
Labels
bug Something isn't working

Comments

@jiceatscion
Copy link
Contributor

They have less than 20% success rate.

@oncilla
Copy link
Contributor

oncilla commented Sep 9, 2024

Papered over in #4605 we still need to investigate

@jiceatscion
Copy link
Contributor Author

jiceatscion commented Sep 13, 2024

The patch helps only somewhat. I added a 10s delay after wait-connectivity. That's not even quite enough, so I added a triple retry on pings. That is enough for the test to succeed most of the time, but the retry only applies to the end2end_integration test. There's another test that keeps failing: scion_integration. That one doesn't have a retry option.

In the end we just need to figure out why it takes so long for path segments to become available.

@jiceatscion
Copy link
Contributor Author

jiceatscion commented Sep 13, 2024

So, it appears that the segments are available after all (give-or-take a small fix in await-connectivity). What makes the tests fail is Deadline exceeded errors when trying to fetch the segments. Following the breadcrubs, I ended-up seeing a CS RPCing to another and both (if memory serves) of them disappearing for several seconds in the middle of processing the request. So, of course, the 10 s client timeout blows up eventually and so the whole chain of RPCs fails.

Increasing the timeout doesn't fix it, so it seems that the hangups can last indefinitely; until the timeout blows up.

@oncilla
Copy link
Contributor

oncilla commented Sep 13, 2024

Reading the release notes carefully, this is the only thing that stands out:

https://tip.golang.org/doc/go1.23#timer-changes

And indeed, there is something in the Go issue tracker: golang/go#69312 and the offending library quic-go/quic-go#4659

In the meantime, we can downgrade, or use GODEBUG="asynctimerchan=1"

Downgrading Go version shows very high reliability: https://buildkite.com/scionproto/scion/builds/4751

oncilla added a commit that referenced this issue Sep 17, 2024
Until golang/go#69312 is resolved, force the
old timer behavior by specifying an older go version in the go.mod file.

Fixes #4606
jiceatscion added a commit that referenced this issue Sep 19, 2024
Found this in the wake of #4606
I believe that await-connectivity could mistake core segments for up
segments (i.e. assuming that only up segments could be found). It still
makes the optimistic assumption that down segments are registered
immediately after up segments are obtained. We have to be content with
that because in hidden paths test cases the down segments cannot all be
found via a simple REST API query.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants