Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Flaky test TestRevisionPause #17548

Open
ahrtr opened this issue Mar 7, 2024 · 12 comments
Open

Flaky test TestRevisionPause #17548

ahrtr opened this issue Mar 7, 2024 · 12 comments

Comments

@ahrtr
Copy link
Member

ahrtr commented Mar 7, 2024

Which Github Action / Prow Jobs are flaking?

pull-etcd-unit-test

Which tests are flaking?

TestRevisionPause

Github Action / Prow Job link

https://prow.k8s.io/view/gs/kubernetes-jenkins/pr-logs/pull/etcd-io_etcd/17546/pull-etcd-unit-test/1765815515617431552

Reason for failure (if possible)

{Failed  === RUN   TestRevisionPause
    revision_test.go:102: len(actions) = 0, expected >= 1
--- FAIL: TestRevisionPause (10.04s)
}

Anything else we need to know?

No response

@ah8ad3
Copy link
Contributor

ah8ad3 commented Mar 12, 2024

I might be able to fix this if it's ok i'll take a look at this.

@ahrtr
Copy link
Member Author

ahrtr commented Mar 12, 2024

/assign @ah8ad3

Thanks

@ah8ad3
Copy link
Contributor

ah8ad3 commented Mar 13, 2024

This issue can be produced by making this

return NewRecorderStreamWithWaitTimout(5 * time.Second)
timeout 0 in local.
We can increase this timeout and hope it got fixed, or we can rewrite some parts of code (idk how much of work maybe rewriting test or even change the revision code to make it less dependable on timeout). What do you think? @ahrtr

@qincwang
Copy link

This might be a small chance flakiness problem, as I run the tests 10k+ times in local

image

we can bump up timeout in this test's recorder to be 10 secs, this would further lower the chance of flakiness, as for test performance concerns, since this is rarely flaky and successful test runs will not exceed the timeout, overall test performance will not be affected

created this PR to increase the timeout #17621 @ah8ad3 @ahrtr

@ivanvc
Copy link
Member

ivanvc commented Mar 21, 2024

Hi @qincwang, could you run the tests with stress? The instructions on how to do it are in the etcd community meeting document. If you left it running for a while, it would give a failure percentage.

@qincwang
Copy link

qincwang commented Mar 21, 2024

Hi @ivanvc running the command for 1h40mins & 1million+ runs for this orignal test and get first error , referring this PR #17586, will try to remove timemout dependency on this and keep stress run for more time to validate, could I be assigned this issue, thx
image

@jmhbnz
Copy link
Member

jmhbnz commented Mar 22, 2024

Hi team - Thank you for your progress on this issue.

Speaking as a sig-etcd co-chair I want to please ask that we all be mindful of issue assignments in future. This issue was assigned to @ah8ad3.

In the interest of growing our collaborative community I would have preferred the pull request resolving this issue to have come from the initial assignee who volunteered to work on this first. Normally if an issue is already assigned we would expect other contributors to ask permission before also raising any pr for the same issue.

Given work has already been happening in this instance, I will also assign @qincwang in the hope you can both work together to finalise any testing or pull requests. Additionally please consider co-authoring commits in github to ensure all contributors are recognised.

Thank you for your kind consideration 🙏🏻

/assign @qincwang

@ah8ad3
Copy link
Contributor

ah8ad3 commented Mar 22, 2024

Thanks @jmhbnz for clarification. Since i'm new to this community i was waiting for the @ahrtr response but anyway i'll try to stress it.

@qincwang
Copy link

Hi Team, sorry for any confusion or concerns. I am also new to the community and initialized the PR in the hope of some more concrete discussions around the code. Will make sure all work and contributors are regonized in the PR.

@ahrtr
Copy link
Member Author

ahrtr commented Mar 24, 2024

Sorry for the late response. Will take a closer look at this issue sometime next week.

@ahrtr
Copy link
Member Author

ahrtr commented Mar 25, 2024

It seems that we just need to follow the same way as #17513 ? Let's get #17513 merged firstly.

@ah8ad3
Copy link
Contributor

ah8ad3 commented Mar 26, 2024

Sorry i was away for few days, i run tests with stress and race couldn't reproduce the error. Code here is pretty old (It got refactored #8123 and it may be older than 7 years). i think we need more of #17513 ,thanks for pointing that out @ahrtr, for these parts to refactor and make them less dependence on timeouts.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Development

No branches or pull requests

5 participants