-
Notifications
You must be signed in to change notification settings - Fork 338
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix retry handling in case IP pool is empty #4590
Conversation
failure here - but as far as i understand not related
might be this issue #4474 |
Manifests that allow to simulate the bug:
|
9470615
to
44e107c
Compare
/cc @maiqueb |
Added missing "return err" also in case of pod delete (not part of the bug, but just saw it symmetric wise) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you.
Is it possible to add some unit tests to this ? |
Imho it doesnt worth it (but it is your decision), it is bit more complex scenario for unit test, I also think that if in the first place such corner case doesnt deserve a unit test, EDIT - maybe we can just unit test if needed the function itself, that should return error |
not sure why build failed this time, just pushed empty change to trigger it. |
I think the build failure is not related to this PR, here it looks the same from glance https://github.com/ovn-org/ovn-kubernetes/actions/runs/10299300360/job/28506303356?pr=4590
|
no changes push - just to retrigger job one more time |
Now it failed due to this (but the previous test did pass in this run, seems it is just a flaky one) Seems both are related to 0x000005aadc80 by goroutine 65748
However, there is a lock per entry, but fwiw the namespace isnt locked, unless it doesnt matter.
|
I will split this PR to 2 PRs, one for fixing the case mentioned on the PR desc |
now the failures seems like known flakes |
rebased (might solve some flakes #4590 (comment)) |
seems the UT hanged
|
rebased |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see this as a global problem though. If there is a reason we shouldn't return err there, is probable that reason is the same for the three cases (create/update/delete). Perhaps chasing the data race tells us more that we are not realizing right know. I don't see any relation of the data race with this PR, seem more cause by this other PR |
Added missing "return err" on pod delete. Rebased Thanks |
Should we do the same for update? |
already did, originally i had to fix "update" and "add" so pods will have a retry chance for both cases |
Wouldn't we potentially have a race of |
the failing test is a reported flake #4432 EDIT beside that all pass |
When a pod is created, and the IP pool is empty, the pod will stuck forever in creation state, even if IPs are released. To simulate the bug: Create a pool with just 1 available IP (others excluded; first and last reserved). Create 2 pods: a. One pod will take the IP. b. The second pod will remain in "creation" state. Wait around 1 minute and then delete the Running pod. Bug: The pending pod stays pending forever, even though an IP is available. Without the fix, it update / add Pod retry mechanism isn't triggered at all, because it will be triggered only if a non nil error is returned. Note: Even with the fix, the retry mechanism is capped in 15 retries, which total for around 15m. After 15m the pod will be pending forever, even if IPs are released, because its annotations won't be updated anymore. Signed-off-by: Or Shoval <oshoval@redhat.com>
Signed-off-by: Or Shoval <oshoval@redhat.com>
The retry will be printed on higher level anyhow. Signed-off-by: Or Shoval <oshoval@redhat.com>
Thanks all |
What this PR does and why is it needed
When a pod is created, and the IP pool is empty,
the pod will stuck forever in creation state, even if IPs are released.
To simulate the bug:
a. One pod will take the IP.
b. The second pod will remain in "creation" state.
The bug: The pending pod stays pending forever, even though an IP is available.
Without the fix, update / add Pod retry mechanism isn't triggered at all, because it will be triggered
only if a non nil error is returned.
Note:
Even with the fix, the retry mechanism is capped in 15 retries [1], which total for around 15m.
After 15m the pod will be pending forever, even if IPs are released, because its annotations
won't be updated anymore.
User will need to manually recreate the pod (unless the pod has a controller of course).
[1]
ovn-kubernetes/go-controller/pkg/retry/obj_retry.go
Line 258 in 87fdae7
Added also the missing "return err" of pod delete event just because it was missing as well.
Removed the logs in case of an error, it will be printed anyhow on higher level.
Which issue(s) this PR fixes
Fixes #
Special notes for reviewers
Please find in the comments below manifests that allow simulating the bug easily upon needs.
How to verify it
Details to documentation updates
Description for the changelog
Does this PR introduce a user-facing change?