Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix retry handling in case IP pool is empty #4590

Merged
merged 3 commits into from
Aug 13, 2024

Conversation

oshoval
Copy link
Contributor

@oshoval oshoval commented Aug 7, 2024

What this PR does and why is it needed

When a pod is created, and the IP pool is empty,
the pod will stuck forever in creation state, even if IPs are released.

To simulate the bug:

  1. Create a NAD with a pool with just 1 available IP (others excluded; first and last reserved).
  2. Create 2 pods that use this NAD:
    a. One pod will take the IP.
    b. The second pod will remain in "creation" state.
  3. Wait around 1 minute and then delete the Running pod.

The bug: The pending pod stays pending forever, even though an IP is available.

Without the fix, update / add Pod retry mechanism isn't triggered at all, because it will be triggered
only if a non nil error is returned.

Note:
Even with the fix, the retry mechanism is capped in 15 retries [1], which total for around 15m.
After 15m the pod will be pending forever, even if IPs are released, because its annotations
won't be updated anymore.
User will need to manually recreate the pod (unless the pod has a controller of course).

[1]

if entry.failedAttempts >= MaxFailedAttempts {

Added also the missing "return err" of pod delete event just because it was missing as well.
Removed the logs in case of an error, it will be printed anyhow on higher level.

Which issue(s) this PR fixes

Fixes #

Special notes for reviewers

Please find in the comments below manifests that allow simulating the bug easily upon needs.

How to verify it

Details to documentation updates

Description for the changelog

Does this PR introduce a user-facing change?

None

@oshoval
Copy link
Contributor Author

oshoval commented Aug 7, 2024

failure here - but as far as i understand not related
https://github.com/ovn-org/ovn-kubernetes/actions/runs/10282831972/job/28456405063?pr=4590

  [FAIL] External Gateway With Admin Policy Based External Route CRs e2e non-vxlan external gateway through a gateway pod Should validate ICMP connectivity to an external gateway's loopback address via a gateway pod [It] ipv4

might be this issue #4474

@oshoval
Copy link
Contributor Author

oshoval commented Aug 7, 2024

Manifests that allow to simulate the bug:

---
apiVersion: "k8s.cni.cncf.io/v1"
kind: NetworkAttachmentDefinition
metadata:
  name: mynet
  namespace: default
spec:
  config: |2
    {
        "cniVersion": "0.3.1",
        "name": "tenantblue",
        "netAttachDefName": "default/mynet",
        "topology": "layer2",
        "type": "ovn-k8s-cni-overlay",
        "subnets": "192.168.200.0/30",
        "excludeSubnets": "192.168.200.1/32"
    }
---
apiVersion: v1
kind: Pod
metadata:
  annotations:
    k8s.v1.cni.cncf.io/networks: '[{"name":"mynet","namespace":"default","interface":"pod8f202bdbf25"}]'
  name: pod1
  namespace: default
spec:
  containers:
  - image: docker.io/fedora:latest 
    name: fedora
    imagePullPolicy: "IfNotPresent"
    command:
    - /bin/bash
    - -c
    - sleep infinity
---
apiVersion: v1
kind: Pod
metadata:
  annotations:
    k8s.v1.cni.cncf.io/networks: '[{"name":"mynet","namespace":"default","interface":"pod8f202bdbf25"}]'
  name: pod2
  namespace: default
spec:
  containers:
  - image: docker.io/fedora:latest 
    name: fedora
    imagePullPolicy: "IfNotPresent"
    command:
    - /bin/bash
    - -c
    - sleep infinity

awk '/32 host/ { print f } {f=$2}' /proc/net/fib_trie can be used to see the allocated ipv4 192.168.200.2

@oshoval oshoval changed the title WIP WIP Fix pod limbo state (will rephrase soon) Aug 7, 2024
@oshoval oshoval changed the title WIP Fix pod limbo state (will rephrase soon) WIP Fix retry handling in case IP pool is empty Aug 8, 2024
@oshoval oshoval force-pushed the fix_update branch 2 times, most recently from 9470615 to 44e107c Compare August 8, 2024 06:26
@oshoval oshoval changed the title WIP Fix retry handling in case IP pool is empty Fix retry handling in case IP pool is empty Aug 8, 2024
@oshoval
Copy link
Contributor Author

oshoval commented Aug 8, 2024

/cc @maiqueb

@oshoval oshoval marked this pull request as ready for review August 8, 2024 08:31
@oshoval oshoval requested a review from a team as a code owner August 8, 2024 08:31
@oshoval oshoval requested a review from npinaeva August 8, 2024 08:31
@oshoval
Copy link
Contributor Author

oshoval commented Aug 8, 2024

Added missing "return err" also in case of pod delete (not part of the bug, but just saw it symmetric wise)

maiqueb
maiqueb previously approved these changes Aug 8, 2024
Copy link
Contributor

@maiqueb maiqueb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you.

@maiqueb
Copy link
Contributor

maiqueb commented Aug 8, 2024

Is it possible to add some unit tests to this ?

@oshoval
Copy link
Contributor Author

oshoval commented Aug 8, 2024

Is it possible to add some unit tests to this ?

Imho it doesnt worth it (but it is your decision), it is bit more complex scenario for unit test,
(looked on TestPodAllocator_reconcileForNAD which is the current place of unit tests iiuc, it isnt enough because it checks podAllocator and we need a level above that at least, because the changed code is calling podAllocator)
CNV QE will test it e2e, this kind of tests more suitable for QE coverage.

I also think that if in the first place such corner case doesnt deserve a unit test,
the fact a bug is later found doesnt need to always be a reason to add a unit test.

EDIT - maybe we can just unit test if needed the function itself, that should return error
in case the pool is full, and not with the whole retry framework (might be what you meant as well)
but looking on go-controller/pkg/clustermanager/network_cluster_controller_test.go it doesnt
have unit tests at all for pod types iiuc.

@oshoval
Copy link
Contributor Author

oshoval commented Aug 8, 2024

not sure why build failed this time, just pushed empty change to trigger it.
(update also first commit change, just removed unrelevant text about delete pod, because it now has its own commit)

@oshoval
Copy link
Contributor Author

oshoval commented Aug 8, 2024

I think the build failure is not related to this PR, here it looks the same from glance
#4594
but each PR has its own failures

https://github.com/ovn-org/ovn-kubernetes/actions/runs/10299300360/job/28506303356?pr=4590
here i have this one
FAIL! -- 428 Passed | 1 Failed | 1 Pending | 0 Skipped

2024-08-08T09:23:44.2870090Z �[91m�[1m• Failure [2.550 seconds]�[0m
2024-08-08T09:23:44.2870673Z Hybrid SDN Master Operations
2024-08-08T09:23:44.2871979Z �[90m/home/runner/work/ovn-kubernetes/ovn-kubernetes/go-controller/pkg/ovn/hybrid_test.go:132�[0m
2024-08-08T09:23:44.2873461Z   �[91m�[1mhandles a HO node is switched to a OVN node [It]�[0m
2024-08-08T09:23:44.2874962Z   �[90m/home/runner/work/ovn-kubernetes/ovn-kubernetes/go-controller/pkg/ovn/hybrid_test.go:1216�[0m
2024-08-08T09:23:44.2875852Z 
2024-08-08T09:23:44.2876141Z   �[91mTimed out after 2.000s.
2024-08-08T09:23:44.2876639Z   Expected
2024-08-08T09:23:44.2877291Z       <[]*nbdb.LogicalRouterStaticRoute | len:2, cap:2>: [
2024-08-08T09:23:44.2877989Z           {
2024-08-08T09:23:44.2878817Z               UUID: "4093cc9b-78b9-4b84-8812-1f94d0f1d79a",
2024-08-08T09:23:44.2879534Z               BFD: nil,
2024-08-08T09:23:44.2880102Z               ExternalIDs: {
2024-08-08T09:23:44.2881138Z                   "name": "hybrid-subnet-node1:node-windows",
2024-08-08T09:23:44.2881824Z               },
2024-08-08T09:23:44.2882402Z               IPPrefix: "10.1.3.0/24",
2024-08-08T09:23:44.2883093Z               Nexthop: "10.1.1.3",
2024-08-08T09:23:44.2883714Z               Options: nil,
2024-08-08T09:23:44.2884307Z               OutputPort: nil,
2024-08-08T09:23:44.2884909Z               Policy: nil,
2024-08-08T09:23:44.2885493Z               RouteTable: "",
2024-08-08T09:23:44.2885975Z           },
2024-08-08T09:23:44.2886347Z           {
2024-08-08T09:23:44.2887160Z               UUID: "25132f5f-c50d-4d54-9f95-cb61855d918e",
2024-08-08T09:23:44.2887866Z               BFD: nil,
2024-08-08T09:23:44.2888418Z               ExternalIDs: {
2024-08-08T09:23:44.2889317Z                   "name": "hybrid-subnet-node1-gr",
2024-08-08T09:23:44.2889963Z               },
2024-08-08T09:23:44.2890527Z               IPPrefix: "10.1.3.0/24",
2024-08-08T09:23:44.2891214Z               Nexthop: "100.64.0.1",
2024-08-08T09:23:44.2891847Z               Options: nil,
2024-08-08T09:23:44.2892432Z               OutputPort: nil,
2024-08-08T09:23:44.2893013Z               Policy: nil,
2024-08-08T09:23:44.2893672Z               RouteTable: "",
2024-08-08T09:23:44.2894155Z           },
2024-08-08T09:23:44.2894520Z       ]
2024-08-08T09:23:44.2894952Z   to have length 0�[0m
2024-08-08T09:23:44.2895379Z 
2024-08-08T09:23:44.2896194Z   /home/runner/work/ovn-kubernetes/ovn-kubernetes/go-controller/pkg/ovn/hybrid_test.go:1380

@oshoval
Copy link
Contributor Author

oshoval commented Aug 8, 2024

no changes push - just to retrigger job one more time

@oshoval
Copy link
Contributor Author

oshoval commented Aug 8, 2024

Now it failed due to this (but the previous test did pass in this run, seems it is just a flaky one)
https://github.com/ovn-org/ovn-kubernetes/actions/runs/10300314435/job/28509457479?pr=4590
2024-08-08T10:41:37.4612347Z testing.go:1398: race detected during execution of test
SUCCESS! -- 429 Passed | 0 Failed | 1 Pending | 0 Skipped

Seems both are related to 0x000005aadc80 by goroutine 65748
in this function updateEgressFirewallForNode
it use sync.Map which is good for specific cases but not thread safe on all cases
for example if threads RW joint entries it is not safe
https://pkg.go.dev/sync#Map

The Map type is optimized for two common use cases: (1) when the entry for a given key is only ever written once but read many times, as in caches that only grow, or (2) when multiple goroutines read, write, and overwrite entries for disjoint sets of keys.

However, there is a lock per entry, but fwiw the namespace isnt locked, unless it doesnt matter.
(for example if keys considered immutable).

		ef := v.(*egressFirewall)
		namespace := k.(string)
		ef.Lock()
2024-08-08T10:40:43.0777832Z ==================
2024-08-08T10:40:43.0780020Z WARNING: DATA RACE
2024-08-08T10:40:43.0780961Z Read at 0x000005aadc80 by goroutine 65748:
2024-08-08T10:40:43.0783181Z   github.com/ovn-org/ovn-kubernetes/go-controller/pkg/libovsdb/ops.(*modelClient).where()
2024-08-08T10:40:43.0785890Z       /home/runner/work/ovn-kubernetes/ovn-kubernetes/go-controller/pkg/libovsdb/ops/model_client.go:468 +0xb3
2024-08-08T10:40:43.0792667Z   github.com/ovn-org/ovn-kubernetes/go-controller/pkg/libovsdb/ops.(*modelClient).lookup()
2024-08-08T10:40:43.0794318Z       /home/runner/work/ovn-kubernetes/ovn-kubernetes/go-controller/pkg/libovsdb/ops/model_client.go:443 +0xdc
2024-08-08T10:40:43.0797052Z   github.com/ovn-org/ovn-kubernetes/go-controller/pkg/libovsdb/ops.(*modelClient).buildOps()
2024-08-08T10:40:43.0798506Z       /home/runner/work/ovn-kubernetes/ovn-kubernetes/go-controller/pkg/libovsdb/ops/model_client.go:304 +0x1cd
2024-08-08T10:40:43.0800041Z   github.com/ovn-org/ovn-kubernetes/go-controller/pkg/libovsdb/ops.(*modelClient).createOrUpdateOps()
2024-08-08T10:40:43.0801839Z       /home/runner/work/ovn-kubernetes/ovn-kubernetes/go-controller/pkg/libovsdb/ops/model_client.go:251 +0x2f9
2024-08-08T10:40:43.0803264Z   github.com/ovn-org/ovn-kubernetes/go-controller/pkg/libovsdb/ops.(*modelClient).CreateOrUpdateOps()
2024-08-08T10:40:43.0804969Z       /home/runner/work/ovn-kubernetes/ovn-kubernetes/go-controller/pkg/libovsdb/ops/model_client.go:216 +0x6f9
2024-08-08T10:40:43.0805986Z   github.com/ovn-org/ovn-kubernetes/go-controller/pkg/libovsdb/ops.CreateOrUpdateACLsOps()
2024-08-08T10:40:43.0807020Z       /home/runner/work/ovn-kubernetes/ovn-kubernetes/go-controller/pkg/libovsdb/ops/acl.go:130 +0x7d
2024-08-08T10:40:43.0808097Z   github.com/ovn-org/ovn-kubernetes/go-controller/pkg/ovn.(*DefaultNetworkController).createEgressFirewallACLOps()
2024-08-08T10:40:43.0809256Z       /home/runner/work/ovn-kubernetes/ovn-kubernetes/go-controller/pkg/ovn/egressfirewall.go:479 +0x1c4
2024-08-08T10:40:43.0810311Z   github.com/ovn-org/ovn-kubernetes/go-controller/pkg/ovn.(*DefaultNetworkController).addEgressFirewallRules()
2024-08-08T10:40:43.0811450Z       /home/runner/work/ovn-kubernetes/ovn-kubernetes/go-controller/pkg/ovn/egressfirewall.go:452 +0x11f1
2024-08-08T10:40:43.0812568Z   github.com/ovn-org/ovn-kubernetes/go-controller/pkg/ovn.(*DefaultNetworkController).updateEgressFirewallForNode.func1()
2024-08-08T10:40:43.0813749Z       /home/runner/work/ovn-kubernetes/ovn-kubernetes/go-controller/pkg/ovn/egressfirewall.go:819 +0x80c
2024-08-08T10:40:25.0461534Z WARNING: DATA RACE
2024-08-08T10:40:25.0462041Z Read at 0x000005aadc80 by goroutine 65748:
2024-08-08T10:40:25.0463273Z   github.com/ovn-org/ovn-kubernetes/go-controller/pkg/libovsdb/ops.TransactAndCheck()
2024-08-08T10:40:25.0465294Z       /home/runner/work/ovn-kubernetes/ovn-kubernetes/go-controller/pkg/libovsdb/ops/transact.go:44 +0x1a5
2024-08-08T10:40:25.0467295Z   github.com/ovn-org/ovn-kubernetes/go-controller/pkg/ovn.(*DefaultNetworkController).addEgressFirewallRules()
2024-08-08T10:40:25.0469468Z       /home/runner/work/ovn-kubernetes/ovn-kubernetes/go-controller/pkg/ovn/egressfirewall.go:457 +0xbfc
2024-08-08T10:40:25.0471751Z   github.com/ovn-org/ovn-kubernetes/go-controller/pkg/ovn.(*DefaultNetworkController).updateEgressFirewallForNode.func1()
2024-08-08T10:40:25.0474002Z       /home/runner/work/ovn-kubernetes/ovn-kubernetes/go-controller/pkg/ovn/egressfirewall.go:819 +0x80c
2024-08-08T10:40:25.0475124Z   sync.(*Map).Range()

@oshoval
Copy link
Contributor Author

oshoval commented Aug 8, 2024

I will split this PR to 2 PRs, one for fixing the case mentioned on the PR desc
and the 2nd for adding the missing pod delete "return err"
the various flakes / data race started when i added the "return err", maybe it woke up something
so with the split we will be safer, able to focus and also see if it is related.
It also not related to this PR, so legit to not be included.

@oshoval
Copy link
Contributor Author

oshoval commented Aug 8, 2024

now the failures seems like known flakes
before it, it stuck on flakes at the "build" stage / UT, and didnt reach the final jobs

@oshoval
Copy link
Contributor Author

oshoval commented Aug 8, 2024

rebased (might solve some flakes #4590 (comment))
because #4583 was merged

@oshoval oshoval requested a review from maiqueb August 8, 2024 14:22
@oshoval
Copy link
Contributor Author

oshoval commented Aug 9, 2024

seems the UT hanged

2024-08-08T13:53:01.5120574Z I0808 13:53:01.511449   34830 route_manager.go:110] Route Manager: completed adding route: {Ifindex: 67 Dst: 0.0.0.0/0 Src: <nil> Gw: <nil> Flags: [] Table: 1067 Realm: 0}
2024-08-08T19:43:09.3217245Z ##[error]The operation was canceled.
2024-08-08T19:43:09.3317234Z Post job cleanup.
```

@oshoval
Copy link
Contributor Author

oshoval commented Aug 9, 2024

rebased

maiqueb
maiqueb previously approved these changes Aug 12, 2024
Copy link
Contributor

@maiqueb maiqueb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks.

Is there something I'm not seeing here ? At a glance, we should return an error here to force reconciliation to happen later.
@jcaamano / @trozet / @tssurya

@jcaamano
Copy link
Contributor

I see this as a global problem though. If there is a reason we shouldn't return err there, is probable that reason is the same for the three cases (create/update/delete). Perhaps chasing the data race tells us more that we are not realizing right know.
I don't realize what reason that might be right now.
Also, if we are returning error we can avoid the log there. It will be logged at a higher level.

I don't see any relation of the data race with this PR, seem more cause by this other PR
#4505
cc @npinaeva

@oshoval
Copy link
Contributor Author

oshoval commented Aug 12, 2024

I see this as a global problem though. If there is a reason we shouldn't return err there, is probable that reason is the same for the three cases (create/update/delete). Perhaps chasing the data race tells us more that we are not realizing right know. I don't realize what reason that might be right now. Also, if we are returning error we can avoid the log there. It will be logged at a higher level.

I don't see any relation of the data race with this PR, seem more cause by this other PR #4505 cc @npinaeva

Added missing "return err" on pod delete.
Removed the log.

Rebased

Thanks

@jcaamano
Copy link
Contributor

I see this as a global problem though. If there is a reason we shouldn't return err there, is probable that reason is the same for the three cases (create/update/delete). Perhaps chasing the data race tells us more that we are not realizing right know. I don't realize what reason that might be right now. Also, if we are returning error we can avoid the log there. It will be logged at a higher level.
I don't see any relation of the data race with this PR, seem more cause by this other PR #4505 cc @npinaeva

Added missing "return err" on pod delete. Removed the log.

Rebased

Thanks

Should we do the same for update?

@oshoval
Copy link
Contributor Author

oshoval commented Aug 12, 2024

I see this as a global problem though. If there is a reason we shouldn't return err there, is probable that reason is the same for the three cases (create/update/delete). Perhaps chasing the data race tells us more that we are not realizing right know. I don't realize what reason that might be right now. Also, if we are returning error we can avoid the log there. It will be logged at a higher level.
I don't see any relation of the data race with this PR, seem more cause by this other PR #4505 cc @npinaeva

Added missing "return err" on pod delete. Removed the log.
Rebased
Thanks

Should we do the same for update?

already did, originally i had to fix "update" and "add" so pods will have a retry chance for both cases
without it, they dont have chance to retry, as the err != nil is the way to signal the retry framework an error occurred
and that it should retry

@npinaeva
Copy link
Member

npinaeva commented Aug 12, 2024

@jcaamano I think #4505 didn't bring any test failures by itself, but just made existing problems more visible.
Here is an attempt to fix ef race #4611

@jcaamano
Copy link
Contributor

@jcaamano I think #4505 didn't bring any test failures by itself, but just made existing problems more visible. Here is an attempt to fix ef race #4611

Wouldn't we potentially have a race of PrepareTestConfig for a test with any other test that might be concurrently executing and reading config values?

@oshoval
Copy link
Contributor Author

oshoval commented Aug 13, 2024

the failing test is a reported flake #4432
https://github.com/ovn-org/ovn-kubernetes/actions/runs/10353880486/job/28694475459?pr=4590

EDIT beside that all pass

When a pod is created, and the IP pool is empty,
the pod will stuck forever in creation state, even if IPs are released.

To simulate the bug:

Create a pool with just 1 available IP (others excluded; first and last reserved).
Create 2 pods:
a. One pod will take the IP.
b. The second pod will remain in "creation" state.
Wait around 1 minute and then delete the Running pod.
Bug: The pending pod stays pending forever, even though an IP is available.
Without the fix, it update / add Pod retry mechanism isn't triggered at all, because it will be triggered
only if a non nil error is returned.

Note:
Even with the fix, the retry mechanism is capped in 15 retries, which total for around 15m.
After 15m the pod will be pending forever, even if IPs are released, because its annotations
won't be updated anymore.

Signed-off-by: Or Shoval <oshoval@redhat.com>
Signed-off-by: Or Shoval <oshoval@redhat.com>
The retry will be printed on higher level anyhow.

Signed-off-by: Or Shoval <oshoval@redhat.com>
@oshoval
Copy link
Contributor Author

oshoval commented Aug 13, 2024

Thanks all

@jcaamano jcaamano merged commit 1186695 into ovn-org:master Aug 13, 2024
38 of 39 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Archived in project
Development

Successfully merging this pull request may close these issues.

4 participants