Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Egress IP fixes #1567

Merged
merged 7 commits into from
Aug 26, 2020
Merged

Conversation

alexanderConstantinescu
Copy link
Collaborator

@alexanderConstantinescu alexanderConstantinescu commented Jul 31, 2020

- What this PR does and why is it needed

This PR fixes a couple of issues with the egress IP feature that merged yesterday with PR: #1484. There is no real rush to get this in today and it can wait until next week. I didn't have time to do this yesterday with all the things going on.

Problems solved:

  • When deleting the pod's egress setup (upon a pod delete) we were retrieving the IP from the OVN database and subsequently racing with deleteLogicalPort, in case we lost: we never cleaned up. I thus changed that to only look at the pod.Status instead, this resolves the race condition. Since a pod IP can't change we thus never really need to look in the OVN DB for that. This also simplified the code as I didn't need to pull in ovnNBClient into the data containers intended at managing egress IP assignments.

  • I never properly removed any watch handlers from the namespace / pod handlers used for egress IP assignment. This lead to very strange corner cases which have now been fixed. I've also added tests taking into account those cases.

@alexanderConstantinescu
Copy link
Collaborator Author

/retest

1 similar comment
@alexanderConstantinescu
Copy link
Collaborator Author

/retest

go-controller/pkg/ovn/egressip.go Outdated Show resolved Hide resolved
go-controller/pkg/ovn/egressip.go Outdated Show resolved Hide resolved
go-controller/pkg/ovn/egressip_local_test.go Show resolved Hide resolved
@alexanderConstantinescu
Copy link
Collaborator Author

@danwinship : I've addressed your comments. PTAL

@coveralls
Copy link

coveralls commented Aug 13, 2020

Coverage Status

Coverage increased (+0.4%) to 63.16% when pulling 585654f on alexanderConstantinescu:egressip_fixes into 3ad1998 on ovn-org:master.

@alexanderConstantinescu
Copy link
Collaborator Author

I have update the PR with the correct egress IP implementation for shared gateway mode. PR: #1616 will now bring in the necessary fix in OVN. So we should be good to go.

@alexanderConstantinescu
Copy link
Collaborator Author

/hold

I actually need another fix which I am currently implementing. Then this should be ready.

@alexanderConstantinescu
Copy link
Collaborator Author

/hold cancel

Added two commits fixing corner cases in the egress IP logic.

go-controller/pkg/ovn/egressip.go Outdated Show resolved Hide resolved
go-controller/pkg/ovn/egressip.go Show resolved Hide resolved
go-controller/pkg/ovn/egressip.go Outdated Show resolved Hide resolved
go-controller/pkg/ovn/egressip.go Outdated Show resolved Hide resolved
go-controller/pkg/ovn/egressip_shared.go Outdated Show resolved Hide resolved
go-controller/pkg/ovn/egressip.go Outdated Show resolved Hide resolved
@alexanderConstantinescu
Copy link
Collaborator Author

alexanderConstantinescu commented Aug 19, 2020

@danwinship , addressed all comments. PTAL, and let me know if you'd prefer I squash some commits or not. I've kept them separated for easier review / tracking of what has been said here.

@danwinship
Copy link
Contributor

changes look good other than the race-detector problem which I commented on above.

Please squash the fixups back into their original commits in the next round

@alexanderConstantinescu
Copy link
Collaborator Author

changes look good other than the race-detector problem which I commented on above.

Please squash the fixups back into their original commits in the next round

Updated and squashed. PTAL

Copy link
Contributor

@danwinship danwinship left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, just a few more comments. poke me when you've updated it.

go-controller/pkg/ovn/egressip.go Outdated Show resolved Hide resolved
go-controller/pkg/ovn/egressip.go Outdated Show resolved Hide resolved
go-controller/pkg/ovn/egressip_local.go Outdated Show resolved Hide resolved
go-controller/pkg/ovn/egressip_shared.go Outdated Show resolved Hide resolved
go-controller/pkg/ovn/egressip_local_test.go Show resolved Hide resolved
go-controller/pkg/ovn/egressip.go Outdated Show resolved Hide resolved
Signed-off-by: Alexander Constantinescu <aconstan@redhat.com>

Keep `getPodIPs` true to its functional name

`getPodIPs` was doing additional operations not related
to "getting the pod IPs". Move that to `addPodEgressIP`

Signed-off-by: Alexander Constantinescu <aconstan@redhat.com>
Rename nodeLogicalRouterIP with IP version for egress IP tests

Signed-off-by: Alexander Constantinescu <aconstan@redhat.com>

Fix variable spelling mistake

Signed-off-by: Alexander Constantinescu <aconstan@redhat.com>
Signed-off-by: Alexander Constantinescu <aconstan@redhat.com>
The initial implementation of egress IP in shared gateway mode
was dependent on having the bug: https://bugzilla.redhat.com/show_bug.cgi?id=1861294
resolved. This is now the case and the fix is in the latest Fedora 31 release.
This commit thus switches to the correct shared gateway mode implementation.

Signed-off-by: Alexander Constantinescu <aconstan@redhat.com>
Tag egress IP related OVN transactions with external_id

This patch deals with avoiding inconsistent state when a pod matches multiple
egress IPs. Before this patch: the lr-policy-add / lr-nat-add commands failed once the second
egress IP setup happened. This wasn't a problem in itself, but if that egress IP was later
deleted: it then proceeded to delete the egress setup for the first egress IP. It goes like this:

1) EgressIP-1 is created matching pod X, setup happens correctly
2) EgressIP-2 is created matching pod X, setup fails because that exact same setup (nat and policy rules) already exists
3) EgressIP-2 is deleted, deleting EgressIP-1's setup.
4) OVN state is now screwed up and pod egress functionality will no longer work even though EgressIP-1 still exists.

To avoid this we have the possibility to tag the policy and nat objects with the external_id,
which in this case is the EgressIP.Name. As such we will only try deleting whatever setup we find
for that particular object. The OVN API for doing this is not as straightforward though, and we use
an extra OVN transaction for the delete, but we win in correctness.

Signed-off-by: Alexander Constantinescu <aconstan@redhat.com>

Expanding on the docstring for policyAlreadyExistsMsg

Signed-off-by: Alexander Constantinescu <aconstan@redhat.com>

Simplify logical router policy filter for egress IP

Signed-off-by: Alexander Constantinescu <aconstan@redhat.com>

Check if nat/lr-policy exists before creating it

As to avoid duplicate objects in the OVN database, check that
the NAT / logical router policy object which is about to be created
does not already exist

Signed-off-by: Alexander Constantinescu <aconstan@redhat.com>
The policy command in local gateway mode (specifying pkt_mark) was incorrect.
It is not a column in the logical_router_policy table, it's an option.

Signed-off-by: Alexander Constantinescu <aconstan@redhat.com>
@alexanderConstantinescu
Copy link
Collaborator Author

@danwinship , just pushed again with a version addressing your latest comments.

@alexanderConstantinescu
Copy link
Collaborator Author

Hmm, last build failure was caused by:

• Failure [5.003 seconds]
Hybrid SDN Master Operations
/home/runner/work/ovn-kubernetes/ovn-kubernetes/go-controller/hybrid-overlay/pkg/controller/master_test.go:70
  cleans up a Linux node when the OVN hostsubnet annotation is removed [It]
  /home/runner/work/ovn-kubernetes/ovn-kubernetes/go-controller/hybrid-overlay/pkg/controller/master_test.go:311

  Timed out after 5.000s.
  Expected
      <map[string]string | len:1>: {
          "k8s.ovn.org/hybrid-overlay-distributed-router-gateway-mac": "00:00:00:52:19:d2",
      }
  not to have key
      <string>: k8s.ovn.org/hybrid-overlay-distributed-router-gateway-mac

  /home/runner/work/ovn-kubernetes/ovn-kubernetes/go-controller/hybrid-overlay/pkg/controller/master_test.go:373

Seems like we have flakes (issues?) with the hybrid overlay code.

The egress IP code was modifying the API server object directly. Leading
to data races when running with the race detector set to "on". Instead copy
the object, modify it and update the API server object with the copied one.

Signed-off-by: Alexander Constantinescu <aconstan@redhat.com>

Remove unecessary locking in egress tests.

Egress tests were locking eIPAllocatorMutex as to avoid the data races, which
have now been fixed by the top level commit.

Signed-off-by: Alexander Constantinescu <aconstan@redhat.com>

Encapsulate all assignEgressIPs operations in lock

Signed-off-by: Alexander Constantinescu <aconstan@redhat.com>

Rename utility functions in egress*_test.go to reflect its new behaviour
@danwinship danwinship merged commit 392a008 into ovn-org:master Aug 26, 2020
jcaamano pushed a commit to jcaamano/ovn-kubernetes that referenced this pull request Mar 27, 2023
OCPBUGS-8741: [release-4.13] Handle Completed pods deletion
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants