Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Retry on leader lease renewal failure #9563

Merged
merged 58 commits into from
Jun 18, 2024

Conversation

davidjumani
Copy link
Contributor

@davidjumani davidjumani commented May 31, 2024

Description

After a container has become a leader, any Kube API server unavailability results in the container crashing. This happens as the leader is unable to renew the lease. This is by design as outlined here

if we allowed gloo to continue to function as a leader during kube apiserver outage, we risk having two leaders in other failure modes. we should remove the panic and allow gloo to continue to serve last-known xds as a follower (effectively having two followers until kube apiserver recovers). this idea is similar to the role xds relay could play for gloo edge

However, crashing the gloo pods can also lead to an outage during scaling as the gateway-proxy pods that come up cannot fetch any configs and all routes result in 404s.

Now instead of crashing, the gloo pod falls back to a follower. This prevents an outage and any other pod can take over as leader if possible
This is fine as a leader only writes reports / statuses over here and here. On any failure, the pod becomes a follower and if elected back as a leader will continue to write reports.

Code changes

  • Introduce a new Reset method on the identity implementation that allows an identity to fall back to a follower

CI changes

  • Kind clusters that run in CI now do not come up with their own CNI as kindnet does not support custom network policies.
    Instead, cilium is installed as CNI as we need to test Kube API server unavailability

Context

Kube API unavailability results in a gloo container crash
When leader election fails, gloo crashes
Design Doc

Interesting decisions

Testing steps

Kube2e tests to verify the following :

  • It can recover from a Kube APIServer unavailability
  • It falls back to a follower
  • Another leader can get elected

Notes for reviewers

Be sure to verify intended behavior by ...

Please proofread comments on ...

This is a complex PR and may require a huddle to discuss ...

Checklist:

  • I have performed a self-review of my own code
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • I have added tests that prove my fix is effective or that my feature works

BOT NOTES:
resolves #8107

@davidjumani davidjumani added the work in progress signals bulldozer to keep pr open (don't auto-merge) label May 31, 2024
@github-actions github-actions bot added the keep pr updated signals bulldozer to keep pr up to date with base branch label May 31, 2024
Copy link

github-actions bot commented May 31, 2024

Visit the preview URL for this PR (updated for commit 7cf3c2b):

https://gloo-edge--pr9563-dont-crash-on-failed-6s4x9piv.web.app

(expires Tue, 25 Jun 2024 00:48:42 GMT)

🔥 via Firebase Hosting GitHub Action 🌎

Sign: 77c2b86e287749579b7ff9cadb81e099042ef677

@davidjumani davidjumani removed the work in progress signals bulldozer to keep pr open (don't auto-merge) label May 31, 2024
Copy link
Contributor

@ashishb-solo ashishb-solo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks @davidjumani - i've read it over a few times and have a few questions. i think a little more documentation (especially around the go funcs) would be helpful. also, a major question regarding whether we still have the regular behavior if we're not configured to recover from failure.

happy to talk in person if that would be easier

ps: i haven't had a chance to look at the test yet

pkg/bootstrap/leaderelector/kube/factory.go Show resolved Hide resolved
pkg/bootstrap/leaderelector/kube/factory.go Outdated Show resolved Hide resolved
pkg/bootstrap/leaderelector/kube/factory.go Outdated Show resolved Hide resolved
projects/gloo/pkg/setup/setup.go Show resolved Hide resolved
Copy link
Contributor

@sam-heilbron sam-heilbron left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looking good! I mainly focused on how we're testing for now, and once that is updated, I'm happy to take a closer look at the implementation itself

ci/kind/cluster.yaml Show resolved Hide resolved
ci/kind/setup-kind.sh Show resolved Hide resolved
ci/kind/setup-kind.sh Outdated Show resolved Hide resolved
pkg/bootstrap/leaderelector/identity.go Show resolved Hide resolved
pkg/bootstrap/leaderelector/kube/factory.go Outdated Show resolved Hide resolved
test/kube2e/gloo/bootstrap_clients_test.go Outdated Show resolved Hide resolved
test/kube2e/gloo/bootstrap_clients_test.go Outdated Show resolved Hide resolved
pkg/bootstrap/leaderelector/kube/factory.go Outdated Show resolved Hide resolved
test/kube2e/gloo/artifacts/block.yaml Outdated Show resolved Hide resolved
test/kube2e/gloo/bootstrap_clients_test.go Outdated Show resolved Hide resolved
nfuden
nfuden previously approved these changes Jun 14, 2024
Copy link
Contributor

@ashishb-solo ashishb-solo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks good to me. one question on the comment from Sam regarding whether the goroutine is safe. if so, happy to ignore the nits to approve

pkg/bootstrap/leaderelector/kube/factory.go Outdated Show resolved Hide resolved
@davidjumani davidjumani mentioned this pull request Jun 18, 2024
@davidjumani
Copy link
Contributor Author

kick bulldozer

@soloio-bulldozer soloio-bulldozer bot merged commit e499b77 into main Jun 18, 2024
20 checks passed
@soloio-bulldozer soloio-bulldozer bot deleted the dont-crash-on-failed-lease-renewal branch June 18, 2024 17:58
davidjumani added a commit that referenced this pull request Jun 18, 2024
* retry on leader lease renewal failure

* Die if unable to recover

* use env var

* add basic tests

* udpate tests

* add comments

* Add commnets around ci changes

* update tests

* refactor

* cleanup

* udpate test

* rename env var

* add changelog

* address comments v1

* address comments v2

* fix test

* sue GinkgoHelper

* Adding changelog file to new location

* Deleting changelog file from old location

* specify a duration

* Adding changelog file to new location

* Deleting changelog file from old location

* remove default

* Adding changelog file to new location

* Deleting changelog file from old location

* fix tests

* move changelog

* move conter

---------

Co-authored-by: soloio-bulldozer[bot] <48420018+soloio-bulldozer[bot]@users.noreply.github.com>
Co-authored-by: changelog-bot <changelog-bot>
Co-authored-by: Nathan Fudenberg <nathan.fudenberg@solo.io>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
keep pr updated signals bulldozer to keep pr up to date with base branch
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Kube API unavailability results in a gloo container crash
4 participants