Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[1.17] Retry on leader lease renewal failure #9639

Merged
merged 2 commits into from
Jun 19, 2024

Conversation

davidjumani
Copy link
Contributor

@davidjumani davidjumani commented Jun 18, 2024

Description

Backport of #9563

After a container has become a leader, any Kube API server unavailability results in the container crashing. This happens as the leader is unable to renew the lease. This is by design as outlined here

if we allowed gloo to continue to function as a leader during kube apiserver outage, we risk having two leaders in other failure modes. we should remove the panic and allow gloo to continue to serve last-known xds as a follower (effectively having two followers until kube apiserver recovers). this idea is similar to the role xds relay could play for gloo edge

However, crashing the gloo pods can also lead to an outage during scaling as the gateway-proxy pods that come up cannot fetch any configs and all routes result in 404s.

Now instead of crashing, the gloo pod falls back to a follower. This prevents an outage and any other pod can take over as leader if possible
This is fine as a leader only writes reports / statuses over here and here. On any failure, the pod becomes a follower and if elected back as a leader will continue to write reports.

Code changes

  • Introduce a new Reset method on the identity implementation that allows an identity to fall back to a follower

CI changes

  • Kind clusters that run in CI now do not come up with their own CNI as kindnet does not support custom network policies.
    Instead, cilium is installed as CNI as we need to test Kube API server unavailability

Context

Kube API unavailability results in a gloo container crash
When leader election fails, gloo crashes
Design Doc

Interesting decisions

Testing steps

Kube2e tests to verify the following :

  • It can recover from a Kube APIServer unavailability
  • It falls back to a follower
  • Another leader can get elected

Notes for reviewers

Be sure to verify intended behavior by ...

Please proofread comments on ...

This is a complex PR and may require a huddle to discuss ...

Checklist:

  • I have performed a self-review of my own code
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • I have added tests that prove my fix is effective or that my feature works

BOT NOTES:
resolves #8107

* retry on leader lease renewal failure

* Die if unable to recover

* use env var

* add basic tests

* udpate tests

* add comments

* Add commnets around ci changes

* update tests

* refactor

* cleanup

* udpate test

* rename env var

* add changelog

* address comments v1

* address comments v2

* fix test

* sue GinkgoHelper

* Adding changelog file to new location

* Deleting changelog file from old location

* specify a duration

* Adding changelog file to new location

* Deleting changelog file from old location

* remove default

* Adding changelog file to new location

* Deleting changelog file from old location

* fix tests

* move changelog

* move conter

---------

Co-authored-by: soloio-bulldozer[bot] <48420018+soloio-bulldozer[bot]@users.noreply.github.com>
Co-authored-by: changelog-bot <changelog-bot>
Co-authored-by: Nathan Fudenberg <nathan.fudenberg@solo.io>
@solo-changelog-bot
Copy link

Issues linked to changelog:
#8107

@github-actions github-actions bot added the keep pr updated signals bulldozer to keep pr up to date with base branch label Jun 18, 2024
Copy link
Contributor

@nfuden nfuden left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Im surprised that new_feature in changelog passed bot but looks good to me

@soloio-bulldozer soloio-bulldozer bot merged commit 52ea4ab into v1.17.x Jun 19, 2024
22 checks passed
@soloio-bulldozer soloio-bulldozer bot deleted the dont-crash-on-failed-lease-renewal-117 branch June 19, 2024 14:27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
keep pr updated signals bulldozer to keep pr up to date with base branch
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants