Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Kube API unavailability results in a gloo container crash #8107

Closed
Ati59 opened this issue Apr 20, 2023 · 4 comments · Fixed by #9563
Closed

Kube API unavailability results in a gloo container crash #8107

Ati59 opened this issue Apr 20, 2023 · 4 comments · Fixed by #9563
Assignees
Labels
Area: Stability Issues related to stability of the product, engineering, tech debt Prioritized Indicating issue prioritized to be worked on in RFE stream Type: Bug Something isn't working

Comments

@Ati59
Copy link
Contributor

Ati59 commented Apr 20, 2023

Gloo Edge Version

1.13.x (latest stable)

Kubernetes Version

None

Describe the bug

A customer is facing regular kube-API outage (on all clouds AWS, Azure and GCP) and when it happens, gloo container is crashing on the gloo pod (because of the election not able to choose the lead).
If the API server is unavailable during a scale-out event (increase of load for instance), the new gateway-proxy won't have the configuration from gloo due to this election problem.

Steps to reproduce the bug

  • Installing GE using this helm values :
global:
  glooMtls:
    enabled: true
  istioSDS:
    enabled: false
  • Make the API-server unavailable some times (I used iptables rule : iptables -A INPUT -p tcp --dport 6443 -j DROP)
  • Check the restart count and last container logs :
2023-04-20T11:39:21.664432264Z stderr F E0420 11:39:21.663749       1 leaderelection.go:330] error retrieving resource lock gloo-system/gloo-ee: Get "https://10.6.0.1:443/api/v1/namespaces/gloo-system/configmaps/gloo-ee": context deadline exceeded
2023-04-20T11:39:21.664668472Z stderr F I0420 11:39:21.664461       1 leaderelection.go:283] failed to renew lease gloo-system/gloo-ee: timed out waiting for the condition
2023-04-20T11:39:21.666594222Z stderr F {"level":"error","ts":"2023-04-20T11:39:21.665Z","logger":"gloo-ee","caller":"kube/factory.go:61","msg":"Stopped Leading","version":"1.13.9","stacktrace":"github.com/solo-io/gloo/pkg/bootstrap/leaderelector/kube.(*kubeElectionFactory).StartElection.func2\n\t/go/pkg/mod/github.com/solo-io/gloo@v1.13.8/pkg/bootstrap/leaderelector/kube/factory.go:61\nk8s.io/client-go/tools/leaderelection.(*LeaderElector).Run.func1\n\t/go/pkg/mod/k8s.io/client-go@v0.22.4/tools/leaderelection/leaderelection.go:203\nk8s.io/client-go/tools/leaderelection.(*LeaderElector).Run\n\t/go/pkg/mod/k8s.io/client-go@v0.22.4/tools/leaderelection/leaderelection.go:213"}
2023-04-20T11:39:21.668590097Z stderr F {"level":"fatal","ts":"2023-04-20T11:39:21.667Z","caller":"setup/setup.go:49","msg":"lost leadership, quitting app","stacktrace":"github.com/solo-io/solo-projects/projects/gloo/pkg/setup.Main.func3\n\t/workspace/solo-projects/projects/gloo/pkg/setup/setup.go:49\ngitpro.ttaallkk.top/solo-io/gloo/pkg/bootstrap/leaderelector/kube.(*kubeElectionFactory).StartElection.func2\n\t/go/pkg/mod/github.com/solo-io/gloo@v1.13.8/pkg/bootstrap/leaderelector/kube/factory.go:62\nk8s.io/client-go/tools/leaderelection.(*LeaderElector).Run.func1\n\t/go/pkg/mod/k8s.io/client-go@v0.22.4/tools/leaderelection/leaderelection.go:203\nk8s.io/client-go/tools/leaderelection.(*LeaderElector).Run\n\t/go/pkg/mod/k8s.io/client-go@v0.22.4/tools/leaderelection/leaderelection.go:213"}

Expected Behavior

Gloo should be resilient to the API outage, at least not crashing.

Additional Context

  • On the customer case, a error message is then appearing on their log when it happens : One or more envoy instances are not connected to the control plane for the last 1 minute
  • On the customer case, federation is enabled

┆Issue is synchronized with this Asana task by Unito

@Ati59 Ati59 added the Type: Bug Something isn't working label Apr 20, 2023
@kdorosh
Copy link
Contributor

kdorosh commented Apr 20, 2023

this is by design. if we allowed gloo to continue to function as a leader during kube apiserver outage, we risk having two leaders in other failure modes. we should remove the panic and allow gloo to continue to serve last-known xds as a follower (effectively having two followers until kube apiserver recovers). this idea is similar to the role xds relay could play for gloo edge

@sam-heilbron
Copy link
Contributor

When we resolve this, let's also close out:

@sam-heilbron
Copy link
Contributor

@DuncanDoyle DuncanDoyle added the Area: Stability Issues related to stability of the product, engineering, tech debt label May 15, 2024
@nfuden nfuden added the Prioritized Indicating issue prioritized to be worked on in RFE stream label May 22, 2024
@davidjumani
Copy link
Contributor

This will be fixed in 1.17.0

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Area: Stability Issues related to stability of the product, engineering, tech debt Prioritized Indicating issue prioritized to be worked on in RFE stream Type: Bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants