Kube API unavailability results in a gloo container crash #8107

Ati59 · 2023-04-20T12:23:01Z

Gloo Edge Version

1.13.x (latest stable)

Kubernetes Version

None

Describe the bug

A customer is facing regular kube-API outage (on all clouds AWS, Azure and GCP) and when it happens, gloo container is crashing on the gloo pod (because of the election not able to choose the lead).
If the API server is unavailable during a scale-out event (increase of load for instance), the new gateway-proxy won't have the configuration from gloo due to this election problem.

Steps to reproduce the bug

Installing GE using this helm values :

global:
  glooMtls:
    enabled: true
  istioSDS:
    enabled: false

Make the API-server unavailable some times (I used iptables rule : iptables -A INPUT -p tcp --dport 6443 -j DROP)
Check the restart count and last container logs :

2023-04-20T11:39:21.664432264Z stderr F E0420 11:39:21.663749       1 leaderelection.go:330] error retrieving resource lock gloo-system/gloo-ee: Get "https://10.6.0.1:443/api/v1/namespaces/gloo-system/configmaps/gloo-ee": context deadline exceeded
2023-04-20T11:39:21.664668472Z stderr F I0420 11:39:21.664461       1 leaderelection.go:283] failed to renew lease gloo-system/gloo-ee: timed out waiting for the condition
2023-04-20T11:39:21.666594222Z stderr F {"level":"error","ts":"2023-04-20T11:39:21.665Z","logger":"gloo-ee","caller":"kube/factory.go:61","msg":"Stopped Leading","version":"1.13.9","stacktrace":"github.com/solo-io/gloo/pkg/bootstrap/leaderelector/kube.(*kubeElectionFactory).StartElection.func2\n\t/go/pkg/mod/github.com/solo-io/gloo@v1.13.8/pkg/bootstrap/leaderelector/kube/factory.go:61\nk8s.io/client-go/tools/leaderelection.(*LeaderElector).Run.func1\n\t/go/pkg/mod/k8s.io/client-go@v0.22.4/tools/leaderelection/leaderelection.go:203\nk8s.io/client-go/tools/leaderelection.(*LeaderElector).Run\n\t/go/pkg/mod/k8s.io/client-go@v0.22.4/tools/leaderelection/leaderelection.go:213"}
2023-04-20T11:39:21.668590097Z stderr F {"level":"fatal","ts":"2023-04-20T11:39:21.667Z","caller":"setup/setup.go:49","msg":"lost leadership, quitting app","stacktrace":"github.com/solo-io/solo-projects/projects/gloo/pkg/setup.Main.func3\n\t/workspace/solo-projects/projects/gloo/pkg/setup/setup.go:49\ngitpro.ttaallkk.top/solo-io/gloo/pkg/bootstrap/leaderelector/kube.(*kubeElectionFactory).StartElection.func2\n\t/go/pkg/mod/github.com/solo-io/gloo@v1.13.8/pkg/bootstrap/leaderelector/kube/factory.go:62\nk8s.io/client-go/tools/leaderelection.(*LeaderElector).Run.func1\n\t/go/pkg/mod/k8s.io/client-go@v0.22.4/tools/leaderelection/leaderelection.go:203\nk8s.io/client-go/tools/leaderelection.(*LeaderElector).Run\n\t/go/pkg/mod/k8s.io/client-go@v0.22.4/tools/leaderelection/leaderelection.go:213"}

Expected Behavior

Gloo should be resilient to the API outage, at least not crashing.

Additional Context

On the customer case, a error message is then appearing on their log when it happens : One or more envoy instances are not connected to the control plane for the last 1 minute
On the customer case, federation is enabled

┆Issue is synchronized with this Asana task by Unito

The text was updated successfully, but these errors were encountered:

kdorosh · 2023-04-20T12:34:42Z

this is by design. if we allowed gloo to continue to function as a leader during kube apiserver outage, we risk having two leaders in other failure modes. we should remove the panic and allow gloo to continue to serve last-known xds as a follower (effectively having two followers until kube apiserver recovers). this idea is similar to the role xds relay could play for gloo edge

sam-heilbron · 2023-05-26T14:00:46Z

When we resolve this, let's also close out:

sam-heilbron · 2023-05-26T14:01:34Z

https://github.com/solo-io/gloo/blob/main/projects/gloo/pkg/setup/setup.go#L46 is the line of code in question

davidjumani · 2024-06-19T23:11:26Z

This will be fixed in 1.17.0

Ati59 added the Type: Bug Something isn't working label Apr 20, 2023

avizov mentioned this issue Aug 23, 2023

failed creating base settings resource client #8612

Open

DuncanDoyle added the Area: Stability Issues related to stability of the product, engineering, tech debt label May 15, 2024

nfuden assigned davidjumani May 22, 2024

nfuden added the Prioritized Indicating issue prioritized to be worked on in RFE stream label May 22, 2024

davidjumani mentioned this issue Jun 6, 2024

Retry on leader lease renewal failure #9563

Merged

4 tasks

soloio-bulldozer bot closed this as completed in #9563 Jun 18, 2024

davidjumani mentioned this issue Jun 18, 2024

[1.17] Retry on leader lease renewal failure #9639

Merged

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Kube API unavailability results in a gloo container crash #8107

Kube API unavailability results in a gloo container crash #8107

Ati59 commented Apr 20, 2023 •

edited by sync-by-unito bot

Loading

kdorosh commented Apr 20, 2023

sam-heilbron commented May 26, 2023

sam-heilbron commented May 26, 2023

davidjumani commented Jun 19, 2024

Kube API unavailability results in a gloo container crash #8107

Kube API unavailability results in a gloo container crash #8107

Comments

Ati59 commented Apr 20, 2023 • edited by sync-by-unito bot Loading

Gloo Edge Version

Kubernetes Version

Describe the bug

Steps to reproduce the bug

Expected Behavior

Additional Context

kdorosh commented Apr 20, 2023

sam-heilbron commented May 26, 2023

sam-heilbron commented May 26, 2023

davidjumani commented Jun 19, 2024

Ati59 commented Apr 20, 2023 •

edited by sync-by-unito bot

Loading