/healthz never recovers from failed state #540

ghost · 2020-06-04T17:41:19Z

What happened:

We're running metrics-server v0.3.6 with a readinessProbe:

        readinessProbe:
          httpGet:
            path: /healthz
            port: main-port
            scheme: HTTPS
          periodSeconds: 5

We have Kubernetes-Clusters provisioned by kops, currently on Version 1.16.9 and 1.17.6, with 3 masters each.

During a rolling update of the masters group, the metrics server often flaps to red (readinessProbe failed) and stays red. This means, rolling update will stop, because the metrics server (being in the kube-system namespace) is not ready, until you manually kill the pod.

A livenessProbe would help, but only disguise the problem.

What you expected to happen:

Metrics Server should NOT break during a rolling update of a master.

Further the provided Metrics Server deployment resources should contain a readinessProbe, so other people actually notice when it breaks.

/king bug

The text was updated successfully, but these errors were encountered:

serathius · 2020-06-09T19:28:12Z

In 0.3.6 we '/healthz' was not correctly defined to be a readiness probe, that's why it was not used in manifests.
We are working on adding readiness probe in 0.4.0.

I'm not sure about

Metrics Server should NOT break during a rolling update of a master.

Metrics Server queries apiserver when serving a request, this makes apiserver a hard dependency. It's resonable to assume that readiness probe should fail if application cannot serve request due hard dependency not being available.

During upgrade of non-HA master it is expected for apiserver not being available, thus metrics server not being able to serve API. Problem of cluster upgrade stalling due pod in kube-system not being available is a behavior of Kops and it's not expected of any Kubernetes cluster.

ghost · 2020-06-10T09:07:17Z

Thanks for the clarification, we'll remove the readinessProbe until 0.4.0

Concerning the API server request: Of course I was only talking about an HA master setup. And I wanted to say, that the metrics-server should be self-healing, at least after all masters are back. I assumed that the metrics-server stopped working even after a master is back, according to the /healthz endpoint, but that was a misunderstanding from me

serathius · 2020-06-10T09:18:59Z

Removing readinessProbe it's up to you. For sure "/healthz" needs improvement.

So the underlying problem is that "/healthz" never recovers after a problem with apiserver (like during upgrade). This for sure should not happen, this is really surprising to me, but I never depended on "/healthz".
/kind bug

Could you provide Metrics Server logs generated during master upgrade?

bytetwin · 2020-06-14T13:16:56Z

hi @serathius As a new contributor is this bug something that I can pick up ?
If yes would need some help on what is to be fixed and some pointers the code. Thanks 😄

serathius · 2020-06-15T08:59:16Z

Hey @hanumanthan,
I think we need to propose design for both Metrics Server health endpoints and probes. There was already issue related to that #291 but it doesn't look like it's worked on.
Currently I don't have enough time to work on the design, if your interested feel free to write some proposal. But I expect this will be pretty hard for new contributors.

Have you looked into other issues labeled with "help wanted"? You can also ping me on slack to do some bug triaging and create some new contributor friendly tasks

fejta-bot · 2020-09-13T09:16:08Z

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

fejta-bot · 2020-10-13T09:59:05Z

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle rotten

fejta-bot · 2020-11-12T10:41:04Z

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close

k8s-ci-robot · 2020-11-12T10:41:12Z

@fejta-bot: Closing this issue.

In response to this:

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

davidxia · 2020-11-12T17:32:09Z

Anyone mind if I reopen this issue? I don't think it's been fixed?

serathius · 2020-11-13T16:59:31Z

Probes were redesigned in #542 for 0.4.0 release. PTAL

serathius changed the title ~~ReadinessProbe fails during rolling update~~ /healthz fails during rolling update Jun 9, 2020

serathius added the kind/bug Categorizes issue or PR as related to a bug. label Jun 9, 2020

serathius changed the title ~~/healthz fails during rolling update~~ /healthz never recovers from failed state Jun 10, 2020

ialidzhikov mentioned this issue Aug 31, 2020

livenessProbe and readinessProbe for metrics-server gardener/gardener#2794

Closed

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Sep 13, 2020

k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Oct 13, 2020

k8s-ci-robot closed this as completed Nov 12, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

/healthz never recovers from failed state #540

/healthz never recovers from failed state #540

ghost commented Jun 4, 2020

serathius commented Jun 9, 2020 •

edited

Loading

ghost commented Jun 10, 2020

serathius commented Jun 10, 2020

bytetwin commented Jun 14, 2020

serathius commented Jun 15, 2020

fejta-bot commented Sep 13, 2020

fejta-bot commented Oct 13, 2020

fejta-bot commented Nov 12, 2020

k8s-ci-robot commented Nov 12, 2020

davidxia commented Nov 12, 2020

serathius commented Nov 13, 2020

/healthz never recovers from failed state #540

/healthz never recovers from failed state #540

Comments

ghost commented Jun 4, 2020

serathius commented Jun 9, 2020 • edited Loading

ghost commented Jun 10, 2020

serathius commented Jun 10, 2020

bytetwin commented Jun 14, 2020

serathius commented Jun 15, 2020

fejta-bot commented Sep 13, 2020

fejta-bot commented Oct 13, 2020

fejta-bot commented Nov 12, 2020

k8s-ci-robot commented Nov 12, 2020

davidxia commented Nov 12, 2020

serathius commented Nov 13, 2020

serathius commented Jun 9, 2020 •

edited

Loading