Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

/healthz never recovers from failed state #540

Closed
ghost opened this issue Jun 4, 2020 · 11 comments
Closed

/healthz never recovers from failed state #540

ghost opened this issue Jun 4, 2020 · 11 comments
Labels
kind/bug Categorizes issue or PR as related to a bug. lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed.

Comments

@ghost
Copy link

ghost commented Jun 4, 2020

What happened:

We're running metrics-server v0.3.6 with a readinessProbe:

        readinessProbe:
          httpGet:
            path: /healthz
            port: main-port
            scheme: HTTPS
          periodSeconds: 5

We have Kubernetes-Clusters provisioned by kops, currently on Version 1.16.9 and 1.17.6, with 3 masters each.

During a rolling update of the masters group, the metrics server often flaps to red (readinessProbe failed) and stays red. This means, rolling update will stop, because the metrics server (being in the kube-system namespace) is not ready, until you manually kill the pod.

A livenessProbe would help, but only disguise the problem.

What you expected to happen:

Metrics Server should NOT break during a rolling update of a master.

Further the provided Metrics Server deployment resources should contain a readinessProbe, so other people actually notice when it breaks.

/king bug

@serathius
Copy link
Contributor

serathius commented Jun 9, 2020

In 0.3.6 we '/healthz' was not correctly defined to be a readiness probe, that's why it was not used in manifests.
We are working on adding readiness probe in 0.4.0.

I'm not sure about

Metrics Server should NOT break during a rolling update of a master.

Metrics Server queries apiserver when serving a request, this makes apiserver a hard dependency. It's resonable to assume that readiness probe should fail if application cannot serve request due hard dependency not being available.

During upgrade of non-HA master it is expected for apiserver not being available, thus metrics server not being able to serve API. Problem of cluster upgrade stalling due pod in kube-system not being available is a behavior of Kops and it's not expected of any Kubernetes cluster.

@serathius serathius changed the title ReadinessProbe fails during rolling update /healthz fails during rolling update Jun 9, 2020
@serathius serathius added the kind/bug Categorizes issue or PR as related to a bug. label Jun 9, 2020
@ghost
Copy link
Author

ghost commented Jun 10, 2020

Thanks for the clarification, we'll remove the readinessProbe until 0.4.0

Concerning the API server request: Of course I was only talking about an HA master setup. And I wanted to say, that the metrics-server should be self-healing, at least after all masters are back. I assumed that the metrics-server stopped working even after a master is back, according to the /healthz endpoint, but that was a misunderstanding from me

@serathius
Copy link
Contributor

Removing readinessProbe it's up to you. For sure "/healthz" needs improvement.

So the underlying problem is that "/healthz" never recovers after a problem with apiserver (like during upgrade). This for sure should not happen, this is really surprising to me, but I never depended on "/healthz".
/kind bug

Could you provide Metrics Server logs generated during master upgrade?

@serathius serathius changed the title /healthz fails during rolling update /healthz never recovers from failed state Jun 10, 2020
@bytetwin
Copy link
Contributor

hi @serathius As a new contributor is this bug something that I can pick up ?
If yes would need some help on what is to be fixed and some pointers the code. Thanks 😄

@serathius
Copy link
Contributor

Hey @hanumanthan,
I think we need to propose design for both Metrics Server health endpoints and probes. There was already issue related to that #291 but it doesn't look like it's worked on.
Currently I don't have enough time to work on the design, if your interested feel free to write some proposal. But I expect this will be pretty hard for new contributors.

Have you looked into other issues labeled with "help wanted"? You can also ping me on slack to do some bug triaging and create some new contributor friendly tasks

@fejta-bot
Copy link

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Sep 13, 2020
@fejta-bot
Copy link

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle rotten

@k8s-ci-robot k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Oct 13, 2020
@fejta-bot
Copy link

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close

@k8s-ci-robot
Copy link
Contributor

@fejta-bot: Closing this issue.

In response to this:

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@davidxia
Copy link

Anyone mind if I reopen this issue? I don't think it's been fixed?

@serathius
Copy link
Contributor

Probes were redesigned in #542 for 0.4.0 release. PTAL

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug. lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed.
Projects
None yet
Development

No branches or pull requests

5 participants