-
Notifications
You must be signed in to change notification settings - Fork 1.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
/healthz never recovers from failed state #540
Comments
In 0.3.6 we '/healthz' was not correctly defined to be a readiness probe, that's why it was not used in manifests. I'm not sure about
Metrics Server queries apiserver when serving a request, this makes apiserver a hard dependency. It's resonable to assume that readiness probe should fail if application cannot serve request due hard dependency not being available. During upgrade of non-HA master it is expected for apiserver not being available, thus metrics server not being able to serve API. Problem of cluster upgrade stalling due pod in kube-system not being available is a behavior of Kops and it's not expected of any Kubernetes cluster. |
Thanks for the clarification, we'll remove the readinessProbe until 0.4.0 Concerning the API server request: Of course I was only talking about an HA master setup. And I wanted to say, that the metrics-server should be self-healing, at least after all masters are back. I assumed that the metrics-server stopped working even after a master is back, according to the /healthz endpoint, but that was a misunderstanding from me |
Removing readinessProbe it's up to you. For sure "/healthz" needs improvement. So the underlying problem is that "/healthz" never recovers after a problem with apiserver (like during upgrade). This for sure should not happen, this is really surprising to me, but I never depended on "/healthz". Could you provide Metrics Server logs generated during master upgrade? |
hi @serathius As a new contributor is this bug something that I can pick up ? |
Hey @hanumanthan, Have you looked into other issues labeled with "help wanted"? You can also ping me on slack to do some bug triaging and create some new contributor friendly tasks |
Issues go stale after 90d of inactivity. If this issue is safe to close now please do so with Send feedback to sig-testing, kubernetes/test-infra and/or fejta. |
Stale issues rot after 30d of inactivity. If this issue is safe to close now please do so with Send feedback to sig-testing, kubernetes/test-infra and/or fejta. |
Rotten issues close after 30d of inactivity. Send feedback to sig-testing, kubernetes/test-infra and/or fejta. |
@fejta-bot: Closing this issue. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
Anyone mind if I reopen this issue? I don't think it's been fixed? |
Probes were redesigned in #542 for 0.4.0 release. PTAL |
What happened:
We're running metrics-server v0.3.6 with a readinessProbe:
We have Kubernetes-Clusters provisioned by kops, currently on Version 1.16.9 and 1.17.6, with 3 masters each.
During a rolling update of the masters group, the metrics server often flaps to red (readinessProbe failed) and stays red. This means, rolling update will stop, because the metrics server (being in the kube-system namespace) is not ready, until you manually kill the pod.
A livenessProbe would help, but only disguise the problem.
What you expected to happen:
Metrics Server should NOT break during a rolling update of a master.
Further the provided Metrics Server deployment resources should contain a readinessProbe, so other people actually notice when it breaks.
/king bug
The text was updated successfully, but these errors were encountered: