Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement kubernetes probes #542

Closed
serathius opened this issue Jun 15, 2020 · 11 comments
Closed

Implement kubernetes probes #542

serathius opened this issue Jun 15, 2020 · 11 comments
Labels
help wanted Denotes an issue that needs help from a contributor. Must meet "help wanted" guidelines. kind/feature Categorizes issue or PR as related to a new feature.
Milestone

Comments

@serathius
Copy link
Contributor

serathius commented Jun 15, 2020

/kind feature
Goal of this issue is to create proper definitions of health and readiness for metrics server and implement Kubernetes probes based on them.

Background

Code in metrics server does two following processes:

  • Periodically scrapes all the nodes in cluster for metrics which is uses to fill internal cache
  • Serves Metrics API by joining list of nodes/pods fetched from Apiserver with metrics stored in cache

Some problems related:

  • To provide zero Metrics API downtime during rolling update, new instance of MS needs to have cache filled.
  • On big clusters it's common to have some requests fail, meaning we cannot depend on succeeding all requests.
  • Before introducing a change to force a sync on node informer (k8s client api compatible cache), first scrape was running with empty list of nodes making MS ready before having any scrapes done.

Things to consider:

  • What probes to use? Readiness, Liveness, Startup
  • What endpoints to use? healthz, readyz, livez

Proposal

Definitions:

  • Metrics Server is alive if it is able to respond to HTTP request and there was a scrape job that finished (no success needed) within last period
  • Metrics Server is ready if it is able to respond to HTTP request, was able to access apiserver and last scrape job has succeeded for at least one node

Reasoning:

  • alive should mean that all processes are working. For MS this is HTTP server working + scraping executing once per period (protects agains deadlock or panic)
  • ready should mean that MS is ready to serve traffic. This means that we have data to serve (job has at least one succeed), and our dependencies are available (without apiserver we cannot serve request). We cannot check for all scrapes to succeed as in big clusters there is at least one node failing.

Things to consider:

  • We could also introduce some threshold for number of succeeded scrapes (e.g. 50%), but this really depends on types of workload in cluster (we could consider adding metric for it and allowing users to monitor and set this threshold themselfs).
  • (not related to probes) Scraping job depends on freshness of list of nodes, we could consider monitoring freshness of node data. I read some PRs to kubernetes proposing for syncing controllers before reconcile. There were some performance problems, but it would be interesting

TODO

We should update/add implementation of livez and readyz based on new definitions. Checking if apiserver is available should be as light as possible (e.g. check if TCP connection is open). In addition we should add/update probes, default configuration (no delay, every 10s with 3 fails threshold) looks ok for now.

Alternatives

No need to implement Startup probe as we don't fall under https://kubernetes.io/docs/concepts/workloads/pods/pod-lifecycle/#when-should-you-use-a-startup-probe

/cc @s-urbaniak

@k8s-ci-robot k8s-ci-robot added the kind/feature Categorizes issue or PR as related to a new feature. label Jun 15, 2020
@serathius serathius added this to the v0.4.0 milestone Jun 15, 2020
@serathius serathius changed the title Define metrics server health and ready condition Implement kubernetes probes Jun 21, 2020
@serathius
Copy link
Contributor Author

/help

@k8s-ci-robot
Copy link
Contributor

@serathius:
This request has been marked as needing help from a contributor.

Please ensure the request meets the requirements listed here.

If this request no longer meets these requirements, the label can be removed
by commenting with the /remove-help command.

In response to this:

/help

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot k8s-ci-robot added the help wanted Denotes an issue that needs help from a contributor. Must meet "help wanted" guidelines. label Jun 21, 2020
@bytetwin
Copy link
Contributor

I am interested to pick this up.

@serathius Is my understanding correct ?

We need to implement

  1. Liveness(/healthz) endpoint - a 200 response when there was a successful scrape in last 30sec
  2. Readiness(/readyz) endpoint - a 200 response when MS has started up, made a successful scrape and has retrieved metric for atleast one node.

Questions

  1. For liveness, is scrape in last 30 sec considered success. Is this time limit configurable?
  2. For readiness - is it success even if one node was scraped. Is this also configurable ?
  3. API server availability - If api-server becomes unavailable much later after MS has started, the liveness check will fail as there is no successful scrape in last 30 sec. K8S restarting MS wont solve the problem. How to handle this

@serathius
Copy link
Contributor Author

serathius commented Jun 22, 2020

/cc @logicalhan
To get some feedback from someone, who worked on apiserver health endpoints

@logicalhan
Copy link
Contributor

/cc @logicalhan
To get some feedback from someone, who worked on apiserver health endpoints

Oh thanks :)

I am interested to pick this up.

@serathius Is my understanding correct ?

We need to implement

  1. Liveness(/healthz) endpoint - a 200 response when there was a successful scrape in last 30sec

"/healthz" is actually deprecated in favor of "/livez" now.

  1. For liveness, is scrape in last 30 sec considered success. Is this time limit configurable?

Yes, there are standard kubernetes configuration options. You can set initialDelaySeconds to specify how long you expect the boot sequence to take (i.e. probing won't start until this time elapses). You can set PeriodSeconds to configure how often to poll and FailureThreshold for how many failures in a row you require before the kubelet will restart the container.

  1. For readiness - is it success even if one node was scraped. Is this also configurable ?

This is configurable as well, but generally you want readiness endpoints to fail fast, since it's usually used for load-balancing.

  1. API server availability - If api-server becomes unavailable much later after MS has started, the liveness check will fail as there is no successful scrape in last 30 sec. K8S restarting MS wont solve the problem. How to handle this

The apiserver doesn't scrape the metrics-server liveness/readiness probes, this is actually a function of the kubelet.

@bytetwin
Copy link
Contributor

bytetwin commented Jun 23, 2020

The apiserver doesn't scrape the metrics-server liveness/readiness probes, this is actually a function of the kubelet.

The design mentions that the /readyz should have a hard-dependancy on api-server. After MS becomes ready (returns 200 for /readyz), the api server becomes unavailable some point in the future, then livez will always return error as no scrape is successful. In this case kubelet will restart MS which won't become ready as the hard dependancy is unavailable. This is my understanding. Want to clarify this.

Wondering if above point is relevant at all

@serathius
Copy link
Contributor Author

The apiserver doesn't scrape the metrics-server liveness/readiness probes, this is actually a function of the kubelet.

The design mentions that the /readyz should have a hard-dependancy on api-server. After MS becomes ready (returns 200 for /readyz), the api server becomes unavailable some point in the future, then livez will always return error as no scrape is successful. In this case kubelet will restart MS which won't become ready as the hard dependancy is unavailable. This is my understanding. Want to clarify this.

Wondering if above point is relevant at all

Liveness of container should not depend on external services (assuming that container can recover from network problems). Apiserver being down should not cause a MS restart (it behaves like that currently and this is not correct). Here proposed definition requires at least one successfully scraped node, meaning that it should not fail on Apiseriver down if we discovered at least one node.

Possible cases when MS don't have nodes to scrape causing liveness check to fail:

  • Long apiserver downtime leads to node list cache is cleared in MS (need to check if this is possible)
  • All nodes in cluster becoming unavailable
  • Externally run Metrics server configured for empty cluster (not supported)

We could change this behavior, but I'm little worried about it making MS ready to early in startup. I would prefer to check if those problems happen in real clusters before trying to solve them.

  1. Liveness(/healthz) endpoint - a 200 response when there was a successful scrape in last 30sec

"/healthz" is actually deprecated in favor of "/livez" now.

Thanks for info, we should change design to use /livez

  1. For liveness, is scrape in last 30 sec considered success. Is this time limit configurable?

Yes, there are standard kubernetes configuration options. You can set initialDelaySeconds to specify how long you expect the boot sequence to take (i.e. probing won't start until this time elapses). You can set PeriodSeconds to configure how often to poll and FailureThreshold for how many failures in a row you require before the kubelet will restart the container.

I think that Hanu meant the period of scraping metrics set in Metrics Server (not the probe period). Answer is yes, scrape period can be changed via a command line flag. Proposed definition of liveness here is to check if this operations is executed on time (+/- 50% to avoid restarting due temporary overload). Value of liveness is independent to period of probes that check it.

  1. For readiness - is it success even if one node was scraped. Is this also configurable ?

This is configurable as well, but generally you want readiness endpoints to fail fast, since it's usually used for load-balancing.

For readiness there is a slightly different working (last scrape not a scrape within 30s). We should check for success in only last scrape attempt and not care about the time.

@serathius
Copy link
Contributor Author

Hey @hanumanthan,
This issue is part of milestone for v0.4.0, that we will be preparing to release. I think your work on probes is really important for next release.
Will you be able to work on your PR to finish it?

@bytetwin
Copy link
Contributor

Hi @serathius I would love to but given my personal circumstances, I wont be able to work on this for a while.

@serathius
Copy link
Contributor Author

np, thanks for confirmation.

@serathius
Copy link
Contributor Author

Both probes were implemented. Implementation of readyz skipped checking apiserver, if that's a problem we can revisit it in future.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
help wanted Denotes an issue that needs help from a contributor. Must meet "help wanted" guidelines. kind/feature Categorizes issue or PR as related to a new feature.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants