Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[concept] add livez/readyz for etcd #16008

Closed
wants to merge 5 commits into from

Conversation

logicalhan
Copy link

This is a prototype for adding livez/readyz support to etcd. Currently I've configured the NO_SPACE alarm to only count towards \readyz since it means etcd is degraded. Whether quorum should be included in liveness is an open question.

Change-Id: Ia440a82b2bf3d275b7cd7d88b5a6e86fe9fe1c28

Signed-off-by: Han Kang <hankang@google.com>
Change-Id: Ief9475a92429be58eb7b1f96246bbdb00e996e75
Change-Id: Ie699ae11d0ecc315b91365f85f0ac0b2d339c28d

Signed-off-by: Han Kang <hankang@google.com>
Change-Id: Iee6f469f63cb1fbcc22a4d633a621b7915a1a799
server/etcdserver/api/etcdhttp/health.go Outdated Show resolved Hide resolved
server/etcdserver/api/etcdhttp/health.go Outdated Show resolved Hide resolved
Signed-off-by: Han Kang <hankang@google.com>

Change-Id: I7e95be58cff6b7bc47fa3114249074a9f69a1620
Co-authored-by: Benjamin Wang <wachao@vmware.com>
Signed-off-by: Han Kang <hankang@google.com>
Change-Id: I45fda0a8ee7d80638af96fee4efb3bfdf2aebaf8
Change-Id: Ie5f02bba1a63f7592c6f3500db9070e6f1022df0
Signed-off-by: Han Kang <hankang@google.com>
Copy link
Member

@serathius serathius left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't want to rush into adding livez/readyz probe. Main problem with existing health probe we just added it to have it without proper consideration.

I want livez to properly reflect fact that etcd needs restart, for example etcd is stuck on stalled storage https://docs.google.com/document/d/1U9hAcZQp3Y36q_JFiw2VBJXVAo2dK2a-8Rsbqv3GgDo/edit?usp=sharing.

Readyz should properly reflect fact that etcd is ready to serve traffic. Don't think alarms matter here. It's a degradation, however it doesn't mean we shouldn't serve reads.

TLDR; I would like to have a design written that will do a proper analysis etcd failure modes and propose matching probes to detect them. Example kubernetes-sigs/metrics-server#542

@@ -141,7 +169,7 @@ func getSerializableFlag(r *http.Request) bool {

// TODO: etcdserver.ErrNoLeader in health API

func checkAlarms(lg *zap.Logger, srv ServerHealth, excludedAlarms AlarmSet) Health {
func checkAlarms(lg *zap.Logger, srv ServerHealth, excludedAlarms AlarmSet, healthType string) Health {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The healthType string isn't used at all, can we remove it?

Suggested change
func checkAlarms(lg *zap.Logger, srv ServerHealth, excludedAlarms AlarmSet, healthType string) Health {
func checkAlarms(lg *zap.Logger, srv ServerHealth, excludedAlarms AlarmSet) Health {

@stale
Copy link

stale bot commented Sep 17, 2023

This issue has been automatically marked as stale because it has not had recent activity. It will be closed after 21 days if no further activity occurs. Thank you for your contributions.

@stale stale bot added the stale label Sep 17, 2023
@jmhbnz
Copy link
Member

jmhbnz commented Jan 4, 2024

Discussed during sig-etcd triage meeting. This original concept has now been superceeded by the work from @siyuanfoundation 🎉

@jmhbnz jmhbnz closed this Jan 4, 2024
@siyuanfoundation
Copy link
Contributor

the work is tracked in #16007

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Development

Successfully merging this pull request may close these issues.

5 participants