Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

/status endpoint returns 503 when any plugin is unavailable #110583

Closed
mshustov opened this issue Aug 31, 2021 · 8 comments · Fixed by #113729
Closed

/status endpoint returns 503 when any plugin is unavailable #110583

mshustov opened this issue Aug 31, 2021 · 8 comments · Fixed by #113729
Assignees
Labels
discuss Team:Core Core services & architecture: plugins, logging, config, saved objects, http, ES client, i18n, etc

Comments

@mshustov
Copy link
Contributor

mshustov commented Aug 31, 2021

Problem

#79012 added logic for /api/status endpoint to return 503 whenever any Kibana plugin has unavailable status.
This logic affects Beats module elastic/beats#27036 (fixed in #109741 by lowering status from unavailable to degraded)

@joshdover said this logic was added because it's was the simplest option to inform an external observer that Kibana is ready to serve traffic:

the reason we made the change in that PR to start returning a 503 on the status endpoint is because we started serving this endpoint sooner (with the notReadyServer)
so we didn’t want to start sending traffic to Kibana until Kibana finished starting up, so returning a 503 was the easiest way to do this.

Maybe a better change would have been to only serve that 503 up until Kibana first becomes available
and then after that always respond 200. Or only return 503 if any of the core services are unavailable or critical.

IMO the current logic seems to be too aggressive - a problem in a single plugin might block traffic for Kibana. Maybe it's more appropriate to return 503 only when Core functionality is unavailable?

Sidenote

The original RFC declares several contradictory statements:

In both the critical and unavailable levels, all of a service's endpoints will return 503s. 

but

unavailable:
All endpoints (with some exceptions in Core) in Kibana return a 503 Unavailable responses by default. This is automatic.

We should clarify the expected behavior in the status service documentation.

@mshustov mshustov added discuss Team:Core Core services & architecture: plugins, logging, config, saved objects, http, ES client, i18n, etc labels Aug 31, 2021
@elasticmachine
Copy link
Contributor

Pinging @elastic/kibana-core (Team:Core)

@rudolf
Copy link
Contributor

rudolf commented Aug 31, 2021

Maybe it's more appropriate to return 503 only when Core functionality is unavailable?

+1

@pmuellr
Copy link
Member

pmuellr commented Aug 31, 2021

Maybe it's more appropriate to return 503 only when Core functionality is unavailable?

Seems like the best / simplest behavior change that would be helpful here.

@liza-mae
Copy link
Contributor

Another case that was affected: https://github.com/elastic/elastic-stack-testing/issues/936

@cachedout
Copy link
Contributor

This may be the root cause of elastic/apm-integration-testing#1188 which we originally filed with Kibana as #107300

If this is indeed the cause, it's been causing failures for test suites all across Observability.

@mshustov
Copy link
Contributor Author

mshustov commented Sep 2, 2021

@cachedout we can prioritize the work on the fix. Do you experience the problem with Kibana master and the last releases?
Would you be able to test against master once the fix is landed?

@cachedout
Copy link
Contributor

we can prioritize the work on the fix.

That's great news! Thanks, @mshustov !

Do you experience the problem with Kibana master and the last releases?

We have been testing against nightly snapshots in 7.x and master. We have seen the problem in master but I don't think we've seen it at all in 7.x.

Would you be able to test against master once the fix is landed?

Yes. We consume nightly snapshots so we'd be able to test this for you as soon as a nightly snapshot is built. (We cannot replicate this locally at all so I think it is best to let this go through our regular CI pipeline before we declare it fixed.)

@pgayvallet
Copy link
Contributor

Maybe a better change would have been to only serve that 503 up until Kibana first becomes available
and then after that always respond 200. Or only return 503 if any of the core services are unavailable or critical

+1 to only return 503 before Kibana becomes available and then only if any core service is unavailable/critical.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
discuss Team:Core Core services & architecture: plugins, logging, config, saved objects, http, ES client, i18n, etc
Projects
None yet
7 participants