Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix cluster health when closing #61709

Conversation

henningandersen
Copy link
Contributor

When master shuts down it's cluster service, a waiting health request
would fail rather than fail over to a new master.

Fails around once every month in CI:
https://build-stats.elastic.co/goto/49d044995e1b636bfc3cb7e5e2371f3a
For instance:
https://gradle-enterprise.elastic.co/s/r2rocz2i6tah2

When master shuts down it's cluster service, a waiting health request
would fail rather than fail over to a new master.
@henningandersen henningandersen added >bug :Distributed/Distributed A catch all label for anything in the Distributed Area. If you aren't sure, use this one. v8.0.0 v7.10.0 labels Aug 31, 2020
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-distributed (:Distributed/Distributed)

@elasticmachine elasticmachine added the Team:Distributed Meta label for distributed team label Aug 31, 2020
@henningandersen
Copy link
Contributor Author

@elasticmachine run elasticsearch-ci/1

@henningandersen
Copy link
Contributor Author

Failure should be fixed by #62061
@elasticmachine update branch

Copy link
Contributor

@ywelsch ywelsch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Change looks good. One question on the test

boolean withIndex = randomBoolean();
if (withIndex) {
// create index with many shards to provoke the health request to wait (for green) while master is being shut down.
createIndex("test", Settings.builder().put(IndexMetadata.SETTING_NUMBER_OF_REPLICAS, randomIntBetween(0, 10)).build());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm confused. How can the cluster ever be green with that many replicas? Should this be number_of_shards?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It ensures that the cluster is yellow when the health request is made, making the health request wait on the observer, triggering the call to onClusterServiceClose when master is shutdown.

The number of replicas is cleared to 0 after having fired all the async restarts and done the master restarts. That ensures that all the requests responds with green status.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah ok, can you add a comment to that effect

Copy link
Contributor

@ywelsch ywelsch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@henningandersen
Copy link
Contributor Author

@elasticmachine run elasticsearch-ci/packaging-sample-windows

@henningandersen henningandersen merged commit db1a137 into elastic:master Sep 19, 2020
henningandersen added a commit to henningandersen/elasticsearch that referenced this pull request Sep 19, 2020
When master shuts down it's cluster service, a waiting health request
would fail rather than fail over to a new master.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
>bug :Distributed/Distributed A catch all label for anything in the Distributed Area. If you aren't sure, use this one. Team:Distributed Meta label for distributed team v7.10.0 v8.0.0-alpha1
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants