Delay allocation #39

jmcarp · 2017-02-17T23:18:32Z

As @cnelson described in cloud-gov/product#673 (comment), restarting the cluster can lead to outages for a few reasons:

Since the monit health check on elastic verifies that the elastic process is running but not that the node has joined the cluster, bosh can start deploying the next node before the previous node has successfully joined the cluster.
Since elastic defaults to a one-minute node timeout before reallocating shards, and restarting a node might take more than one minute, the cluster can frantically start reallocating shards for no reason.

This patch addresses both issues. We added a post-start script that blocks until the node is listening on 9200--and, for data nodes, until the cluster is healthy. We also optionally increase the node timeout on drain, then restore it on post-deploy, to avoid "shard shuffle". This also means we don't need to rely on shard routing settings to keep the cluster healthy during restarts: elastic/elasticsearch#19739.

@cnelson

h/t @cnelson

axelaris · 2017-02-20T11:56:46Z

Hi @jmcarp,
I believe this is a good idea.
However, I can see at least one potential issue here. Upon upgrading a cluster with this patch, each node will be drained using old version of script (disabling routing allocation). Since new post-start script does not enabling it back, cluster upgrade will fail :-(

jmcarp · 2017-02-20T18:08:27Z

Good point--how about leaving the routing allocation change in post-start for now, with a TODO to drop it in a future release?

cnelson · 2017-02-24T15:05:37Z

@axelaris @Infra-Red Any more feedback on this (and this #38) PR?

axelaris · 2017-02-26T14:38:16Z

jobs/elasticsearch/templates/bin/post-start.erb

@@ -5,6 +5,11 @@ set -e
 out=$(mktemp health-XXXXXX)
 remaining=<%= p("elasticsearch.health.timeout") %>

+# Ensure shard allocation is enabled for updates from previous release
+# TODO: Deprecate on next release
+curl -X PUT -s <%= p('elasticsearch.master_hosts').first %>:9200/_cluster/settings


It seems that you missed \ at the line end :-(

axelaris · 2017-02-26T14:44:19Z

Hi @cnelson,
just checked it and found a little issue in the code. Please check my comment.

@axelaris

h/t @axelaris

axelaris · 2017-02-27T21:08:05Z

Now all is seems to be working ok.
Could you please squash your commits for better tracking?

jmcarp · 2017-02-27T23:19:41Z

If you want this to be squashed into a single commit, the github "squash and merge" button should do it.

axelaris · 2017-02-28T08:57:49Z

Ok, it seems thats working.
Thank you @jmcarp !

@cnelson

* Optionally delay allocation during restart. * Wait for nodes to rejoin post-start. * Move delay allocation restore to post-start. h/t @cnelson * Check local state if not data-only. * Ensure shard allocation so that upgrades succeed. h/t @axelaris

geofffranks · 2017-03-29T10:40:43Z

Can we get a new final release built with these improvements?

axelaris · 2017-03-29T10:48:10Z

Hi @geofffranks,
we're working on v5 release, where this PR will be included

jmcarp added 2 commits February 17, 2017 12:50

Optionally delay allocation during restart.

a168c1b

Wait for nodes to rejoin post-start.

4be6ea9

jmcarp mentioned this pull request Feb 17, 2017

[WIP] Add optional http health check for elasticsearch. #35

Closed

jmcarp added 2 commits February 17, 2017 19:12

Move delay allocation restore to post-start.

ee93197

h/t @cnelson

Check local state if not data-only.

4e9b318

jmcarp force-pushed the delay-allocation branch from 00950fa to 4e9b318 Compare February 18, 2017 00:12

axelaris self-assigned this Feb 18, 2017

axelaris requested a review from Infra-Red February 20, 2017 10:02

This was referenced Feb 21, 2017

Configure elastic stable dt. cloud-gov/cg-riemann-boshrelease#69

Merged

Configure restart delay allocation. cloud-gov/deploy-logsearch#53

Merged

Outage free logsearch updates cloud-gov/product#673

Closed

axelaris reviewed Feb 26, 2017

View reviewed changes

Ensure shard allocation so that upgrades succeed.

92cb9a6

h/t @axelaris

jmcarp force-pushed the delay-allocation branch from b13c921 to 92cb9a6 Compare February 27, 2017 15:02

axelaris merged commit e990bb3 into cloudfoundry-community:develop Feb 28, 2017

Infra-Red mentioned this pull request Mar 28, 2017

Make enable-shard-allocation persistent? #58

Closed

axelaris added a commit that referenced this pull request May 19, 2020

Dropping temporary change #39 (comment)

db1294d

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Delay allocation #39

Delay allocation #39

jmcarp commented Feb 17, 2017

axelaris commented Feb 20, 2017

jmcarp commented Feb 20, 2017

cnelson commented Feb 24, 2017

axelaris Feb 26, 2017 •

edited

Loading

jmcarp Feb 27, 2017

axelaris commented Feb 26, 2017

axelaris commented Feb 27, 2017

jmcarp commented Feb 27, 2017

axelaris commented Feb 28, 2017

geofffranks commented Mar 29, 2017

axelaris commented Mar 29, 2017

Delay allocation #39

Delay allocation #39

Conversation

jmcarp commented Feb 17, 2017

axelaris commented Feb 20, 2017

jmcarp commented Feb 20, 2017

cnelson commented Feb 24, 2017

axelaris Feb 26, 2017 • edited Loading

Choose a reason for hiding this comment

jmcarp Feb 27, 2017

Choose a reason for hiding this comment

axelaris commented Feb 26, 2017

axelaris commented Feb 27, 2017

jmcarp commented Feb 27, 2017

axelaris commented Feb 28, 2017

geofffranks commented Mar 29, 2017

axelaris commented Mar 29, 2017

axelaris Feb 26, 2017 •

edited

Loading