Outage free logsearch updates #673

mogul · 2017-01-21T00:45:46Z

In order to eliminate outages in our logs service, we want to support rolling restarts after logsearch updates without triggering recovery conditions

Acceptance criteria

Replicate production volume characteristics to facilitate investigations for this and future improvements without disturbing production
The cluster stays yellow / green during a stemcell update, and smoke tests pass immediately afterwards
Alerts are not generated for the cluster being in a yellow state while a deployment is in process

cnelson · 2017-01-27T21:53:01Z

During the ES maint windows this week I've been monitoring the cluster while it's deploying, and there are actually several overlapping issues that are occurring which when combined result in the issue we've been seeing where all shards are marked unassigned and the cluster is unavailable. I believe addressing these issues will result in the ability to do rolling updates without downtime.

Nodes leave the cluster and are not flagged unhealthy by BOSH

This happens for several reasons some of which we can address:

Nodes run out of file descriptors and stop responding

We should make this hardcoded limit configurable and set it to at
least 128k
The release should run ES with -Des.max-open-files=true so that it logs the ulimit on startup.

Nodes run out of memory and stop responding:

We may want to change our indexing strategy back to one-index-per-day, rather than the on-e-index-per-space-per-day that upstream switched to. Less index == Less shards == Less memory usage.
Our masters may also be affected by this issue due to the number of shards being allocated OOM errors caused by large queue of cluster state update tasks elastic/elasticsearch#21568 I have observed masters going OOM with > 15k pending tasks during a recovery
We may want to only allow searching recent data and close indexes older than X days while still keeping retention at 180 days: https://www.elastic.co/blog/curator-tending-your-time-series-indices

Connectivity

This one is harder to diagnose, logs show nodes being unable to talk to each other ( SendRequestTransportException, NodeNotConnectedException) but testing with netcat / curl shows the network to be fine.
We may want to start having collectd monitoring connectivity between nodes so we can gather data on if there is an underlying networking issue. cc: Alert on any collectd metric via bosh manifest #689

In all of these scenarios ES is still running so monit takes no action. We may want to add a http health check to monit on localhost:9200/_cluster/health so it will restart ES when it crashes and leaves the cluster.

The rolling restart may be taking down multiple nodes at the same time

Monit is looking for a pid file to determine if ES has started but this file is written almost immediately after the JVM starts and it could be several minutes before ES has actually rejoined the cluster and is really "running".

We should put in a post-start script that waits for ES to rejoin the cluster. UAA does something similar since it takes so long to really start.
- This script should be smart enough to:
  - Wait for the node to rejoin the cluster and halt the deploy if it does not in a given time
  - Re-enable allocation when run on data nodes
  - Possibly increase recovery settings to a large number '{"transient":{"cluster.routing.allocation.cluster_concurrent_rebalance":"128", "cluster.routing.allocation.node_concurrent_recoveries":"128", "indices.recovery.concurrent_streams": "256", "indices.store.throttle.type": "none"}}'
  - Monitor cluster health until state is green and shard allocation is complete
  - Return recovery settings to normal

The cluster does not wait long enough for nodes to rejoin

index.unassigned.node_left.delayed_timeout defaults to 1m in ES which is not enough time for a data node to update and rejoin the cluster. This causes all primary shards on a data node to fail over to their replicas (which then triggers new replicas to be created) whenever an update is performed.

When a deploy starts we should set index.unassigned.node_left.delayed_timeout on all indices to something large enough to cover a stemcell update and then return it a more sane value in post-deploy

Discovery is incorrectly configured

When a master leaves the cluster (due to update, or crashing) the cluster may agree on the master replacement, and a split-brain scenario can occur.

elasticsearch.discovery.minimum_master_nodes should be set to 2

@jmcarp @rogeruiz WDYT?

cnelson · 2017-02-01T01:43:26Z

Upstream is unstable

Upstream frequently commits breaking changes without notice. This has happened twice in the last two weeks: cloud-gov/deploy-logsearch#51, cloud-gov/deploy-logsearch#46

@jmcarp has issues upstream about getting stable releases from them Add to https://bosh.io/releases cloudfoundry-community/logsearch-for-cloudfoundry#195, Document breaking changes and upgrade paths cloudfoundry-community/logsearch-for-cloudfoundry#225

jmcarp · 2017-02-13T23:27:43Z

For a better health check, I was testing the following on staging:

check host elasticsearch with address localhost
  start program "/var/vcap/jobs/elasticsearch/bin/monit_debugger elasticsearch_ctl '/var/vcap/jobs/elasticsearch/bin/elasticsearc
h_ctl start'" with timeout 120 seconds
  stop program "/var/vcap/jobs/elasticsearch/bin/monit_debugger elasticsearch_ctl '/var/vcap/jobs/elasticsearch/bin/elasticsearch
_ctl stop'"
  if failed url http://localhost:9200/_cluster/health
    timeout 15 seconds
    for 5 cycles
  then restart
  group vcap

Which works as expected. WDYT about sending a patch upstream with that logic, substituting templated variables for constants? We can also include a spec variable that allows for the current monit behavior in case folks don't want to change to http health checks.

cnelson · 2017-02-14T15:02:39Z

LGTM but will this cause a deployment to fail with it using then restart? If not, WDYT about using then stop?

The goal would be to make sure that if a node fails to come back up during a rolling update that we fail the deployment and then an operator can fix it while it's only one node and kick off a new deployment once the issue has been resolved.

Part of cloud-gov/product#673.

jmcarp · 2017-02-15T04:56:56Z

I've been running into issues trying to get the monit health check to work in real life:

If we configure monit to stop the task after e.g. 3 consecutive failures, monit thinks the task is in a healthy state until the third failure: initializing -> healthy -> unhealthy.
When we use check host instead of check process, running monit restart seems to trigger the stop and start calls so close together that the start doesn't take effect.

So I'm thinking we should go with @cnelson 's original plan of waiting for each node to join the cluster in post-start. I'd also be happy to pair on this and see if the monit http check is viable after all--it's very possible that I'm missing something obvious here.

jmcarp · 2017-02-17T15:40:10Z

Let's agree on criteria for outage-free updates. How about: the cluster stays yellow / green during a stemcell update, and smoke tests pass immediately afterwards?

cnelson · 2017-02-17T18:23:04Z

I also think as part of this effort we need to stable the Service elasticsearch health is in state warning on host... riemann alerts to match up with our acceptable time for cluster in the yellow stage, so we don't get a bunch of open issues every time we do a rolling restart.

jmcarp · 2017-02-21T16:51:59Z

We sent a handful of pull requests upstream to resolve issues that [@cnelson - Chris Nelson] identified:

[merged] cloudfoundry-community/logsearch-boshrelease#34
[merged] cloudfoundry-community/logsearch-boshrelease#38
[merged] cloudfoundry-community/logsearch-boshrelease#39

We also made a few changes to our own repos:
[merged] cloud-gov/cg-riemann-boshrelease#69
[blocked] cloud-gov/deploy-logsearch#53

Once these are in good shape and merged, we'll run another stemcell update and verify the ACs.

rogeruiz · 2017-03-03T18:24:24Z

Upstream merged our PR, there's a config change left to do and then we'll redeploy on staging.

cnelson · 2017-03-07T21:11:56Z

Confirmed during maint window that deploy works as expected.

Rolled out monitoring updates here: https://ci.fr.cloud.gov/teams/main/pipelines/deploy-monitoring/jobs/deploy-monitoring-production/builds/162

We may need future tuning of timeouts as we gather more data, but that should be handled as white-noise-support-work so I'm calling this done.

mogul assigned cnelson Jan 21, 2017

mogul added Atlas and removed In Progress labels Jan 21, 2017

mogul removed Ready labels Feb 2, 2017

cnelson mentioned this issue Feb 3, 2017

Size govcloud production logsearch deployment cloud-gov/cg-atlas#181

Open

2 tasks

jmcarp added a commit to cloud-gov/deploy-logsearch that referenced this issue Feb 14, 2017

Bump file descriptors.

35241a3

Part of cloud-gov/product#673.

jmcarp mentioned this issue Feb 14, 2017

Bump file descriptors. cloud-gov/deploy-logsearch#52

Merged

jmcarp mentioned this issue Feb 17, 2017

Delay allocation cloudfoundry-community/logsearch-boshrelease#39

Merged

cnelson closed this as completed Mar 8, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Outage free logsearch updates #673

Outage free logsearch updates #673

mogul commented Jan 21, 2017 •

edited by cnelson

Loading

cnelson commented Jan 27, 2017 •

edited

Loading

cnelson commented Feb 1, 2017 •

edited

Loading

jmcarp commented Feb 13, 2017

cnelson commented Feb 14, 2017 •

edited

Loading

jmcarp commented Feb 15, 2017

jmcarp commented Feb 17, 2017

cnelson commented Feb 17, 2017

jmcarp commented Feb 21, 2017 •

edited by mogul

Loading

rogeruiz commented Mar 3, 2017

cnelson commented Mar 7, 2017

Outage free logsearch updates #673

Outage free logsearch updates #673

Comments

mogul commented Jan 21, 2017 • edited by cnelson Loading

cnelson commented Jan 27, 2017 • edited Loading

Nodes leave the cluster and are not flagged unhealthy by BOSH

Nodes run out of file descriptors and stop responding

Nodes run out of memory and stop responding:

Connectivity

The rolling restart may be taking down multiple nodes at the same time

The cluster does not wait long enough for nodes to rejoin

Discovery is incorrectly configured

cnelson commented Feb 1, 2017 • edited Loading

Upstream is unstable

jmcarp commented Feb 13, 2017

cnelson commented Feb 14, 2017 • edited Loading

jmcarp commented Feb 15, 2017

jmcarp commented Feb 17, 2017

cnelson commented Feb 17, 2017

jmcarp commented Feb 21, 2017 • edited by mogul Loading

rogeruiz commented Mar 3, 2017

cnelson commented Mar 7, 2017

mogul commented Jan 21, 2017 •

edited by cnelson

Loading

cnelson commented Jan 27, 2017 •

edited

Loading

cnelson commented Feb 1, 2017 •

edited

Loading

cnelson commented Feb 14, 2017 •

edited

Loading

jmcarp commented Feb 21, 2017 •

edited by mogul

Loading