Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Outage free logsearch updates #673

Closed
1 of 3 tasks
mogul opened this issue Jan 21, 2017 · 10 comments
Closed
1 of 3 tasks

Outage free logsearch updates #673

mogul opened this issue Jan 21, 2017 · 10 comments
Assignees

Comments

@mogul
Copy link
Contributor

mogul commented Jan 21, 2017

In order to eliminate outages in our logs service, we want to support rolling restarts after logsearch updates without triggering recovery conditions

Acceptance criteria

  • Replicate production volume characteristics to facilitate investigations for this and future improvements without disturbing production
  • The cluster stays yellow / green during a stemcell update, and smoke tests pass immediately afterwards
  • Alerts are not generated for the cluster being in a yellow state while a deployment is in process
@cnelson
Copy link
Contributor

cnelson commented Jan 27, 2017

During the ES maint windows this week I've been monitoring the cluster while it's deploying, and there are actually several overlapping issues that are occurring which when combined result in the issue we've been seeing where all shards are marked unassigned and the cluster is unavailable. I believe addressing these issues will result in the ability to do rolling updates without downtime.

Nodes leave the cluster and are not flagged unhealthy by BOSH

This happens for several reasons some of which we can address:

Nodes run out of file descriptors and stop responding

Nodes run out of memory and stop responding:

Connectivity

  • This one is harder to diagnose, logs show nodes being unable to talk to each other ( SendRequestTransportException, NodeNotConnectedException) but testing with netcat / curl shows the network to be fine.
  • We may want to start having collectd monitoring connectivity between nodes so we can gather data on if there is an underlying networking issue. cc: Alert on any collectd metric via bosh manifest #689

In all of these scenarios ES is still running so monit takes no action. We may want to add a http health check to monit on localhost:9200/_cluster/health so it will restart ES when it crashes and leaves the cluster.

The rolling restart may be taking down multiple nodes at the same time

Monit is looking for a pid file to determine if ES has started but this file is written almost immediately after the JVM starts and it could be several minutes before ES has actually rejoined the cluster and is really "running".

  • We should put in a post-start script that waits for ES to rejoin the cluster. UAA does something similar since it takes so long to really start.
    • This script should be smart enough to:
      • Wait for the node to rejoin the cluster and halt the deploy if it does not in a given time
      • Re-enable allocation when run on data nodes
      • Possibly increase recovery settings to a large number '{"transient":{"cluster.routing.allocation.cluster_concurrent_rebalance":"128", "cluster.routing.allocation.node_concurrent_recoveries":"128", "indices.recovery.concurrent_streams": "256", "indices.store.throttle.type": "none"}}'
      • Monitor cluster health until state is green and shard allocation is complete
      • Return recovery settings to normal

The cluster does not wait long enough for nodes to rejoin

index.unassigned.node_left.delayed_timeout defaults to 1m in ES which is not enough time for a data node to update and rejoin the cluster. This causes all primary shards on a data node to fail over to their replicas (which then triggers new replicas to be created) whenever an update is performed.

  • When a deploy starts we should set index.unassigned.node_left.delayed_timeout on all indices to something large enough to cover a stemcell update and then return it a more sane value in post-deploy

Discovery is incorrectly configured

When a master leaves the cluster (due to update, or crashing) the cluster may agree on the master replacement, and a split-brain scenario can occur.

  • elasticsearch.discovery.minimum_master_nodes should be set to 2

@jmcarp @rogeruiz WDYT?

@cnelson
Copy link
Contributor

cnelson commented Feb 1, 2017

Upstream is unstable

Upstream frequently commits breaking changes without notice. This has happened twice in the last two weeks: cloud-gov/deploy-logsearch#51, cloud-gov/deploy-logsearch#46

@jmcarp
Copy link
Contributor

jmcarp commented Feb 13, 2017

For a better health check, I was testing the following on staging:

check host elasticsearch with address localhost
  start program "/var/vcap/jobs/elasticsearch/bin/monit_debugger elasticsearch_ctl '/var/vcap/jobs/elasticsearch/bin/elasticsearc
h_ctl start'" with timeout 120 seconds
  stop program "/var/vcap/jobs/elasticsearch/bin/monit_debugger elasticsearch_ctl '/var/vcap/jobs/elasticsearch/bin/elasticsearch
_ctl stop'"
  if failed url http://localhost:9200/_cluster/health
    timeout 15 seconds
    for 5 cycles
  then restart
  group vcap

Which works as expected. WDYT about sending a patch upstream with that logic, substituting templated variables for constants? We can also include a spec variable that allows for the current monit behavior in case folks don't want to change to http health checks.

@cnelson
Copy link
Contributor

cnelson commented Feb 14, 2017

LGTM but will this cause a deployment to fail with it using then restart? If not, WDYT about using then stop?

The goal would be to make sure that if a node fails to come back up during a rolling update that we fail the deployment and then an operator can fix it while it's only one node and kick off a new deployment once the issue has been resolved.

jmcarp added a commit to cloud-gov/deploy-logsearch that referenced this issue Feb 14, 2017
@jmcarp
Copy link
Contributor

jmcarp commented Feb 15, 2017

I've been running into issues trying to get the monit health check to work in real life:

  • If we configure monit to stop the task after e.g. 3 consecutive failures, monit thinks the task is in a healthy state until the third failure: initializing -> healthy -> unhealthy.
  • When we use check host instead of check process, running monit restart seems to trigger the stop and start calls so close together that the start doesn't take effect.

So I'm thinking we should go with @cnelson 's original plan of waiting for each node to join the cluster in post-start. I'd also be happy to pair on this and see if the monit http check is viable after all--it's very possible that I'm missing something obvious here.

@jmcarp
Copy link
Contributor

jmcarp commented Feb 17, 2017

Let's agree on criteria for outage-free updates. How about: the cluster stays yellow / green during a stemcell update, and smoke tests pass immediately afterwards?

@cnelson
Copy link
Contributor

cnelson commented Feb 17, 2017

I also think as part of this effort we need to stable the Service elasticsearch health is in state warning on host... riemann alerts to match up with our acceptable time for cluster in the yellow stage, so we don't get a bunch of open issues every time we do a rolling restart.

@jmcarp
Copy link
Contributor

jmcarp commented Feb 21, 2017

We sent a handful of pull requests upstream to resolve issues that [@cnelson - Chris Nelson] identified:

[merged] cloudfoundry-community/logsearch-boshrelease#34
[merged] cloudfoundry-community/logsearch-boshrelease#38
[merged] cloudfoundry-community/logsearch-boshrelease#39

We also made a few changes to our own repos:
[merged] cloud-gov/cg-riemann-boshrelease#69
[blocked] cloud-gov/deploy-logsearch#53

Once these are in good shape and merged, we'll run another stemcell update and verify the ACs.

@rogeruiz
Copy link
Contributor

rogeruiz commented Mar 3, 2017

Upstream merged our PR, there's a config change left to do and then we'll redeploy on staging.

@cnelson
Copy link
Contributor

cnelson commented Mar 7, 2017

Confirmed during maint window that deploy works as expected.

Rolled out monitoring updates here: https://ci.fr.cloud.gov/teams/main/pipelines/deploy-monitoring/jobs/deploy-monitoring-production/builds/162

We may need future tuning of timeouts as we gather more data, but that should be handled as white-noise-support-work so I'm calling this done.

@cnelson cnelson closed this as completed Mar 8, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants