Size govcloud production logsearch deployment #181

jmcarp · 2016-10-27T03:14:38Z

Since we don't have many tenants on govcloud yet, our logsearch deployment is smaller than on east, with only two data nodes (east is currently using ten). If the east deployment needs ten data nodes, we'll probably want to add more nodes to govcloud before asking more tenants to migrate.

I'm guessing this might be interesting to @datn, and I'm guessing @LinuxBozo or @sharms was involved in setting up the original cluster.

Acceptance criteria:

Production logsearch in govcloud is running a sensible number and type of nodes, based on our experience on east
We have alerts that represent out heuristics for when and how to scale up in future

cnelson · 2017-01-11T19:51:21Z

This likely needs some attention in the near term.

Today we had an outage when elasticsearch master began reporting out of memory errors. We bumped it's instance size to quadruple available RAM as a temporary fix. However, even with the upgraded instance, starting up the cluster from a bad state still took 1.5 hours for it to return to green which IMO is far too long for a production system.

LinuxBozo · 2017-01-11T21:05:00Z

@cnelson master, as in single? There is currently IIRC a 3 master cluster in e/w. So yeah, between that and data nodes, we should revisit promptly.

mogul · 2017-01-21T07:46:52Z

This issue was about the sizing in particular. I think this is what @rogeruiz is working on this week, right? In which case, this should be In Progress... Moving it there now.

cnelson · 2017-01-30T17:08:41Z

I put together a bare bones calculator which may be helpful when thinking about the various bits of data that go into making sizing decisions for elastic search. Some of which we can control (index strategy, sharding strategy, instance sizing) and some of which we cannot (retention requirements, volume of data per day).

After playing with this for a bit, I think we should do the following to size our cluster appropriately and fix the stability issues we've been seeing:

Change our indexing strategy back to index-per-day (currently we are index-per-space-per-day)
Leave our sharding strategy as is
Reindex old data to index-per-day in order to bring our total number of shards down to a reasonable size

cnelson · 2017-01-30T19:16:45Z

We are moving forward with the reindexing steps described above:

@rogeruiz and @rememberlenny are working on an upstream PR to expose indexing strategy as an option.

@jmcarp @cnelson working on determining the fastest way to re-index and starting that process.

cnelson · 2017-02-02T18:01:02Z

We have alerts that represent out heuristics for when and how to scale up in future\

I'm unsure what kind of alerts we need / want for this AC.

If we are concerned about cluster health I think we have those already, we'll be alerted if cpu/disk/memory goes above thresholds on our data nodes.

If we are concerned about needing to adjust our sharding strategy as our log volume increases over-time to maintain search performance, perhaps we should add some timeouts to the queries in in check-logs.sh and alert if we don't receive a response in that window?

That would be an indicator that our log volume is growing, and that we need to up the number of shards per index going forward.

Thoughts?

cnelson · 2017-02-03T11:49:16Z

After discussing yesterday we've come up with the following plan for better on alerting on when the cluster needs to scale:

Accept this story as-is to make WIP room to start on cloud-gov/product#673 which we need to complete before we add any new functionality to logsearch so that we can iterate on it safely without risk of downtime.

Once that's completed, we've decided that we will get get the best data on real-world query performance by using new relic to monitor traffic from actual users and as a bonus we'd get logsearch response times on statuspage for free: cloud-gov/product#693

jmcarp added the Atlas label Oct 27, 2016

mogul added SkyPorter and removed SkyPorter labels Dec 21, 2016

suprenant added the migration label Jan 12, 2017

mogul added the In Progress label Jan 22, 2017

suprenant removed the migration label Jan 23, 2017

mogul assigned rogeruiz Jan 23, 2017

mogul changed the title ~~Size govcloud production elasticsearch deployment~~ Size govcloud production logsearch deployment Jan 23, 2017

mogul added Ready In Progress F668 and removed In Progress Ready labels Jan 25, 2017

cnelson mentioned this issue Jan 30, 2017

logs.fr.cloud.gov has no application logs cloud-gov/product#573

Closed

rogeruiz assigned rememberlenny Jan 30, 2017

mogul removed the Ready label Feb 2, 2017

cnelson closed this as completed Feb 2, 2017

cnelson reopened this Feb 2, 2017

karareinsel unassigned rogeruiz and rememberlenny Apr 10, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Size govcloud production logsearch deployment #181

Size govcloud production logsearch deployment #181

jmcarp commented Oct 27, 2016 •

edited by cnelson

Loading

cnelson commented Jan 11, 2017

LinuxBozo commented Jan 11, 2017

mogul commented Jan 21, 2017

cnelson commented Jan 30, 2017 •

edited

Loading

cnelson commented Jan 30, 2017

cnelson commented Feb 2, 2017 •

edited

Loading

cnelson commented Feb 3, 2017 •

edited

Loading

Size govcloud production logsearch deployment #181

Size govcloud production logsearch deployment #181

Comments

jmcarp commented Oct 27, 2016 • edited by cnelson Loading

cnelson commented Jan 11, 2017

LinuxBozo commented Jan 11, 2017

mogul commented Jan 21, 2017

cnelson commented Jan 30, 2017 • edited Loading

cnelson commented Jan 30, 2017

cnelson commented Feb 2, 2017 • edited Loading

cnelson commented Feb 3, 2017 • edited Loading

jmcarp commented Oct 27, 2016 •

edited by cnelson

Loading

cnelson commented Jan 30, 2017 •

edited

Loading

cnelson commented Feb 2, 2017 •

edited

Loading

cnelson commented Feb 3, 2017 •

edited

Loading