Skip to content
This repository has been archived by the owner on Dec 8, 2017. It is now read-only.

Size govcloud production logsearch deployment #181

Open
1 of 2 tasks
jmcarp opened this issue Oct 27, 2016 · 7 comments
Open
1 of 2 tasks

Size govcloud production logsearch deployment #181

jmcarp opened this issue Oct 27, 2016 · 7 comments

Comments

@jmcarp
Copy link

jmcarp commented Oct 27, 2016

Since we don't have many tenants on govcloud yet, our logsearch deployment is smaller than on east, with only two data nodes (east is currently using ten). If the east deployment needs ten data nodes, we'll probably want to add more nodes to govcloud before asking more tenants to migrate.

I'm guessing this might be interesting to @datn, and I'm guessing @LinuxBozo or @sharms was involved in setting up the original cluster.

Acceptance criteria:

  • Production logsearch in govcloud is running a sensible number and type of nodes, based on our experience on east
  • We have alerts that represent out heuristics for when and how to scale up in future
@jmcarp jmcarp added the Atlas label Oct 27, 2016
@mogul mogul added SkyPorter and removed SkyPorter labels Dec 21, 2016
@cnelson
Copy link

cnelson commented Jan 11, 2017

This likely needs some attention in the near term.

Today we had an outage when elasticsearch master began reporting out of memory errors. We bumped it's instance size to quadruple available RAM as a temporary fix. However, even with the upgraded instance, starting up the cluster from a bad state still took 1.5 hours for it to return to green which IMO is far too long for a production system.

@LinuxBozo
Copy link

@cnelson master, as in single? There is currently IIRC a 3 master cluster in e/w. So yeah, between that and data nodes, we should revisit promptly.

@mogul
Copy link
Contributor

mogul commented Jan 21, 2017

This issue was about the sizing in particular. I think this is what @rogeruiz is working on this week, right? In which case, this should be In Progress... Moving it there now.

@mogul mogul changed the title Size govcloud production elasticsearch deployment Size govcloud production logsearch deployment Jan 23, 2017
@cnelson
Copy link

cnelson commented Jan 30, 2017

I put together a bare bones calculator which may be helpful when thinking about the various bits of data that go into making sizing decisions for elastic search. Some of which we can control (index strategy, sharding strategy, instance sizing) and some of which we cannot (retention requirements, volume of data per day).

After playing with this for a bit, I think we should do the following to size our cluster appropriately and fix the stability issues we've been seeing:

  • Change our indexing strategy back to index-per-day (currently we are index-per-space-per-day)
  • Leave our sharding strategy as is
  • Reindex old data to index-per-day in order to bring our total number of shards down to a reasonable size

@cnelson
Copy link

cnelson commented Jan 30, 2017

We are moving forward with the reindexing steps described above:

@rogeruiz and @rememberlenny are working on an upstream PR to expose indexing strategy as an option.

@jmcarp @cnelson working on determining the fastest way to re-index and starting that process.

@cnelson
Copy link

cnelson commented Feb 2, 2017

We have alerts that represent out heuristics for when and how to scale up in future\

I'm unsure what kind of alerts we need / want for this AC.

If we are concerned about cluster health I think we have those already, we'll be alerted if cpu/disk/memory goes above thresholds on our data nodes.

If we are concerned about needing to adjust our sharding strategy as our log volume increases over-time to maintain search performance, perhaps we should add some timeouts to the queries in in check-logs.sh and alert if we don't receive a response in that window?

That would be an indicator that our log volume is growing, and that we need to up the number of shards per index going forward.

Thoughts?

@cnelson cnelson closed this as completed Feb 2, 2017
@cnelson cnelson reopened this Feb 2, 2017
@cnelson
Copy link

cnelson commented Feb 3, 2017

After discussing yesterday we've come up with the following plan for better on alerting on when the cluster needs to scale:

Accept this story as-is to make WIP room to start on cloud-gov/product#673 which we need to complete before we add any new functionality to logsearch so that we can iterate on it safely without risk of downtime.

Once that's completed, we've decided that we will get get the best data on real-world query performance by using new relic to monitor traffic from actual users and as a bonus we'd get logsearch response times on statuspage for free: cloud-gov/product#693

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

7 participants