Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

logs.fr.cloud.gov has no application logs #573

Closed
sharms opened this issue Dec 19, 2016 · 16 comments
Closed

logs.fr.cloud.gov has no application logs #573

sharms opened this issue Dec 19, 2016 · 16 comments
Assignees

Comments

@sharms
Copy link
Contributor

sharms commented Dec 19, 2016

Application logs are not being sent to ElasticSearch

@mogul
Copy link
Contributor

mogul commented Dec 19, 2016

Per @jmcarp we're getting logs now. However, we need to investigate how much we missed, make an announcement about that, and figure out why our monitoring didn't work.

@mogul mogul self-assigned this Dec 21, 2016
@mogul
Copy link
Contributor

mogul commented Dec 21, 2016

We've made an announcement on statuspage and filled the hole in monitoring. @jmcarp is working on reimporting the logs, and estimates it's no more than about a day of careful work out.

@mogul
Copy link
Contributor

mogul commented Dec 21, 2016

Things we identified that could prevent this from happening again in future:

  • make our monitoring more robust (have done that)
  • add "not enough perms" error reporting upstream on logsearch @jmcarp
  • document the scopes needed for the client credential @cnelson
  • ensure those scopes are generated automatically whenever we update the secrets file @cnelson

(Chris, you're tagged on those latter two since you're doing that kind of work already this week.)

@mogul mogul added the bug label Dec 21, 2016
@mogul
Copy link
Contributor

mogul commented Dec 21, 2016

Other process aspects:

  • We should label issues like this with something obvious like "customer-facing" that's red so everyone understands the "expedite" aspect
  • When we identify that there's a communication obligation, we include that in the initial post just as if we'd groomed/planned to handle this
  • We talk about who's responsible for that at stand-up, though it defaults to the assignee

We're going to loft this info closer to our incident-handling procedures.

@rogeruiz
Copy link
Contributor

rogeruiz commented Jan 6, 2017

Waiting on @cnelson

@mogul
Copy link
Contributor

mogul commented Jan 6, 2017

Sorry, I've lost track: Why are we waiting on @cnelson for this?

Note we still have the statuspage incident open...

@jmcarp
Copy link
Contributor

jmcarp commented Jan 7, 2017

I wrote a migration script to move the misplaced logs back into the correct index, but there's no point in running the script now, since we had elastic configured to drop logs over a week old. If we want to bring back the logs in question, we'll need to restore logs from s3 back to elastic, then validate and run the migration. Last week, we were discussing whether this is worthwhile, and iirc @cnelson suggested that it might be, if only to practice restoring logs from s3. This is more a product question than a technical question, but IMO there are more useful things for us to do than restore logs--if we hadn't said we were going to do it on statuspage, I would definitely argue against putting time into this.

@mogul
Copy link
Contributor

mogul commented Jan 7, 2017

Would it stop your arguing if I told you that I'd heard a prospective customer saw our spate of statuspage incidents recently, and was concerned about our lack of attention to detail (in root-causing) and follow-through (on resolution and post-mortem details) such that it made them reluctant to use our platform? :trollface:

@cnelson
Copy link
Contributor

cnelson commented Jan 17, 2017

I have logstash successfully ingesting logs from s3 and delivering them back into the redis -> parser pipeline, but the restored data is not showing up in Kibana. Will continue to investigate / restore data tomorrow.

@cnelson
Copy link
Contributor

cnelson commented Jan 18, 2017

Restore is now underway. I should be able to provide an estimated completion time after a few days of data have been imported.

@cnelson
Copy link
Contributor

cnelson commented Jan 19, 2017

Issues with the production logsearch cluster blocked me from making any progress on the restore of data today. Cluster is back in a good state will resume restore activities tomorrow

@cnelson
Copy link
Contributor

cnelson commented Jan 30, 2017

Marking this blocked until the re-indexing work described in cloud-gov/cg-atlas#181 is complete and the cluster is capable of keeping all historical data online

@brittag
Copy link

brittag commented Jan 30, 2017

Yay for progress! Can we post an update to https://cloudgov.statuspage.io/incidents/ywrnjr7f52j8 with a month-later status note? Something like this, for example: "We're currently completing some backend work to making log indexing more efficient and resilient, which will make this log reimport process go smoothly (and help prevent similar problems in the future). We'll reimport these logs when that work is finished."

@mogul mogul removed the In Progress label Feb 2, 2017
@cnelson cnelson removed the blocked label Feb 3, 2017
@cnelson
Copy link
Contributor

cnelson commented Feb 3, 2017

Data from 10/23-10/31 and 11/20-current has been reindexed and is available for search.

Data for 11/1-11/20 is being indexed now and expected to complete in the next 72 hours.

@cnelson
Copy link
Contributor

cnelson commented Feb 7, 2017

Reindexing is mostly complete. I've scheduled a maint window for tonight to do some house keeping in the cluster and then this should be complete.

@cnelson
Copy link
Contributor

cnelson commented Feb 8, 2017

Added a PR for documentation on how to do this manually, and updated cloud-gov/cg-atlas#179 with an implementation sketch. I'm calling this done 😅

@cnelson cnelson closed this as completed Feb 17, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants