Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Active alerts do not recover after re-enabling a rule #110080

Closed
mikecote opened this issue Aug 25, 2021 · 20 comments · Fixed by #111671
Closed

Active alerts do not recover after re-enabling a rule #110080

mikecote opened this issue Aug 25, 2021 · 20 comments · Fixed by #111671
Assignees
Labels
bug Fixes for quality problems that affect the customer experience estimate:needs-research Estimated as too large and requires research to break down into workable issues Feature:Alerting/RulesFramework Issues related to the Alerting Rules Framework Team:ResponseOps Label for the ResponseOps team (formerly the Cases and Alerting teams)

Comments

@mikecote
Copy link
Contributor

When disabling a rule with active alerts, they will not recover if no longer active after re-enabling the rule.

Steps to reproduce:

  • Create a rule that will create active alerts
  • Disable the rule
  • Edit the rule so it doesn’t create active alerts anymore
  • Enable the rule

Notice the previously detected alert remains active and will not recover.

@mikecote mikecote added bug Fixes for quality problems that affect the customer experience Team:ResponseOps Label for the ResponseOps team (formerly the Cases and Alerting teams) Feature:Alerting/RulesFramework Issues related to the Alerting Rules Framework labels Aug 25, 2021
@elasticmachine
Copy link
Contributor

Pinging @elastic/kibana-alerting-services (Team:Alerting Services)

@mikecote
Copy link
Contributor Author

mikecote commented Aug 25, 2021

Some random thoughts:

  • Maybe storing the rule and alert state with the Task Manager task isn't a good idea and should be stored on the rule saved-object instead? This could also facilitate migrations on the state, if we don't want to do it at runtime.
  • If we eventually write alert documents to an index, maybe the alert state can be persisted there, and solve the problem of tracking all alerts and their state in memory

@pmuellr
Copy link
Member

pmuellr commented Aug 25, 2021

In the following #106292 (comment), I noted:

Next thing that it would be nice to not have to re-create - task manager tasks! I'll have to poke around and see if it's possible to "disable" a task, as opposed to deleting and re-creating it. The situation I had with multiple tasks associated with the same rule was one that we don't really have any way at all to remediate, without explicitly deleting documents from the task manager index.

I don't think we have an issue open on this, but I think this would fix this problem, since the state would persist even if the alert is disabled. The main reason we want to not delete/re-create the tasks is that due to the race conditions mentioned in that other issue, we are sometimes creating > 1 task associated with a rule, and we don't really have any way to clean these up. I've not seen this happen in practice, only with some very targetted ad hoc testing, but it can happen.

I'll open a new issue JUST for that - keeping the task around even if the rule is disabled (since I don't see one open on that right now).

update: issue opened: #110096

@mikecote
Copy link
Contributor Author

The thing I like about the disabled task approach is it then becomes a 1:1 relationship between rule and task. We create a task at the creation of the rule and then synchronize enable/disable accordingly.

@mikecote
Copy link
Contributor Author

mikecote commented Sep 1, 2021

May relate to #110096

@mikecote mikecote added the estimate:needs-research Estimated as too large and requires research to break down into workable issues label Sep 1, 2021
@YulNaumenko YulNaumenko self-assigned this Sep 3, 2021
@mikecote
Copy link
Contributor Author

mikecote commented Sep 7, 2021

Relates to #105046 and may resolve it at the same time?

@YulNaumenko
Copy link
Contributor

YulNaumenko commented Sep 9, 2021

Relates to #105046 and may resolve it at the same time?

Yes, this issues describe the same problem.
I did the research and found the next happened after this flow:

  1. Create a rule that will create active alerts

Task was created with state of the alert instances.

  1. Disable the rule

Task was deleted with state of the alert instances.

  1. Edit the rule so it doesn’t create active alerts anymore
  2. Enable the rule

Here we create a new task instance with the own state.

But the Event Log (which is giving us a rule summary with the alerts instances) was not updated after the first task was removed.
As a possible solution, maybe we can update Event Log before we deleted the task with the state?

@mikecote
Copy link
Contributor Author

mikecote commented Sep 9, 2021

But the Event Log (which is giving us a rule summary with the alerts instances) was not updated after the first task was removed.
As a possible solution, maybe we can update Event Log before we deleted the task with the state?

The event log uses ILM to delete documents older than a certain amount of time, would the bug come back when the event log doc gets removed?

@pmuellr
Copy link
Member

pmuellr commented Sep 10, 2021

But the Event Log (which is giving us a rule summary with the alerts instances) was not updated after the first task was removed.
As a possible solution, maybe we can update Event Log before we deleted the task with the state?

The event log uses ILM to delete documents older than a certain amount of time, would the bug come back when the event > log doc gets removed?

My read of Yuliia's comment was to "finish up" the instances at that point, by writing a recovered-instance event, or maybe new event indicating it's "done" because the rule is disabled. That would "fix" the problem of it not being recovered after re-enabling.

Just checked Yuliia's POC, that's exactly what she did :-) Super interesting it passed all FT, but failed some jest tests!!!

Of course, ideally it would be nice to show the state at the time it was disabled, with enough UI to make it obvious it's disabled and the info isn't current.

Beyond our own "lost state" issues with deleted task documents, rule types lose their state as well!

I think we really want to maintain this state over a disable, and we have two choices - stop deleting task documents on disable, or move the state into the rule. And we should really stop deleting task documents because of the race conditions with rapid enable/disable calls.

If we need to do something sooner, Yuliia's POC would probably be fine, but we might want to think about writing some enabled and disabled events out for the rule as well ...

I wonder if we're going to run into any problems with existing rule types that depend on the state being cleared on disable? Guessing no. But I'm guessing we may want a new method for alerts to "reset" them, since our old way of doing this disable/enable won't work any more; of course most of the time we do that to reset the API key, which it would also be nice to persist, but different problem :-)

@YulNaumenko
Copy link
Contributor

Yes, as @pmuellr mentioned above, my suggestion is an alternative "easy" solution before we decide to do something more complicated like moving the state to rule or stop deleting tasks.

@pmuellr
Copy link
Member

pmuellr commented Sep 10, 2021

ya, I think implementing the short-term fix sooner than the longer-term fix makes sense to me. I would only really be concerned with the short-term fix if we showed something for a disabled rule, but we don't today. The only effect would be that you won't see those already recovered alerts as still active, when re-enabled - or it seems like that should be the only effect.

Also forgot to mention, presumably this only happens if the active-instance docs are still found in the event log search. That search use a time window of N x rule interval (N = 6 or 12 or something), so if you wait long enough, those "sticky" instances should disappear.

@mikecote
Copy link
Contributor Author

Have we thought about rules that set up actions on recovery? What should happen to PagerDuty incidents when disabling a rule? What should happen to PagerDuty incidents when enabling a rule that detects some recoveries. I'm ok expanding on scope if we feel another approach is a bit bigger in scope but the ultimate solution.

@pmuellr
Copy link
Member

pmuellr commented Sep 10, 2021

Ya, I think these recovered-because-rule-disabled probably deserve something other than the usual recovered story, long-term. Does that mean a new "built-in" action group? Maybe it's a recovered state with a reason field associated. How do you even describe this situation to a human? Seems like it could get complicated, so we at least need to create an issue to research / track it.

@YulNaumenko
Copy link
Contributor

As the result of the current research was raised a few product questions, which should make the shape of the next steps to solve the current "bug" behavior:

  1. What is supposed to happen with the state of the active alert instances after the rule was disabled:
    • Does it make sense to have it as "recovered" or a better option will be to define a new 'reset' state for such kind of resolution?
    • What should happen with this alert instances when the rule was modified and re-enabled? The threshold condition could be changed and the possible side effect is that this alert instances will appear as active from the previous state. We need to design the clear UX, where this is resolved or explained.
  2. How the related actions should behave for this "recovered" alert instances? For example what should happened for created incidents in PagerDuty? We cannot 'resolve' this incidents because the threshold conditions probably still meet and it will be a confusing UX if we do that.

@arisonl and @alexfrancoeur need your help on the defining the requirements mentioned above.

@YulNaumenko YulNaumenko removed their assignment Sep 11, 2021
@arisonl
Copy link
Contributor

arisonl commented Sep 14, 2021

Here is a visual capture of how this behaves right now.

bug3

Some initial thoughts (that will need validation): Increasing the threshold and this way causing the condition to stop being met is not a recovery. If you change the threshold, you have reset the rule. A recovery is defined against a certain threshold, if you change the threshold to a higher value, the alert has not recovered against the previous threshold, you just changed the rule. (That said @gmmorris has noted that admins may disable/enable a rule in order to reset an API key, in which case the state should probably not be lost -#106876)).

We have two things coming together in this issue: Resetting the threshold value and disabling the rule and we probably need to think about how they reflect to the event log and the external system independently as well as when they are combined (like in this bug).

@pmuellr
Copy link
Member

pmuellr commented Sep 14, 2021

Increasing the threshold and this way causing the condition to stop being met is not a recovery. If you change the threshold, you have reset the rule.

Good insight! That means we should be resetting the rule state on update, which we are NOT doing today! I think we'd just be considering rule-type specific fields to count as an update where "the rule state will be reset" - setting mute/throttle/name, etc, shouldn't reset the rule state.

If we were to implement this today, we'd have to end up marking those alert instances as "recovered", but of course that doesn't feel quite right. Do we need a different "alert instance no longer firing" case? No longer firing because it has actually recovered, or no longer firing for other reasons (eg, rule was updated). So either in the event log, a new peer of recovered-instance docs, something like no-longer-tracking-instance kinda thing. Or we could have a reason field in the doc where we'd indicate WHY it recovered. The latter will probably be WAAAY easier.

@gmmorris
Copy link
Contributor

@arisonl will investigate the product expectations further with O11y and SecuirtySolution

@pmuellr
Copy link
Member

pmuellr commented Sep 29, 2021

Re-looking at this after ~3 weeks ... thought I'd summarize the short-term options we could look at:

  • when a rule is disabled, generate recovered-instance event log events for the current instances
  • stop deleting the task documents when a rule is disabled; they will only be deleted when the rule is deleted

I'm not sure which of these is actually easier, or what other side effects they might cause.

The second is probably harder, but presumably removes some existing code (deleting and re-creating the task document on disable/enable). It also will help with the race condition where if you enable/disable very quickly (via the HTTP API or presumably rule client), you can end up with multiple task documents for a single rule, which are impossible to delete.

From the view of the event log:

  • the effect of the first is that the alerts will appear recovered when the rule is disabled
  • the effect of the second is that the alerts will appear active while the rule is disabled, and if they recovered while the rule is disabled, won't actually emit a recovered-instance event until the rule is enabled again
  • the event log doesn't know if the rule is disabled or not - you could "intelligently" interpret what happened if the event log knew the rule was disabled, but is that an event (probably) or an attribute in the event log doc?

The other aspect of this is that the instances WILL "recover" in the sense that they won't show up in the rule details page, eventually, because the query to get the instances from the event log is time-based (60 * rule interval), so things become "eventually consistent" :-). Problem is if the interval is 24 hr, that's 60 days (link to code below).

For alerts that would eventually recover while the rule is disabled (if the rule had not be disabled), the alerts will end up with a "recovered" message at the "wrong time" in both cases - either when the rule is disabled (first option) or when the rule is re-enabled (second option). Neither is correct, but the first feels somewhat better.

It's also probably likely that we will need to add a new "reset" API if we go with the second option, since "disable then re-enable" is the general way to "reset" a rule, and aspects of that "reset" will stop working if we stop deleting the task record - we'd need a new HTTP API and likely UX gesture to explicitly "reset" the rule.

I really want to do the second, because it fixes the race condition (which we've only seen in artificial situations), but it's feeling heavier than the first option.


Here's the code that uses the 60 * rule interval date range to read the event log:

// default duration of instance summary is 60 * alert interval
const dateNow = new Date();
const durationMillis = parseDuration(alert.schedule.interval) * 60;
const defaultDateStart = new Date(dateNow.valueOf() - durationMillis);
const parsedDateStart = parseDate(dateStart, 'dateStart', defaultDateStart);
const eventLogClient = await this.getEventLogClient();
this.logger.debug(`getAlertInstanceSummary(): search the event log for alert ${id}`);
let events: IEvent[];
try {
const queryResults = await eventLogClient.findEventsBySavedObjectIds(
'alert',
[id],
{
page: 1,
per_page: 10000,
start: parsedDateStart.toISOString(),
end: dateNow.toISOString(),
sort_order: 'desc',
},
alert.legacyId !== null ? [alert.legacyId] : undefined
);
events = queryResults.data;
} catch (err) {
this.logger.debug(
`rulesClient.getAlertInstanceSummary(): error searching event log for alert ${id}: ${err.message}`
);
events = [];
}

@pmuellr
Copy link
Member

pmuellr commented Sep 29, 2021

Should have pointed out - it's buried in other comments - that Yuliia has a POC for the first option of emitting recovered events on disable, in PR 111671.

@arisonl
Copy link
Contributor

arisonl commented Sep 29, 2021

We decided to move forward with the recovery option for the short term, with a solution that will not trigger resolve-on-recovery for disabled alerts that are set up to resolve incidents on recovery. We also discussed to flag that somehow on the event log, if possible (I understand from Patrick's comment that it is currently not, but it can be added either as an event or attribute).

@kobelb kobelb added the needs-team Issues missing a team label label Jan 31, 2022
@botelastic botelastic bot removed the needs-team Issues missing a team label label Jan 31, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Fixes for quality problems that affect the customer experience estimate:needs-research Estimated as too large and requires research to break down into workable issues Feature:Alerting/RulesFramework Issues related to the Alerting Rules Framework Team:ResponseOps Label for the ResponseOps team (formerly the Cases and Alerting teams)
Projects
None yet
Development

Successfully merging a pull request may close this issue.

7 participants