Alerting rules can end up in a state where they stop running indefinitely until a user intervenes to fix the problem #119650

mikecote · 2021-11-24T18:08:50Z

Alerting rules should operate continually. When a user enables the alerting rule, it shouldn't stop running or fail indefinitely until the rule is disabled.

We should audit our codebase and identify scenarios where rules stop running indefinitely. Then, based on the findings, we should propose fixes or mitigations for the scenarios that can be.

As a starting point, I am aware of the following scenarios where rules stop running indefinitely. Since I only did a brief research, we should still analyze our code in-depth and prioritize after the fact so we can start discussing solutions and priorities for each.

An alerting rule's task tasks get deleted
An alerting rule's task gets marked as failed
An alerting rule's task cannot find its associated rule
An alerting rule's API key is deleted
An alerting rule is unable to decrypt the API key due to the Kibana encrypted saved objects encryption key changed
The alerting rule's type no longer exists
An alerting rule was created without an API key (security features disabled) and later on the security features got enabled making the rule miss an API key (relevant issue)
An alerting rule's task becomes unrecognizable

The text was updated successfully, but these errors were encountered:

elasticmachine · 2021-11-24T18:08:51Z

Pinging @elastic/kibana-alerting-services (Team:Alerting Services)

mikecote · 2021-11-30T13:58:45Z

There is still an open question with regards to rule actions

Should we consider rule actions (ex: Email, PagerDuty, etc.) part of these requirements, or would the rule actions portion be its dedicated non-functional requirement? (ex: "Users need to know when rule actions stop executing indefinitely")

@arisonl and I are trying to get an answer to this question but should be part of this research issue once we have an answer.

arisonl · 2021-12-01T10:49:07Z

I believe that in practice, if an action is set up, receiving the actions in the integrated system is an integral part of the rule from a use case perspective. We should assume that if it goes out, it is very important and there is a workflow associated with it, and hence having the rule running but the action failing should be anticipated to be just as bad as the rule not running at all, at least for a number of use cases.

gmmorris · 2021-12-07T11:34:52Z

Added a link to a related issue (it was already described in the issue, but wasn't linked to the source issue).

#118520

ymao1 · 2022-01-13T18:52:19Z

After some research, we've concluded that there are two types of problems that alerting rules can encounter:

Task manager related problems - These are problems with the task that backs an alerting rule. If these tasks get deleted, duplicated, or into a state where they are no longer picked up by task manager, alerting rule execution will be impacted. This is especially egregious because we can only identify these issues by inspecting the task manager index. The rule management UI doesn't provide any indication of issues occurring at the task level
Alerting task runner related problems - These are alerting rules that execute at the requested interval but throw an error on every execution. Often these execution errors cannot be automatically fixed and require manual intervention. These errors already do show up on the rule management UI and error status is stored in the rule SO.

As part of the research, we've identified the scenarios that can lead to the specified problems. More details are available in the research document. As a result of this research, the following issues have been created:

[Alerting] Smarter retry interval for ES Connectivity errors #122390
Detect and prevent the use of mismatched encryption keys #92654 - With this we envision working with Kibana security to implement their vision as proposed via an RFC
[Task Manager] Investigate better handling of unregistered task types #122389
[Alerting] Add telemetry to count number of failed or unrecognized rule tasks #122985
[Alerting] Investigate recurring task to detect broken rules #122984
[Alerting] Gracefully restore failed rules from pre-7.11 #117593 - Existing bug issue
[alerts] provide an "explain" capability to show elasticsearch queries that alerts will run #84417 - This is only tangentially related but we briefly discussed providing an API for ad-hoc rule execution, which we could then use as part of the recurring task in 122984 to try to detect errors that otherwise only show up during rule type execution.

Closing this research issue in favor of the linked issues.

mikecote mentioned this issue Nov 24, 2021

[Alerting] Evaluate how we can reduce the number of scenarios in which rules are disabled due to an unrecoverable failure #116919

Closed

3 tasks

mikecote assigned chrisronline and YulNaumenko Dec 2, 2021

mikecote assigned ymao1 and unassigned chrisronline and YulNaumenko Dec 17, 2021

This was referenced Jan 5, 2022

[Task Manager] Investigate better handling of unregistered task types #122389

Closed

[Alerting] Smarter retry interval for ES Connectivity errors #122390

Closed

This was referenced Jan 13, 2022

[Alerting] Investigate recurring task to detect broken rules #122984

Open

[Alerting] Add telemetry to count number of failed or unrecognized rule tasks #122985

Closed

ymao1 closed this as completed Jan 13, 2022

kobelb added the needs-team Issues missing a team label label Jan 31, 2022

botelastic bot removed the needs-team Issues missing a team label label Jan 31, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Alerting rules can end up in a state where they stop running indefinitely until a user intervenes to fix the problem #119650

Alerting rules can end up in a state where they stop running indefinitely until a user intervenes to fix the problem #119650

mikecote commented Nov 24, 2021 •

edited by gmmorris

Loading

elasticmachine commented Nov 24, 2021

mikecote commented Nov 30, 2021

arisonl commented Dec 1, 2021

gmmorris commented Dec 7, 2021

ymao1 commented Jan 13, 2022

Alerting rules can end up in a state where they stop running indefinitely until a user intervenes to fix the problem #119650

Alerting rules can end up in a state where they stop running indefinitely until a user intervenes to fix the problem #119650

Comments

mikecote commented Nov 24, 2021 • edited by gmmorris Loading

elasticmachine commented Nov 24, 2021

mikecote commented Nov 30, 2021

arisonl commented Dec 1, 2021

gmmorris commented Dec 7, 2021

ymao1 commented Jan 13, 2022

mikecote commented Nov 24, 2021 •

edited by gmmorris

Loading