[Alerting] Investigate recurring task to detect broken rules #122984

ymao1 · 2022-01-13T18:44:48Z

As part of the research to identify when rules stop running continuously, we have proposed creating a recurring task to detect and possible repair broken rules. This single task could check for a variety of scenarios:

Ensure there is a task manager task for each alerting rule. If there does not exist a task for each enabled alerting rule, create one.
Ensure there is no more than one task manager task for each alerting rule. If there exists multiple tasks for an alerting rule, keep only the valid task document.
Ensure that rules and connectors are decryptable with the current encryption key
Ensure that nothing has broken AAD for rules & connectors (manual SO updates, for example)
Ensure that rule type parameter validation succeeds for each rule
Ensure there exists a valid API key for each enabled rule. If there does not exist an API key for each enabled alerting rule, create one (is this possible as a background task without the user?)

With the output of these tasks we could explore doing the following:

Storing the output somewhere and adding telemetry on this data.
Merging with the existing alerting health framework status checks to provide an API for users to get this information
Pushing the outputs to stack monitoring in order to leverage the O11y of Alerting efforts to get rule information into Stack Monitoring.

elasticmachine · 2022-01-13T18:44:50Z

Pinging @elastic/response-ops (Team:ResponseOps)

mikecote · 2022-01-24T13:44:23Z

@chrisronline, @XavierM during @ymao1's research, we've identified some scenarios where rules end up in a state where they stop running, and we should notify the user. Some other bullet points relate to "self-healing" issues.

We should discuss the proper way to detect and notify these scenarios to our users before starting the implementation. Example: is a task the right approach to detect?

Is something like this near term for the O11y of Alerting project?

ymao1 mentioned this issue Jan 13, 2022

Alerting rules can end up in a state where they stop running indefinitely until a user intervenes to fix the problem #119650

Closed

mikecote added the blocked label Jan 24, 2022

mikecote mentioned this issue Jan 26, 2022

Alerting to re-create task manager tasks in the scenario they get deleted #79362

Closed

kobelb added the needs-team Issues missing a team label label Jan 31, 2022

botelastic bot removed the needs-team Issues missing a team label label Jan 31, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Alerting] Investigate recurring task to detect broken rules #122984

[Alerting] Investigate recurring task to detect broken rules #122984

ymao1 commented Jan 13, 2022

elasticmachine commented Jan 13, 2022

mikecote commented Jan 24, 2022

[Alerting] Investigate recurring task to detect broken rules #122984

[Alerting] Investigate recurring task to detect broken rules #122984

Comments

ymao1 commented Jan 13, 2022

elasticmachine commented Jan 13, 2022

mikecote commented Jan 24, 2022