Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Alerting rules can end up in a state where they stop running indefinitely until a user intervenes to fix the problem #119650

Closed
mikecote opened this issue Nov 24, 2021 · 5 comments
Assignees
Labels
estimate:needs-research Estimated as too large and requires research to break down into workable issues Feature:Alerting/RulesFramework Issues related to the Alerting Rules Framework Team:ResponseOps Label for the ResponseOps team (formerly the Cases and Alerting teams)

Comments

@mikecote
Copy link
Contributor

mikecote commented Nov 24, 2021

Alerting rules should operate continually. When a user enables the alerting rule, it shouldn't stop running or fail indefinitely until the rule is disabled.

We should audit our codebase and identify scenarios where rules stop running indefinitely. Then, based on the findings, we should propose fixes or mitigations for the scenarios that can be.

As a starting point, I am aware of the following scenarios where rules stop running indefinitely. Since I only did a brief research, we should still analyze our code in-depth and prioritize after the fact so we can start discussing solutions and priorities for each.

  • An alerting rule's task tasks get deleted
  • An alerting rule's task gets marked as failed
  • An alerting rule's task cannot find its associated rule
  • An alerting rule's API key is deleted
  • An alerting rule is unable to decrypt the API key due to the Kibana encrypted saved objects encryption key changed
  • The alerting rule's type no longer exists
  • An alerting rule was created without an API key (security features disabled) and later on the security features got enabled making the rule miss an API key (relevant issue)
  • An alerting rule's task becomes unrecognizable
@mikecote mikecote added Team:ResponseOps Label for the ResponseOps team (formerly the Cases and Alerting teams) Feature:Alerting/RulesFramework Issues related to the Alerting Rules Framework estimate:needs-research Estimated as too large and requires research to break down into workable issues labels Nov 24, 2021
@elasticmachine
Copy link
Contributor

Pinging @elastic/kibana-alerting-services (Team:Alerting Services)

@mikecote
Copy link
Contributor Author

There is still an open question with regards to rule actions

Should we consider rule actions (ex: Email, PagerDuty, etc.) part of these requirements, or would the rule actions portion be its dedicated non-functional requirement? (ex: "Users need to know when rule actions stop executing indefinitely")

@arisonl and I are trying to get an answer to this question but should be part of this research issue once we have an answer.

@arisonl
Copy link
Contributor

arisonl commented Dec 1, 2021

I believe that in practice, if an action is set up, receiving the actions in the integrated system is an integral part of the rule from a use case perspective. We should assume that if it goes out, it is very important and there is a workflow associated with it, and hence having the rule running but the action failing should be anticipated to be just as bad as the rule not running at all, at least for a number of use cases.

@gmmorris
Copy link
Contributor

gmmorris commented Dec 7, 2021

Added a link to a related issue (it was already described in the issue, but wasn't linked to the source issue).

#118520

@ymao1
Copy link
Contributor

ymao1 commented Jan 13, 2022

After some research, we've concluded that there are two types of problems that alerting rules can encounter:

  • Task manager related problems - These are problems with the task that backs an alerting rule. If these tasks get deleted, duplicated, or into a state where they are no longer picked up by task manager, alerting rule execution will be impacted. This is especially egregious because we can only identify these issues by inspecting the task manager index. The rule management UI doesn't provide any indication of issues occurring at the task level
  • Alerting task runner related problems - These are alerting rules that execute at the requested interval but throw an error on every execution. Often these execution errors cannot be automatically fixed and require manual intervention. These errors already do show up on the rule management UI and error status is stored in the rule SO.

As part of the research, we've identified the scenarios that can lead to the specified problems. More details are available in the research document. As a result of this research, the following issues have been created:

Closing this research issue in favor of the linked issues.

@ymao1 ymao1 closed this as completed Jan 13, 2022
@kobelb kobelb added the needs-team Issues missing a team label label Jan 31, 2022
@botelastic botelastic bot removed the needs-team Issues missing a team label label Jan 31, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
estimate:needs-research Estimated as too large and requires research to break down into workable issues Feature:Alerting/RulesFramework Issues related to the Alerting Rules Framework Team:ResponseOps Label for the ResponseOps team (formerly the Cases and Alerting teams)
Projects
No open projects
Development

No branches or pull requests

8 participants