Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Metrics Alerting] Implement a cap for group_by cardinality within metrics threshold rule type #119073

Closed
jasonrhodes opened this issue Nov 18, 2021 · 4 comments
Assignees
Labels
Team: Actionable Observability - DEPRECATED For Observability Alerting and SLOs use "Team:obs-ux-management", for AIops "Team:obs-knowledge"

Comments

@jasonrhodes
Copy link
Member

jasonrhodes commented Nov 18, 2021

AC:

  • For a given cap C, the metrics threshold rule should stop paging through composite aggregation pages once it reaches CEIL(C / composite.size) pages (e.g. for a cap of 5000 and a composite.size of 120, we should stop after processing 42 pages).
  • Clear messaging is added to the rule creation/edit dialogue, near the "group by" field, indicating that alerts can and will be dropped at any time if the chosen group by field exceeds a cardinality of cap C
  • When cap C is exceeded, the user is clearly notified

TBD:

  • What is cap C?
  • Can we increase the composite.size value at the same time? In the Metrics UI we found we could safely increase this to 2000, but it's very dependent on the query context. We should investigate. @neptunian may know more there.
  • What should the clear messaging be in the create/edit dialogue, exactly?
  • If we cap, should we continue to use the composite aggregation?
  • How do we notify the user when this cap is exceeded?
    • Do we notify them every time, or just the first?
    • We likely need all of the same rule logic for this notification, something like an administrative rule set up in the background that we can trigger from inside this executor somehow. Is that possible?
    • What actions would be scheduled for that rule?
    • How does a user configure where those notifications should be sent?
    • Do we want a warning sent at level W, then a hard stop at level C?
@jasonrhodes jasonrhodes added the Team: Actionable Observability - DEPRECATED For Observability Alerting and SLOs use "Team:obs-ux-management", for AIops "Team:obs-knowledge" label Nov 18, 2021
@jasonrhodes
Copy link
Member Author

jasonrhodes commented Nov 18, 2021

Related ticket from Log Threshold rule side: #98010

@matschaffer
Copy link
Contributor

Total gut-feel here, but I suspect anything more than 5 pages is danger-zone.

If everything is happy and performant, 5 pages of 10ms queries is no big deal. But if we start getting 30s+ response times, we're looking at 150s evaluation times, which could exceed the rule interval.

@weltenwort
Copy link
Member

If execution time is our concern and we're looking at introducing a mechanism to limit it, wouldn't it make sense to use it as the primary criterion for aborting the evaluation instead? The cancellation signal and per-rule timeout that the alerting team is working on sounds more robust than trying to come up with a surrogate metric.

@jasonrhodes
Copy link
Member Author

Closing this in favor of #123053

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Team: Actionable Observability - DEPRECATED For Observability Alerting and SLOs use "Team:obs-ux-management", for AIops "Team:obs-knowledge"
Projects
None yet
Development

No branches or pull requests

4 participants