[Metrics Alerting] Implement a cap for group_by cardinality within metrics threshold rule type #119073

jasonrhodes · 2021-11-18T18:28:11Z

AC:

For a given cap C, the metrics threshold rule should stop paging through composite aggregation pages once it reaches CEIL(C / composite.size) pages (e.g. for a cap of 5000 and a composite.size of 120, we should stop after processing 42 pages).
Clear messaging is added to the rule creation/edit dialogue, near the "group by" field, indicating that alerts can and will be dropped at any time if the chosen group by field exceeds a cardinality of cap C
When cap C is exceeded, the user is clearly notified

TBD:

What is cap C?
Can we increase the composite.size value at the same time? In the Metrics UI we found we could safely increase this to 2000, but it's very dependent on the query context. We should investigate. @neptunian may know more there.
What should the clear messaging be in the create/edit dialogue, exactly?
If we cap, should we continue to use the composite aggregation?
How do we notify the user when this cap is exceeded?
- Do we notify them every time, or just the first?
- We likely need all of the same rule logic for this notification, something like an administrative rule set up in the background that we can trigger from inside this executor somehow. Is that possible?
- What actions would be scheduled for that rule?
- How does a user configure where those notifications should be sent?
- Do we want a warning sent at level W, then a hard stop at level C?

The text was updated successfully, but these errors were encountered:

jasonrhodes · 2021-11-18T18:32:13Z

Related ticket from Log Threshold rule side: #98010

matschaffer · 2021-11-19T01:55:29Z

Total gut-feel here, but I suspect anything more than 5 pages is danger-zone.

If everything is happy and performant, 5 pages of 10ms queries is no big deal. But if we start getting 30s+ response times, we're looking at 150s evaluation times, which could exceed the rule interval.

weltenwort · 2021-11-19T15:42:42Z

If execution time is our concern and we're looking at introducing a mechanism to limit it, wouldn't it make sense to use it as the primary criterion for aborting the evaluation instead? The cancellation signal and per-rule timeout that the alerting team is working on sounds more robust than trying to come up with a surrogate metric.

jasonrhodes · 2022-01-14T14:46:28Z

Closing this in favor of #123053

jasonrhodes added the Team: Actionable Observability - DEPRECATED For Observability Alerting and SLOs use "Team:obs-ux-management", for AIops "Team:obs-knowledge" label Nov 18, 2021

paulb-elastic assigned simianhacker Dec 6, 2021

jasonrhodes closed this as completed Jan 14, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Metrics Alerting] Implement a cap for group_by cardinality within metrics threshold rule type #119073

[Metrics Alerting] Implement a cap for group_by cardinality within metrics threshold rule type #119073

jasonrhodes commented Nov 18, 2021 •

edited

Loading

jasonrhodes commented Nov 18, 2021 •

edited

Loading

matschaffer commented Nov 19, 2021

weltenwort commented Nov 19, 2021

jasonrhodes commented Jan 14, 2022

[Metrics Alerting] Implement a cap for group_by cardinality within metrics threshold rule type #119073

[Metrics Alerting] Implement a cap for group_by cardinality within metrics threshold rule type #119073

Comments

jasonrhodes commented Nov 18, 2021 • edited Loading

jasonrhodes commented Nov 18, 2021 • edited Loading

matschaffer commented Nov 19, 2021

weltenwort commented Nov 19, 2021

jasonrhodes commented Jan 14, 2022

jasonrhodes commented Nov 18, 2021 •

edited

Loading

jasonrhodes commented Nov 18, 2021 •

edited

Loading