[Logs UI] Optimize grouped rule execution in the log threshold rule type #124130

weltenwort · 2022-01-31T15:03:17Z

📓 Summary

We want to optimize the way the log threshold rule type executor queries and processes the data as to not block the nodejs event loop and decrease memory usage.

part of #98010

ℹ️ Background

The log threshold rule type excecutor currently handles four particular cases based on the rule params:

ungrouped with a single "count" criterion

execute simple query with size: 0
fire alert based on comparison between total hit count and configured threshold
complexity in ES: single search without scoring and size: 0
complexity in nodejs: constant time and memory

ungrouped with a "ratio" of two criteria

for the denominator and numerator each: execute a simple queries with size: 0
calculate ratio of the two total hit counts
fire alert based on comparison of the ratio and configured threshold
complexity in ES: two searches without scoring and size: 0
complexity in nodejs: constant time and memory

grouped with a single "count" criterion

execute a query with a composite agg over all grouping criteria as terms with page size 2000
collect all buckets into an array until the pages are exhausted
for each bucket: fire alert based on comparison between hit count and configured threshold
complexity in ES: a number of searches that increases linearly with the number of groups (divided by page size)
complexity in nodejs: linear time and memory regarding number of groups

grouped with a "ratio" of two criteria

for the denominator and numerator each: execute a query with a composite agg over all grouping criteria as terms with page size 2000
collect all buckets for each query into an array each until the pages are exhausted
for each bucket in the first array: find the matching bucket in the second array, calculate the ratio, and fire alert based on comparison between ratio and configured threshold
complexity in ES: a number of searches that increases linearly with the number of groups (divided by page size)
complexity in nodejs:
- quadratic time regarding number of groups (finding the matching bucket)
- linear memory regarding number of groups

💡 Optimizations

The grouped cases have particularly high optimization potential due to two factors:

The comparison with the thresholds is done on nodejs instead of Elasticsearch, which means the full result set needs to be fetched.
The results are gathered into memory completely before they're processed. This can lead to high memory usage and blocks the event loop while the IO-less comparisons are made.

Perform more computation in Elasticsearch

In order to ensure the alert doesn't miss any groups to approximate terms results, the grouping is performed using a composite aggregation. That aggregation currently has the limitation that it can't be post-processes as a whole using a sibling pipeline agg such as bucket_selector.

The individual pages can be processed that way, though. A bucket_selector in a composite agg can remove buckets from a page that don't match certain criteria (such as having a doc count above or below a threshold). In consequence pages would be partly empty leading to smaller response sizes and the threshold computation would be performed by Elasticsearch. This advantage would be largest for the most complex case of the grouped ratio rule, where a script could calculate the ratio before performing the filtering. That also means that the nominator/denominator filters probably needs to be moved (or duplicated) beneath the composite agg.

If the bucket_selector script is written to take the threshold as params its compilation result could even be cached between executions.

Process results incrementally

Assuming the ratio calculation and filtering is performed in a bucket_selector script as described above, the pages of groups could be processes immediately as they come in. This would interleave IO operations with the (much reduced) computation, which would allow for the event loop to preempt execution in favor of other workloads. Currently, the code has a high degree of code-reuse due to its carefully crafted decomposition (:clap:), which would be a bit harder in an interleaved execution model. But it's probably still possible to come up with an adequate structure, even if it looks a bit different.

✔️ Acceptance criteria

In the "grouped count" case: The filtering of the buckets is performed for each page in Elasticsearch instead of inline in the log threshold rule type executor.
In the "grouped ratio" case: The ratio calculation for each bucket and the filtering of the buckets for each page are performed in Elasticsearch instead of inline in the log threshold rule type executor.
In both "grouped" cases: The executor immediately calls the action factory after receiving a page such that...
- no more than one page is kept in memory at a time.
- control is yielded to the event loop at least once per page.

The text was updated successfully, but these errors were encountered:

elasticmachine · 2022-01-31T15:03:19Z

Pinging @elastic/infra-monitoring-ui (Team:Infra Monitoring UI)

weltenwort · 2022-01-31T17:38:46Z

@Kerry350 I'd like to hear your thoughts on this. Do you think it's feasible or did I miss anything?

Kerry350 · 2022-02-02T11:01:19Z

@weltenwort Thanks for breaking this down so thoroughly. This seems totally feasible to me using the methods you've stated.

Do you think we should make some before / after metrics (probably just execution time) part of the ACs? (Tends to be easier to grab those things as we develop).

which would be a bit harder in an interleaved execution model. But it's probably still possible to come up with an adequate structure, even if it looks a bit different.

I agree we'll end up with something that looks quite different, but we should be able to come up with something "easy to follow" again.

weltenwort · 2022-02-02T11:38:05Z

Do you think we should make some before / after metrics (probably just execution time) part of the ACs?

That would be neat, but would require a pretty big test dataset. I'll try to see how difficult it would be to appropriate the synthtrace cli for that. Any thoughts on what characteristics that dataset needs to have?

Kerry350 · 2022-02-04T14:35:08Z

but would require a pretty big test dataset. I'll try to see how difficult it would be to appropriate the synthtrace cli for that. Any thoughts on what characteristics that dataset needs to have?

Yeah, fair point. I know Chris did some good work to produce high cardinality datasets for metrics, but I appreciate that might not help here.

For characteristics I'd say we need one or more fields to represent the high cardinality nature, and then it would also be handy to have some fields with varied types (text, keyword etc) so that we can test against different comparators (which have the largest influence on the eventual query).

weltenwort · 2022-02-07T19:04:41Z

After looking at the synthtrace architecture and discussing it with @miltonhultgren I'd say it should be pretty feasible to add support for generating log entries and setting up the correct mappings. It's not a quick thing, though, so I'd rather not make it a dependency of this issue.

I'll try to come up with a simple bash script or so to use as a fallback in case this is prioritized higher than the synthtrace improvement.

jasonrhodes · 2022-04-18T19:07:07Z

Sorry for the delay in responding but this sounds like an excellent plan of action to me.

elasticmachine · 2023-11-09T12:02:40Z

Pinging @elastic/obs-ux-logs-team (Team:obs-ux-logs)

weltenwort added Feature:Logs UI Logs UI feature Team:Infra Monitoring UI - DEPRECATED DEPRECATED - Label for the Infra Monitoring UI team. Use Team:obs-ux-infra_services labels Jan 31, 2022

weltenwort self-assigned this Jan 31, 2022

weltenwort mentioned this issue Jan 31, 2022

[Logs UI] Handle log threshold alert grouping fields with large cardinalities more robustly #98010

Closed

weltenwort removed their assignment Jan 4, 2023

gbamparop added the Feature:Alerting label Oct 13, 2023

gbamparop added the enhancement New value added to drive a business result label Oct 27, 2023

gbamparop added Team:obs-ux-logs Observability Logs User Experience Team and removed Team:Infra Monitoring UI - DEPRECATED DEPRECATED - Label for the Infra Monitoring UI team. Use Team:obs-ux-infra_services labels Nov 9, 2023

botelastic bot added the needs-team Issues missing a team label label Nov 9, 2023

botelastic bot removed the needs-team Issues missing a team label label Nov 9, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Logs UI] Optimize grouped rule execution in the log threshold rule type #124130

[Logs UI] Optimize grouped rule execution in the log threshold rule type #124130

weltenwort commented Jan 31, 2022 •

edited

Loading

elasticmachine commented Jan 31, 2022

weltenwort commented Jan 31, 2022

Kerry350 commented Feb 2, 2022

weltenwort commented Feb 2, 2022

Kerry350 commented Feb 4, 2022

weltenwort commented Feb 7, 2022

jasonrhodes commented Apr 18, 2022

elasticmachine commented Nov 9, 2023

[Logs UI] Optimize grouped rule execution in the log threshold rule type #124130

[Logs UI] Optimize grouped rule execution in the log threshold rule type #124130

Comments

weltenwort commented Jan 31, 2022 • edited Loading

📓 Summary

ℹ️ Background

ungrouped with a single "count" criterion

ungrouped with a "ratio" of two criteria

grouped with a single "count" criterion

grouped with a "ratio" of two criteria

💡 Optimizations

Perform more computation in Elasticsearch

Process results incrementally

✔️ Acceptance criteria

elasticmachine commented Jan 31, 2022

weltenwort commented Jan 31, 2022

Kerry350 commented Feb 2, 2022

weltenwort commented Feb 2, 2022

Kerry350 commented Feb 4, 2022

weltenwort commented Feb 7, 2022

jasonrhodes commented Apr 18, 2022

elasticmachine commented Nov 9, 2023

weltenwort commented Jan 31, 2022 •

edited

Loading