Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Logs UI] Optimize grouped rule execution in the log threshold rule type #124130

Open
weltenwort opened this issue Jan 31, 2022 · 8 comments
Open
Labels
enhancement New value added to drive a business result Feature:Alerting Feature:Logs UI Logs UI feature Team:obs-ux-logs Observability Logs User Experience Team

Comments

@weltenwort
Copy link
Member

weltenwort commented Jan 31, 2022

📓 Summary

We want to optimize the way the log threshold rule type executor queries and processes the data as to not block the nodejs event loop and decrease memory usage.

part of #98010

ℹ️ Background

The log threshold rule type excecutor currently handles four particular cases based on the rule params:

ungrouped with a single "count" criterion

  • execute simple query with size: 0
  • fire alert based on comparison between total hit count and configured threshold
  • complexity in ES: single search without scoring and size: 0
  • complexity in nodejs: constant time and memory

ungrouped with a "ratio" of two criteria

  • for the denominator and numerator each: execute a simple queries with size: 0
  • calculate ratio of the two total hit counts
  • fire alert based on comparison of the ratio and configured threshold
  • complexity in ES: two searches without scoring and size: 0
  • complexity in nodejs: constant time and memory

grouped with a single "count" criterion

  • execute a query with a composite agg over all grouping criteria as terms with page size 2000
  • collect all buckets into an array until the pages are exhausted
  • for each bucket: fire alert based on comparison between hit count and configured threshold
  • complexity in ES: a number of searches that increases linearly with the number of groups (divided by page size)
  • complexity in nodejs: linear time and memory regarding number of groups

grouped with a "ratio" of two criteria

  • for the denominator and numerator each: execute a query with a composite agg over all grouping criteria as terms with page size 2000
  • collect all buckets for each query into an array each until the pages are exhausted
  • for each bucket in the first array: find the matching bucket in the second array, calculate the ratio, and fire alert based on comparison between ratio and configured threshold
  • complexity in ES: a number of searches that increases linearly with the number of groups (divided by page size)
  • complexity in nodejs:
    • quadratic time regarding number of groups (finding the matching bucket)
    • linear memory regarding number of groups

💡 Optimizations

The grouped cases have particularly high optimization potential due to two factors:

  • The comparison with the thresholds is done on nodejs instead of Elasticsearch, which means the full result set needs to be fetched.
  • The results are gathered into memory completely before they're processed. This can lead to high memory usage and blocks the event loop while the IO-less comparisons are made.

Perform more computation in Elasticsearch

In order to ensure the alert doesn't miss any groups to approximate terms results, the grouping is performed using a composite aggregation. That aggregation currently has the limitation that it can't be post-processes as a whole using a sibling pipeline agg such as bucket_selector.

The individual pages can be processed that way, though. A bucket_selector in a composite agg can remove buckets from a page that don't match certain criteria (such as having a doc count above or below a threshold). In consequence pages would be partly empty leading to smaller response sizes and the threshold computation would be performed by Elasticsearch. This advantage would be largest for the most complex case of the grouped ratio rule, where a script could calculate the ratio before performing the filtering. That also means that the nominator/denominator filters probably needs to be moved (or duplicated) beneath the composite agg.

If the bucket_selector script is written to take the threshold as params its compilation result could even be cached between executions.

Process results incrementally

Assuming the ratio calculation and filtering is performed in a bucket_selector script as described above, the pages of groups could be processes immediately as they come in. This would interleave IO operations with the (much reduced) computation, which would allow for the event loop to preempt execution in favor of other workloads. Currently, the code has a high degree of code-reuse due to its carefully crafted decomposition (:clap:), which would be a bit harder in an interleaved execution model. But it's probably still possible to come up with an adequate structure, even if it looks a bit different.

✔️ Acceptance criteria

  • In the "grouped count" case: The filtering of the buckets is performed for each page in Elasticsearch instead of inline in the log threshold rule type executor.
  • In the "grouped ratio" case: The ratio calculation for each bucket and the filtering of the buckets for each page are performed in Elasticsearch instead of inline in the log threshold rule type executor.
  • In both "grouped" cases: The executor immediately calls the action factory after receiving a page such that...
    • no more than one page is kept in memory at a time.
    • control is yielded to the event loop at least once per page.
@weltenwort weltenwort added Feature:Logs UI Logs UI feature Team:Infra Monitoring UI - DEPRECATED DEPRECATED - Label for the Infra Monitoring UI team. Use Team:obs-ux-infra_services labels Jan 31, 2022
@weltenwort weltenwort self-assigned this Jan 31, 2022
@elasticmachine
Copy link
Contributor

Pinging @elastic/infra-monitoring-ui (Team:Infra Monitoring UI)

@weltenwort
Copy link
Member Author

@Kerry350 I'd like to hear your thoughts on this. Do you think it's feasible or did I miss anything?

@Kerry350
Copy link
Contributor

Kerry350 commented Feb 2, 2022

@weltenwort Thanks for breaking this down so thoroughly. This seems totally feasible to me using the methods you've stated.

Do you think we should make some before / after metrics (probably just execution time) part of the ACs? (Tends to be easier to grab those things as we develop).

which would be a bit harder in an interleaved execution model. But it's probably still possible to come up with an adequate structure, even if it looks a bit different.

I agree we'll end up with something that looks quite different, but we should be able to come up with something "easy to follow" again.

@weltenwort
Copy link
Member Author

Do you think we should make some before / after metrics (probably just execution time) part of the ACs?

That would be neat, but would require a pretty big test dataset. I'll try to see how difficult it would be to appropriate the synthtrace cli for that. Any thoughts on what characteristics that dataset needs to have?

@Kerry350
Copy link
Contributor

Kerry350 commented Feb 4, 2022

but would require a pretty big test dataset. I'll try to see how difficult it would be to appropriate the synthtrace cli for that. Any thoughts on what characteristics that dataset needs to have?

Yeah, fair point. I know Chris did some good work to produce high cardinality datasets for metrics, but I appreciate that might not help here.

For characteristics I'd say we need one or more fields to represent the high cardinality nature, and then it would also be handy to have some fields with varied types (text, keyword etc) so that we can test against different comparators (which have the largest influence on the eventual query).

@weltenwort
Copy link
Member Author

After looking at the synthtrace architecture and discussing it with @miltonhultgren I'd say it should be pretty feasible to add support for generating log entries and setting up the correct mappings. It's not a quick thing, though, so I'd rather not make it a dependency of this issue.

I'll try to come up with a simple bash script or so to use as a fallback in case this is prioritized higher than the synthtrace improvement.

@jasonrhodes
Copy link
Member

Sorry for the delay in responding but this sounds like an excellent plan of action to me.

@weltenwort weltenwort removed their assignment Jan 4, 2023
@gbamparop gbamparop added the enhancement New value added to drive a business result label Oct 27, 2023
@gbamparop gbamparop added Team:obs-ux-logs Observability Logs User Experience Team and removed Team:Infra Monitoring UI - DEPRECATED DEPRECATED - Label for the Infra Monitoring UI team. Use Team:obs-ux-infra_services labels Nov 9, 2023
@botelastic botelastic bot added the needs-team Issues missing a team label label Nov 9, 2023
@elasticmachine
Copy link
Contributor

Pinging @elastic/obs-ux-logs-team (Team:obs-ux-logs)

@botelastic botelastic bot removed the needs-team Issues missing a team label label Nov 9, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New value added to drive a business result Feature:Alerting Feature:Logs UI Logs UI feature Team:obs-ux-logs Observability Logs User Experience Team
Projects
None yet
Development

No branches or pull requests

5 participants