[Logs UI] Handle log threshold alert grouping fields with large cardinalities more robustly #98010

weltenwort · 2021-04-22T13:38:27Z

Problem description

When the user creates a log threshold alert with a "group by" field of large cardinality, the alert executor will paginate through a large number of composite aggregation pages. This can use and possibly exceed the resources available in Elasticsearch and Kibana and thereby negatively impact the availability of the service. Additionally, the alert execution might time out and miss alerts that should have been fired.

On top of that the query used for checking the condition prioritizes correctness over performance by filtering out non-matching groups as late as possible. This allows for checking for zero-count threshold, but prevents Elasticsearch from optimizing the query more aggressively.

Possible solutions

Check and warn about high cardinality of the grouping field when creating the job.
Offer a setting on job creation to set an acceptable cardinality limit (as in "group by host.name up to 10000 groups).
Check the cardinality on execution and fail early and loudly when the configured limit is exceeded.
Special case costly grouped "alert when less than" conditions and use more efficient queries for all other cases. (By moving the filter out of the composite agg to the global bool filter.)

The text was updated successfully, but these errors were encountered:

elasticmachine · 2021-04-22T13:38:29Z

Pinging @elastic/logs-metrics-ui (Team:logs-metrics-ui)

dgieselaar · 2021-04-22T19:19:03Z

It looks like the filter is applied as a child as the composite aggregation, that means that if the group by field has high cardinality, it will run into issues regardless of whether the user has specified a filter.

weltenwort · 2021-04-23T11:40:38Z

@dgieselaar That is true, but intentional. The reasoning is explained in #68250 (review):

TL;DR: filtering too early means we can't detect disappearing groups

The alert won't fire when no document matches the condition. This is consequence of the way the grouping composite aggregation is combined with the condition in the filter query. The filter is evaluated first, so groups might fully disappear and alert instances might not be created. So unfortunately this means that an alert like "alert me when any of the hosts don't produce logs where 'status' is 'alive' for the last five minutes" won't work.

One way around this would be to not put the condition filters in the top-level query (only the date range filters), but instead nest them in a filter aggregation beneath the composite aggregation. That way the groups would exist as long as the grouping criteria exists at all in any document in that time range. Groups will then still disappear when no document exists, but I can't think of a clean way around that.

dgieselaar · 2021-04-23T12:15:57Z

@weltenwort I don't see an option in the UI to alert on no data? Am I looking in the wrong place?

(Looks like the popovers keep getting stuck as well, btw)

weltenwort · 2021-04-23T13:17:07Z

You can alert on a threshold of 0 documents, which means the group is known, but there are no documents that match the specific filter conditions. That can be used, for example, to detect when a service on a host has stopped sending logs.

jasonrhodes · 2021-05-04T17:05:16Z

You can alert on a threshold of 0 documents

I think this also arises when you query for "less than n documents", if I recall correctly (since the 0 scenario is always less than n).

weltenwort · 2021-05-05T10:00:57Z

Yes, that sounds correct. 👍 It would still be valuable to optimize the query for "more than" cases IMHO.

dgieselaar · 2021-05-05T10:03:21Z

One solution would be to offer the user two filters: one that is applied as part of the query, and one that is applied as a filter aggregation.

gmmorris · 2021-05-05T11:02:28Z

If there's anything we can do at Platform level to make this kind of thing easier, please don't hesitate to pull us in :)

weltenwort · 2021-05-05T11:16:42Z

One solution would be to offer the user two filters: one that is applied as part of the query, and one that is applied as a filter aggregation.

Yes, that would be one approach, but it might make it a lot harder for the user to understand. Maybe there's a way to structure this in the UI in such a way that the semantics of opting into an expensive execution strategy would become apparent. Those would all be breaking changes, though.

jasonrhodes · 2021-06-08T14:02:45Z

I notice we are using a composite.size of 40 for this query. Are we basing this number exactly on the number of buckets we expect? We bumped the metrics inventory query up to 2000 for this size (based on bucket calculations) and it had a massive effect on perf.

dgieselaar · 2021-06-08T14:07:13Z

I notice we are using a composite.size of 40 for this query. Are we basing this number exactly on the number of buckets we expect? We bumped the metrics inventory query up to 2000 for this size (based on bucket calculations) and it had a massive effect on perf.

FWIW, I'm not sure if this solves the issue where Kibana itself OOMs (as the responses are kept in memory).

weltenwort · 2021-06-08T15:04:41Z

FWIW, I'm not sure if this solves the issue where Kibana itself OOMs (as the responses are kept in memory).

Probably not, because it would only reduce the number of requests it performs to fetch the same amount of data.

jasonrhodes · 2021-06-08T15:10:54Z

I think it still likely makes the rule execution take longer than necessary unless we have a good reason to keep it at 40. But yes, we probably need to think broader than that.

Kerry350 · 2021-06-15T11:35:54Z

Special case costly grouped "alert when less than" conditions and use more efficient queries for all other cases. (By moving the filter out of the composite agg to the global bool filter.)

I'm going to implement this quick win, which will at least dramatically improve the more than scenario. I'll also increase the composite size. And lastly add a note about high cardinality fields to the docs (more of a stop gap until we provide some UI options).

Edit: I'll also add an in-UI warning when using a less than comparator and a group by field that this can cause problems. We can add this without committing to options that will need to be persisted.

matschaffer · 2021-11-17T05:29:27Z

Hi, I ended up here tracing from https://github.com/elastic/sdh-kibana/issues/2266 and why this isn't a top N query.

For an alert like "count is more than 75 group by arn" I don't think I would expect an exhaustive list of all possible arns when there are thousands of them.

I would have expected a top N query such that we could say "more than 75 error messages found for these top 10 arns"

Perhaps with a note like "{{sum_other_doc_count}} documents were skipped after finding those top 10, which may contain additional arns"

matschaffer · 2021-11-17T05:32:10Z

I guess "less than" might be a bit of a gotcha given that you should avoid _count: asc and use https://www.elastic.co/guide/en/elasticsearch/reference/current/search-aggregations-bucket-rare-terms-aggregation.html instead, but we could probably make the agg selection on the back end.

weltenwort · 2021-11-17T11:11:48Z

@matschaffer There might be misunderstanding here. Grouping doesn't create an alert with a list, but a separate alert for each group. That way a downstream notification tool can fan out the alerts to the proper channels.

matschaffer · 2021-11-17T13:34:34Z

Ahhh... so each one could potentially go to a different channel?

weltenwort · 2021-11-17T13:53:35Z

Yes, or be indexed using the index action.

miltonhultgren · 2021-12-08T08:13:55Z

Dropping an idea @weltenwort mentioned in a chat, it could be nice if we had some way to simulate an rule execution when a rule is being created. if we could couple that with some profiling in the Alerts Framework to give timing information back to the users we might be able to guide them to create more performant alerts.

weltenwort · 2021-12-08T10:53:58Z

IIRC @Kerry350 has been thinking about simulating the queries for a while.

Kerry350 · 2021-12-08T13:02:36Z

👋 👋

Yeah, I've been wanting to add query simulation for a while (release one 😂). There are so many cases in SDHs and so on where a simple copy-pasta of the query ES is executing would have been enough. I planned to dig into this for my Space, Time.

Being able to couple that with some Alerting profiling would be pretty cool.

The Alerting team have a ticket for adding an "explain" functionality, which is what we want to aid in these debugging scenarios (~~I'm struggling to find the ticket, but I'll link it in when I do~~ #84417). I may well look at adding a proof of concept for that overall functionality. If that's too hard, then I'll just add a standalone "explain" POC for us. (For the ON Week).

miltonhultgren · 2021-12-15T15:28:24Z

After working on "Check and warn about high cardinality of the grouping field when creating the job." for a few days and speaking with @weltenwort, my conclusion is that it is hard to offer meaningful up front warnings since the final performance will depend on the deployment size and load during execution (which also raises the question of when to perform such health checks).

Simply having a high cardinality field that will create a lot of pages does not mean the performance will be degraded because "high" needs to be put in relation to the cluster size and load. For a small deployment, 10 pages of a 100 buckets each might kill it while a large deployment chews right through it.

While we can make progress by changing how we process the buckets to avoid going OOM or blocking the Kibana JavaScript thread for too long, it feels like we also need some way to alert if the performance of a rule itself is starting to degrade (Rule A takes on average longer to execute than its interval setting) or the cluster health is degrading as a result of a rule executing (when Rule A is executing, memory consumption spikes 9 out of 10 times).
With that we might be able to disable poorly performing rules before the cluster goes down, guiding users to diagnostic tools (like simulation) and offering some advice on what might be causing the degradation (increased cardinality over time, increased load on the cluster).

jasonrhodes · 2022-01-06T05:01:55Z

It's possible that the changes made by @simianhacker in this PR -> #121904 can be implemented here as well.

miltonhultgren · 2022-01-10T09:43:57Z

We'd likely have to check if 10k is a good number for the log threshold executor and do a similar performance investigation of that code to find the bottlenecks to remove. Should we update this issue for that or create a new issue for that work? @jasonrhodes

jasonrhodes · 2022-01-28T22:45:34Z

From my perspective, I think we need to start assuming that the size value should start much, much bigger unless we have a specific reason to suspect that it will cause "too many buckets" errors. But yes, we should do some spot checking for sure.

I think a new ticket would be a good idea, although I think @simianhacker has already done some work here on implementing this across some of the other rules?

The main things I think we should focus on for now are:

Increase composite "size" value as much as possible
Do not block the event loop while processing results in memory -- either schedule alert actions in between each cursor-based call, or at least use some kind of recursive function + setImmediate() to process them in smaller batches.
Implement the new cancellation APIs from the alerting framework

(1) needs perf evaluation
(2) is something we should just do, no matter what
(3) we need to dig into more, but also we should just do it

If someone can make some tickets to work on this, that'd be great, but if not I will put it on my list to break this down and we can plan it for an upcoming cycle.

weltenwort · 2022-01-31T10:47:15Z

I can create those tickets

weltenwort · 2022-01-31T12:14:03Z

@jasonrhodes there already is an issue (#122550) for adding cancellation on the @elastic/actionable-observability board. We should clarify who tackles that and make sure it doesn't collide with the executor refactoring required for the composite loop optimization.

weltenwort · 2022-01-31T17:37:58Z

@jasonrhodes I created an issue for the optimized querying in #124130. Let me know what you think.

smith · 2022-04-18T13:49:32Z

Do we think we should close this and assume doing #122550 and #124130 will get us to a place where things are acceptable?

jasonrhodes · 2022-04-18T19:07:29Z

Yeah I think #124130 in particular just replaces this one.

weltenwort added discuss Feature:Logs UI Logs UI feature Team:Infra Monitoring UI - DEPRECATED DEPRECATED - Label for the Infra Monitoring UI team. Use Team:obs-ux-infra_services labels Apr 22, 2021

sgrodzicki assigned jasonrhodes Jun 8, 2021

Kerry350 self-assigned this Jun 15, 2021

Kerry350 mentioned this issue Jun 18, 2021

[Logs UI] Log threshold rule performance improvements #102650

Merged

jasonrhodes mentioned this issue Nov 18, 2021

[Metrics Alerting] Implement a cap for group_by cardinality within metrics threshold rule type #119073

Closed

miltonhultgren self-assigned this Dec 8, 2021

weltenwort mentioned this issue Dec 8, 2021

[Infra Monitoring] Warning about large index sets #120615

Closed

miltonhultgren removed their assignment Dec 15, 2021

weltenwort mentioned this issue Jan 31, 2022

[Logs UI] Optimize grouped rule execution in the log threshold rule type #124130

Open

jasonrhodes closed this as completed Apr 18, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Logs UI] Handle log threshold alert grouping fields with large cardinalities more robustly #98010

[Logs UI] Handle log threshold alert grouping fields with large cardinalities more robustly #98010

weltenwort commented Apr 22, 2021 •

edited

Loading

elasticmachine commented Apr 22, 2021

dgieselaar commented Apr 22, 2021

weltenwort commented Apr 23, 2021

dgieselaar commented Apr 23, 2021

weltenwort commented Apr 23, 2021 •

edited

Loading

jasonrhodes commented May 4, 2021

weltenwort commented May 5, 2021

dgieselaar commented May 5, 2021

gmmorris commented May 5, 2021

weltenwort commented May 5, 2021

jasonrhodes commented Jun 8, 2021

dgieselaar commented Jun 8, 2021

weltenwort commented Jun 8, 2021

jasonrhodes commented Jun 8, 2021

Kerry350 commented Jun 15, 2021 •

edited

Loading

matschaffer commented Nov 17, 2021 •

edited

Loading

matschaffer commented Nov 17, 2021

weltenwort commented Nov 17, 2021

matschaffer commented Nov 17, 2021

weltenwort commented Nov 17, 2021

miltonhultgren commented Dec 8, 2021

weltenwort commented Dec 8, 2021

Kerry350 commented Dec 8, 2021 •

edited

Loading

miltonhultgren commented Dec 15, 2021 •

edited

Loading

jasonrhodes commented Jan 6, 2022

miltonhultgren commented Jan 10, 2022 •

edited

Loading

jasonrhodes commented Jan 28, 2022

weltenwort commented Jan 31, 2022

weltenwort commented Jan 31, 2022

weltenwort commented Jan 31, 2022

smith commented Apr 18, 2022

jasonrhodes commented Apr 18, 2022

[Logs UI] Handle log threshold alert grouping fields with large cardinalities more robustly #98010

[Logs UI] Handle log threshold alert grouping fields with large cardinalities more robustly #98010

Comments

weltenwort commented Apr 22, 2021 • edited Loading

Problem description

Possible solutions

elasticmachine commented Apr 22, 2021

dgieselaar commented Apr 22, 2021

weltenwort commented Apr 23, 2021

dgieselaar commented Apr 23, 2021

weltenwort commented Apr 23, 2021 • edited Loading

jasonrhodes commented May 4, 2021

weltenwort commented May 5, 2021

dgieselaar commented May 5, 2021

gmmorris commented May 5, 2021

weltenwort commented May 5, 2021

jasonrhodes commented Jun 8, 2021

dgieselaar commented Jun 8, 2021

weltenwort commented Jun 8, 2021

jasonrhodes commented Jun 8, 2021

Kerry350 commented Jun 15, 2021 • edited Loading

matschaffer commented Nov 17, 2021 • edited Loading

matschaffer commented Nov 17, 2021

weltenwort commented Nov 17, 2021

matschaffer commented Nov 17, 2021

weltenwort commented Nov 17, 2021

miltonhultgren commented Dec 8, 2021

weltenwort commented Dec 8, 2021

Kerry350 commented Dec 8, 2021 • edited Loading

miltonhultgren commented Dec 15, 2021 • edited Loading

jasonrhodes commented Jan 6, 2022

miltonhultgren commented Jan 10, 2022 • edited Loading

jasonrhodes commented Jan 28, 2022

weltenwort commented Jan 31, 2022

weltenwort commented Jan 31, 2022

weltenwort commented Jan 31, 2022

smith commented Apr 18, 2022

jasonrhodes commented Apr 18, 2022

weltenwort commented Apr 22, 2021 •

edited

Loading

weltenwort commented Apr 23, 2021 •

edited

Loading

Kerry350 commented Jun 15, 2021 •

edited

Loading

matschaffer commented Nov 17, 2021 •

edited

Loading

Kerry350 commented Dec 8, 2021 •

edited

Loading

miltonhultgren commented Dec 15, 2021 •

edited

Loading

miltonhultgren commented Jan 10, 2022 •

edited

Loading