Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Logs UI] Handle log threshold alert grouping fields with large cardinalities more robustly #98010

Closed
weltenwort opened this issue Apr 22, 2021 · 32 comments
Assignees
Labels
discuss Feature:Logs UI Logs UI feature Team:Infra Monitoring UI - DEPRECATED DEPRECATED - Label for the Infra Monitoring UI team. Use Team:obs-ux-infra_services

Comments

@weltenwort
Copy link
Member

weltenwort commented Apr 22, 2021

Problem description

When the user creates a log threshold alert with a "group by" field of large cardinality, the alert executor will paginate through a large number of composite aggregation pages. This can use and possibly exceed the resources available in Elasticsearch and Kibana and thereby negatively impact the availability of the service. Additionally, the alert execution might time out and miss alerts that should have been fired.

On top of that the query used for checking the condition prioritizes correctness over performance by filtering out non-matching groups as late as possible. This allows for checking for zero-count threshold, but prevents Elasticsearch from optimizing the query more aggressively.

Possible solutions

  • Check and warn about high cardinality of the grouping field when creating the job.
  • Offer a setting on job creation to set an acceptable cardinality limit (as in "group by host.name up to 10000 groups).
  • Check the cardinality on execution and fail early and loudly when the configured limit is exceeded.
  • Special case costly grouped "alert when less than" conditions and use more efficient queries for all other cases. (By moving the filter out of the composite agg to the global bool filter.)
@weltenwort weltenwort added discuss Feature:Logs UI Logs UI feature Team:Infra Monitoring UI - DEPRECATED DEPRECATED - Label for the Infra Monitoring UI team. Use Team:obs-ux-infra_services labels Apr 22, 2021
@elasticmachine
Copy link
Contributor

Pinging @elastic/logs-metrics-ui (Team:logs-metrics-ui)

@dgieselaar
Copy link
Member

It looks like the filter is applied as a child as the composite aggregation, that means that if the group by field has high cardinality, it will run into issues regardless of whether the user has specified a filter.

@weltenwort
Copy link
Member Author

@dgieselaar That is true, but intentional. The reasoning is explained in #68250 (review):

TL;DR: filtering too early means we can't detect disappearing groups

The alert won't fire when no document matches the condition. This is consequence of the way the grouping composite aggregation is combined with the condition in the filter query. The filter is evaluated first, so groups might fully disappear and alert instances might not be created. So unfortunately this means that an alert like "alert me when any of the hosts don't produce logs where 'status' is 'alive' for the last five minutes" won't work.

One way around this would be to not put the condition filters in the top-level query (only the date range filters), but instead nest them in a filter aggregation beneath the composite aggregation. That way the groups would exist as long as the grouping criteria exists at all in any document in that time range. Groups will then still disappear when no document exists, but I can't think of a clean way around that.

@dgieselaar
Copy link
Member

@weltenwort I don't see an option in the UI to alert on no data? Am I looking in the wrong place?

image

(Looks like the popovers keep getting stuck as well, btw)

@weltenwort
Copy link
Member Author

weltenwort commented Apr 23, 2021

You can alert on a threshold of 0 documents, which means the group is known, but there are no documents that match the specific filter conditions. That can be used, for example, to detect when a service on a host has stopped sending logs.

@jasonrhodes
Copy link
Member

You can alert on a threshold of 0 documents

I think this also arises when you query for "less than n documents", if I recall correctly (since the 0 scenario is always less than n).

@weltenwort
Copy link
Member Author

Yes, that sounds correct. 👍 It would still be valuable to optimize the query for "more than" cases IMHO.

@dgieselaar
Copy link
Member

One solution would be to offer the user two filters: one that is applied as part of the query, and one that is applied as a filter aggregation.

@gmmorris
Copy link
Contributor

gmmorris commented May 5, 2021

If there's anything we can do at Platform level to make this kind of thing easier, please don't hesitate to pull us in :)

@weltenwort
Copy link
Member Author

One solution would be to offer the user two filters: one that is applied as part of the query, and one that is applied as a filter aggregation.

Yes, that would be one approach, but it might make it a lot harder for the user to understand. Maybe there's a way to structure this in the UI in such a way that the semantics of opting into an expensive execution strategy would become apparent. Those would all be breaking changes, though.

@jasonrhodes
Copy link
Member

I notice we are using a composite.size of 40 for this query. Are we basing this number exactly on the number of buckets we expect? We bumped the metrics inventory query up to 2000 for this size (based on bucket calculations) and it had a massive effect on perf.

@dgieselaar
Copy link
Member

I notice we are using a composite.size of 40 for this query. Are we basing this number exactly on the number of buckets we expect? We bumped the metrics inventory query up to 2000 for this size (based on bucket calculations) and it had a massive effect on perf.

FWIW, I'm not sure if this solves the issue where Kibana itself OOMs (as the responses are kept in memory).

@weltenwort
Copy link
Member Author

FWIW, I'm not sure if this solves the issue where Kibana itself OOMs (as the responses are kept in memory).

Probably not, because it would only reduce the number of requests it performs to fetch the same amount of data.

@jasonrhodes
Copy link
Member

I think it still likely makes the rule execution take longer than necessary unless we have a good reason to keep it at 40. But yes, we probably need to think broader than that.

@Kerry350
Copy link
Contributor

Kerry350 commented Jun 15, 2021

Special case costly grouped "alert when less than" conditions and use more efficient queries for all other cases. (By moving the filter out of the composite agg to the global bool filter.)

I'm going to implement this quick win, which will at least dramatically improve the more than scenario. I'll also increase the composite size. And lastly add a note about high cardinality fields to the docs (more of a stop gap until we provide some UI options).

Edit: I'll also add an in-UI warning when using a less than comparator and a group by field that this can cause problems. We can add this without committing to options that will need to be persisted.

@matschaffer
Copy link
Contributor

matschaffer commented Nov 17, 2021

Hi, I ended up here tracing from https://github.com/elastic/sdh-kibana/issues/2266 and why this isn't a top N query.

For an alert like "count is more than 75 group by arn" I don't think I would expect an exhaustive list of all possible arns when there are thousands of them.

I would have expected a top N query such that we could say "more than 75 error messages found for these top 10 arns"

Perhaps with a note like "{{sum_other_doc_count}} documents were skipped after finding those top 10, which may contain additional arns"

@matschaffer
Copy link
Contributor

I guess "less than" might be a bit of a gotcha given that you should avoid _count: asc and use https://www.elastic.co/guide/en/elasticsearch/reference/current/search-aggregations-bucket-rare-terms-aggregation.html instead, but we could probably make the agg selection on the back end.

@weltenwort
Copy link
Member Author

@matschaffer There might be misunderstanding here. Grouping doesn't create an alert with a list, but a separate alert for each group. That way a downstream notification tool can fan out the alerts to the proper channels.

@matschaffer
Copy link
Contributor

Ahhh... so each one could potentially go to a different channel?

@weltenwort
Copy link
Member Author

Yes, or be indexed using the index action.

@miltonhultgren
Copy link
Contributor

Dropping an idea @weltenwort mentioned in a chat, it could be nice if we had some way to simulate an rule execution when a rule is being created. if we could couple that with some profiling in the Alerts Framework to give timing information back to the users we might be able to guide them to create more performant alerts.

@miltonhultgren miltonhultgren self-assigned this Dec 8, 2021
@weltenwort
Copy link
Member Author

IIRC @Kerry350 has been thinking about simulating the queries for a while.

@Kerry350
Copy link
Contributor

Kerry350 commented Dec 8, 2021

👋 👋

Yeah, I've been wanting to add query simulation for a while (release one 😂). There are so many cases in SDHs and so on where a simple copy-pasta of the query ES is executing would have been enough. I planned to dig into this for my Space, Time.

Being able to couple that with some Alerting profiling would be pretty cool.

The Alerting team have a ticket for adding an "explain" functionality, which is what we want to aid in these debugging scenarios (I'm struggling to find the ticket, but I'll link it in when I do #84417). I may well look at adding a proof of concept for that overall functionality. If that's too hard, then I'll just add a standalone "explain" POC for us. (For the ON Week).

@miltonhultgren
Copy link
Contributor

miltonhultgren commented Dec 15, 2021

After working on "Check and warn about high cardinality of the grouping field when creating the job." for a few days and speaking with @weltenwort, my conclusion is that it is hard to offer meaningful up front warnings since the final performance will depend on the deployment size and load during execution (which also raises the question of when to perform such health checks).

Simply having a high cardinality field that will create a lot of pages does not mean the performance will be degraded because "high" needs to be put in relation to the cluster size and load. For a small deployment, 10 pages of a 100 buckets each might kill it while a large deployment chews right through it.

While we can make progress by changing how we process the buckets to avoid going OOM or blocking the Kibana JavaScript thread for too long, it feels like we also need some way to alert if the performance of a rule itself is starting to degrade (Rule A takes on average longer to execute than its interval setting) or the cluster health is degrading as a result of a rule executing (when Rule A is executing, memory consumption spikes 9 out of 10 times).
With that we might be able to disable poorly performing rules before the cluster goes down, guiding users to diagnostic tools (like simulation) and offering some advice on what might be causing the degradation (increased cardinality over time, increased load on the cluster).

@miltonhultgren miltonhultgren removed their assignment Dec 15, 2021
@jasonrhodes
Copy link
Member

It's possible that the changes made by @simianhacker in this PR -> #121904 can be implemented here as well.

@miltonhultgren
Copy link
Contributor

miltonhultgren commented Jan 10, 2022

We'd likely have to check if 10k is a good number for the log threshold executor and do a similar performance investigation of that code to find the bottlenecks to remove. Should we update this issue for that or create a new issue for that work? @jasonrhodes

@jasonrhodes
Copy link
Member

From my perspective, I think we need to start assuming that the size value should start much, much bigger unless we have a specific reason to suspect that it will cause "too many buckets" errors. But yes, we should do some spot checking for sure.

I think a new ticket would be a good idea, although I think @simianhacker has already done some work here on implementing this across some of the other rules?

The main things I think we should focus on for now are:

  1. Increase composite "size" value as much as possible
  2. Do not block the event loop while processing results in memory -- either schedule alert actions in between each cursor-based call, or at least use some kind of recursive function + setImmediate() to process them in smaller batches.
  3. Implement the new cancellation APIs from the alerting framework

(1) needs perf evaluation
(2) is something we should just do, no matter what
(3) we need to dig into more, but also we should just do it

If someone can make some tickets to work on this, that'd be great, but if not I will put it on my list to break this down and we can plan it for an upcoming cycle.

@weltenwort
Copy link
Member Author

I can create those tickets

@weltenwort
Copy link
Member Author

@jasonrhodes there already is an issue (#122550) for adding cancellation on the @elastic/actionable-observability board. We should clarify who tackles that and make sure it doesn't collide with the executor refactoring required for the composite loop optimization.

@weltenwort
Copy link
Member Author

@jasonrhodes I created an issue for the optimized querying in #124130. Let me know what you think.

@smith
Copy link
Contributor

smith commented Apr 18, 2022

Do we think we should close this and assume doing #122550 and #124130 will get us to a place where things are acceptable?

@jasonrhodes
Copy link
Member

Yeah I think #124130 in particular just replaces this one.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
discuss Feature:Logs UI Logs UI feature Team:Infra Monitoring UI - DEPRECATED DEPRECATED - Label for the Infra Monitoring UI team. Use Team:obs-ux-infra_services
Projects
None yet
Development

No branches or pull requests

9 participants