-
Notifications
You must be signed in to change notification settings - Fork 8.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Stack Monitoring] Logstash Pipelines view can trigger OOM #37246
Comments
Pinging @elastic/stack-monitoring |
@ycombinator You worked on something similar. Anything come to mind here? |
Without a proper investigation, my guess is this has to do with the sparklines on the Logstash pipeline listing page. I suspect the aggs for the queries that generate those are the source of this issue. |
Same. |
This could be an interesting issue for @igoristic to pick up. @chrisronline what do you think? |
This is definitely something I can take a look at |
Thanks @igoristic. I'm happy to answer any questions you might have about the code or the history here. |
Thanks, @igoristic. Since this is negatively affecting some production systems, please consider this issue a higher priority. |
The current PR address the issue, but now the Pipelines will only refresh when the time range changes. The problem is that I initially went with a different approach by adding Another idea I had was to maybe add Let me know what you guys think would really appreciate any suggestions @ycombinator @chrisronline |
Indeed, and this is a general problem we have with all our listing pages. What we probably want is server-side pagination but we should implement that for all our listing pages, not just one. [EDIT] When I say "server-side pagination", I mean the following:
Interesting. Does this mean server-side pagination would not work in this case?
If we do this, wouldn't any of the data in each row of the pipeline listing not update every 10s (or whatever interval the user sets)? That would be at odds with the rest of the listing pages in the Stack Monitoring UI, leading to an inconsistent UX, no? Further up in this issue both @pickypg and I speculated that the memory usage has to do with the sparkline-related aggregations. Did you do any investigation along those lines? For instance, did you try to remove just the aggs needed for the sparklines' data and see if that helps? More generally (i.e. disregarding our specific speculation about sparklines for a moment), it would be good to know what the long pole in the tent is on this page, w.r.t. to performance. Specifically which element in the UI is having the most performance impact? Once we've narrowed that down, it might become clearer how to proceed by perhaps handling that element differently than we do today. |
I don't think anyone ever watches the sparklines anticipating their next tick, or expects them to be 100% accurate. The general assumption with all sparklines (at least from my trading crypto experience) is that they are static. This was my main motivation behind the solution.
True, but I think we can start introducing values that don't get updated/aggregated similarly in context with: #39308 this can either be performance or capability driven. But, of course we will still express that somehow based on the outcome of mentioned ticket
I was initially thinking the same, but even if we do go with the "only aggregate what you see" approach this only fixes the issue partially, since their row count can be set to 50.
Wouldn't this make it worse? Since, now we'll be making X number of requests more multiplied by the row count (every ten seconds). We would also need to do this each time they get a result from the search field. Maybe I'm miss understanding something?
@ycombinator @pickypg I'm kind of confused as to which memory hiccup we are concerned about? Browser, JVM, or Both? The ticket has a screenshot of JVM Heap chart, so that's what I've been focusing on |
Even for large pages, this would still be far superior to that existing approach that we've stuck ourselves with in the past. On the one hand, we'd still suffer with pages of 50, but that would be dramatically be better than unlimited pages.
Most likely not. The problem with the existing request is that it has to hold onto a massive amount of memory and pass that between nodes, until it finally is able to respond to the caller (Kibana). If you "walked" the list via the browser and just requested them on-demand, even in batches, it would be superior to a single massive request -- probably even if you managed to send them all in parallel because it can throw away the memory in pieces (and browsers limit the number of calls, so it couldn't fire 50 at once). The bigger problem is that we would have to be intelligent about the paging so that it was efficient and also fast. We don't want to fire 10 in parallel at a time because that could end up, under defaults, being at least 70 shards getting touched per request so you'd quickly hit search rejections.
It's the downside to having a more dynamic API, but paging also implies that the search field is actually using ES search rather than local search like it does today, which kind of further implies that we'd have to do that anyway.
100% JVM. Taking down the node(s) is problematic to the entire Stack. Taking down the browser would be pretty bad, but it's not the problem that I've been concerned about at all. My assumption is that @ycombinator meant "performance" in the timing sense rather than browser memory. If you removed the spark lines from the pipeline list, for debugging purposes, I can practically guarantee that memory pressure would disappear with 96 pipelines or even 293 as I saw recently. Massive aggregations because of combinatorially large requests is my major concern here. |
Thanks @pickypg. I was traveling so couldn't get to this soon enough but you covered everything. :) |
I forgot to mention, but there is a direct correlation between pipeline aggregation and JVM memory usage based on the simple tests I did in the beginning by removing:
from I'm thinking the per row aggregation might not be such a bad idea here especially if we limit the max row count for pagination to 20 (and set the default to 5) |
Doesn't Also, narrowing down the problem target like this could potentially open up other solutions, like lazily fetching just sparkline data after the initial page load (we did something along these lines in another performance-related issue not too long ago). In general, I'd like us to be able to characterize the root cause of this problem as narrowly as possible before we start considering any solutions. |
Late to the party, but it's not clear to me what request we're doing that is causing the performance issue. It'd probably be useful to outline all the requests made from Kibana server -> ES (which starts with an XHR request from Kibana client -> Kibana server) and then comment out each individual request separately, and test which one has the biggest impact on performance. It feels like we're jumping the gun a bit on what to do actually do here. EDIT: In the case that it might actually only be a single request from Kibana server -> ES, we should break down the individual aggregations done in the single request and figure out which one is impacting performance the most |
Ran into a new variation of this issue today, where there were only 41 pipelines, but in 7.x the page simply would not load because the default max buckets ( I couldn't even get to the page to load without increasing |
Added 7.x to the initial issue description. For anyone hitting this issue: (from the initial issue description):
It is highly recommended to setup a separate monitoring cluster for production environments which will avoid conflict of having |
Same as #36892? |
@inqueue No, that's for the ES Node listing. It's the same symptom (blocked request because the monitoring listings try to return everything in one pass), but it's for a functionally different reason. |
I was able to test this more granularly. I created about 40 fake pipelines using: - pipeline.id: "random_0"
pipeline.workers: 1
pipeline.batch.size: 10
config.string: "input { generator {} } filter { sleep { time => 1 } } output { stdout { codec => dots } }" Gave it some time to bake in (for logstash to create in/out events etc), and was able to confirm that the aggregation does spike the JVM heap usage in Elasticsearch. I created a new API call with a simple non-aggregated query for polling eg: GET .monitoring-logstash-6-*,.monitoring-logstash-7-*/_search
{
"query": {
"bool": {
"filter": [{
"term": {
"cluster_uuid": "0Fl90z31QCmpxOY3SCbiyw"
}
}, {
"range": {
"logstash_stats.timestamp": {
"format": "epoch_millis",
"gte": 1562778618444,
"lte": 1562941838109
}
}
}]
}
},
"_source": ["logstash_stats.pipelines"]
} And saw that the JVM usage went down significantly. My testing:
One thing I also discovered is that watching the Monitoring in I'm thinking maybe it'll be a lot cheaper to do some of this aggregation on the frontend? Thoughts @ycombinator @pickypg @chrisronline EDIT: Forgot to mention that narrowing aggregation to just ids eg: GET .monitoring-logstash-6-*,.monitoring-logstash-7-*/_search
{
"query": {
"bool": {
"filter": [{
"term": {
"cluster_uuid": "0Fl90z31QCmpxOY3SCbiyw"
}
}, {
"range": {
"logstash_stats.timestamp": {
"format": "epoch_millis",
"gte": 1562778618444,
"lte": 1562941838109
}
}
}]
}
},
"aggs": {
"check": {
"date_histogram": {
"field": "logstash_stats.timestamp",
"interval": "30s"
},
"aggs": {
"pipelines_nested": {
"nested": {
"path": "logstash_stats.pipelines"
},
"aggs": {
"by_pipeline_id": {
"terms": {
"field": "logstash_stats.pipelines.id",
"size": 1000
}
}
}
}
}
}
}
} Also spiked the memory usage a little bit. I couldn't test it thoroughly, because I eventually ran into |
@igoristic Yeah, once we start doing arbitrarily sized aggregations (e.g., Looking at your aggregation, your date range is about Looking at the edit, there's two things that I think we should consider moving forward for this solution and three things overall based on your comment. The first is unrelated to the Logstash Pipelines issue here, but related to:
I think it's safe to discuss this problem generically because whatever you all do to introduce paging here should extend to the other screens. However, beyond that, I think it would be ideal to discuss the nodes own problems (specifically it aggregates the shards because Relative to the LS pipeline issues:
For a long time, I have thought that we need to completely change how our listings work and I think this is fundamentally what we must do. We should page by taking advantage of ES, only selecting the subset that we want to display, then performing a follow-on aggregation that only applies to those documents if we even have to aggregate. Be aware that your example polling query only fetched the top 10 hits, which isn't 100% comparable to what the existing UI functionally does: it effectively fetches every hit it can. Although this would fix the memory issues that we face (and bucket quantities), this does introduce some problems relative to the existing UX. First, anything aggregated would become unsortable because we wouldn't have all values in memory to know if we were sorting properly. Second, any search / filter boxes would have to dynamically work against the generated ES query rather than the in-memory data because, again, we don't have it all. That's not an easy thing to implement because it has implications with understanding the mappings of the data and what's therefore possible (the EUI search bar examples should help here).
Having just gone through my explanation of the costs associated with bucketing massive amounts, a very easy win for us would be to stop using Particularly with the spark line, which has a lot of dots, this should help tremendously and we should reduce the number of buckets we have to probably something on the order of We could then extend this |
Upon thorough investigation into "aggs": {
"events_stats": {
"stats": {
"field": "logstash_stats.pipelines.events.out"
}
},
"throughput": {
"bucket_script": {
"script": "params.max - params.min",
"buckets_path": {
"min": "events_stats.min",
"max": "events_stats.max"
}
}
}
} I was still able to derive comparison, just from observing the Stack Monitoring overview page, since it also uses We are actually dealing with two different types of issues here:
I also tested and investigated different possible solutions for both issues:
EDIT: Forgot to mention that in addition to all these solutions we can do some of the aggregations on the frontend to take the load off ES **One thing to note that paging the hits has no impact on aggregation, so we always get all the items (pipelines), so it would be great to also include our own type of paging solution in addition with the solutions described above (since there could be 100 pipeline, but only 5 visible at a time). Would love to hear your guys' opinion: @pickypg @ycombinator @chrisronline @cachedout |
This is the ideal solution to me, assuming the ES team can fix the pipeline issue that we observed. We get direct control over the number of buckets, removing 100% of the unpredictable nature of it that I'm about to discuss in the second point.
We actually already try to do this. I'm not sure that the sparklines take advantage of the same logic as the rest of the charts, but we try to set "intelligent" intervals based on the requested time range. The inherent variability here is half of what our problem is though because we the number of buckets increases as the range does until we effectively shift gears to the next higher interval.
I think we want to do this, to some degree. #37246 (comment) Instead of one-by-one, I think we could still safely do it page-by-page (e.g., 20 - 50 at a time). Note, this means we'd have to do search via hits, then do a second request that aggregates against those specific IDs.
This would end up generating roughly the same number of buckets, but over a set of requests instead of one. That would be helpful to the GC being able to kick in and save ES, but with
This sounds like just a mechanism for hoping to avoid the issue, but it wouldn't safely work. Two users exploring pipeline issues together could still trigger massive heap usage in parallel. |
It's been a while since there's been activity on this issue and IMO it's a pretty critical one as it keeps coming up. So I wanted to try and summarize the discussion so far and see if we could move forward, at least with a short-term fix. Long-term fix: I think the consensus here seems to be smarter pagination. That is, delegate pagination to ES and only request enough data for one page at a time. Even in that data, we might want to only request just enough to render enough useful data on a row initially, and then do follow-on aggregations per row to render more data asynchronously. Short-term fix #1: Using Short-term fix #2: For sparklines used in listing page (AFAIK the LS Pipeline Listing page is the only one doing this today), use a different mapping of time picker intervals => bucket intervals than the ones we use for other timeseries charts in the Stack Monitoring UI. The idea here is to come up with larger bucket intervals, therefore reducing the # of buckets. I think this could help alleviate the problem to a certain extent, but if the # of pipelines (i.e. # of sparklines) in the listing grows beyond a certain point we will hit the same problem again. Short-term fix #3: Split requests by hours. This means fewer buckets being requested in each request. Again, similar to #2, this could help alleviate the problem to a certain extent, but if the # of pipelines (i.e. # of sparklines) in the listing grows beyond a certain point we will hit the same problem again. Short-term fix #4: AFAICT this hasn't been proposed yet but what about providing users with a checkbox on the LS Pipeline Listing page to show/hide the sparklines? There are some UX details to work out here but the general idea here is to give users an escape hatch to avoid the OOM by avoiding requesting data for the sparklines altogether. I think we should vote on either short term fix #2, #3, or #4 and make progress on that, just to move this issue forward. Meanwhile we should continue to follow up with the ES folks on the blocker for short term fix #1 (if we still need it) and also work on the long term fix. |
I like option 2. I don't think that granularity in the history is as critical as it might be in a larger time-series graph. |
It's unclear to me why we can't start moving in this direction (beyond scheduling of course), especially for the Pipeline page as the first pass. This would ultimately fix the issue across the board, but also serve as the basis for every other part of Stack Monitoring.
I like both of these options, but I am not sure that either of these work enough for the larger users out there to warrant the effort versus the long term fix, and we wouldn't know how much they helped in 40+ LS pipeline users without implementing them.
I am not sure that this would work since the page would sometimes never actually load to give them the opportunity.
100% this. All of Kibana should be moving in that direction. |
Yeah, I think it's scheduling more than anything else, given how we haven't been able to attend to this issue for ~2 months. If we can prioritize this fix over other commitments, I agree we should focus on this over any of the short term options mentioned here.
True, if the default is to show the sparklines. We could flip the default to be to hide the sparklines. Again, keeping in mind that this is meant to be more of a stop gap. |
I think this is a perfectly reasonable course of action for a short-term fix. |
Long-term fix for this particular page: #46587 |
I'm going to close this out, as #46587 should fix this for effectively everyone. Please reopen if you see the behavior persist after the fix is released |
Kibana version:
Observed in both 6.5.x, 6.7.x. and 7.x
Elasticsearch version:
Matching.
Description of the problem including expected versus actual behavior:
Loading the Logstash Pipeline listing view in the Stack Monitoring UI, with a relatively large number of pipelines (in the screenshot there are 96), then it can trigger a very large amount of memory utilization across Elasticsearch and result in OOM if you are unlucky.
Steps to reproduce:
Each heap usage that spiked above 75% was from me allowing the listing to load (otherwise I had it paused to avoid heap usage).
The text was updated successfully, but these errors were encountered: