-
Notifications
You must be signed in to change notification settings - Fork 24.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Aggregations: Memory-bound terms #6697
Comments
Hey there, with such algorithm the top buckets are determined once all values have been processed, right? Therefore, the sub (bucket-specific & nested) aggregations can only be computed once all documents have been collected in order to know the final list of buckets, isn't it? Because if I'm right, for each collected document (method Instead, in case of a count-based (or terms-based) sorting, any chance the sub aggregations could be collected in another pass? |
I think sub aggregations could be collected on the fly. The algorithm works on a fixed set of It might introduce accuracy issues for the sub aggregations since counters can be relabeled during collection but this is the general trade-off of this algorithm: trading accuracy for memory, so I think it's fine? Moreover if the distribution of the data is skewed enough and if we oversize |
Makes sense, we're on the same page! Thank you @jpountz |
Just wanna let you know that I've released a plugin embedding such algorithm and targeting ES +1.2. Our first tests are pretty concluant 👍 It's still in alpha version but I would love any comment/review/pull-request of ES gurus reading this comment :) Plugin is here: elasticsearch-topk-plugin |
It looks good in general. FYI you would get better performance by specializing on numerics/strings like the terms aggregation does. On numerics this would only help a bit by saving the conversion to BytesRef and allowing to use more efficient data-structures, but on strings this could have a big impact. The way field data works on strings is that there is a first level of indirection that gives ordinals given a doc ID, and a second level of indirection that gives the values given an ordinal. Ordinals have the interesting property of being dense (so that you can use them as indices in arrays) and sorted (useful for comparisons, would help if you want to sort buckets of this top-k agg by term). Additionally they are fast to retrieve. On the other hand, retrieving values given an ordinal might be slow, especially if you are using doc values. That's why our aggregations try to use ordinals whenever possible as this brings important speedups. |
Thanks for the inputs @jpountz, I'm definitely planning to use the ordinals but wanted to have a first working version first. |
@jpountz is this still something you plan on pursuing? |
Yes. I think it can be useful for pagination in particular, when sorting by term. I'll close for now and reopen when I have more concrete plans. |
Our terms aggregations implementations use an amount of memory that is linear with the cardinality of the value source they run on. Thanks to global ordinals, we only require 4 bytes per unique term to track the count, so even with a field that has a cardinality of 10M, that would only be 40MB of memory.
However, things get worse when using sub aggregations, especially the memory-intensive ones such as
percentiles
,cardinality
,top_hits
or bucket aggregations. Ideally we would want memory usage to depend onsize
instead of the cardinality of the value source.I have been looking recently at the
Space-Saving
algorithm described in section 3.1 ofEfficient Computation of Frequent and Top-k Elements in Data Streams
. Although described for computing top buckets based on counts, I think it would be possible to use it when sorting by term or sub-aggregation. But you don't get optimized memory usage for free so it would have two main drawbacks compared to the current implementation:I think
1
is fine since top terms tend to make more sense on skewed distributions anyway, eg. most frequent terms in natural language text, or top ip addresses that hit a website.2
is more inconvenient, especially if there are other aggregations under this terms aggregation. One good news is that we could know what buckets might have aggregated data that is not theirs (they would be those whose term ordinal has been updated during collection), so we could have it as part of the response if necessary. On the other hand, if there are no sub aggregations, it would be more acceptable to use this implementation since counts are inaccurate anyway. And it would also work nicely with #6696 since the maximum error can be estimated.The text was updated successfully, but these errors were encountered: