Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Aggregations: Memory-bound terms #6697

Closed
jpountz opened this issue Jul 2, 2014 · 8 comments
Closed

Aggregations: Memory-bound terms #6697

jpountz opened this issue Jul 2, 2014 · 8 comments

Comments

@jpountz
Copy link
Contributor

jpountz commented Jul 2, 2014

Our terms aggregations implementations use an amount of memory that is linear with the cardinality of the value source they run on. Thanks to global ordinals, we only require 4 bytes per unique term to track the count, so even with a field that has a cardinality of 10M, that would only be 40MB of memory.

However, things get worse when using sub aggregations, especially the memory-intensive ones such as percentiles, cardinality, top_hits or bucket aggregations. Ideally we would want memory usage to depend on size instead of the cardinality of the value source.

I have been looking recently at the Space-Saving algorithm described in section 3.1 of Efficient Computation of Frequent and Top-k Elements in Data Streams. Although described for computing top buckets based on counts, I think it would be possible to use it when sorting by term or sub-aggregation. But you don't get optimized memory usage for free so it would have two main drawbacks compared to the current implementation:

  1. it would only work correctly on skewed distributions
  2. there is no clear separation between buckets and some buckets might aggregate document from other buckets

I think 1 is fine since top terms tend to make more sense on skewed distributions anyway, eg. most frequent terms in natural language text, or top ip addresses that hit a website.

2 is more inconvenient, especially if there are other aggregations under this terms aggregation. One good news is that we could know what buckets might have aggregated data that is not theirs (they would be those whose term ordinal has been updated during collection), so we could have it as part of the response if necessary. On the other hand, if there are no sub aggregations, it would be more acceptable to use this implementation since counts are inaccurate anyway. And it would also work nicely with #6696 since the maximum error can be estimated.

@jpountz jpountz self-assigned this Jul 2, 2014
@redox
Copy link

redox commented Jul 21, 2014

Hey there,

with such algorithm the top buckets are determined once all values have been processed, right? Therefore, the sub (bucket-specific & nested) aggregations can only be computed once all documents have been collected in order to know the final list of buckets, isn't it? Because if I'm right, for each collected document (method collect) we need to call collectableSugAggregators.collect with the associated bucketOrd (which will be unknown at this step). Am I missing something?

Instead, in case of a count-based (or terms-based) sorting, any chance the sub aggregations could be collected in another pass?

@jpountz
Copy link
Contributor Author

jpountz commented Jul 21, 2014

Therefore, the sub (bucket-specific & nested) aggregations can only be computed once all documents have been collected

I think sub aggregations could be collected on the fly. The algorithm works on a fixed set of m counters, so we could associate a bucket ordinal (between 0 and m-1) with each of these counters and use them to collect the sub aggregations.

It might introduce accuracy issues for the sub aggregations since counters can be relabeled during collection but this is the general trade-off of this algorithm: trading accuracy for memory, so I think it's fine? Moreover if the distribution of the data is skewed enough and if we oversize m a bit compared to the number of top terms that we are interested in, the likelyhood of the top terms being affected by this issue should remain low.

@redox
Copy link

redox commented Jul 21, 2014

Makes sense, we're on the same page! Thank you @jpountz

@redox
Copy link

redox commented Jul 23, 2014

Just wanna let you know that I've released a plugin embedding such algorithm and targeting ES +1.2. Our first tests are pretty concluant 👍

It's still in alpha version but I would love any comment/review/pull-request of ES gurus reading this comment :)

Plugin is here: elasticsearch-topk-plugin

@jpountz
Copy link
Contributor Author

jpountz commented Jul 23, 2014

It looks good in general. FYI you would get better performance by specializing on numerics/strings like the terms aggregation does. On numerics this would only help a bit by saving the conversion to BytesRef and allowing to use more efficient data-structures, but on strings this could have a big impact. The way field data works on strings is that there is a first level of indirection that gives ordinals given a doc ID, and a second level of indirection that gives the values given an ordinal. Ordinals have the interesting property of being dense (so that you can use them as indices in arrays) and sorted (useful for comparisons, would help if you want to sort buckets of this top-k agg by term). Additionally they are fast to retrieve. On the other hand, retrieving values given an ordinal might be slow, especially if you are using doc values. That's why our aggregations try to use ordinals whenever possible as this brings important speedups.

@redox
Copy link

redox commented Jul 23, 2014

Thanks for the inputs @jpountz, I'm definitely planning to use the ordinals but wanted to have a first working version first.

@clintongormley
Copy link

@jpountz is this still something you plan on pursuing?

@jpountz
Copy link
Contributor Author

jpountz commented Nov 23, 2015

Yes. I think it can be useful for pagination in particular, when sorting by term. I'll close for now and reopen when I have more concrete plans.

@jpountz jpountz closed this as completed Nov 23, 2015
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants