Aggregations: Memory-bound terms #6697

jpountz · 2014-07-02T22:55:33Z

Our terms aggregations implementations use an amount of memory that is linear with the cardinality of the value source they run on. Thanks to global ordinals, we only require 4 bytes per unique term to track the count, so even with a field that has a cardinality of 10M, that would only be 40MB of memory.

However, things get worse when using sub aggregations, especially the memory-intensive ones such as percentiles, cardinality, top_hits or bucket aggregations. Ideally we would want memory usage to depend on size instead of the cardinality of the value source.

I have been looking recently at the Space-Saving algorithm described in section 3.1 of Efficient Computation of Frequent and Top-k Elements in Data Streams. Although described for computing top buckets based on counts, I think it would be possible to use it when sorting by term or sub-aggregation. But you don't get optimized memory usage for free so it would have two main drawbacks compared to the current implementation:

it would only work correctly on skewed distributions
there is no clear separation between buckets and some buckets might aggregate document from other buckets

I think 1 is fine since top terms tend to make more sense on skewed distributions anyway, eg. most frequent terms in natural language text, or top ip addresses that hit a website.

2 is more inconvenient, especially if there are other aggregations under this terms aggregation. One good news is that we could know what buckets might have aggregated data that is not theirs (they would be those whose term ordinal has been updated during collection), so we could have it as part of the response if necessary. On the other hand, if there are no sub aggregations, it would be more acceptable to use this implementation since counts are inaccurate anyway. And it would also work nicely with #6696 since the maximum error can be estimated.

The text was updated successfully, but these errors were encountered:

redox · 2014-07-21T11:49:37Z

Hey there,

with such algorithm the top buckets are determined once all values have been processed, right? Therefore, the sub (bucket-specific & nested) aggregations can only be computed once all documents have been collected in order to know the final list of buckets, isn't it? Because if I'm right, for each collected document (method collect) we need to call collectableSugAggregators.collect with the associated bucketOrd (which will be unknown at this step). Am I missing something?

Instead, in case of a count-based (or terms-based) sorting, any chance the sub aggregations could be collected in another pass?

jpountz · 2014-07-21T13:21:53Z

Therefore, the sub (bucket-specific & nested) aggregations can only be computed once all documents have been collected

I think sub aggregations could be collected on the fly. The algorithm works on a fixed set of m counters, so we could associate a bucket ordinal (between 0 and m-1) with each of these counters and use them to collect the sub aggregations.

It might introduce accuracy issues for the sub aggregations since counters can be relabeled during collection but this is the general trade-off of this algorithm: trading accuracy for memory, so I think it's fine? Moreover if the distribution of the data is skewed enough and if we oversize m a bit compared to the number of top terms that we are interested in, the likelyhood of the top terms being affected by this issue should remain low.

redox · 2014-07-21T13:53:19Z

Makes sense, we're on the same page! Thank you @jpountz

redox · 2014-07-23T17:33:34Z

Just wanna let you know that I've released a plugin embedding such algorithm and targeting ES +1.2. Our first tests are pretty concluant 👍

It's still in alpha version but I would love any comment/review/pull-request of ES gurus reading this comment :)

Plugin is here: elasticsearch-topk-plugin

jpountz · 2014-07-23T18:00:44Z

It looks good in general. FYI you would get better performance by specializing on numerics/strings like the terms aggregation does. On numerics this would only help a bit by saving the conversion to BytesRef and allowing to use more efficient data-structures, but on strings this could have a big impact. The way field data works on strings is that there is a first level of indirection that gives ordinals given a doc ID, and a second level of indirection that gives the values given an ordinal. Ordinals have the interesting property of being dense (so that you can use them as indices in arrays) and sorted (useful for comparisons, would help if you want to sort buckets of this top-k agg by term). Additionally they are fast to retrieve. On the other hand, retrieving values given an ordinal might be slow, especially if you are using doc values. That's why our aggregations try to use ordinals whenever possible as this brings important speedups.

redox · 2014-07-23T18:03:57Z

Thanks for the inputs @jpountz, I'm definitely planning to use the ordinals but wanted to have a first working version first.

clintongormley · 2015-11-21T15:38:10Z

@jpountz is this still something you plan on pursuing?

jpountz · 2015-11-23T10:51:07Z

Yes. I think it can be useful for pagination in particular, when sorting by term. I'll close for now and reopen when I have more concrete plans.

jpountz self-assigned this Jul 2, 2014

jpountz mentioned this issue Oct 9, 2014

Aggregations: Memory-bound terms. #8035

Closed

clintongormley added >enhancement discuss :Analytics/Aggregations Aggregations labels Nov 21, 2015

jpountz closed this as completed Nov 23, 2015

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Aggregations: Memory-bound terms #6697

Aggregations: Memory-bound terms #6697

jpountz commented Jul 2, 2014

redox commented Jul 21, 2014

jpountz commented Jul 21, 2014

redox commented Jul 21, 2014

redox commented Jul 23, 2014

jpountz commented Jul 23, 2014

redox commented Jul 23, 2014

clintongormley commented Nov 21, 2015

jpountz commented Nov 23, 2015

Aggregations: Memory-bound terms #6697

Aggregations: Memory-bound terms #6697

Comments

jpountz commented Jul 2, 2014

redox commented Jul 21, 2014

jpountz commented Jul 21, 2014

redox commented Jul 21, 2014

redox commented Jul 23, 2014

jpountz commented Jul 23, 2014

redox commented Jul 23, 2014

clintongormley commented Nov 21, 2015

jpountz commented Nov 23, 2015