Adds a new auto-interval date histogram #28993

colings86 · 2018-03-12T13:51:36Z

This change adds a new type of histogram aggregation called auto_date_histogram where you can specify the target number of buckets you require and it will find an appropriate interval for the returned buckets. The aggregation works by first collecting documents in buckets at second interval, when it has created more than the target number of buckets it merges these buckets into minute interval bucket and continues collecting until it reaches the target number of buckets again. It will keep merging buckets when it exceeds the target until either collection is finished or the highest interval (currently years) is reached. A similar process happens at reduce time.

This aggregation intentionally does not support min_doc_count, offest and extended_bounds to keep the already complex logic from becoming more complex. The aggregation accepts sub-aggregations but will always operate in breadth_first mode deferring the computation of sub-aggregations until the final buckets from the shard are known. min_doc_count is effectively hard-coded to zero meaning that we will insert empty buckets where necessary.

This is also the first aggregation to merge buckets at collection time but having this aggregation will open the door to other such aggregations that merge buckets at collection time.

Things still to do:

Write documentation on the new aggregation and its parameters
Write more tests, specifically around having sub-aggregations and checking they are merged correctly in the reduce step
Investigate whether we could collect 10 * target buckets on the shards and then use the extra bucket to create buckets representing multiples of the interval in the cases where we have too many buckets for one interval but if we increase the interval we get only 1 (or a few) bucket(s) (e.g. if the target buckets is 10 and we end up with 30 x minute level buckets if we increase the rounding we'll get 1 x hour bucket so instead can we merge every 3 minute level buckets to get 10 x 3 minute buckets?) - this will probably be moved out into a separate issue to be tackled when this first pass is merged
Add an equivalent auto_histogram aggregation to work on numeric (rather than date) fields - this should be moved into a separate issue
Fail the request at parsing time if the number of buckets requested is at higher than soft_limit_value / m where m is the largest innerInterval for all rounding intervals (the largest number of buckets we would merge to one bucket in any rounding, currently 30)
Add a test to ensure that if a user requests more than 10k buckets, we return an error.
Fix doc tests

Closes #9572

This change adds a new type of histogram aggregation called `auto_date_histogram` where you can specify the target number of buckets you require and it will find an appropriate interval for the returned buckets. The aggregation works by first collecting documents in buckets at second interval, when it has created more than the target number of buckets it merges these buckets into minute interval bucket and continues collecting until it reaches the target number of buckets again. It will keep merging buckets when it exceeds the target until either collection is finished or the highest interval (currently years) is reached. A similar process happens at reduce time. This aggregation intentionally does not support min_doc_count, offest and extended_bounds to keep the already complex logic from becoming more complex. The aggregation accepts sub-aggregations but will always operate in `breadth_first` mode deferring the computation of sub-aggregations until the final buckets from the shard are known. min_doc_count is effectively hard-coded to zero meaning that we will insert empty buckets where necessary. Closes #9572

elasticmachine · 2018-03-12T13:51:38Z

Pinging @elastic/es-search-aggs

This gives us more options at reduce time in terms of how we do the final merge of the buckeets to produce the final result

This reverts commit 993c782.

colings86 · 2018-03-15T13:21:53Z

@jpountz there are tests that need to be added to this PR and @pcsanwald is looking into benchmarking the aggregation at the moment to make sure the time performance is ok but I think its worth you taking a look so if there are any major things to change we can fix them now

jpountz

I had a quick look and think the general approach is good. It would help to better document how the target number of buckets is interpreted and how far the actual number of buckets may be.

Added some notes in the documentation about the intervals that can bbe returned. Also added a test case that utilises the merging of conseecutive buckets

This reverts commit d88d764.

* Adds a new auto-interval date histogram This change adds a new type of histogram aggregation called `auto_date_histogram` where you can specify the target number of buckets you require and it will find an appropriate interval for the returned buckets. The aggregation works by first collecting documents in buckets at second interval, when it has created more than the target number of buckets it merges these buckets into minute interval bucket and continues collecting until it reaches the target number of buckets again. It will keep merging buckets when it exceeds the target until either collection is finished or the highest interval (currently years) is reached. A similar process happens at reduce time. This aggregation intentionally does not support min_doc_count, offest and extended_bounds to keep the already complex logic from becoming more complex. The aggregation accepts sub-aggregations but will always operate in `breadth_first` mode deferring the computation of sub-aggregations until the final buckets from the shard are known. min_doc_count is effectively hard-coded to zero meaning that we will insert empty buckets where necessary. Closes elastic#9572 * Adds documentation * Added sub aggregator test * Fixes failing docs test * Brings branch up to date with master changes * trying to get tests to pass again * Fixes multiBucketConsumer accounting * Collects more buckets than needed on shards This gives us more options at reduce time in terms of how we do the final merge of the buckeets to produce the final result * Revert "Collects more buckets than needed on shards" This reverts commit 993c782. * Adds ability to merge within a rounding * Fixes nonn-timezone doc test failure * Fix time zone tests * iterates on tests * Adds test case and documentation changes Added some notes in the documentation about the intervals that can bbe returned. Also added a test case that utilises the merging of conseecutive buckets * Fixes performance bug The bug meant that getAppropriate rounding look a huge amount of time if the range of the data was large but also sparsely populated. In these situations the rounding would be very low so iterating through the rounding values from the min key to the max keey look a long time (~120 seconds in one test). The solution is to add a rough estimate first which chooses the rounding based just on the long values of the min and max keeys alone but selects the rounding one lower than the one it thinks is appropriate so the accurate method can choose the final rounding taking into account the fact that intervals are not always fixed length. Thee commit also adds more tests * Changes to only do complex reduction on final reduce * merge latest with master * correct tests and add a new test case for 10k buckets * refactor to perform bucket number check in innerBuild * correctly derive bucket setting, update tests to increase bucket threshold * fix checkstyle * address code review comments * add documentation for default buckets * fix typo

* es/6.x: Use correct formatting for links (#29460) Revert "Adds a new auto-interval date histogram (#28993)" Revert "fix typo" fix typo Adds a new auto-interval date histogram (#28993) [Rollup] Replace RollupIT with a ESRestTestCase version (#31977) [Rollup] Fix duplicate field names in test (#32075) [Tests] Fix failure due to changes exception message (#32036) [Test] Mute MlJobIT#testDeleteJobAfterMissingAliases Replace Ingest ScriptContext with Custom Interface (#32003) (#32060) Cleanup Duplication in `PainlessScriptEngine` (#31991) (#32061) HLRC: Add xpack usage api (#31975) Clean Up Snapshot Create Rest API (#31779)

mrec · 2018-08-16T17:12:01Z

Apologies for the necropost, but I'm increasingly unsure that I've understood the implications of "Fail the request at parsing time if the number of buckets requested is at higher than soft_limit_value / m", mostly because I can't see where soft_limit_value is coming from. Is this a global ES config property?

Also, the agg docs don't make it clear what happens if the requested buckets can't be met even with 100-year intervals - I assume error again, although that feels a bit harsh since it's only partially under the caller's control - or whether the 7-day and 3-month roundings will be to natural (calendar) week/quarter boundaries.

colings86 · 2018-08-17T08:31:04Z

@mrec The soft_limit_value refers to the existing soft limit that applies across all aggregations at limits the number of buckets that can be created by an aggregations request. You can see the setting defined in the code here.

The idea of the soft limits is that we have controls that aim to stop a single user performing a request that would destabilise or be otherwise detrimental to the cluster or would adversely affect other users trying to perform request on the cluster. Most of these soft limits end up effectively limiting the amount of memory a user's request can use in various ways but some do other things.

The idea with the "Fail the request at parsing time if the number of buckets requested is at higher than soft_limit_value / m" is that if at the point we parse the request we can already determine that we are going to need to produce more buckets than this particular soft limit allows there is not point in us executing the request since during the execution it will definitely trip the soft limit and fail. So instead we fail the request at parsing time.

Thats not to say that the soft limit won't be tripped if it passes the soft_limit_value/m test. Other factors like the number of buckets any sub aggregations create could still make us hit the soft limit but the idea is to catch the case where we already know at parse time that the request won't work.

If the request cannot be satisfied even with 100 year intervals then the request will fail on the soft limit as will an other aggregation request. I'd like to note two things here though:

As is explained in the error response upon hitting the soft limit and admin can modify the value of the soft limit if they wish. This is an admin choice (someone who can control cluster settings) because its something that needs to be considered with the health of the whole cluster and all its users in mind.
The default soft limit for the number of buckets is 10,000 so in order to trip the soft limit on the soft_limit_value / m test you would need to request 10,000 / 30 = 333 buckets in this aggregation (m is 30 because we take the maximum innerInterval of the all the intervals except the highest one) so assuming the data returned by the query would require 100-year buckets you are still able to aggregate a range of 33,000 years which to me seems very reasonable for a default.

mrec · 2018-08-17T11:35:22Z

@colings86 I think you're confirming my worries - a maximum range of 33,000 years is more than generous, but I think a (default) maximum buckets of 333 is too low. Particularly since it's not documented, it's not the case for other bucketed aggs and the following "if a user requests more than 10k buckets, we return an error" kind of implies that that's the limit.

colings86 · 2018-08-17T13:28:23Z

I think a (default) maximum buckets of 333 is too low

I don't agree here. Given that the final interval is out of the users hands I don't think this aggregation is useful for analytics jobs which are consuming a lot of buckets (since they would almost certainly want to determine the interval of the buckets themselves) but rather for visualising data to the user (e.g. on a bar chart). Given this I find it hard to believe that a visualisation could provide the information in a way that a user could consume if its rendering 100s of buckets. Is your use case intending to present the out put of auto_date_histogram to the user through a visualisation or to an analytics job?

Particularly since it's not documented, it's not the case for other bucketed aggs and the following "if a user requests more than 10k buckets, we return an error" kind of implies that that's the limit.

We do document the soft limit here. Note that the documentation says its disabled by default at the moment but it will be set to a default of 10,000 in 7.0. The optimisation we added in this aggregation doesn't change the behaviour or lower the limit, it just catches the violation of the soft limit earlier before we start executing the request. As I mentioned before, we feel that the soft limits are important tools for cluster admins to be able to add confidence that a single request is not going to adversely affect the cluster or other user's requests and the limits can always be adjusted if the admin determines its appropriate to increase it, but then the decision can be made with the knowledge that the cluster is safe to handle these more expensive requests

mrec · 2018-08-17T14:43:53Z

Sorry, I think that came across more argumentative than intended. Our main expected use cases are around visualizations for screening, yes, and examples I've seen of "the sort of thing we want" include pixel-per-bar charts around 400px wide. Whether that resolution is actually needed is open to debate; I'll talk to our PM.

I do think it'd be worth calling out the 333 limit explicitly in the agg docs - the connection between that and the documented 10k limit is not obvious, and most users won't be reading this PR thread.

colings86 · 2018-08-17T14:49:43Z

No offence taken at all 😄

I do think it'd be worth calling out the 333 limit explicitly in the agg docs - the connection between that and the documented 10k limit is not obvious, and most users won't be reading this PR thread.

You have convinced me on this. I think we should point this out in the documentation. Thanks for persevering with me

colings86 · 2018-08-17T14:53:06Z

@mrec I raised #32950 for the documentation issue

…arch#28993

…arch#28993 (#3521)

…arch#28993 (#3521) (cherry picked from commit b4e8bd3)

colings86 added 7 commits March 12, 2018 10:33

Adds documentation

29e76f3

Added sub aggregator test

1480aac

Fixes failing docs test

fbe907d

Brings branch up to date with master changes

2aec5bc

trying to get tests to pass again

33458f6

Fixes multiBucketConsumer accounting

678802a

colings86 added >feature WIP :Analytics/Aggregations Aggregations v7.0.0 v6.3.0 labels Mar 12, 2018

colings86 self-assigned this Mar 12, 2018

This was referenced Mar 12, 2018

Adds a new auto-interval date histogram #26659

Closed

Enhancement: Range agg specified as max bucket count rather than explicit ranges #24254

Closed

colings86 added 5 commits March 13, 2018 14:58

Collects more buckets than needed on shards

993c782

This gives us more options at reduce time in terms of how we do the final merge of the buckeets to produce the final result

Merge branch 'master' into auto-interval-histo

827ee1c

Merge branch 'master' into auto-interval-histo

d45132c

Revert "Collects more buckets than needed on shards"

ee5e721

This reverts commit 993c782.

Adds ability to merge within a rounding

5bedd6f

colings86 requested a review from jpountz March 15, 2018 13:20

colings86 added 5 commits March 19, 2018 10:22

Merge branch 'master' into auto-interval-histo

d11b0fe

Fixes nonn-timezone doc test failure

fa3b901

Fix time zone tests

7c4d019

iterates on tests

65c056e

Merge branch 'master' into auto-interval-histo

e6c180c

jpountz reviewed Mar 21, 2018

View reviewed changes

Adds test case and documentation changes

4fac5bf

Added some notes in the documentation about the intervals that can bbe returned. Also added a test case that utilises the merging of conseecutive buckets

pcsanwald removed the backport pending label Jul 16, 2018

pcsanwald mentioned this pull request Jul 16, 2018

auto-interval date histogram: add response field that gives interval #32102

Closed

jdconrad pushed a commit to jdconrad/elasticsearch that referenced this pull request Jul 16, 2018

Revert "Adds a new auto-interval date histogram (elastic#28993)"

fe7fd09

This reverts commit d88d764.

pcsanwald mentioned this pull request Jul 16, 2018

auto-interval date histogram - 6.x backport #32107

Merged

melissachang mentioned this pull request Jul 16, 2018

Support dynamic interval and fixed buckets for histogram aggregation #9572

Closed

pcsanwald added v6.5.0 and removed v6.4.0 labels Aug 9, 2018

lessless mentioned this pull request Oct 11, 2018

Add K-means clustering feature #5512

Closed

Mpdreamz mentioned this pull request Dec 13, 2018

[meta] 6.5.0 Release elastic/elasticsearch-net#3457

Closed

Mpdreamz added a commit to elastic/elasticsearch-net that referenced this pull request Dec 18, 2018

Add support for the auto_date_histogram aggregation elastic/elasticse…

76655ee

…arch#28993

Mpdreamz mentioned this pull request Dec 18, 2018

Add support for the auto_date_histogram aggregations elastic/elasticsearch-net#3521

Merged

Mpdreamz added a commit to elastic/elasticsearch-net that referenced this pull request Jan 2, 2019

Add support for the auto_date_histogram aggregation elastic/elasticse…

a830b20

…arch#28993

Mpdreamz added a commit to elastic/elasticsearch-net that referenced this pull request Jan 24, 2019

Add support for the auto_date_histogram aggregation elastic/elasticse…

92484fd

…arch#28993 (#3521)

Mpdreamz added a commit to elastic/elasticsearch-net that referenced this pull request Jan 25, 2019

Add support for the auto_date_histogram aggregation elastic/elasticse…

47c5e29

…arch#28993 (#3521)

Mpdreamz added a commit to elastic/elasticsearch-net that referenced this pull request Jan 25, 2019

Add support for the auto_date_histogram aggregation elastic/elasticse…

b4e8bd3

…arch#28993 (#3521)

Mpdreamz added a commit to elastic/elasticsearch-net that referenced this pull request Jan 28, 2019

Add support for the auto_date_histogram aggregation elastic/elasticse…

c5c10bc

…arch#28993 (#3521) (cherry picked from commit b4e8bd3)

colings86 added v7.0.0-beta1 and removed v7.0.0 labels Feb 7, 2019

Mpdreamz added a commit to elastic/elasticsearch-net that referenced this pull request Mar 29, 2019

Add support for the auto_date_histogram aggregation elastic/elasticse…

c11a6e3

…arch#28993 (#3521) (cherry picked from commit b4e8bd3)

debadair mentioned this pull request Oct 11, 2019

Document the default max number of buckets in the auto_date_histogram documentation #32950

Closed

agirbal mentioned this pull request Dec 19, 2019

Add a new quantile histogram aggregation for numeric fields #50386

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Adds a new auto-interval date histogram #28993

Adds a new auto-interval date histogram #28993

colings86 commented Mar 12, 2018 •

edited by pcsanwald

Loading

elasticmachine commented Mar 12, 2018

colings86 commented Mar 15, 2018

jpountz left a comment

mrec commented Aug 16, 2018

colings86 commented Aug 17, 2018

mrec commented Aug 17, 2018

colings86 commented Aug 17, 2018

mrec commented Aug 17, 2018

colings86 commented Aug 17, 2018

colings86 commented Aug 17, 2018

Adds a new auto-interval date histogram #28993

Adds a new auto-interval date histogram #28993

Conversation

colings86 commented Mar 12, 2018 • edited by pcsanwald Loading

elasticmachine commented Mar 12, 2018

colings86 commented Mar 15, 2018

jpountz left a comment

Choose a reason for hiding this comment

mrec commented Aug 16, 2018

colings86 commented Aug 17, 2018

mrec commented Aug 17, 2018

colings86 commented Aug 17, 2018

mrec commented Aug 17, 2018

colings86 commented Aug 17, 2018

colings86 commented Aug 17, 2018

colings86 commented Mar 12, 2018 •

edited by pcsanwald

Loading