Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adds a new auto-interval date histogram #28993

Merged
merged 33 commits into from
Jul 13, 2018
Merged

Adds a new auto-interval date histogram #28993

merged 33 commits into from
Jul 13, 2018

Conversation

colings86
Copy link
Contributor

@colings86 colings86 commented Mar 12, 2018

This change adds a new type of histogram aggregation called auto_date_histogram where you can specify the target number of buckets you require and it will find an appropriate interval for the returned buckets. The aggregation works by first collecting documents in buckets at second interval, when it has created more than the target number of buckets it merges these buckets into minute interval bucket and continues collecting until it reaches the target number of buckets again. It will keep merging buckets when it exceeds the target until either collection is finished or the highest interval (currently years) is reached. A similar process happens at reduce time.

This aggregation intentionally does not support min_doc_count, offest and extended_bounds to keep the already complex logic from becoming more complex. The aggregation accepts sub-aggregations but will always operate in breadth_first mode deferring the computation of sub-aggregations until the final buckets from the shard are known. min_doc_count is effectively hard-coded to zero meaning that we will insert empty buckets where necessary.

This is also the first aggregation to merge buckets at collection time but having this aggregation will open the door to other such aggregations that merge buckets at collection time.

Things still to do:

  • Write documentation on the new aggregation and its parameters
  • Write more tests, specifically around having sub-aggregations and checking they are merged correctly in the reduce step
  • Investigate whether we could collect 10 * target buckets on the shards and then use the extra bucket to create buckets representing multiples of the interval in the cases where we have too many buckets for one interval but if we increase the interval we get only 1 (or a few) bucket(s) (e.g. if the target buckets is 10 and we end up with 30 x minute level buckets if we increase the rounding we'll get 1 x hour bucket so instead can we merge every 3 minute level buckets to get 10 x 3 minute buckets?) - this will probably be moved out into a separate issue to be tackled when this first pass is merged
  • Add an equivalent auto_histogram aggregation to work on numeric (rather than date) fields - this should be moved into a separate issue
  • Fail the request at parsing time if the number of buckets requested is at higher than soft_limit_value / m where m is the largest innerInterval for all rounding intervals (the largest number of buckets we would merge to one bucket in any rounding, currently 30)
  • Add a test to ensure that if a user requests more than 10k buckets, we return an error.
  • Fix doc tests

Closes #9572

This change adds a new type of histogram aggregation called `auto_date_histogram` where you can specify the target number of buckets you require and it will find an appropriate interval for the returned buckets. The aggregation works by first collecting documents in buckets at second interval, when it has created more than the target number of buckets it merges these buckets into minute interval bucket and continues collecting until it reaches the target number of buckets again. It will keep merging buckets when it exceeds the target until either collection is finished or the highest interval (currently years) is reached. A similar process happens at reduce time.

This aggregation intentionally does not support min_doc_count, offest and extended_bounds to keep the already complex logic from becoming more complex. The aggregation accepts sub-aggregations but will always operate in `breadth_first` mode deferring the computation of sub-aggregations until the final buckets from the shard are known. min_doc_count is effectively hard-coded to zero meaning that we will insert empty buckets where necessary.

Closes #9572
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-search-aggs

@colings86 colings86 requested a review from jpountz March 15, 2018 13:20
@colings86
Copy link
Contributor Author

@jpountz there are tests that need to be added to this PR and @pcsanwald is looking into benchmarking the aggregation at the moment to make sure the time performance is ok but I think its worth you taking a look so if there are any major things to change we can fix them now

Copy link
Contributor

@jpountz jpountz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had a quick look and think the general approach is good. It would help to better document how the target number of buckets is interpreted and how far the actual number of buckets may be.

Added some notes in the documentation about the intervals that can bbe
returned.

Also added a test case that utilises the merging of conseecutive buckets
jdconrad pushed a commit to jdconrad/elasticsearch that referenced this pull request Jul 16, 2018
pcsanwald pushed a commit to pcsanwald/elasticsearch that referenced this pull request Jul 16, 2018
* Adds a new auto-interval date histogram

This change adds a new type of histogram aggregation called `auto_date_histogram` where you can specify the target number of buckets you require and it will find an appropriate interval for the returned buckets. The aggregation works by first collecting documents in buckets at second interval, when it has created more than the target number of buckets it merges these buckets into minute interval bucket and continues collecting until it reaches the target number of buckets again. It will keep merging buckets when it exceeds the target until either collection is finished or the highest interval (currently years) is reached. A similar process happens at reduce time.

This aggregation intentionally does not support min_doc_count, offest and extended_bounds to keep the already complex logic from becoming more complex. The aggregation accepts sub-aggregations but will always operate in `breadth_first` mode deferring the computation of sub-aggregations until the final buckets from the shard are known. min_doc_count is effectively hard-coded to zero meaning that we will insert empty buckets where necessary.

Closes elastic#9572

* Adds documentation

* Added sub aggregator test

* Fixes failing docs test

* Brings branch up to date with master changes

* trying to get tests to pass again

* Fixes multiBucketConsumer accounting

* Collects more buckets than needed on shards

This gives us more options at reduce time in terms of how we do the
final merge of the buckeets to produce the final result

* Revert "Collects more buckets than needed on shards"

This reverts commit 993c782.

* Adds ability to merge within a rounding

* Fixes nonn-timezone doc test failure

* Fix time zone tests

* iterates on tests

* Adds test case and documentation changes

Added some notes in the documentation about the intervals that can bbe
returned.

Also added a test case that utilises the merging of conseecutive buckets

* Fixes performance bug

The bug meant that getAppropriate rounding look a huge amount of time
if the range of the data was large but also sparsely populated. In
these situations the rounding would be very low so iterating through
the rounding values from the min key to the max keey look a long time
(~120 seconds in one test).

The solution is to add a rough estimate first which chooses the
rounding based just on the long values of the min and max keeys alone
but selects the rounding one lower than the one it thinks is
appropriate so the accurate method can choose the final rounding taking
into account the fact that intervals are not always fixed length.

Thee commit also adds more tests

* Changes to only do complex reduction on final reduce

* merge latest with master

* correct tests and add a new test case for 10k buckets

* refactor to perform bucket number check in innerBuild

* correctly derive bucket setting, update tests to increase bucket threshold

* fix checkstyle

* address code review comments

* add documentation for default buckets

* fix typo
martijnvg added a commit that referenced this pull request Jul 16, 2018
* es/6.x:
  Use correct formatting for links (#29460)
  Revert "Adds a new auto-interval date histogram (#28993)"
  Revert "fix typo"
  fix typo
  Adds a new auto-interval date histogram (#28993)
  [Rollup] Replace RollupIT with a ESRestTestCase version (#31977)
  [Rollup] Fix duplicate field names in test (#32075)
  [Tests] Fix failure due to changes exception message (#32036)
  [Test] Mute MlJobIT#testDeleteJobAfterMissingAliases
  Replace Ingest ScriptContext with Custom Interface (#32003) (#32060)
  Cleanup Duplication in `PainlessScriptEngine` (#31991) (#32061)
  HLRC: Add xpack usage api (#31975)
  Clean Up Snapshot Create Rest API (#31779)
@pcsanwald pcsanwald added v6.5.0 and removed v6.4.0 labels Aug 9, 2018
@mrec
Copy link

mrec commented Aug 16, 2018

Apologies for the necropost, but I'm increasingly unsure that I've understood the implications of "Fail the request at parsing time if the number of buckets requested is at higher than soft_limit_value / m", mostly because I can't see where soft_limit_value is coming from. Is this a global ES config property?

Also, the agg docs don't make it clear what happens if the requested buckets can't be met even with 100-year intervals - I assume error again, although that feels a bit harsh since it's only partially under the caller's control - or whether the 7-day and 3-month roundings will be to natural (calendar) week/quarter boundaries.

@colings86
Copy link
Contributor Author

@mrec The soft_limit_value refers to the existing soft limit that applies across all aggregations at limits the number of buckets that can be created by an aggregations request. You can see the setting defined in the code here.

The idea of the soft limits is that we have controls that aim to stop a single user performing a request that would destabilise or be otherwise detrimental to the cluster or would adversely affect other users trying to perform request on the cluster. Most of these soft limits end up effectively limiting the amount of memory a user's request can use in various ways but some do other things.

The idea with the "Fail the request at parsing time if the number of buckets requested is at higher than soft_limit_value / m" is that if at the point we parse the request we can already determine that we are going to need to produce more buckets than this particular soft limit allows there is not point in us executing the request since during the execution it will definitely trip the soft limit and fail. So instead we fail the request at parsing time.

Thats not to say that the soft limit won't be tripped if it passes the soft_limit_value/m test. Other factors like the number of buckets any sub aggregations create could still make us hit the soft limit but the idea is to catch the case where we already know at parse time that the request won't work.

If the request cannot be satisfied even with 100 year intervals then the request will fail on the soft limit as will an other aggregation request. I'd like to note two things here though:

  1. As is explained in the error response upon hitting the soft limit and admin can modify the value of the soft limit if they wish. This is an admin choice (someone who can control cluster settings) because its something that needs to be considered with the health of the whole cluster and all its users in mind.
  2. The default soft limit for the number of buckets is 10,000 so in order to trip the soft limit on the soft_limit_value / m test you would need to request 10,000 / 30 = 333 buckets in this aggregation (m is 30 because we take the maximum innerInterval of the all the intervals except the highest one) so assuming the data returned by the query would require 100-year buckets you are still able to aggregate a range of 33,000 years which to me seems very reasonable for a default.

@mrec
Copy link

mrec commented Aug 17, 2018

@colings86 I think you're confirming my worries - a maximum range of 33,000 years is more than generous, but I think a (default) maximum buckets of 333 is too low. Particularly since it's not documented, it's not the case for other bucketed aggs and the following "if a user requests more than 10k buckets, we return an error" kind of implies that that's the limit.

@colings86
Copy link
Contributor Author

I think a (default) maximum buckets of 333 is too low

I don't agree here. Given that the final interval is out of the users hands I don't think this aggregation is useful for analytics jobs which are consuming a lot of buckets (since they would almost certainly want to determine the interval of the buckets themselves) but rather for visualising data to the user (e.g. on a bar chart). Given this I find it hard to believe that a visualisation could provide the information in a way that a user could consume if its rendering 100s of buckets. Is your use case intending to present the out put of auto_date_histogram to the user through a visualisation or to an analytics job?

Particularly since it's not documented, it's not the case for other bucketed aggs and the following "if a user requests more than 10k buckets, we return an error" kind of implies that that's the limit.

We do document the soft limit here. Note that the documentation says its disabled by default at the moment but it will be set to a default of 10,000 in 7.0. The optimisation we added in this aggregation doesn't change the behaviour or lower the limit, it just catches the violation of the soft limit earlier before we start executing the request. As I mentioned before, we feel that the soft limits are important tools for cluster admins to be able to add confidence that a single request is not going to adversely affect the cluster or other user's requests and the limits can always be adjusted if the admin determines its appropriate to increase it, but then the decision can be made with the knowledge that the cluster is safe to handle these more expensive requests

@mrec
Copy link

mrec commented Aug 17, 2018

Sorry, I think that came across more argumentative than intended. Our main expected use cases are around visualizations for screening, yes, and examples I've seen of "the sort of thing we want" include pixel-per-bar charts around 400px wide. Whether that resolution is actually needed is open to debate; I'll talk to our PM.

I do think it'd be worth calling out the 333 limit explicitly in the agg docs - the connection between that and the documented 10k limit is not obvious, and most users won't be reading this PR thread.

@colings86
Copy link
Contributor Author

No offence taken at all 😄

I do think it'd be worth calling out the 333 limit explicitly in the agg docs - the connection between that and the documented 10k limit is not obvious, and most users won't be reading this PR thread.

You have convinced me on this. I think we should point this out in the documentation. Thanks for persevering with me

@colings86
Copy link
Contributor Author

@mrec I raised #32950 for the documentation issue

Mpdreamz added a commit to elastic/elasticsearch-net that referenced this pull request Dec 18, 2018
Mpdreamz added a commit to elastic/elasticsearch-net that referenced this pull request Jan 2, 2019
Mpdreamz added a commit to elastic/elasticsearch-net that referenced this pull request Jan 24, 2019
Mpdreamz added a commit to elastic/elasticsearch-net that referenced this pull request Jan 25, 2019
Mpdreamz added a commit to elastic/elasticsearch-net that referenced this pull request Jan 25, 2019
Mpdreamz added a commit to elastic/elasticsearch-net that referenced this pull request Jan 28, 2019
Mpdreamz added a commit to elastic/elasticsearch-net that referenced this pull request Mar 29, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants