Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cumulative Sum with correct initial value #60672

Closed
Tracked by #60126
timroes opened this issue Mar 19, 2020 · 7 comments
Closed
Tracked by #60126

Cumulative Sum with correct initial value #60672

timroes opened this issue Mar 19, 2020 · 7 comments
Labels
enhancement New value added to drive a business result Feature:Aggregations Aggregation infrastructure (AggConfig, esaggs, ...) impact:low Addressing this issue will have a low level of impact on the quality/strength of our product. loe:small Small Level of Effort Team:Visualizations Visualization editors, elastic-charts and infrastructure

Comments

@timroes
Copy link
Contributor

timroes commented Mar 19, 2020

It would be nice if the user would have a way to use cumulative sum, but not have the first bucket start at 0, but basically at the value the aggregation had for all documents "before" the date_histogram started. Currently there is no real way of using that value as a starting point.

Users want to achieve charts like the following (see this discuss post):

8f8cefa08dc81c1896766dc750bc4cd083989cfe

I think that feature would mainly make sense for cumulative_sums that run over date_histogram buckets, though technically the same idea could apply for histograms, and the initial value would be the "sum" (or whatever metric), of all documents smaller then the left most bucket in the histogram.

@timroes timroes added Feature:Aggregations Aggregation infrastructure (AggConfig, esaggs, ...) enhancement New value added to drive a business result Team:Visualizations Visualization editors, elastic-charts and infrastructure Team:AppArch labels Mar 19, 2020
@elasticmachine
Copy link
Contributor

Pinging @elastic/kibana-app (Team:KibanaApp)

@elasticmachine
Copy link
Contributor

Pinging @elastic/kibana-app-arch (Team:AppArch)

@timroes
Copy link
Contributor Author

timroes commented Mar 19, 2020

@elastic/es-analytics-geo I wonder if there would be a way to achieve such a feature inside Elasticsearch? Does the cumulative sum has enough knowledge that it could be able to calculate basically the sum (or whatever metric it runs over) for all documents "before the date histogram starts" (which would usually be determined by the overall date range filter)?

I am still on the fence, if that - even though a common use-case in Kibana - isn't requiring to specific knowledge to handle it in Elasticsearch directly. If it doesn't make sense, it would at least be good if we could manually specify a initial value to the cumulative_sum that will be added before the first bucket, so we could do a two query approach on that.

@polyfractal
Copy link
Contributor

Not easily, no. :( The main issue is that aggs get their values from whatever matches the query, so if the query is filtering out a portion of time none of the aggs will ever get a chance to see it. And aggs don't have any influence over the query so it's entirely up to the user (or kibana) to configure it so that the right data is aggregated. That makes supporting it tricky because we don't want aggs to be dependent on the user setting up the agg tree correctly (e.g. this functionality only works if you have ABC in XYZ places)

Supporting something like an initial_value would be trivial to do, however.

Theoretically the data could be collected with a few different combinations, but none of them are super clean:

Option 1

  • match_all query
    • filter agg with "pre-date-histo" range
      • sum agg to get total before the histo
    • filter agg with appropriate range
      • date_histo + whatever else + cumulative_sum

Option 2

  • range filter
    • date_histo + whatever else + cumulative_sum
    • global agg to escape the range filter context
      • filter agg with "pre-date-histo" range
        • sum agg to get total before the histo

Option 3

  • msearch
    • range filter
      • date_histo + whatever else + cumulative_sum
    • range filter with "pre-date-histo" range
      • sum agg to get total before the histo

Option 4

  • match_all query
    • date_histo + whatever else + cumulative_sum

Unclear which would be best. The first option is one search collector execution, but the filter aggs are relatively slow.

The second executes two searches essentially (global works by doing an extra match_all query), so one query would be fast because it has an exclusive filter, and the other is a slow filter agg

The third is probably the fastest since it's two exclusive queries hitting the minimal amount of data, but does require an msearch and two entire search executions including all the extra overhead.

The fourth is probably faster than 1 or 2, but requires kibana to take all the "pre-date-histo" buckets and merge them together before presenting to the user.

And in all cases, it requires Kibana to intervene and munge the results into something usable for the user.

@timroes
Copy link
Contributor Author

timroes commented Mar 20, 2020

Thanks for the detailed explanation of different possibility. In general it's not problem to mungle results later in Kibana, it would just be nice if we can solve it in one query ideally, since that simplifies infrastructure for us quiet a lot. Thanks also for the performance hints here.

My feeling around those:

  • Option 4: Not a good solution, since the user could have a rather small date_histogram interval, because they are only looking on a small time frame of the overall data. Meaning we run the risk if we now use that same interval for the overall time frame to easily extend the max buckets limit.
  • Option 3: Despite the fastest, most likely the trickiest to get into our infrastructure since suddenly a visualization would need to inject a separate request (and we're actually switching over to _search right now), into the same _msearch request.
  • Option 2 & 1: Option 2 Would actually be the solution that fits the easiest into our infrastructure, and given that it might be slightly faster than Option 1, I think we currently should try to focus around this one.

Since we also have a chance solving that in one request, I think we rather prefer solving it in one request and then summing up on Kibana side, than making 2 requests and need a "initial value" option for the cum sum aggregation.

@polyfractal
Copy link
Contributor

👍 that seems reasonable to me, especially as this isn't likely to be a feature that shows up on all dashboards across all parts of kibana. So a little slower relative to other operations is probably acceptable :)

Between 1 and 2, hard to really pin down which would be better, might be worthwhile testing both against a decent-sized dataset.

Filter aggs work by fetch the bitset of matches against the filter, and then checking that bitset for each document that it sees. This is in contrast to normal query filters which "advance" to the first matching document.

  • So Option 1 means all documents will be presented to both filters, which independently check to see if they match. Single pass through the data and two bitset checks per doc

  • Option 2 means the range filter will exclude the majority of documents and so the date_histo/etc will only see the matching docs. But the global will execute a secondary search which presents them all to the bitset for checks. So one full pass + one bitset check per-doc, and a second partial pass for the date_histo and friends.

Hard to say. :)

@exalate-issue-sync exalate-issue-sync bot added impact:low Addressing this issue will have a low level of impact on the quality/strength of our product. loe:small Small Level of Effort labels Jun 21, 2021
@ppisljar
Copy link
Member

ppisljar commented Aug 8, 2022

Thank you for contributing to this issue, however, we are closing this issue due to inactivity as part of a backlog grooming effort. If you believe this feature/bug should still be considered, please reopen with a comment.

@ppisljar ppisljar closed this as completed Aug 8, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New value added to drive a business result Feature:Aggregations Aggregation infrastructure (AggConfig, esaggs, ...) impact:low Addressing this issue will have a low level of impact on the quality/strength of our product. loe:small Small Level of Effort Team:Visualizations Visualization editors, elastic-charts and infrastructure
Projects
None yet
Development

No branches or pull requests

4 participants