Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add additional box plot agg options #60466

Closed
imotov opened this issue Jul 30, 2020 · 9 comments · Fixed by #63617
Closed

Add additional box plot agg options #60466

imotov opened this issue Jul 30, 2020 · 9 comments · Fixed by #63617
Assignees
Labels
:Analytics/Aggregations Aggregations >enhancement Team:Analytics Meta label for analytical engine team (ESQL/Aggs/Geo)

Comments

@imotov
Copy link
Contributor

imotov commented Jul 30, 2020

In #51948 we have a basic support for a box plot graphs with a minimum amount of supported values (min, max q1, q2, and q3). as we discussed before date for some alternative methods of displaying whiskers can be derived from the 5 values we provide at the moment. Recently, we have got a request from a user to add these calculations into the aggregation. I would like to discuss this as well as adding support for some other styles of box plot such as:

  • one standard deviation above and below the mean of the data
  • the 9th percentile and the 91st percentile
  • the 2nd percentile and the 98th percentile.
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-analytics-geo (:Analytics/Aggregations)

@elasticmachine elasticmachine added the Team:Analytics Meta label for analytical engine team (ESQL/Aggs/Geo) label Jul 30, 2020
@jtibshirani
Copy link
Contributor

Summarizing from a previous comment -- a common style is for the whiskers extend to the furthest points within [Q1 - 1.5 * IQR, Q3 + 1.5 * IQR]. Points that are outside this interval are marked as outliers and displayed on the plot. I suspect this is what many users would think of as a 'standard' box plot.

A couple questions/ observations:

  • In order to accommodate this new style, we could rework the names in the response. Perhaps we'd use terms like low and high instead min and max, since the whiskers will no longer always be the min and max values.
  • Do we want to return some representation of the outliers (the points outside the whiskers)? I wonder what information would be both helpful and feasible to return.

@not-napoleon not-napoleon self-assigned this Aug 19, 2020
@not-napoleon
Copy link
Member

@jtibshirani - Couple of questions on this:

  • is it only whiskers that we want to add options for? I.e. is the main box always Q1 to Q3 with Q2 marked?
  • Should we allow just setting the whiskers to an arbitrary distance from the max, or do we only want to support a specific set of pre-canned cases (9/91, 2/98, and 1.5IQR, for example)?
    • if we want to allow arbitrary setting, do they need to be the same "distance" or could you set them to, e.g., 2nd percentile and 90th percentile? My instinct is to just let the user set them to whatever they want rather than enforce that they be "symmetric"
    • Again, assuming we allow for arbitrary settings, it's pretty clear that the whiskers should be outside the main box. Other than that, are there any bounds we need to enforce on them?  I don't think there would be.
  • How important are the outliers to this? At a glance, it's much easier to support varying the length of the whiskers than it is to include outliers, but before I build that I just want to double check that it's worth having one without the other.

Thanks!

@not-napoleon
Copy link
Member

Julie and I discussed this on Zoom, and have a few thoughts. It sounds like the biggest use case here is the 1.5IQR points. If we just want to add that, we can simply include it in the output we have now, going from returning 5 numbers to returning 7 numbers. This would also allow users to know if there were outliers, by checking if the 1.5IQR value was less than the max (or more than the min on the other side). It would then be possible to query the outliers with a range query if desiered.

In this proposal, we would not add a new parameter to the agg, and would not support the 9/91 or 2/98 quantile cases. We'd just be enhancing the current agg with two new output values, and let the user choose which to use for displaying the whisker end points.

@benwtrent @mattfield @pmoust @tveasey - Tagging you folks as interested parties on this for feedback on this proposed solution. Would returning a set of 7 numbers - max, Q3 + 1.5 * IQR, Q3, Q2, Q1, Q1 - 1.5*IQR, min - be enough to make this aggregation useful to you?

@pmoust
Copy link
Member

pmoust commented Aug 20, 2020

Yes.

@benwtrent
Copy link
Member

@blaklaybul @Winterflower @joshdevins what do y'all think of this proposal?

@tveasey
Copy link
Contributor

tveasey commented Aug 24, 2020

This sounds like a good proposal to me. I agree that actual outliers should be retrieved separately and might conceivably need to be downsampled anyway for a very large data set. I think it would be worth mentioning this thinking

This would also allow users to know if there were outliers, by checking if the 1.5IQR value was less than the max (or more than the min on the other side). It would then be possible to query the outliers with a range query if desiered.

in the docs for this agg as well.

@leewadhams
Copy link

Hi, from a completely self-centred point of view the ability to sort by the whisker values is important. If I'm understanding correctly the proposal is to return the values of the 1.5IQR upper and lower bounds but not actually work out the highest and lowest values that are contained to produce the actual whisker values. Leaving it to the user to execute follow up queries in order to determine outliers and the actual whisker values. Is this understanding correct?

Apologies if i've misunderstood.

@not-napoleon
Copy link
Member

@leewadhams (and others) - I don't want to just return the 1.5 IQR values, since that's pretty trivial to compute client side and doesn't add much utility to the aggregation. Ideally, I'd like to return the closest contained value to the 1.5 IQR point, which I believe should be the whisker value, but in practice it's not that simple. Boxplot is built on a bounded error sketch of the data (it uses a t-digest internally), so the best I can do in the general case is to get close to the whisker value. I'm still playing around with methodologies, but I hope to be able to quantify "close" a bit more before I release this.

By the same token, we can't return outlier values from this aggregation, because the sketch doesn't store exact values.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Analytics/Aggregations Aggregations >enhancement Team:Analytics Meta label for analytical engine team (ESQL/Aggs/Geo)
Projects
None yet
Development

Successfully merging a pull request may close this issue.

9 participants