-
Notifications
You must be signed in to change notification settings - Fork 24.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add Boxplot Aggregation #51948
Add Boxplot Aggregation #51948
Conversation
Adds a `boxplot` aggregation that calculates min, max, medium and the first and the third quartiles of the given data set. Closes elastic#33112
Pinging @elastic/es-analytics-geo (:Analytics/Aggregations) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Left some comments! Mainly around the docs, and a few bits and pieces elsewhere. This looks good!
[source,js] | ||
-------------------------------------------------- | ||
{ | ||
"boxplot": { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder, should we snake_case this to box_plot
? I personally don't really like snake casing but it would be more consistent (date_histogram
, significant_terms
, etc).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wikipedia states both spellings and boxplot
was shorter, but I don't have strong feelings about it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In that case, boxplot it is. I like shorter too
-------------------------------------------------- | ||
{ | ||
"boxplot": { | ||
"buckets_path": "my_cardinality_agg" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Typo re: "buckets_path" instead of "field"?
-------------------------------------------------- | ||
// TESTRESPONSE[s/\.\.\./"took": $body.took,"timed_out": false,"_shards": $body._shards,"hits": $body.hits,/] | ||
|
||
As you can see, the aggregation will return a calculated value for each percentile |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like copy pasta leftovers :)
==== Script | ||
|
||
The boxplot metric supports scripting. For example, if our load times | ||
are in milliseconds but we want percentiles calculated in seconds, we could use |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"percentiles" typo
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmm, I am not sure how to call them here. Technically they are percentiles :) Values?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Whoops, yeah this is probably fine. Was thinking it was a copy/paste leftover.
|
||
<1> Compression controls memory usage and approximation error | ||
|
||
The TDigest algorithm uses a number of "nodes" to approximate percentiles -- the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm assuming this is the same as the Percentiles page? Maybe we should link to it instead? Or maybe there's a fancy way to embed a snippet from a different page? @nik9000 do you happen to know?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@@ -378,7 +378,7 @@ public void testGetProperty() throws IOException { | |||
iw.addDocument(singleton(new NumericDocValuesField("number", 7))); | |||
iw.addDocument(singleton(new NumericDocValuesField("number", 1))); | |||
}, (Consumer<InternalGlobal>) global -> { | |||
assertEquals(1.0, global.getDocCount(), 2); | |||
assertEquals(2, global.getDocCount()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
😅
import java.util.List; | ||
import java.util.Map; | ||
|
||
public class BoxplotAggregator extends NumericMetricsAggregator.MultiValue { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Out of curiosity, did you look into extending AbstractTDigestPercentilesAggregator
? It might avoid some code duplication, but otoh it probably does more than you want/need. No strong feelings here.... the percentile/ranks code is already a complicated hierarchy so might be easier to just keep this separate.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if it'd be cleaner to make a class they could both delegate their common behavior to? It'd be cool to do that in a follow up change if it is tricky.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, noodled on this a bit and it's probably too much for this PR (especially with #51887 potentially in the works)
The main difficulty is that percentiles/ranks share TDigest and HDRHisto agg base classes, and this would just share with TDigest, so it'd be a fork in the tree in the middle somewhere.
AbstractTDigestBase --> AbstractTDigestPercentiles --> Percentiles
\--> Ranks
\--> Boxplot
Unsure if the extra complexity is worth it or not. 🤷♂️ But probably too much complexity for this PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it is up to @imotov but I'm fine waiting on it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I looked at it when it was still IQR and it was immediate dead-end from the inheritance perspective because of IQR was a single value/multivalue agg. It might be a bit easier now but I agree with @nik9000 and still think we should do some refactoring by composition (instead of inheritance) here. However, I would prefer to do it in another PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
if (super.equals(obj) == false) return false; | ||
|
||
InternalBoxplot that = (InternalBoxplot) obj; | ||
return Objects.equals(state, that.state); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we include format
in this equality check?
Philosophically, I wonder if we should check equality of the boxplot components instead of the state? Like, are two boxplots identical if min/max/q1/q2/q3 are the same but the rest of the distribution is different?
I really dunno heh :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is already included in super.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🤦♂️
expected.add(input.state()); | ||
} | ||
assertNotNull(expected); | ||
// The final calculated result may very depending on the order, which requires higher delta |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this comment leftover? E.g. the delta looks like it's zero?
|
||
<1> Compression controls memory usage and approximation error | ||
|
||
The TDigest algorithm uses a number of "nodes" to approximate percentiles -- the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
|
||
import org.elasticsearch.search.aggregations.metrics.NumericMetricsAggregation; | ||
|
||
public interface Boxplot extends NumericMetricsAggregation.MultiValue { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need these any more?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, I'm not sure tbh. We might want to keep them around as a bridge to the HLRC? E.g. InternalMin
and ParsedMin
both implement Min
, and it might help us make sure they don't diverge?
Or maybe useless bloat now that the TC is gone and we can get rid of them... but probably better to investigate that in a separate PR and apply it to all aggs if we decide we can remove them?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The HLRC won't have Boxplot
on the classpath because it doesn't have the analytics
module. With top_metrics
I just pointed folks to ParsedTopMetrics
. I'm fine either way. It is just that when I made my new agg I figured I could save myself a little work an not make it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Argh, you're right. Keep forgetting this is in xpack. 👍 probably not necessary given the current situation
BoxplotAggregationBuilder> { | ||
public static final String NAME = "boxplot"; | ||
|
||
public static final ParseField COMPRESSION_FIELD = new ParseField("compression"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if it'd be better to use the field from the PercentilesAggregationBuilder
. That at least is a hint that it means the same thing.
static { | ||
PARSER = new ObjectParser<>(BoxplotAggregationBuilder.NAME); | ||
ValuesSourceParserHelper.declareNumericFields(PARSER, true, true, false); | ||
PARSER.declareDouble((builder, compression) -> builder.compression = compression, COMPRESSION_FIELD); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think if you've declared the setter you may as well use it here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So BoxplotAggregationBuilder::compression
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also that'd gt your validation too.
import java.util.List; | ||
import java.util.Map; | ||
|
||
public class BoxplotAggregator extends NumericMetricsAggregator.MultiValue { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if it'd be cleaner to make a class they could both delegate their common behavior to? It'd be cool to do that in a follow up change if it is tricky.
|
||
enum Metrics { | ||
|
||
min, max, q1, q2, q3; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you name these in SHOUTING_SNAKE_CASE like normal Java enums? Is there a super compelling reason not to?
@Override | ||
public double value(String name) { | ||
Metrics metrics = Metrics.valueOf(name); | ||
switch (metrics) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would you be making a private method on Metrics
that takes this
as a parameter and calls the right method instead?
return new InternalBoxplot(name, merged, format, pipelineAggregators(), metaData); | ||
} | ||
|
||
static class Fields { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought that these Fields
objects had gone out of favor. But I'm not really in the loop.
@elasticmachine test this please |
I was wondering if |
@jtibshirani excellent idea! It doesn't support it at the moment. I will open a follow up PR to add this support. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm happy with it, lgtm!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Lgtm 🚢
Refactors BoxplotAggregator to reuse parts that are already available in AbstractTDigestPercentilesAggregator Follow up for elastic#51948
Relates: elastic/elasticsearch#51948 This commit implements the boxplot aggregation. Integration tests run against XPackCluster because it requires a license to use.
Adds a
boxplot
aggregation that calculates min, max, medium and the firstand the third quartiles of the given data set.
Closes #33112