Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Boxplot Aggregation #51948

Merged
merged 3 commits into from
Feb 7, 2020
Merged

Conversation

imotov
Copy link
Contributor

@imotov imotov commented Feb 5, 2020

Adds a boxplot aggregation that calculates min, max, medium and the first
and the third quartiles of the given data set.

Closes #33112

Adds a `boxplot` aggregation that calculates min, max, medium and the first
and the third quartiles of the given data set.

Closes elastic#33112
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-analytics-geo (:Analytics/Aggregations)

Copy link
Contributor

@polyfractal polyfractal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left some comments! Mainly around the docs, and a few bits and pieces elsewhere. This looks good!

[source,js]
--------------------------------------------------
{
"boxplot": {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder, should we snake_case this to box_plot? I personally don't really like snake casing but it would be more consistent (date_histogram, significant_terms, etc).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wikipedia states both spellings and boxplot was shorter, but I don't have strong feelings about it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In that case, boxplot it is. I like shorter too

--------------------------------------------------
{
"boxplot": {
"buckets_path": "my_cardinality_agg"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Typo re: "buckets_path" instead of "field"?

--------------------------------------------------
// TESTRESPONSE[s/\.\.\./"took": $body.took,"timed_out": false,"_shards": $body._shards,"hits": $body.hits,/]

As you can see, the aggregation will return a calculated value for each percentile
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like copy pasta leftovers :)

==== Script

The boxplot metric supports scripting. For example, if our load times
are in milliseconds but we want percentiles calculated in seconds, we could use
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"percentiles" typo

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm, I am not sure how to call them here. Technically they are percentiles :) Values?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Whoops, yeah this is probably fine. Was thinking it was a copy/paste leftover.


<1> Compression controls memory usage and approximation error

The TDigest algorithm uses a number of "nodes" to approximate percentiles -- the
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm assuming this is the same as the Percentiles page? Maybe we should link to it instead? Or maybe there's a fancy way to embed a snippet from a different page? @nik9000 do you happen to know?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@@ -378,7 +378,7 @@ public void testGetProperty() throws IOException {
iw.addDocument(singleton(new NumericDocValuesField("number", 7)));
iw.addDocument(singleton(new NumericDocValuesField("number", 1)));
}, (Consumer<InternalGlobal>) global -> {
assertEquals(1.0, global.getDocCount(), 2);
assertEquals(2, global.getDocCount());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

😅

import java.util.List;
import java.util.Map;

public class BoxplotAggregator extends NumericMetricsAggregator.MultiValue {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Out of curiosity, did you look into extending AbstractTDigestPercentilesAggregator? It might avoid some code duplication, but otoh it probably does more than you want/need. No strong feelings here.... the percentile/ranks code is already a complicated hierarchy so might be easier to just keep this separate.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if it'd be cleaner to make a class they could both delegate their common behavior to? It'd be cool to do that in a follow up change if it is tricky.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, noodled on this a bit and it's probably too much for this PR (especially with #51887 potentially in the works)

The main difficulty is that percentiles/ranks share TDigest and HDRHisto agg base classes, and this would just share with TDigest, so it'd be a fork in the tree in the middle somewhere.

AbstractTDigestBase --> AbstractTDigestPercentiles --> Percentiles
                                                  \--> Ranks
                   \--> Boxplot

Unsure if the extra complexity is worth it or not. 🤷‍♂️ But probably too much complexity for this PR.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it is up to @imotov but I'm fine waiting on it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I looked at it when it was still IQR and it was immediate dead-end from the inheritance perspective because of IQR was a single value/multivalue agg. It might be a bit easier now but I agree with @nik9000 and still think we should do some refactoring by composition (instead of inheritance) here. However, I would prefer to do it in another PR.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

if (super.equals(obj) == false) return false;

InternalBoxplot that = (InternalBoxplot) obj;
return Objects.equals(state, that.state);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we include format in this equality check?

Philosophically, I wonder if we should check equality of the boxplot components instead of the state? Like, are two boxplots identical if min/max/q1/q2/q3 are the same but the rest of the distribution is different?

I really dunno heh :)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is already included in super.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤦‍♂️

expected.add(input.state());
}
assertNotNull(expected);
// The final calculated result may very depending on the order, which requires higher delta
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this comment leftover? E.g. the delta looks like it's zero?


<1> Compression controls memory usage and approximation error

The TDigest algorithm uses a number of "nodes" to approximate percentiles -- the
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.


import org.elasticsearch.search.aggregations.metrics.NumericMetricsAggregation;

public interface Boxplot extends NumericMetricsAggregation.MultiValue {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need these any more?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, I'm not sure tbh. We might want to keep them around as a bridge to the HLRC? E.g. InternalMin and ParsedMin both implement Min, and it might help us make sure they don't diverge?

Or maybe useless bloat now that the TC is gone and we can get rid of them... but probably better to investigate that in a separate PR and apply it to all aggs if we decide we can remove them?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The HLRC won't have Boxplot on the classpath because it doesn't have the analytics module. With top_metrics I just pointed folks to ParsedTopMetrics. I'm fine either way. It is just that when I made my new agg I figured I could save myself a little work an not make it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Argh, you're right. Keep forgetting this is in xpack. 👍 probably not necessary given the current situation

BoxplotAggregationBuilder> {
public static final String NAME = "boxplot";

public static final ParseField COMPRESSION_FIELD = new ParseField("compression");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if it'd be better to use the field from the PercentilesAggregationBuilder. That at least is a hint that it means the same thing.

static {
PARSER = new ObjectParser<>(BoxplotAggregationBuilder.NAME);
ValuesSourceParserHelper.declareNumericFields(PARSER, true, true, false);
PARSER.declareDouble((builder, compression) -> builder.compression = compression, COMPRESSION_FIELD);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think if you've declared the setter you may as well use it here.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So BoxplotAggregationBuilder::compression

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also that'd gt your validation too.

import java.util.List;
import java.util.Map;

public class BoxplotAggregator extends NumericMetricsAggregator.MultiValue {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if it'd be cleaner to make a class they could both delegate their common behavior to? It'd be cool to do that in a follow up change if it is tricky.


enum Metrics {

min, max, q1, q2, q3;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you name these in SHOUTING_SNAKE_CASE like normal Java enums? Is there a super compelling reason not to?

@Override
public double value(String name) {
Metrics metrics = Metrics.valueOf(name);
switch (metrics) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would you be making a private method on Metrics that takes this as a parameter and calls the right method instead?

return new InternalBoxplot(name, merged, format, pipelineAggregators(), metaData);
}

static class Fields {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought that these Fields objects had gone out of favor. But I'm not really in the loop.

@imotov
Copy link
Contributor Author

imotov commented Feb 7, 2020

@elasticmachine test this please

@jtibshirani
Copy link
Contributor

I was wondering if boxplot could be run on the histogram field type? If so we could mention it in the docs, otherwise it could be a nice future extension.

@imotov
Copy link
Contributor Author

imotov commented Feb 7, 2020

@jtibshirani excellent idea! It doesn't support it at the moment. I will open a follow up PR to add this support.

Copy link
Contributor

@polyfractal polyfractal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm happy with it, lgtm!

Copy link
Member

@nik9000 nik9000 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lgtm 🚢

@imotov imotov merged commit c50cfa0 into elastic:master Feb 7, 2020
imotov added a commit to imotov/elasticsearch that referenced this pull request Mar 10, 2020
Refactors BoxplotAggregator to reuse parts that are already available in
AbstractTDigestPercentilesAggregator

Follow up for elastic#51948
russcam added a commit to elastic/elasticsearch-net that referenced this pull request Apr 15, 2020
Relates: elastic/elasticsearch#51948

This commit implements the boxplot aggregation.
Integration tests run against XPackCluster because
it requires a license to use.
@imotov imotov deleted the issue-33112-add-boxplot-aggs branch May 1, 2020 22:18
@jakelandis jakelandis removed the v8.0.0 label Jul 26, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Boxplot Aggregation
7 participants