Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add composite aggregator #26800

Merged
merged 3 commits into from
Nov 16, 2017
Merged

Add composite aggregator #26800

merged 3 commits into from
Nov 16, 2017

Conversation

jimczi
Copy link
Contributor

@jimczi jimczi commented Sep 27, 2017

This change adds a module called aggs-composite that defines a new aggregation named composite.
The composite aggregation is a multi-buckets aggregation that creates composite buckets made of multiple sources.
The sources for each bucket can be defined as:

  • A terms source, values are extracted from a field or a script.
  • A date_histogram source, values are extracted from a date field and rounded to the provided interval.
    This aggregation can be used to retrieve all buckets of a deeply nested aggregation by flattening the nested aggregation in composite buckets.
    A composite buckets is composed of one value per source and is built for each document as the combinations of values in the provided sources.
    For instance the following aggregation:
"test_agg": {
  "terms": {
    "field": "field1"
  },
  "aggs": {
    "nested_test_agg":
      "terms": {
        "field": "field2"
      }
  }
}

... which retrieves the top N terms for field1 and for each top term in field1 the top N terms for field2, can be replaced by a composite aggregation in order to retrieve all the combinations of field1, field2 in the matching documents:

"composite_agg": {
  "composite": {
    "sources": [
      {
        "terms": {
          "field": "field1"
          }
      },
      {
        "terms": {
          "field": "field2"
        }
      }
     ]
    }
  }

The response of the aggregation looks like this:

"aggregations": {
  "composite_agg": {
    "buckets": [
      {
        "key": [
          "alabama",
          "almanach"
        ],
        "doc_count": 100
      },
      {
        "key": [
          "alabama",
          "calendar"
        ],
        "doc_count": 1
      },
      {
        "key": [
          "arizona",
          "calendar"
        ],
        "doc_count": 1
      }
    ]
  }
}

By default this aggregation returns 10 buckets sorted in ascending order of the composite key.
Pagination can be achieved by providing after values, the values of the composite key to aggregate after.
For instance the following aggregation will aggregate all composite keys that sorts after arizona, calendar:

"composite_agg": {
  "composite": {
    "after": ["alabama", "calendar"],
    "size": 100,
    "sources": [
      {
        "terms": {
          "field": "field1"
          }
      },
      {
        "terms": {
          "field": "field2"
        }
      }
     ]
    }
  }

This aggregation is optimized for indices that set an index sorting that match the composite source definition.
For instance the aggregation above could run faster on indices that defines an index sorting like this:

"settings": {
  "index.sort.field": ["field1", "field2"]
}

In this case the composite aggregation can early terminate on each segment.
This aggregation also accepts multi-valued field but disables early termination for these fields even if index sorting matches the sources definition.
This is mandatory because index sorting picks only one value per document to perform the sort.

For sorted indices, we could jump directly to documents that sort after the provided after values in each segment in order to speed up the collection but it can be done in a follow up.

This aggregation also accepts any sub aggregations and returns the result inside each composite bucket like any other multi-buckets agg:

"composite_agg": {
  "composite": {
    "size": 100,
    "sources": [
      {
        "terms": {
          "field": "field1"
          }
      },
      {
        "terms": {
          "field": "field2"
        }
      }
     ]
    },
    "max_value": {
      "max": {
        "field": "number"
     }
  }
}

Documentation is missing which is why this is still a WIP.

Copy link
Member

@martijnvg martijnvg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In general this change looks good to me. I left some questions and remarks. I'll do another review round later.

This PR is quite large and it would be great if someone else can also take a look at it. Maybe @colings86?

long ord;
while ((ord = dvs.nextOrd()) != NO_MORE_ORDS) {
values[0] = ord;
next.collect(doc, 0L);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Change 0L into bucket?

throw new CollectionTerminatedException();
}
// just skip this key for now
return ;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: remove whitespace?

}
}

private LeafBucketCollector getFirstPassCollector() {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this method and the next method can just return LeafCollector?

The bucket ord isn't used in this implementations. Also these collectors are not directly used by the aggs framework, but are wrapped by a LeafBucketCollector instance in CompositeValuesSource.java.

final LeafBucketCollector collector = array.getLeafCollector(context.ctx, getSecondPassCollector(context.subCollector));
int docID;
while ((docID = docIdSetIterator.nextDoc()) != DocIdSetIterator.NO_MORE_DOCS) {
collector.collect(docID, 0L);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe use collector.collect(docID); instead?

/**
* A wrapper for {@link ValuesSource} that can record and compare values produced during a collection.
*/
abstract class CompositeValuesSource<VS extends ValuesSource, T extends Comparable<T>> {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is some comparing and ordering done here. I wonder if we can incorporate or extend Lucene's PriorityQueue class here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am using this class to create slots that can be referenced in a priority queue. It's mainly a comparator + a composite array that can update the values. It is equivalent to a FieldComparator but for buckets.

import java.util.Map;
import java.util.TreeMap;

class CompositeAggregator extends BucketsAggregator {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

make class final?

public class CompositeKey {
final Comparable<?>[] values;

public CompositeKey(Comparable<?>... values) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can the constructor be package protected?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It can be used to build a composite aggregation so I think it's useful to keep it public for now. It adds the ability to start from anywhere a composite aggregation in the java client.

return new LeafBucketCollector() {
@Override
public void collect(int doc, long bucket) throws IOException {
if (dvs.advanceExact(doc)) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The bucket variable is always zero here, right? If that is the case then maybe we should have assertions for this in these LeafBucketCollector anonymous classes?

private final List<InternalBucket> buckets;
private final int[] reverseMuls;

public InternalComposite(String name, int size, List<InternalBucket> buckets, int[] reverseMuls,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

these constructors can be package protected?

import static org.elasticsearch.test.EqualsHashCodeTestUtils.checkEqualsAndHashCode;
import static org.hamcrest.Matchers.hasSize;

public class CompositeAggregationBuilderTests extends ESTestCase {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe add a dedicated testcase for each build (extending from AbstractStreamableTestCase)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a test for the CompositeAggregationBuilder so it randomizes the different value sources that can be used inside. Splitting the test for each source would defeat the purpose since the idea is to compose an aggregation from different value source.

Copy link
Contributor

@colings86 colings86 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I took a first look, will need to review again later.

One thing that I am a little wary of is that we seem to be recreating a lot of the ValuesSource classes here. I am wondering if we can reuse more of the existing ValuesSource builder and parser code to avoid having it in two places? We will want the sources to feel very like an array of ValuesSource configs so as far as possible it would be good to use the same code to parse them as we have for the regular aggs.

*/
CompositeKey afterKey();

static XContentBuilder bucketToXContentFragment(CompositeAggregation.Bucket bucket,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit, can this build the object rather than just the fragment? It looks like it just gets wrapped in an object below?

inner.collect(doc);
}
};
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think its worth adding some JAvaDocs here to explain how the collection works and what is happening in the first pass and the second pass

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

++

/**
* A {@link ValuesSource} builder for {@link CompositeAggregationBuilder}
*/
public abstract class CompositeValuesSourceBuilder<AB extends CompositeValuesSourceBuilder<AB>> implements Writeable, ToXContentFragment {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks very like the ValuesSourceBuilder that already exists. Are we not able to use that here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried but this values source builder is not an AbstractAggregationBuilder. It is used solely to access the values, not to build a complete aggregator. I also need to access the values in a specific way in order to be able to build the combinations of multiple values source so my usage is really an edge case.

@jimczi
Copy link
Contributor Author

jimczi commented Nov 7, 2017

@martijnvg @colings86 thanks for reviewing
I pushed more commits to address your comments.
I also pushed the documentation for this aggregation, @clintongormley can you take a look and validate the API and response ?

@clintongormley
Copy link

This is looking good. I wonder if sources should be named, eg:

    "sources": [
        { "date": {"date_histogram": { "field": "timestamp", "interval": "1d" }}},
        { "product": {"terms": {"field": "product"}}}
    ]

Then the result would look like this:

{
     ...
     "aggregations": {
         "my_buckets": {
             "buckets": [
                 {
                     "values": {
                         "date": 1494201600000,
                         "product": rocky"
                     },
                     "doc_count": 1
                 },
                 {    
                     "values": {
                         "date": 1494288000000,
                         "product": mad max"
                     },
                     "doc_count": 2
                 }
             ]
         }
     }
 }

This seems more human readable, but maybe using a hashmap instead of an array to represent the values is less useful for consumers of this API like Kibana. I'm easy.

The only other thing I'd suggest is to add an introductory paragraph at the beginning of the docs explaining why you would use the composite agg, ie extolling its benefits. They're all mentioned on the page, but the user has to read the whole page before they realise how powerful this agg is.

experimental[]

A multi-bucket aggregation that creates composite buckets from different sources.
The buckets are build from the combinations of the values extracted/created for each document and each

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

build -> built

// CONSOLE

WARNING: The optimization takes effect only if the fields used for sorting are single-valued and follow
the same order than the aggregation (`desc` or `asc`).

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

than -> as

Copy link
Contributor

@colings86 colings86 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jimczi I left a few more comments

/**
* Returns a new {@link TermsValuesSourceBuilder}.
*/
public TermsValuesSourceBuilder termsSource() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These methods could be static? In fact do we need these methods since they just create the source builders directly anyway?

private CompositeKey afterKey;
private int size = 10;

public CompositeAggregationBuilder(String name) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we add a parameter for the sources here since they are required? That way you can't create an invalid instance of this builder? We could also then validate the length of the after key in the setter rather than when doBuild() is called.

return vs;
}

int reverseMul() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you add a javaDoc for this. It took me a while to work out what this actually was as it wasn't obvious from where its used

tzRoundingBuilder = Rounding.builder(TimeValue.timeValueMillis(interval));
}
Rounding rounding = tzRoundingBuilder.build();
return rounding;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we have timezone support here?

@polyfractal
Copy link
Contributor

polyfractal commented Nov 8, 2017

++ to some kind of ability to name sources. It'll be a lot easier for applications to use if they can refer to names rather than positions. Ditto to the after parameter, maybe a map instead of ordered array?

"after": {
  "date": 1494201600000, 
  "product": "rocky"
}

Two questions! :)

If a composite key doesn't have any documents, is it returned in results with doc_count: 0, or does that composite key not show up at all?

My understanding is that if index sorting is not enabled, paging through the results with after means it has to re-evaluate all the docs again (since it can't "jump" to the right position). Do you happen to know how expensive that actually is? I'm thinking about the case where you have a very large index and need to deeply page into it. Maybe the cost isn't so bad because doc value evaluation is fast?

@jimczi
Copy link
Contributor Author

jimczi commented Nov 8, 2017

If a bucket doesn't have any documents, is it returned in results with doc_count: 0, or does that composite key not show up at all?

The composite buckets are created from existing values only.
This aggregation implicitly sets min_doc_count to 1 and this value cannot be changed.
I guess that this is more an issue for histogram and date_histogram sources. Do you think it is worth trying to add the support to create empty buckets for these sources ?

My understanding is that if index sorting is not enabled, paging through the results with after means it has to re-evaluate all the docs again (since it can't "jump" to the right position). Do you happen to know how expensive that actually is? I'm thinking about the case where you have a very large index and need to deeply page into it. Maybe the cost isn't so bad because doc value evaluation is fast?

Your understanding is correct. We need to reevaluate all documents on every requests but the memory consumption (dictated by size) remains the same no matter how far you are in the buckets. I think it's ok because this aggregation should be used in a batch context, it's not a realtime thing so I consider this as a safe way to dump all buckets from a complex aggregation tree without killing your cluster. We have some ideas to make it faster when the index is not sorted (related to #23022) but for the moment the order of magnitude between a sorted index and a non sorted index can be huge.
It also depends on the number of buckets that you can handle per query, in a sense this is the same as scrolling an index with a specific sort that do not depends on doc_id.

@polyfractal
Copy link
Contributor

Thanks @jimczi!

Do you think it is worth trying to add the support to create empty buckets for these sources ?

I'm not sure... probably not. I was mostly just curious, not sure it's really needed. The main reason we went to min_doc_count: 0 was for pipeline aggs which isn't really a consideration here.

Your understanding is correct. We need to reevaluate all documents on every requests but the memory consumption (dictated by size) remains the same no matter how far you are in the buckets.

Good to know, thanks for the explanation. I'll run some tests here locally to get a feel for the impact. :)

@clintongormley
Copy link

The main reason we went to min_doc_count: 0 was for pipeline aggs which isn't really a consideration here.

Does this mean that pipeline aggs are not supported under the composite agg?

@colings86
Copy link
Contributor

@clintongormley it'll mean that pipeline aggregations like derivative, moving_average and cumulative sum which only work with histogram or date_histogram aggregations will not work but we wouldn't be able to make them work across pages anyway since it would need to maintain state across requests. Pipeline aggregations like bucket_selector and bucket_script I think will still work.

@@ -546,7 +547,8 @@ private void parse(ParseContext parseContext, Token token, XContentParser parser
}
}
} else {
throw new ElasticsearchParseException("failed to parse expected text or object got" + token.name());
throw new ParsingException(parser.getTokenLocation(), "failed to parse expected text or object got " + token.name());
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change looks unrelated? and I think was addressed in a different PR.

…aggregation named `composite`.

The `composite` aggregation is a multi-buckets aggregation that creates composite buckets made of multiple sources.
The sources for each bucket can be defined as:
  * A `terms` source, values are extracted from a field or a script.
  * A `date_histogram` source, values are extracted from a date field and rounded to the provided interval.
This aggregation can be used to retrieve all buckets of a deeply nested aggregation by flattening the nested aggregation in composite buckets.
A composite buckets is composed of one value per source and is built for each document as the combinations of values in the provided sources.
For instance the following aggregation:

````
"test_agg": {
  "terms": {
    "field": "field1"
  },
  "aggs": {
    "nested_test_agg":
      "terms": {
        "field": "field2"
      }
  }
}
````
... which retrieves the top N terms for `field1` and for each top term in `field1` the top N terms for `field2`, can be replaced by a `composite` aggregation in order to retrieve **all** the combinations of `field1`, `field2` in the matching documents:

````
"composite_agg": {
  "composite": {
    "sources": [
      {
	"field1": {
          "terms": {
              "field": "field1"
            }
        }
      },
      {
	"field2": {
          "terms": {
            "field": "field2"
          }
        }
      },
    }
  }
````

The response of the aggregation looks like this:

````
"aggregations": {
  "composite_agg": {
    "buckets": [
      {
        "key": {
          "field1": "alabama",
          "field2": "almanach"
        },
        "doc_count": 100
      },
      {
        "key": {
          "field1": "alabama",
          "field2": "calendar"
        },
        "doc_count": 1
      },
      {
        "key": {
          "field1": "arizona",
          "field2": "calendar"
        },
        "doc_count": 1
      }
    ]
  }
}
````

By default this aggregation returns 10 buckets sorted in ascending order of the composite key.
Pagination can be achieved by providing `after` values, the values of the composite key to aggregate after.
For instance the following aggregation will aggregate all composite keys that sorts after `arizona, calendar`:

````
"composite_agg": {
  "composite": {
    "after": {"field1": "alabama", "field2": "calendar"},
    "size": 100,
    "sources": [
      {
	"field1": {
          "terms": {
            "field": "field1"
          }
        }
      },
      {
	"field2": {
          "terms": {
            "field": "field2"
          }
	}
      }
    }
  }
````

This aggregation is optimized for indices that set an index sorting that match the composite source definition.
For instance the aggregation above could run faster on indices that defines an index sorting like this:

````
"settings": {
  "index.sort.field": ["field1", "field2"]
}
````

In this case the `composite` aggregation can early terminate on each segment.
This aggregation also accepts multi-valued field but disables early termination for these fields even if index sorting matches the sources definition.
This is mandatory because index sorting picks only one value per document to perform the sort.

another iter

docs

Add tests for time zone support

docs

docs
@jimczi
Copy link
Contributor Author

jimczi commented Nov 13, 2017

I pushed another iteration to address reviews.
The main changes are:

  • The composite key is now rendered as a map (thanks @clintongormley !)
  • The date_histogram handles time_zone
  • Updated documentation

@martijnvg can you take another look ?

Copy link
Member

@martijnvg martijnvg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left two small comments. LGTM otherwise!

return Arrays.hashCode(values);
}

static String formatValue(Object value, DocValueFormat formatter) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

removed? It does not seem to be used.

@@ -182,7 +182,7 @@ public TypeParser() {
protected FormatDateTimeFormatter dateTimeFormatter;
protected DateMathParser dateMathParser;

DateFieldType() {
public DateFieldType() {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I prefer to keep this package protected. I think in CompositeAggregatorTests#setUp() we should do this instead:

DateFieldMapper.Builder builder = new DateFieldMapper.Builder("date");
builder.docValues(true);
DateFieldMapper fieldMapper = builder.build(new Mapper.BuilderContext(createIndexSettings().getSettings(), new ContentPath(0)));
FIELD_TYPES[3] = fieldMapper.fieldType();

@jimczi jimczi merged commit 623367d into elastic:master Nov 16, 2017
@jimczi jimczi deleted the aggs-composite branch November 16, 2017 14:13
jimczi added a commit to jimczi/elasticsearch that referenced this pull request Nov 16, 2017
* This change adds a module called `aggs-composite` that defines a new aggregation named `composite`.
The `composite` aggregation is a multi-buckets aggregation that creates composite buckets made of multiple sources.
The sources for each bucket can be defined as:
  * A `terms` source, values are extracted from a field or a script.
  * A `date_histogram` source, values are extracted from a date field and rounded to the provided interval.
This aggregation can be used to retrieve all buckets of a deeply nested aggregation by flattening the nested aggregation in composite buckets.
A composite buckets is composed of one value per source and is built for each document as the combinations of values in the provided sources.
For instance the following aggregation:

````
"test_agg": {
  "terms": {
    "field": "field1"
  },
  "aggs": {
    "nested_test_agg":
      "terms": {
        "field": "field2"
      }
  }
}
````
... which retrieves the top N terms for `field1` and for each top term in `field1` the top N terms for `field2`, can be replaced by a `composite` aggregation in order to retrieve **all** the combinations of `field1`, `field2` in the matching documents:

````
"composite_agg": {
  "composite": {
    "sources": [
      {
	"field1": {
          "terms": {
              "field": "field1"
            }
        }
      },
      {
	"field2": {
          "terms": {
            "field": "field2"
          }
        }
      },
    }
  }
````

The response of the aggregation looks like this:

````
"aggregations": {
  "composite_agg": {
    "buckets": [
      {
        "key": {
          "field1": "alabama",
          "field2": "almanach"
        },
        "doc_count": 100
      },
      {
        "key": {
          "field1": "alabama",
          "field2": "calendar"
        },
        "doc_count": 1
      },
      {
        "key": {
          "field1": "arizona",
          "field2": "calendar"
        },
        "doc_count": 1
      }
    ]
  }
}
````

By default this aggregation returns 10 buckets sorted in ascending order of the composite key.
Pagination can be achieved by providing `after` values, the values of the composite key to aggregate after.
For instance the following aggregation will aggregate all composite keys that sorts after `arizona, calendar`:

````
"composite_agg": {
  "composite": {
    "after": {"field1": "alabama", "field2": "calendar"},
    "size": 100,
    "sources": [
      {
	"field1": {
          "terms": {
            "field": "field1"
          }
        }
      },
      {
	"field2": {
          "terms": {
            "field": "field2"
          }
	}
      }
    }
  }
````

This aggregation is optimized for indices that set an index sorting that match the composite source definition.
For instance the aggregation above could run faster on indices that defines an index sorting like this:

````
"settings": {
  "index.sort.field": ["field1", "field2"]
}
````

In this case the `composite` aggregation can early terminate on each segment.
This aggregation also accepts multi-valued field but disables early termination for these fields even if index sorting matches the sources definition.
This is mandatory because index sorting picks only one value per document to perform the sort.
jimczi added a commit to jimczi/elasticsearch that referenced this pull request Nov 16, 2017
jimczi added a commit that referenced this pull request Nov 16, 2017
* This change adds a module called `aggs-composite` that defines a new aggregation named `composite`.
The `composite` aggregation is a multi-buckets aggregation that creates composite buckets made of multiple sources.
The sources for each bucket can be defined as:
  * A `terms` source, values are extracted from a field or a script.
  * A `date_histogram` source, values are extracted from a date field and rounded to the provided interval.
This aggregation can be used to retrieve all buckets of a deeply nested aggregation by flattening the nested aggregation in composite buckets.
A composite buckets is composed of one value per source and is built for each document as the combinations of values in the provided sources.
For instance the following aggregation:

````
"test_agg": {
  "terms": {
    "field": "field1"
  },
  "aggs": {
    "nested_test_agg":
      "terms": {
        "field": "field2"
      }
  }
}
````
... which retrieves the top N terms for `field1` and for each top term in `field1` the top N terms for `field2`, can be replaced by a `composite` aggregation in order to retrieve **all** the combinations of `field1`, `field2` in the matching documents:

````
"composite_agg": {
  "composite": {
    "sources": [
      {
	"field1": {
          "terms": {
              "field": "field1"
            }
        }
      },
      {
	"field2": {
          "terms": {
            "field": "field2"
          }
        }
      },
    }
  }
````

The response of the aggregation looks like this:

````
"aggregations": {
  "composite_agg": {
    "buckets": [
      {
        "key": {
          "field1": "alabama",
          "field2": "almanach"
        },
        "doc_count": 100
      },
      {
        "key": {
          "field1": "alabama",
          "field2": "calendar"
        },
        "doc_count": 1
      },
      {
        "key": {
          "field1": "arizona",
          "field2": "calendar"
        },
        "doc_count": 1
      }
    ]
  }
}
````

By default this aggregation returns 10 buckets sorted in ascending order of the composite key.
Pagination can be achieved by providing `after` values, the values of the composite key to aggregate after.
For instance the following aggregation will aggregate all composite keys that sorts after `arizona, calendar`:

````
"composite_agg": {
  "composite": {
    "after": {"field1": "alabama", "field2": "calendar"},
    "size": 100,
    "sources": [
      {
	"field1": {
          "terms": {
            "field": "field1"
          }
        }
      },
      {
	"field2": {
          "terms": {
            "field": "field2"
          }
	}
      }
    }
  }
````

This aggregation is optimized for indices that set an index sorting that match the composite source definition.
For instance the aggregation above could run faster on indices that defines an index sorting like this:

````
"settings": {
  "index.sort.field": ["field1", "field2"]
}
````

In this case the `composite` aggregation can early terminate on each segment.
This aggregation also accepts multi-valued field but disables early termination for these fields even if index sorting matches the sources definition.
This is mandatory because index sorting picks only one value per document to perform the sort.
jimczi added a commit that referenced this pull request Nov 16, 2017
jasontedor added a commit that referenced this pull request Nov 16, 2017
* master:
  Stop skipping REST test after backport of #27056
  Fix default value of ignore_unavailable for snapshot REST API (#27056)
  Add composite aggregator (#26800)
  Fix `ShardSplittingQuery` to respect nested documents. (#27398)
  [Docs] Restore section about multi-level parent/child relation in parent-join (#27392)
  Add TcpChannel to unify Transport implementations (#27132)
  Add note on plugin distributions in plugins folder
  Remove implementations of `TransportChannel` (#27388)
  Update Google SDK to version 1.23 (#27381)
  Fix Gradle 4.3.1 compatibility for logging (#27382)
  [Test] Change Elasticsearch startup timeout to 120s in packaging tests
  Docs/windows installer (#27369)
jimczi added a commit that referenced this pull request Nov 20, 2017
jimczi added a commit that referenced this pull request Nov 20, 2017
jasontedor added a commit that referenced this pull request Nov 20, 2017
* master: (31 commits)
  [TEST] Fix `GeoShapeQueryTests#testPointsOnly` failure
  Transition transport apis to use void listeners (#27440)
  AwaitsFix GeoShapeQueryTests#testPointsOnly #27454
  Bump test version after backport
  Ensure nested documents have consistent version and seq_ids (#27455)
  Tests: Add Fedora-27 to packaging tests
  Delete some seemingly unused exceptions (#27439)
  #26800: Fix docs rendering
  Remove config prompting for secrets and text (#27216)
  Move the CLI into its own subproject (#27114)
  Correct usage of "an" to "a" in getting started docs
  Avoid NPE when getting build information
  Removes BWC snapshot status handler used in 6.x (#27443)
  Remove manual tracking of registered channels (#27445)
  Remove parameters on HandshakeResponseHandler (#27444)
  [GEO] fix pointsOnly bug for MULTIPOINT
  Standardize underscore requirements in parameters (#27414)
  peanut butter hamburgers
  Log primary-replica resync failures
  Uses TransportMasterNodeAction to update shard snapshot status (#27165)
  ...
jasontedor added a commit that referenced this pull request Nov 20, 2017
* 6.x: (41 commits)
  [TEST] Fix `GeoShapeQueryTests#testPointsOnly` failure
  Transition transport apis to use void listeners (#27440)
  AwaitsFix GeoShapeQueryTests#testPointsOnly #27454
  Ensure nested documents have consistent version and seq_ids (#27455)
  Tests: Add Fedora-27 to packaging tests
  #26800: Fix docs rendering
  Move the CLI into its own subproject (#27114)
  Correct usage of "an" to "a" in getting started docs
  Avoid NPE when getting build information
  Remove manual tracking of registered channels (#27445)
  Standardize underscore requirements in parameters (#27414)
  Remove parameters on HandshakeResponseHandler (#27444)
  [GEO] fix pointsOnly bug for MULTIPOINT
  peanut butter hamburgers
  Uses TransportMasterNodeAction to update shard snapshot status (#27165)
  Log primary-replica resync failures
  Add limits for ngram and shingle settings (#27411)
  Enforce a minimum task execution and service time of 1 nanosecond
  Fix place-holder in allocation decider messages (#27436)
  Remove newline from log message (#27425)
  ...
@jimczi jimczi removed the review label Nov 21, 2017
jimczi added a commit that referenced this pull request Nov 21, 2017
jimczi added a commit that referenced this pull request Nov 21, 2017
Exclude "key" field from random modifications in tests, the composite agg uses
an array of object for bucket key and values are checked.
Relates #26800
jimczi added a commit that referenced this pull request Nov 21, 2017
Exclude "key" field from random modifications in tests, the composite agg uses
an array of object for bucket key and values are checked.
Relates #26800
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants