Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SQL: Implement FIRST/LAST aggregate functions #37936

Merged
merged 11 commits into from
Jan 31, 2019
126 changes: 126 additions & 0 deletions docs/reference/sql/functions/aggs.asciidoc
Original file line number Diff line number Diff line change
Expand Up @@ -113,6 +113,129 @@ Returns the total number of _distinct non-null_ values in input values.
include-tagged::{sql-specs}/docs.csv-spec[aggCountDistinct]
--------------------------------------------------

[[sql-functions-aggs-first]]
===== `FIRST/FIRST_VALUE`
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Additionally please add an example for FIRST_VALUE.


.Synopsis:
[source, sql]
--------------------------------------------------
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think for optional arguments, it would be nice to add an example with FIRST just with the first argument.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, for synopsis.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@costin our documentation, until now, lists one version of a function in synopsis where the optional arguments are between square brackets. See https://github.com/elastic/elasticsearch/blob/master/docs/reference/sql/functions/string.asciidoc#locate and https://github.com/elastic/elasticsearch/blob/master/docs/reference/sql/functions/math.asciidoc#round (and others). Listing all versions of a function (with/without optional params) imo would look a bit crowded and not-so-clean and sleek. In this particular case of FIRST/FIRST_VALUE/LAST/LAST_VALUE means we need to list four versions of the same function in the synopsis and I personally believe it doesn't look that great.

Do you believe that, maybe, it is not clear that parameter is optional? (the square brackets and the description saying "optional")

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not the optional parameter but the method name that I'd like to be exemplified through a query to make the function name more obvious.
It might just be easier to create a separate section for the alias, mentioning that it's an alias for ...
This can be done in a separate PR though since it's just documentation.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FIRST(field_name<1>, sort_by_field_name<2>)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sort_by_field_name -> ordering field

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If sort_by_field_name is optional, I think you need to put it between square brackets to remain consistent with the rest of the documentation.

--------------------------------------------------

*Input*:

<1> a field name
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

<1> target field for the aggregation
<2> optional field used for ordering

<2> a field name; optional

*Output*: same type as the input

.Description:

When only one argument is provided it returns the first **non-NULL** value across input values in the field
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"When only the first argument is provided".... (only the second one is optional).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The first non-null value is returned regardless of the presence of the optional argument.
I would rephrase this to "Returns the first non-NULL value (if it exists) of the input column sorted by the sort_column. If sort_column is missing, the field_name column is used for sorting`. or something along those lines.

`field_name`. It will return **NULL** only if all values in `field_name` are null. When a second argument
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't believe it's clear what "first non-NULL value across input values" means... It is clear for the second part of the description (when the optional argument is used), but not clear for the first part: the values are sorted by sort_by_field_name ascending and then the first value is returned, but there is nothing mentioned about ordering in case the second optional argument is missing.

is provided then it returns the first **non-NULL** value across input values in the field `field_name` ordered
ascending by the **non-NULL** values of `sort_by_field_name`. E.g.:

[cols="<,<"]
|===
s|a|b
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please use indentation to have the table columns aligned.

| 100 | 1
| 200 | 1
| 1 | 2
| 2 | 2
| 10 | null
| 20 | null
|
|===

[source, sql]
-------------------------
SELECT FIRST(a, b) FROM t
-------------------------

will result in:
[cols="<"]
|===
s|FIRST(a, b)
| 100
|===


["source","sql",subs="attributes,macros"]
-----------------------------------------------------------
include-tagged::{sql-specs}/docs.csv-spec[firstWithOneArg]
-----------------------------------------------------------

["source","sql",subs="attributes,macros"]
-----------------------------------------------------------
include-tagged::{sql-specs}/docs.csv-spec[firstWithTwoArgs]
-----------------------------------------------------------

[NOTE]
`FIRST` cannot be used in to create a filter in a `HAVING` clause of a `GROUP BY` query.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think "in" is not needed here?
Also, worth putting this as a limitation in the docs? (meaning, in the dedicated Limitations page)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FIRST cannot be used in a HAVING clause (no need to specify GROUP BY as that's understood and also might be implicit).


[[sql-functions-aggs-last]]
===== `LAST/LAST_VALUE`

.Synopsis:
[source, sql]
--------------------------------------------------
LAST(field_name<1>, sort_by_field_name<2>)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as above for optional fields and square brackets.

--------------------------------------------------

*Input*:

<1> a field name
<2> a field name; optional

*Output*: same type as the input

.Description:

It's the inverse of <<sql-functions-aggs-first>>. When only one argument is provided it returns the
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"first argument" instead of "one argument".

last **non-NULL** value across input values in the field `field_name`. It will return **NULL** only if
all values in `field_name` are null. When a second argument is provided then it returns the last
**non-NULL** value across input values in the field `field_name` ordered descending by the **non-NULL**
values of `sort_by_field_name`. E.g.:

[cols="<,<"]
|===
s|a|b
| 10 | 1
| 20 | 1
| 1 | 2
| 2 | 2
| 100 | null
| 200 | null
|===

[source, sql]
------------------------
SELECT LAST(a, b) FROM t
------------------------

will result in:
[cols="<"]
|===
s|LAST(a, b)
| 2
|===


["source","sql",subs="attributes,macros"]
-----------------------------------------------------------
include-tagged::{sql-specs}/docs.csv-spec[lastWithOneArg]
-----------------------------------------------------------

["source","sql",subs="attributes,macros"]
-----------------------------------------------------------
include-tagged::{sql-specs}/docs.csv-spec[lastWithTwoArgs]
-----------------------------------------------------------

[NOTE]
`LAST` cannot be used in to create a filter in a `HAVING` clause of a `GROUP BY` query.


[[sql-functions-aggs-max]]
===== `MAX`

Expand Down Expand Up @@ -161,6 +284,9 @@ Returns the minimum value across input values in the field `field_name`.
include-tagged::{sql-specs}/docs.csv-spec[aggMin]
--------------------------------------------------

[NOTE]
`MIN` on a field of type <<text, `text`>> or <<keyword, `keyword`>> is translated into <<sql-functions-aggs-first>>.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And I'm assuming MIN/MAX on keywords cannot be used in HAVINGs...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that's why I made the link here. Do you think I should add the info about HAVING here too?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, please.


[[sql-functions-aggs-sum]]
===== `SUM`

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -31,6 +31,10 @@ public void testShowFunctions() throws IOException {
assertThat(readLine(), containsString(HEADER_SEPARATOR));
assertThat(readLine(), RegexMatcher.matches("\\s*AVG\\s*\\|\\s*AGGREGATE\\s*"));
assertThat(readLine(), RegexMatcher.matches("\\s*COUNT\\s*\\|\\s*AGGREGATE\\s*"));
assertThat(readLine(), RegexMatcher.matches("\\s*FIRST\\s*\\|\\s*AGGREGATE\\s*"));
assertThat(readLine(), RegexMatcher.matches("\\s*FIRST_VALUE\\s*\\|\\s*AGGREGATE\\s*"));
assertThat(readLine(), RegexMatcher.matches("\\s*LAST\\s*\\|\\s*AGGREGATE\\s*"));
assertThat(readLine(), RegexMatcher.matches("\\s*LAST_VALUE\\s*\\|\\s*AGGREGATE\\s*"));
assertThat(readLine(), RegexMatcher.matches("\\s*MAX\\s*\\|\\s*AGGREGATE\\s*"));
assertThat(readLine(), RegexMatcher.matches("\\s*MIN\\s*\\|\\s*AGGREGATE\\s*"));
String line = readLine();
Expand Down Expand Up @@ -58,6 +62,8 @@ public void testShowFunctions() throws IOException {
public void testShowFunctionsLikePrefix() throws IOException {
assertThat(command("SHOW FUNCTIONS LIKE 'L%'"), RegexMatcher.matches("\\s*name\\s*\\|\\s*type\\s*"));
assertThat(readLine(), containsString(HEADER_SEPARATOR));
assertThat(readLine(), RegexMatcher.matches("\\s*LAST\\s*\\|\\s*AGGREGATE\\s*"));
assertThat(readLine(), RegexMatcher.matches("\\s*LAST_VALUE\\s*\\|\\s*AGGREGATE\\s*"));
assertThat(readLine(), RegexMatcher.matches("\\s*LEAST\\s*\\|\\s*CONDITIONAL\\s*"));
assertThat(readLine(), RegexMatcher.matches("\\s*LOG\\s*\\|\\s*SCALAR\\s*"));
assertThat(readLine(), RegexMatcher.matches("\\s*LOG10\\s*\\|\\s*SCALAR\\s*"));
Expand Down
33 changes: 33 additions & 0 deletions x-pack/plugin/sql/qa/src/main/resources/agg.csv-spec
Original file line number Diff line number Diff line change
Expand Up @@ -373,3 +373,36 @@ SELECT COUNT(ALL last_name)=COUNT(ALL first_name) AS areEqual, COUNT(ALL first_n
---------------+---------------+---------------
false |90 |100
;

topHitsWithOneArgAndGroupBy
schema::gender:s|first:s|last:s
SELECT gender, FIRST(first_name) as first, LAST(first_name) as last FROM test_emp GROUP BY gender ORDER BY gender;

gender | first | last
---------------+---------------+------------
null | Berni | Patricio
F | Alejandro | Xinglin
M | Amabile | Zvonko
;

topHitsWithTwoArgsAndGroupBy
schema::gender:s|first:s|last:s
SELECT gender, FIRST(first_name, birth_date) as first, LAST(first_name, birth_date) as last FROM test_emp GROUP BY gender ORDER BY gender;

gender | first | last
---------------+---------------+-----------------
null | Lillian | Eberhardt
F | Sumant | Valdiodio
M | Remzi | Hilari
;

topHitsOnDatetime
schema::gender:s|first:i|last:i
SELECT gender, month(first(birth_date, languages)) first, month(last(birth_date, languages)) last FROM test_emp GROUP BY gender ORDER BY gender;

gender | first | last
---------------+---------------+---------------
null | 1 | 10
F | 4 | 6
M | 1 | 4
;
8 changes: 6 additions & 2 deletions x-pack/plugin/sql/qa/src/main/resources/command.csv-spec
Original file line number Diff line number Diff line change
Expand Up @@ -8,8 +8,12 @@ SHOW FUNCTIONS;

name:s | type:s
AVG |AGGREGATE
COUNT |AGGREGATE
MAX |AGGREGATE
COUNT |AGGREGATE
FIRST |AGGREGATE
FIRST_VALUE |AGGREGATE
LAST |AGGREGATE
LAST_VALUE |AGGREGATE
MAX |AGGREGATE
MIN |AGGREGATE
SUM |AGGREGATE
KURTOSIS |AGGREGATE
Expand Down
57 changes: 56 additions & 1 deletion x-pack/plugin/sql/qa/src/main/resources/docs.csv-spec
Original file line number Diff line number Diff line change
Expand Up @@ -185,7 +185,11 @@ SHOW FUNCTIONS;
name | type
-----------------+---------------
AVG |AGGREGATE
COUNT |AGGREGATE
COUNT |AGGREGATE
FIRST |AGGREGATE
FIRST_VALUE |AGGREGATE
LAST |AGGREGATE
LAST_VALUE |AGGREGATE
MAX |AGGREGATE
MIN |AGGREGATE
SUM |AGGREGATE
Expand Down Expand Up @@ -699,6 +703,8 @@ SELECT MIN(salary) AS min, MAX(salary) AS max FROM emp HAVING min > 25000;
// end::groupByHavingImplicitNoMatch
//;



///////////////////////////////
//
// Grouping
Expand Down Expand Up @@ -998,6 +1004,55 @@ SELECT COUNT(DISTINCT hire_date) unique_hires, COUNT(hire_date) AS hires FROM em
// end::aggCountDistinct
;

firstWithOneArg
schema::FIRST(first_name):s
// tag::firstWithOneArg
SELECT FIRST(first_name) FROM emp;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the examples would be more meaningful along with a GROUP BY.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generally we don't have group by for min/max/avg/kurtosis etc. examples

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have at least one query with GROUP BY for each agg - the idea is to underline that this is in fact an agg (which might not be obvious when using implicit grouping).


FIRST(first_name)
-----------------
Alejandro

// end::firstWithOneArg
;

firstWithTwoArgs
schema::FIRST(first_name, birth_date):s
// tag::firstWithTwoArgs
SELECT FIRST(first_name, birth_date) FROM emp;

FIRST(first_name, birth_date)
-----------------------------
Remzi

// end::firstWithTwoArgs
;

lastWithOneArg
schema::LAST(first_name):s
// tag::lastWithOneArg
SELECT LAST(first_name) FROM emp;

LAST(first_name)
---------------
Zvonko

// end::lastWithOneArg
;


lastWithTwoArgs
schema::LAST(first_name, birth_date):s
// tag::lastWithTwoArgs
SELECT LAST(first_name, birth_date) FROM emp;

LAST(first_name, birth_date)
---------------------------
Hilari

// end::lastWithTwoArgs
;

aggMax
// tag::aggMax
SELECT MAX(salary) AS max FROM emp;
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -19,6 +19,7 @@
import org.elasticsearch.xpack.sql.expression.function.Functions;
import org.elasticsearch.xpack.sql.expression.function.Score;
import org.elasticsearch.xpack.sql.expression.function.aggregate.AggregateFunctionAttribute;
import org.elasticsearch.xpack.sql.expression.function.aggregate.TopHits;
import org.elasticsearch.xpack.sql.expression.function.grouping.GroupingFunctionAttribute;
import org.elasticsearch.xpack.sql.expression.function.scalar.ScalarFunction;
import org.elasticsearch.xpack.sql.expression.predicate.conditional.ConditionalFunction;
Expand Down Expand Up @@ -366,16 +367,26 @@ private static boolean checkGroupByHaving(LogicalPlan p, Set<Failure> localFailu
if (f.child() instanceof Aggregate) {
Aggregate a = (Aggregate) f.child();

Map<Expression, Node<?>> missing = new LinkedHashMap<>();
Set<Expression> missing = new LinkedHashSet<>();
Set<Expression> unsupported = new LinkedHashSet<>();
Expression condition = f.condition();
// variation of checkGroupMatch customized for HAVING, which requires just aggregations
condition.collectFirstChildren(c -> checkGroupByHavingHasOnlyAggs(c, condition, missing, functions));
condition.collectFirstChildren(c -> checkGroupByHavingHasOnlyAggs(c, missing, unsupported, functions));

if (!missing.isEmpty()) {
String plural = missing.size() > 1 ? "s" : StringUtils.EMPTY;
localFailures.add(
fail(condition, "Cannot use HAVING filter on non-aggregate" + plural + " %s; use WHERE instead",
Expressions.names(missing.keySet())));
Expressions.names(missing)));
groupingFailures.add(a);
return false;
}

if (!unsupported.isEmpty()) {
String plural = unsupported.size() > 1 ? "s" : StringUtils.EMPTY;
localFailures.add(
fail(condition, "HAVING filter is unsupported for function" + plural + " %s",
Expressions.names(unsupported)));
groupingFailures.add(a);
return false;
}
Expand All @@ -385,8 +396,8 @@ private static boolean checkGroupByHaving(LogicalPlan p, Set<Failure> localFailu
}


private static boolean checkGroupByHavingHasOnlyAggs(Expression e, Node<?> source,
Map<Expression, Node<?>> missing, Map<String, Function> functions) {
private static boolean checkGroupByHavingHasOnlyAggs(Expression e, Set<Expression> missing,
Set<Expression> unsupported, Map<String, Function> functions) {

// resolve FunctionAttribute to backing functions
if (e instanceof FunctionAttribute) {
Expand All @@ -407,13 +418,17 @@ private static boolean checkGroupByHavingHasOnlyAggs(Expression e, Node<?> sourc

// unwrap function to find the base
for (Expression arg : sf.arguments()) {
arg.collectFirstChildren(c -> checkGroupByHavingHasOnlyAggs(c, source, missing, functions));
arg.collectFirstChildren(c -> checkGroupByHavingHasOnlyAggs(c, missing, unsupported, functions));
}
return true;

} else if (e instanceof Score) {
// Score can't be used for having
missing.put(e, source);
// Score can't be used in having
unsupported.add(e);
return true;
} else if (e instanceof TopHits) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe this will not catch MAX(keyword) pass by.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right as the Verifier is checking before the optimisations.

// First and last cannot be used in having
unsupported.add(e);
return true;
}

Expand All @@ -428,7 +443,7 @@ private static boolean checkGroupByHavingHasOnlyAggs(Expression e, Node<?> sourc

// left without leaves which have to match; that's a failure since everything should be based on an agg
if (e instanceof Attribute) {
missing.put(e, source);
missing.add(e);
return true;
}

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -32,6 +32,7 @@
import org.elasticsearch.xpack.sql.execution.search.extractor.FieldHitExtractor;
import org.elasticsearch.xpack.sql.execution.search.extractor.HitExtractor;
import org.elasticsearch.xpack.sql.execution.search.extractor.MetricAggExtractor;
import org.elasticsearch.xpack.sql.execution.search.extractor.TopHitsAggExtractor;
import org.elasticsearch.xpack.sql.expression.gen.pipeline.AggExtractorInput;
import org.elasticsearch.xpack.sql.expression.gen.pipeline.AggPathInput;
import org.elasticsearch.xpack.sql.expression.gen.pipeline.HitExtractorInput;
Expand All @@ -45,6 +46,7 @@
import org.elasticsearch.xpack.sql.querydsl.container.QueryContainer;
import org.elasticsearch.xpack.sql.querydsl.container.ScriptFieldRef;
import org.elasticsearch.xpack.sql.querydsl.container.SearchHitFieldRef;
import org.elasticsearch.xpack.sql.querydsl.container.TopHitsAggRef;
import org.elasticsearch.xpack.sql.session.Configuration;
import org.elasticsearch.xpack.sql.session.Rows;
import org.elasticsearch.xpack.sql.session.SchemaRowSet;
Expand Down Expand Up @@ -276,6 +278,11 @@ private BucketExtractor createExtractor(FieldExtraction ref, BucketExtractor tot
return new MetricAggExtractor(r.name(), r.property(), r.innerKey());
}

if (ref instanceof TopHitsAggRef) {
TopHitsAggRef r = (TopHitsAggRef) ref;
return new TopHitsAggExtractor(r.name(), r.fieldDataType());
}

if (ref == GlobalCountRef.INSTANCE) {
return totalCount;
}
Expand Down
Loading