SQL: Implement FIRST/LAST aggregate functions (#37936)

FIRST and LAST can be used with one argument and work similarly to MIN and MAX but they are implemented using a Top Hits aggregation and therefore can also operate on keyword fields. When a second argument is provided then they return the first/last value of the first arg when its values are ordered ascending/descending (respectively) by the values of the second argument. Currently because of the usage of a Top Hits aggregation FIRST and LAST cannot be used in the HAVING clause of a GROUP BY query to filter on the results of the aggregation. Closes: #35639
elastic · Jan 31, 2019 · 4710a74 · 4710a74
1 parent 7487be3
commit 4710a74
Show file tree

Hide file tree

Showing 34 changed files with 1,201 additions and 99 deletions.
diff --git a/docs/reference/sql/functions/aggs.asciidoc b/docs/reference/sql/functions/aggs.asciidoc
@@ -113,6 +113,196 @@ Returns the total number of _distinct non-null_ values in input values.
 include-tagged::{sql-specs}/docs.csv-spec[aggCountDistinct]
 --------------------------------------------------
 
+[[sql-functions-aggs-first]]
+===== `FIRST/FIRST_VALUE`
+
+.Synopsis:
+[source, sql]
+----------------------------------------------
+FIRST(field_name<1>[, ordering_field_name]<2>)
+----------------------------------------------
+
+*Input*:
+
+<1> target field for the aggregation
+<2> optional field used for ordering
+
+*Output*: same type as the input
+
+.Description:
+
+Returns the first **non-NULL** value (if such exists) of the `field_name` input column sorted by
+the `ordering_field_name` column. If `ordering_field_name` is not provided, only the `field_name`
+column is used for the sorting. E.g.:
+
+[cols="<,<"]
+|===
+s| a    | b
+
+ | 100  | 1
+ | 200  | 1
+ | 1    | 2
+ | 2    | 2
+ | 10   | null
+ | 20   | null
+ | null | null
+|===
+
+[source, sql]
+----------------------
+SELECT FIRST(a) FROM t
+----------------------
+
+will result in:
+[cols="<"]
+|===
+s| FIRST(a)
+ | 1
+|===
+
+and
+
+[source, sql]
+-------------------------
+SELECT FIRST(a, b) FROM t
+-------------------------
+
+will result in:
+[cols="<"]
+|===
+s| FIRST(a, b)
+ | 100
+|===
+
+
+["source","sql",subs="attributes,macros"]
+-----------------------------------------------------------
+include-tagged::{sql-specs}/docs.csv-spec[firstWithOneArg]
+-----------------------------------------------------------
+
+["source","sql",subs="attributes,macros"]
+--------------------------------------------------------------------
+include-tagged::{sql-specs}/docs.csv-spec[firstWithOneArgAndGroupBy]
+--------------------------------------------------------------------
+
+["source","sql",subs="attributes,macros"]
+-----------------------------------------------------------
+include-tagged::{sql-specs}/docs.csv-spec[firstWithTwoArgs]
+-----------------------------------------------------------
+
+["source","sql",subs="attributes,macros"]
+---------------------------------------------------------------------
+include-tagged::{sql-specs}/docs.csv-spec[firstWithTwoArgsAndGroupBy]
+---------------------------------------------------------------------
+
+`FIRST_VALUE` is a name alias and can be used instead of `FIRST`, e.g.:
+
+["source","sql",subs="attributes,macros"]
+--------------------------------------------------------------------------
+include-tagged::{sql-specs}/docs.csv-spec[firstValueWithTwoArgsAndGroupBy]
+--------------------------------------------------------------------------
+
+[NOTE]
+`FIRST` cannot be used in a HAVING clause.
+[NOTE]
+`FIRST` cannot be used with columns of type <<text, `text`>> unless
+the field is also <<before-enabling-fielddata,saved as a keyword>>.
+
+[[sql-functions-aggs-last]]
+===== `LAST/LAST_VALUE`
+
+.Synopsis:
+[source, sql]
+--------------------------------------------------
+LAST(field_name<1>[, ordering_field_name]<2>)
+--------------------------------------------------
+
+*Input*:
+
+<1> target field for the aggregation
+<2> optional field used for ordering
+
+*Output*: same type as the input
+
+.Description:
+
+It's the inverse of <<sql-functions-aggs-first>>. Returns the last **non-NULL** value (if such exists) of the
+`field_name`input column sorted descending by the `ordering_field_name` column. If `ordering_field_name` is not
+provided, only the `field_name` column is used for the sorting. E.g.:
+
+[cols="<,<"]
+|===
+s| a    | b
+
+ | 10   | 1
+ | 20   | 1
+ | 1    | 2
+ | 2    | 2
+ | 100  | null
+ | 200  | null
+ | null | null
+|===
+
+[source, sql]
+------------------------
+SELECT LAST(a) FROM t
+------------------------
+
+will result in:
+[cols="<"]
+|===
+s| LAST(a)
+ | 200
+|===
+
+and
+
+[source, sql]
+------------------------
+SELECT LAST(a, b) FROM t
+------------------------
+
+will result in:
+[cols="<"]
+|===
+s| LAST(a, b)
+ | 2
+|===
+
+
+["source","sql",subs="attributes,macros"]
+-----------------------------------------------------------
+include-tagged::{sql-specs}/docs.csv-spec[lastWithOneArg]
+-----------------------------------------------------------
+
+["source","sql",subs="attributes,macros"]
+-------------------------------------------------------------------
+include-tagged::{sql-specs}/docs.csv-spec[lastWithOneArgAndGroupBy]
+-------------------------------------------------------------------
+
+["source","sql",subs="attributes,macros"]
+-----------------------------------------------------------
+include-tagged::{sql-specs}/docs.csv-spec[lastWithTwoArgs]
+-----------------------------------------------------------
+
+["source","sql",subs="attributes,macros"]
+--------------------------------------------------------------------
+include-tagged::{sql-specs}/docs.csv-spec[lastWithTwoArgsAndGroupBy]
+--------------------------------------------------------------------
+
+`LAST_VALUE` is a name alias and can be used instead of `LAST`, e.g.:
+
+["source","sql",subs="attributes,macros"]
+-------------------------------------------------------------------------
+include-tagged::{sql-specs}/docs.csv-spec[lastValueWithTwoArgsAndGroupBy]
+-------------------------------------------------------------------------
+
+[NOTE]
+`LAST` cannot be used in `HAVING` clause.
+[NOTE]
+`LAST` cannot be used with columns of type <<text, `text`>> unless
+the field is also <<before-enabling-fielddata,`saved as a keyword`>>.
+
 [[sql-functions-aggs-max]]
 ===== `MAX`
 
@@ -137,6 +327,10 @@ Returns the maximum value across input values in the field `field_name`.
 include-tagged::{sql-specs}/docs.csv-spec[aggMax]
 --------------------------------------------------
 
+[NOTE]
+`MAX` on a field of type <<text, `text`>> or <<keyword, `keyword`>> is translated into
+<<sql-functions-aggs-last>> and therefore, it cannot be used in `HAVING` clause.
+
 [[sql-functions-aggs-min]]
 ===== `MIN`
 
@@ -161,6 +355,10 @@ Returns the minimum value across input values in the field `field_name`.
 include-tagged::{sql-specs}/docs.csv-spec[aggMin]
 --------------------------------------------------
 
+[NOTE]
+`MIN` on a field of type <<text, `text`>> or <<keyword, `keyword`>> is translated into
+<<sql-functions-aggs-first>> and therefore, it cannot be used in `HAVING` clause.
+
 [[sql-functions-aggs-sum]]
 ===== `SUM`
 

diff --git a/docs/reference/sql/limitations.asciidoc b/docs/reference/sql/limitations.asciidoc
@@ -90,3 +90,10 @@ include-tagged::{sql-specs}/docs.csv-spec[limitationSubSelectRewritten]
 
 But, if the sub-select would include a `GROUP BY` or `HAVING` or the enclosing `SELECT` would be more complex than `SELECT X
 FROM (SELECT ...) WHERE [simple_condition]`, this is currently **un-supported**.
+
+[float]
+=== Use <<sql-functions-aggs-first, `FIRST`>>/<<sql-functions-aggs-last,`LAST`>> aggregation functions in `HAVING` clause
+
+Using `FIRST` and `LAST` in the `HAVING` clause is not supported. The same applies to
+<<sql-functions-aggs-min,`MIN`>> and <<sql-functions-aggs-max,`MAX`>> when their target column
+is of type <<keyword, `keyword`>> as they are internally translated to `FIRST` and `LAST`.
diff --git a/x-pack/plugin/sql/qa/src/main/java/org/elasticsearch/xpack/sql/qa/cli/ShowTestCase.java b/x-pack/plugin/sql/qa/src/main/java/org/elasticsearch/xpack/sql/qa/cli/ShowTestCase.java
@@ -31,6 +31,10 @@ public void testShowFunctions() throws IOException {
         assertThat(readLine(), containsString(HEADER_SEPARATOR));
         assertThat(readLine(), RegexMatcher.matches("\\s*AVG\\s*\\|\\s*AGGREGATE\\s*"));
         assertThat(readLine(), RegexMatcher.matches("\\s*COUNT\\s*\\|\\s*AGGREGATE\\s*"));
+        assertThat(readLine(), RegexMatcher.matches("\\s*FIRST\\s*\\|\\s*AGGREGATE\\s*"));
+        assertThat(readLine(), RegexMatcher.matches("\\s*FIRST_VALUE\\s*\\|\\s*AGGREGATE\\s*"));
+        assertThat(readLine(), RegexMatcher.matches("\\s*LAST\\s*\\|\\s*AGGREGATE\\s*"));
+        assertThat(readLine(), RegexMatcher.matches("\\s*LAST_VALUE\\s*\\|\\s*AGGREGATE\\s*"));
         assertThat(readLine(), RegexMatcher.matches("\\s*MAX\\s*\\|\\s*AGGREGATE\\s*"));
         assertThat(readLine(), RegexMatcher.matches("\\s*MIN\\s*\\|\\s*AGGREGATE\\s*"));
         String line = readLine();
@@ -58,6 +62,8 @@ public void testShowFunctions() throws IOException {
     public void testShowFunctionsLikePrefix() throws IOException {
         assertThat(command("SHOW FUNCTIONS LIKE 'L%'"), RegexMatcher.matches("\\s*name\\s*\\|\\s*type\\s*"));
         assertThat(readLine(), containsString(HEADER_SEPARATOR));
+        assertThat(readLine(), RegexMatcher.matches("\\s*LAST\\s*\\|\\s*AGGREGATE\\s*"));
+        assertThat(readLine(), RegexMatcher.matches("\\s*LAST_VALUE\\s*\\|\\s*AGGREGATE\\s*"));
         assertThat(readLine(), RegexMatcher.matches("\\s*LEAST\\s*\\|\\s*CONDITIONAL\\s*"));
         assertThat(readLine(), RegexMatcher.matches("\\s*LOG\\s*\\|\\s*SCALAR\\s*"));
         assertThat(readLine(), RegexMatcher.matches("\\s*LOG10\\s*\\|\\s*SCALAR\\s*"));

diff --git a/x-pack/plugin/sql/qa/src/main/resources/agg.csv-spec b/x-pack/plugin/sql/qa/src/main/resources/agg.csv-spec
@@ -373,3 +373,76 @@ SELECT COUNT(ALL last_name)=COUNT(ALL first_name) AS areEqual, COUNT(ALL first_n
 ---------------+---------------+---------------
 false          |90             |100
 ;
+
+topHitsWithOneArgAndGroupBy
+schema::gender:s|first:s|last:s
+SELECT gender, FIRST(first_name) as first, LAST(first_name) as last FROM test_emp GROUP BY gender ORDER BY gender;
+
+    gender     |   first       |   last
+---------------+---------------+---------------
+null           |   Berni       |   Patricio
+F              |   Alejandro   |   Xinglin
+M              |   Amabile     |   Zvonko
+;
+
+topHitsWithTwoArgsAndGroupBy
+schema::gender:s|first:s|last:s
+SELECT gender, FIRST(first_name, birth_date) as first, LAST(first_name, birth_date) as last FROM test_emp GROUP BY gender ORDER BY gender;
+
+    gender     |   first       |   last
+---------------+---------------+---------------
+null           |   Lillian     |   Eberhardt
+F              |   Sumant      |   Valdiodio
+M              |   Remzi       |   Hilari
+;
+
+topHitsWithTwoArgsAndGroupByWithNullsOnTargetField
+schema::gender:s|first:s|last:s
+SELECT gender, FIRST(first_name, birth_date) AS first, LAST(first_name, birth_date) AS last FROM test_emp WHERE emp_no BETWEEN 10025 AND 10035 GROUP BY gender ORDER BY gender;
+
+    gender     |   first       |   last
+---------------+---------------+---------------
+F              |   null        |   Divier
+M              |   null        |   Domenick
+;
+
+topHitsWithTwoArgsAndGroupByWithNullsOnSortingField
+schema::gender:s|first:s|last:s
+SELECT gender, FIRST(first_name, birth_date) AS first, LAST(first_name, birth_date) AS last FROM test_emp WHERE emp_no BETWEEN 10047 AND 10052 GROUP BY gender ORDER BY gender;
+
+    gender     |   first       |   last
+---------------+---------------+---------------
+F              |   Basil       |   Basil
+M              |   Hidefumi    |   Heping
+;
+
+topHitsWithTwoArgsAndGroupByWithNullsOnTargetAndSortingField
+schema::gender:s|first:s|last:s
+SELECT gender, FIRST(first_name, birth_date) AS first, LAST(first_name, birth_date) AS last FROM test_emp WHERE emp_no BETWEEN 10037 AND 10052 GROUP BY gender ORDER BY gender;
+
+    gender     |   first     |  last
+---------------+-------------+-----------------
+F              |   Basil     |  Weiyi
+M              |   Hidefumi  |  null
+;
+
+topHitsWithTwoArgsAndGroupByWithAllNullsOnTargetField
+schema::gender:s|first:s|last:s
+SELECT gender, FIRST(first_name, birth_date) AS first, LAST(first_name, birth_date) AS last FROM test_emp WHERE emp_no BETWEEN 10030 AND 10037 GROUP BY gender ORDER BY gender;
+
+    gender     |   first       |   last
+---------------+---------------+---------------
+F              |   null        |   null
+M              |   null        |   null
+;
+
+topHitsOnDatetime
+schema::gender:s|first:i|last:i
+SELECT gender, month(first(birth_date, languages)) first, month(last(birth_date, languages)) last FROM test_emp GROUP BY gender ORDER BY gender;
+
+    gender     |   first       |   last
+---------------+---------------+---------------
+null           |   1           |   10
+F              |   4           |   6
+M              |   1           |   4
+;
diff --git a/x-pack/plugin/sql/qa/src/main/resources/command.csv-spec b/x-pack/plugin/sql/qa/src/main/resources/command.csv-spec
@@ -8,8 +8,12 @@ SHOW FUNCTIONS;
 
     name:s       |    type:s
 AVG              |AGGREGATE      
-COUNT            |AGGREGATE      
-MAX              |AGGREGATE      
+COUNT            |AGGREGATE
+FIRST            |AGGREGATE
+FIRST_VALUE      |AGGREGATE
+LAST             |AGGREGATE
+LAST_VALUE       |AGGREGATE
+MAX              |AGGREGATE
 MIN              |AGGREGATE      
 SUM              |AGGREGATE      
 KURTOSIS         |AGGREGATE