ESQL: Compute engine support for stateful grouping functions #112757

nik9000 · 2024-09-11T17:27:35Z

This adds support to the compute engine for "stateful grouping functions". Think of these like ExpressionEvaluators but they can:

Encode extra state to be passed to the coordinating node as part of the agg
Use that extra state to transform the group keys on the coordinating node
Apply a transformation using the output after the aggregation is complete
Use a different intermediate representation as final representation (think, "I group on an integer, but when finished I transform into a string")

nik9000 · 2024-09-11T17:28:41Z

x-pack/plugin/esql/compute/src/main/java/org/elasticsearch/compute/aggregation/GroupingKey.java

+import java.util.ArrayList;
+import java.util.List;
+
+public record GroupingKey(AggregatorMode mode, Thing thing) implements EvalOperator.ExpressionEvaluator {


I've somewhat mirrored the way we do grouping aggs with this and it seems to have worked out fairly well. It's not perfect, but it's a lot less confusing than I thought it would be.

nik9000 · 2024-09-11T17:29:32Z

...n/esql/compute/src/main/java/org/elasticsearch/compute/operator/HashAggregationOperator.java


                for (int i = 0; i < prepared.length; i++) {
                    prepared[i] = aggregators.get(i).prepareProcessPage(blockHash, page);
                }

-                blockHash.add(wrapPage(page), add);
+                blockHash.add(new Page(keys), add);


I'd like to modify BlockHash to take a Block[] with the blocks in the right position. But that seems like something for another time.

nik9000 · 2024-09-11T17:29:55Z

...n/esql/compute/src/main/java/org/elasticsearch/compute/operator/HashAggregationOperator.java


                for (int i = 0; i < prepared.length; i++) {
                    prepared[i] = aggregators.get(i).prepareProcessPage(blockHash, page);
                }

-                blockHash.add(wrapPage(page), add);
+                blockHash.add(new Page(keys), add);
                hashNanos += System.nanoTime() - add.hashStart;


It's probably worth timing the evaluation here.

nik9000 · 2024-09-11T17:31:01Z

...n/esql/compute/src/main/java/org/elasticsearch/compute/operator/HashAggregationOperator.java

+            int[] aggBlockCounts = new int[aggregators.size()];
+            for (int a = 0; a < aggregators.size(); a++) {
+                aggBlockCounts[a] = aggregators.get(a).evaluateBlockCount();
+                blockCount += aggBlockCounts[a];


I found it a lot easier to read if I encoded the resultOffsets into the GroupKeys. It'd be even easier to read if the offsets were encoded into the aggregators too. Or if we returns Block[].

nik9000 · 2024-09-17T21:53:54Z

@jan-elastic have a look at this one. It's closer, I think.

Once we can figure out how this is supposed to work in the unit test I think we can iterate some more on the language side to figure out how to make it build that.

...in/ml/src/main/java/org/elasticsearch/xpack/ml/aggs/categorization/TokenListCategorizer.java

Categorize

nik9000 · 2024-09-19T14:32:37Z

...esql/src/main/java/org/elasticsearch/xpack/esql/expression/function/grouping/Categorize.java

+
+        @Override
+        public void replaceIntermediateKeys(BlockFactory blockFactory, Block[] blocks) {
+            // NOCOMMIT this offset can't be the same in the result array and intermediate array


I bumped into this a few days ago and I think I need to dig some more - in this brave new world there's two "shapes" of data coming out of these grouping functions - the intermediate shape and the final shape. This is pretty similar to how aggs work - which is something I never fully understood to be honest. Anyway, I'm using resultOffset here - but that's the result offset of the intermediate data. not the final offset. So it can't be right.

dnhatn

Thanks @nik9000. I might be missing some context, but it seems we're trying to include SerializableTokenListCategory alongside the aggregated results for each driver. Could we resolve this by adding infrastructure to support SerializableTokenListCategory (or a variant) as the new block hash key?

dnhatn · 2024-09-22T05:24:59Z

...esql/src/main/java/org/elasticsearch/xpack/esql/expression/function/grouping/Categorize.java

+                    CategorizationPartOfSpeechDictionary.getInstance(),
+                    0.70f
+                );
+                evaluator = new CategorizeEvaluator(


It seems the CategorizeEvaluator will be executed twice: here and in toEvaluator?

I was wondering the same.

However, when running the CategorizeOperatorTests it seems that Categorize::toEvaluator is never executed.

ivancea

Partial review with some questions

ivancea · 2024-10-01T13:41:14Z

...n/esql/compute/src/main/java/org/elasticsearch/compute/operator/HashAggregationOperator.java

+            int offset = 0;
+            for (int g = 0; g < groups.size(); g++) {
+                blocks[offset] = keys[g];
+                groups.get(g).finish(blocks, selected, driverContext);


No offset passed to the finish() here? How does it know where to place the blocks?

ivancea · 2024-10-01T13:42:47Z

x-pack/plugin/esql/compute/src/main/java/org/elasticsearch/compute/aggregation/GroupingKey.java

+        return mode.isInputPartial() ? thing.evalIntermediateInput(blockFactory, page) : thing.evalRawInput(page);
+    }
+
+    public int finishBlockCount() {


Nit: We're calling this "finish", while in the aggregator it's "evaluate". Some reason to keep those names separated? From what I understand, the operation is nearly the same (?)

ivancea · 2024-10-01T13:44:54Z

.../esql/compute/src/main/java/org/elasticsearch/compute/operator/OrdinalsGroupingOperator.java

-                    maxPageSize,
-                    false
+                List.of(
+                    // NOCOMMIT double check the mode


Commenting just in case this was forgotten

ivancea · 2024-10-01T14:38:49Z

x-pack/plugin/esql/compute/src/main/java/org/elasticsearch/compute/aggregation/GroupingKey.java

+import java.util.List;
+
+public record GroupingKey(AggregatorMode mode, Thing thing, BlockFactory blockFactory) implements EvalOperator.ExpressionEvaluator {
+    public interface Thing extends Releasable {


Is "Thing" the final name here?

nik9000 added 3 commits September 11, 2024 11:26

Start

bc389e4

Next

712b672

Like so?

37415d2

nik9000 added >non-issue :Analytics/ES|QL AKA ESQL v8.16.0 labels Sep 11, 2024

nik9000 requested a review from dnhatn September 11, 2024 17:27

elasticsearchmachine added the v9.0.0 label Sep 11, 2024

nik9000 commented Sep 11, 2024

View reviewed changes

Fix one test

1fb09b1

jan-elastic mentioned this pull request Sep 17, 2024

ES|QL categorize v1 #112860

Merged

nik9000 added 2 commits September 17, 2024 14:58

Merge branch 'main' into stateful_grouping

5a088b6

foooooooooo

522c0bc

jan-elastic reviewed Sep 18, 2024

View reviewed changes

...in/ml/src/main/java/org/elasticsearch/xpack/ml/aggs/categorization/TokenListCategorizer.java Outdated Show resolved Hide resolved

jan-elastic and others added 6 commits September 18, 2024 16:41

Output correct regexes

96e6505

Remap intermediate category IDs

f6ef350

Fix mem leak

a385441

spotless

5a46823

Test categorize operator on multiple nodes

7685a0c

Merge pull request #10 from jan-elastic/categorize

f35fccb

Categorize

nik9000 commented Sep 19, 2024

View reviewed changes

dnhatn reviewed Sep 22, 2024

View reviewed changes

ivancea self-requested a review September 30, 2024 15:27

alex-spies self-requested a review September 30, 2024 15:28

costin self-requested a review September 30, 2024 15:28

iverase self-requested a review September 30, 2024 15:29

craigtaverner self-requested a review September 30, 2024 15:52

ivancea reviewed Oct 2, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ESQL: Compute engine support for stateful grouping functions #112757

ESQL: Compute engine support for stateful grouping functions #112757

nik9000 commented Sep 11, 2024

nik9000 Sep 11, 2024

nik9000 Sep 11, 2024

nik9000 Sep 11, 2024

nik9000 Sep 11, 2024

nik9000 commented Sep 17, 2024

nik9000 Sep 19, 2024

dnhatn left a comment

dnhatn Sep 22, 2024

jan-elastic Sep 23, 2024

ivancea left a comment

ivancea Oct 1, 2024

ivancea Oct 1, 2024

ivancea Oct 1, 2024

ivancea Oct 1, 2024

ESQL: Compute engine support for stateful grouping functions #112757

Are you sure you want to change the base?

ESQL: Compute engine support for stateful grouping functions #112757

Conversation

nik9000 commented Sep 11, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

nik9000 commented Sep 17, 2024

Choose a reason for hiding this comment

dnhatn left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ivancea left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment