ES|QL categorize v1 #112860

jan-elastic · 2024-09-13T09:40:32Z

Incomplete implementation of the machine learning categorize text aggregation in ES|QL.

.../java/org/elasticsearch/xpack/esql/expression/function/scalar/string/CategorizeInternal.java

elasticsearchmachine · 2024-09-13T15:00:13Z

Pinging @elastic/ml-core (Team:ML)

elasticsearchmachine · 2024-09-13T15:00:13Z

Pinging @elastic/es-analytical-engine (Team:Analytics)

ivancea

Just partially reviewed; I'm missing context about this categorization. Would be nice to have some info in the description of the PR about it 👀

ivancea · 2024-09-13T17:16:00Z

...src/test/java/org/elasticsearch/xpack/esql/expression/function/grouping/CategorizeTests.java

+                        List.of(new TestCaseSupplier.TypedData(new BytesRef("blah blah blah"), dataType, "f")),
+                        "CategorizeInternalEvaluator[v=Attribute[channel=0]]",
+                        DataType.INTEGER,
+                        equalTo(0)


Can we test other outputs? Potentially with randomized inputs, like in other function tests

Is it possible to create testcases, where the function CATEGORIZE is evaluated sequentially on multiple inputs while keeping its state? Comparable to the CSV test in this PR?

Evaluating it on a single input doesn't give meaningful tests: the result always is 0 (meaning this single line of text is a category by itself, with ID 0)

AbstractScalarFunctionTestCase isn't designed for that, sadly. I wonder if we should not use it and make something that extends AbstractFunctionTestCase. Or, well, something else? I'm not sure. This isn't a "scalar" function because of it's state.

Indeed, that's something useful to consider.

However, at the moment AbstractScalarFunctionTestCase does provide a ton of useful stuff. The ...WithDefaultChecks checks nulls, multivalues, wrong types, memory leaks, ... These tests helped me a lot fixing my bugs.

ivancea · 2024-09-13T17:17:17Z

.../java/org/elasticsearch/xpack/esql/expression/function/scalar/string/CategorizeInternal.java

+            source(),
+            toEvaluator.apply(str),
+            context -> new CategorizationAnalyzer(
+                // TODO(jan): get the correct analyzer in here, see CategorizationAnalyzerConfig::buildStandardCategorizationAnalyzer


So I understand, is this a "must" before this function works, or is it a good to have? Will be done here?

This should be resolved before going live with this. I might fix it here. Right now, this implementation is an incomplete snapshot function anyway, so it's not really important.

Do you have any advise on how to best get analyzers, char filters and token filters in here?

Oh boy! I don't know to be honest. Analyzers are usually a concept bound to an index, but ESQL queries are cross-index. But this analyzer is generic, right?

Yes, it's a generic one.

Something like:

tokenizer: ml_standard

char filter: first_line_with_letters

token filter: stopword (with custom stopword list)

token filter: max token count

Note for the future: the Categorize text aggregation allows you to specify a categorization_analyzer in the request body. I think we want that too in ES|QL eventually, but not needed for the first version.

jan-elastic · 2024-09-16T07:07:18Z

Just partially reviewed; I'm missing context about this categorization. Would be nice to have some info in the description of the PR about it 👀

Thanks for the review and sorry for the missing information. I've added it now. (This was developed in close collaboration with @nik9000 and I assumed he'd review and didn't need any context.)

costin

LGTM! Thanks for the PR. Small nit, worth creating a separate meta issue with the leftovers/todo - makes it easy not just to track the rest of the work but also to serve as a guide for similar integrations in the future.

costin · 2024-09-16T15:13:38Z

...in/ml/src/main/java/org/elasticsearch/xpack/ml/aggs/categorization/TokenListCategorizer.java

+    /**
+     * TokenListCategorizer that takes ownership of the CategorizationBytesRefHash and releases it when closed.
+     */
+    public static class CloseableTokenListCategorizer extends TokenListCategorizer implements Releasable {


FYI: this was needed, because the autogen code automatically closes releasables. However, the original TokenListCategorizer doesn't take ownership of the ByteRef, leading to leaked memory (and failing tests; kudos for creating those tests)

astefan · 2024-09-16T15:36:19Z

...src/javaRestTest/java/org/elasticsearch/xpack/remotecluster/RemoteClusterSecurityEsqlIT.java

            .module("ingest-common")
            .apply(commonClusterConfig)
+            .setting("xpack.ml.enabled", "false")


I remember I did that because of some locally running tests issues. You did this because of this or some other reason?

To implement this ES|QL categorize text feature, the esql plugin has to depend on the ml plugin, which contains the categorize text code. If ml is fully enabled, it needs a writeable tmp dir, a bunch of named pipes etc. Part of that fails in this IT setup.

nik9000 · 2024-09-16T15:07:39Z

...src/test/java/org/elasticsearch/xpack/esql/expression/function/grouping/CategorizeTests.java

+                        List.of(new TestCaseSupplier.TypedData(new BytesRef("blah blah blah"), dataType, "f")),
+                        "CategorizeInternalEvaluator[v=Attribute[channel=0]]",
+                        DataType.INTEGER,
+                        equalTo(0)


AbstractScalarFunctionTestCase isn't designed for that, sadly. I wonder if we should not use it and make something that extends AbstractFunctionTestCase. Or, well, something else? I'm not sure. This isn't a "scalar" function because of it's state.

nik9000 · 2024-09-16T15:52:00Z

...esql/src/main/java/org/elasticsearch/xpack/esql/expression/function/grouping/Categorize.java

+        @Fixed(includeInToString = false, build = true) TokenListCategorizer.CloseableTokenListCategorizer categorizer
+    ) {
+        String s = v.utf8ToString();
+        try (TokenStream ts = analyzer.tokenStream("text", s)) {


It's super expensive to do all this. But such is life.

I spent a little looking and am pretty sure there's a nice way to make a Reader that works on the BytesRef directly and you don't need to make a String here. I couldn't find anything easy to just plug in, so I think it can wait.

Thanks for pointing that out. I'll add it to the to do list.

nik9000 · 2024-09-16T15:53:16Z

...esql/src/main/java/org/elasticsearch/xpack/esql/expression/function/grouping/Categorize.java

+            return categorizer.computeCategory(ts, s.length(), 1).getId();
+        } catch (IOException e) {
+            throw new RuntimeException(e);
+        }


We'd talked about allows IOException from a lot of things in ESQL - we do read from files and such. But we never go around to it and I'm not sure we'd do it for process anyway. And this one isn't real anyway - at least, sort of not. It's throw if the token stream fails to do token stream things, but that's not real IO. Bleh. This is totally fine.

jan-elastic · 2024-09-17T08:16:35Z

I've added a TODO list to the existing github issue.

Merging this now, so that I can try to combine this with #112757

nik9000 · 2024-09-17T13:15:14Z

👍

Nice. I'm going to try and revive #112757, hopefully targeting this and see where that puts us.

* Prepare TokenListCategorizer for usage in ES|QL * Expose text categorization from ML module * Let esql plugin depend on ml plugin * Fix/suppress this-escape * Incomplete v1 of ES|QL Categorize * Refactor / remove CategorizeInternal

* ES|QL categorize v1 (#112860) * Prepare TokenListCategorizer for usage in ES|QL * Expose text categorization from ML module * Let esql plugin depend on ml plugin * Fix/suppress this-escape * Incomplete v1 of ES|QL Categorize * Refactor / remove CategorizeInternal * Fix categorize csv test (#113089) * Move CATEGORIZE from EsqlFeatures to EsqlCapabilities * Make CATEGORIZE csv test deterministic * Unmute categorize test * spotless * Deterministic categorize csv test with category IDs --------- Co-authored-by: Jan Kuipers <148754765+jan-elastic@users.noreply.github.com>

jan-elastic added 4 commits September 13, 2024 11:26

Prepare TokenListCategorizer for usage in ES|QL

b49703b

Expose text categorization from ML module

82d75ac

Let esql plugin depend on ml plugin

2f01aa0

Fix/suppress this-escape

3e18f56

jan-elastic added >non-issue :ml Machine learning Team:Analytics Meta label for analytical engine team (ESQL/Aggs/Geo) Team:ML Meta label for the ML team :Analytics/ES|QL AKA ESQL v8.16.0 v9.0.0 labels Sep 13, 2024

jan-elastic requested review from a team as code owners September 13, 2024 09:40

jan-elastic marked this pull request as draft September 13, 2024 09:40

jan-elastic force-pushed the esql-categorize-v1 branch 3 times, most recently from 25cd90d to 91fd0fa Compare September 13, 2024 14:46

jan-elastic commented Sep 13, 2024

View reviewed changes

.../java/org/elasticsearch/xpack/esql/expression/function/scalar/string/CategorizeInternal.java Outdated Show resolved Hide resolved

Incomplete v1 of ES|QL Categorize

22ff782

jan-elastic force-pushed the esql-categorize-v1 branch from 91fd0fa to 22ff782 Compare September 13, 2024 14:58

jan-elastic assigned nik9000 Sep 13, 2024

jan-elastic marked this pull request as ready for review September 13, 2024 14:59

mark-vieira approved these changes Sep 13, 2024

View reviewed changes

ivancea reviewed Sep 13, 2024

View reviewed changes

Refactor / remove CategorizeInternal

fd2b13c

costin approved these changes Sep 16, 2024

View reviewed changes

astefan reviewed Sep 16, 2024

View reviewed changes

nik9000 reviewed Sep 16, 2024

View reviewed changes

jan-elastic merged commit 71b30ce into main Sep 17, 2024
16 checks passed

jan-elastic deleted the esql-categorize-v1 branch September 17, 2024 08:16

ioanatia mentioned this pull request Sep 18, 2024

ESQL - generate docs for snapshot functions #113080

Merged

nik9000 mentioned this pull request Sep 19, 2024

Incomplete version of CATEGORIZE for ESQL #113207

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ES|QL categorize v1 #112860

ES|QL categorize v1 #112860

jan-elastic commented Sep 13, 2024 •

edited

Loading

elasticsearchmachine commented Sep 13, 2024

elasticsearchmachine commented Sep 13, 2024

ivancea left a comment

ivancea Sep 13, 2024

jan-elastic Sep 16, 2024

nik9000 Sep 16, 2024

jan-elastic Sep 17, 2024

ivancea Sep 13, 2024

jan-elastic Sep 16, 2024

jan-elastic Sep 16, 2024

nik9000 Sep 16, 2024

jan-elastic Sep 16, 2024

jan-elastic Sep 16, 2024

jan-elastic commented Sep 16, 2024 •

edited

Loading

costin left a comment

costin Sep 16, 2024

jan-elastic Sep 16, 2024

astefan Sep 16, 2024

jan-elastic Sep 17, 2024

nik9000 Sep 16, 2024

nik9000 Sep 16, 2024

jan-elastic Sep 17, 2024

nik9000 Sep 16, 2024

jan-elastic commented Sep 17, 2024

nik9000 commented Sep 17, 2024

ES|QL categorize v1 #112860

ES|QL categorize v1 #112860

Conversation

jan-elastic commented Sep 13, 2024 • edited Loading

elasticsearchmachine commented Sep 13, 2024

elasticsearchmachine commented Sep 13, 2024

ivancea left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jan-elastic commented Sep 16, 2024 • edited Loading

costin left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jan-elastic commented Sep 17, 2024

nik9000 commented Sep 17, 2024

jan-elastic commented Sep 13, 2024 •

edited

Loading

jan-elastic commented Sep 16, 2024 •

edited

Loading