Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ES|QL categorize v1 #112860

Merged
merged 6 commits into from
Sep 17, 2024
Merged

ES|QL categorize v1 #112860

merged 6 commits into from
Sep 17, 2024

Conversation

jan-elastic
Copy link
Contributor

@jan-elastic jan-elastic commented Sep 13, 2024

Incomplete implementation of the machine learning categorize text aggregation in ES|QL.

@jan-elastic jan-elastic added >non-issue :ml Machine learning Team:Analytics Meta label for analytical engine team (ESQL/Aggs/Geo) Team:ML Meta label for the ML team :Analytics/ES|QL AKA ESQL v8.16.0 v9.0.0 labels Sep 13, 2024
@jan-elastic jan-elastic requested review from a team as code owners September 13, 2024 09:40
@jan-elastic jan-elastic marked this pull request as draft September 13, 2024 09:40
@jan-elastic jan-elastic force-pushed the esql-categorize-v1 branch 3 times, most recently from 25cd90d to 91fd0fa Compare September 13, 2024 14:46
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/ml-core (Team:ML)

@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-analytical-engine (Team:Analytics)

Copy link
Contributor

@ivancea ivancea left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just partially reviewed; I'm missing context about this categorization. Would be nice to have some info in the description of the PR about it 👀

List.of(new TestCaseSupplier.TypedData(new BytesRef("blah blah blah"), dataType, "f")),
"CategorizeInternalEvaluator[v=Attribute[channel=0]]",
DataType.INTEGER,
equalTo(0)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we test other outputs? Potentially with randomized inputs, like in other function tests

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it possible to create testcases, where the function CATEGORIZE is evaluated sequentially on multiple inputs while keeping its state? Comparable to the CSV test in this PR?

Evaluating it on a single input doesn't give meaningful tests: the result always is 0 (meaning this single line of text is a category by itself, with ID 0)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AbstractScalarFunctionTestCase isn't designed for that, sadly. I wonder if we should not use it and make something that extends AbstractFunctionTestCase. Or, well, something else? I'm not sure. This isn't a "scalar" function because of it's state.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed, that's something useful to consider.

However, at the moment AbstractScalarFunctionTestCase does provide a ton of useful stuff. The ...WithDefaultChecks checks nulls, multivalues, wrong types, memory leaks, ... These tests helped me a lot fixing my bugs.

source(),
toEvaluator.apply(str),
context -> new CategorizationAnalyzer(
// TODO(jan): get the correct analyzer in here, see CategorizationAnalyzerConfig::buildStandardCategorizationAnalyzer
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So I understand, is this a "must" before this function works, or is it a good to have? Will be done here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be resolved before going live with this. I might fix it here. Right now, this implementation is an incomplete snapshot function anyway, so it's not really important.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you have any advise on how to best get analyzers, char filters and token filters in here?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh boy! I don't know to be honest. Analyzers are usually a concept bound to an index, but ESQL queries are cross-index. But this analyzer is generic, right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it's a generic one.

Something like:

  • tokenizer: ml_standard
  • char filter: first_line_with_letters
  • token filter: stopword (with custom stopword list)
  • token filter: max token count

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note for the future: the Categorize text aggregation allows you to specify a categorization_analyzer in the request body. I think we want that too in ES|QL eventually, but not needed for the first version.

@jan-elastic
Copy link
Contributor Author

jan-elastic commented Sep 16, 2024

Just partially reviewed; I'm missing context about this categorization. Would be nice to have some info in the description of the PR about it 👀

Thanks for the review and sorry for the missing information. I've added it now. (This was developed in close collaboration with @nik9000 and I assumed he'd review and didn't need any context.)

Copy link
Member

@costin costin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Thanks for the PR. Small nit, worth creating a separate meta issue with the leftovers/todo - makes it easy not just to track the rest of the work but also to serve as a guide for similar integrations in the future.

/**
* TokenListCategorizer that takes ownership of the CategorizationBytesRefHash and releases it when closed.
*/
public static class CloseableTokenListCategorizer extends TokenListCategorizer implements Releasable {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FYI: this was needed, because the autogen code automatically closes releasables. However, the original TokenListCategorizer doesn't take ownership of the ByteRef, leading to leaked memory (and failing tests; kudos for creating those tests)

.module("ingest-common")
.apply(commonClusterConfig)
.setting("xpack.ml.enabled", "false")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I remember I did that because of some locally running tests issues. You did this because of this or some other reason?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To implement this ES|QL categorize text feature, the esql plugin has to depend on the ml plugin, which contains the categorize text code. If ml is fully enabled, it needs a writeable tmp dir, a bunch of named pipes etc. Part of that fails in this IT setup.

List.of(new TestCaseSupplier.TypedData(new BytesRef("blah blah blah"), dataType, "f")),
"CategorizeInternalEvaluator[v=Attribute[channel=0]]",
DataType.INTEGER,
equalTo(0)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AbstractScalarFunctionTestCase isn't designed for that, sadly. I wonder if we should not use it and make something that extends AbstractFunctionTestCase. Or, well, something else? I'm not sure. This isn't a "scalar" function because of it's state.

@Fixed(includeInToString = false, build = true) TokenListCategorizer.CloseableTokenListCategorizer categorizer
) {
String s = v.utf8ToString();
try (TokenStream ts = analyzer.tokenStream("text", s)) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's super expensive to do all this. But such is life.

I spent a little looking and am pretty sure there's a nice way to make a Reader that works on the BytesRef directly and you don't need to make a String here. I couldn't find anything easy to just plug in, so I think it can wait.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for pointing that out. I'll add it to the to do list.

return categorizer.computeCategory(ts, s.length(), 1).getId();
} catch (IOException e) {
throw new RuntimeException(e);
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We'd talked about allows IOException from a lot of things in ESQL - we do read from files and such. But we never go around to it and I'm not sure we'd do it for process anyway. And this one isn't real anyway - at least, sort of not. It's throw if the token stream fails to do token stream things, but that's not real IO. Bleh. This is totally fine.

@jan-elastic
Copy link
Contributor Author

I've added a TODO list to the existing github issue.

Merging this now, so that I can try to combine this with #112757

@jan-elastic jan-elastic merged commit 71b30ce into main Sep 17, 2024
16 checks passed
@jan-elastic jan-elastic deleted the esql-categorize-v1 branch September 17, 2024 08:16
@nik9000
Copy link
Member

nik9000 commented Sep 17, 2024

👍

Nice. I'm going to try and revive #112757, hopefully targeting this and see where that puts us.

javanna pushed a commit to javanna/elasticsearch that referenced this pull request Sep 18, 2024
* Prepare TokenListCategorizer for usage in ES|QL

* Expose text categorization from ML module

* Let esql plugin depend on ml plugin

* Fix/suppress this-escape

* Incomplete v1 of ES|QL Categorize

* Refactor / remove CategorizeInternal
nik9000 pushed a commit to nik9000/elasticsearch that referenced this pull request Sep 19, 2024
* Prepare TokenListCategorizer for usage in ES|QL

* Expose text categorization from ML module

* Let esql plugin depend on ml plugin

* Fix/suppress this-escape

* Incomplete v1 of ES|QL Categorize

* Refactor / remove CategorizeInternal
nik9000 pushed a commit to nik9000/elasticsearch that referenced this pull request Sep 19, 2024
* Prepare TokenListCategorizer for usage in ES|QL

* Expose text categorization from ML module

* Let esql plugin depend on ml plugin

* Fix/suppress this-escape

* Incomplete v1 of ES|QL Categorize

* Refactor / remove CategorizeInternal
elasticsearchmachine pushed a commit that referenced this pull request Sep 19, 2024
* ES|QL categorize v1 (#112860)

* Prepare TokenListCategorizer for usage in ES|QL

* Expose text categorization from ML module

* Let esql plugin depend on ml plugin

* Fix/suppress this-escape

* Incomplete v1 of ES|QL Categorize

* Refactor / remove CategorizeInternal

* Fix categorize csv test (#113089)

* Move CATEGORIZE from EsqlFeatures to EsqlCapabilities

* Make CATEGORIZE csv test deterministic

* Unmute categorize test

* spotless

* Deterministic categorize csv test with category IDs

---------

Co-authored-by: Jan Kuipers <148754765+jan-elastic@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Analytics/ES|QL AKA ESQL :ml Machine learning >non-issue Team:Analytics Meta label for analytical engine team (ESQL/Aggs/Geo) Team:ML Meta label for the ML team v8.16.0 v9.0.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants