-
Notifications
You must be signed in to change notification settings - Fork 24.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ES|QL categorize v1 #112860
ES|QL categorize v1 #112860
Conversation
25cd90d
to
91fd0fa
Compare
.../java/org/elasticsearch/xpack/esql/expression/function/scalar/string/CategorizeInternal.java
Outdated
Show resolved
Hide resolved
91fd0fa
to
22ff782
Compare
Pinging @elastic/ml-core (Team:ML) |
Pinging @elastic/es-analytical-engine (Team:Analytics) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just partially reviewed; I'm missing context about this categorization. Would be nice to have some info in the description of the PR about it 👀
List.of(new TestCaseSupplier.TypedData(new BytesRef("blah blah blah"), dataType, "f")), | ||
"CategorizeInternalEvaluator[v=Attribute[channel=0]]", | ||
DataType.INTEGER, | ||
equalTo(0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we test other outputs? Potentially with randomized inputs, like in other function tests
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it possible to create testcases, where the function CATEGORIZE
is evaluated sequentially on multiple inputs while keeping its state? Comparable to the CSV test in this PR?
Evaluating it on a single input doesn't give meaningful tests: the result always is 0
(meaning this single line of text is a category by itself, with ID 0)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
AbstractScalarFunctionTestCase
isn't designed for that, sadly. I wonder if we should not use it and make something that extends AbstractFunctionTestCase
. Or, well, something else? I'm not sure. This isn't a "scalar" function because of it's state.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Indeed, that's something useful to consider.
However, at the moment AbstractScalarFunctionTestCase
does provide a ton of useful stuff. The ...WithDefaultChecks
checks nulls, multivalues, wrong types, memory leaks, ... These tests helped me a lot fixing my bugs.
source(), | ||
toEvaluator.apply(str), | ||
context -> new CategorizationAnalyzer( | ||
// TODO(jan): get the correct analyzer in here, see CategorizationAnalyzerConfig::buildStandardCategorizationAnalyzer |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So I understand, is this a "must" before this function works, or is it a good to have? Will be done here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should be resolved before going live with this. I might fix it here. Right now, this implementation is an incomplete snapshot function anyway, so it's not really important.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you have any advise on how to best get analyzers, char filters and token filters in here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh boy! I don't know to be honest. Analyzers are usually a concept bound to an index, but ESQL queries are cross-index. But this analyzer is generic, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, it's a generic one.
Something like:
- tokenizer: ml_standard
- char filter: first_line_with_letters
- token filter: stopword (with custom stopword list)
- token filter: max token count
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note for the future: the Categorize text aggregation allows you to specify a categorization_analyzer
in the request body. I think we want that too in ES|QL eventually, but not needed for the first version.
Thanks for the review and sorry for the missing information. I've added it now. (This was developed in close collaboration with @nik9000 and I assumed he'd review and didn't need any context.) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! Thanks for the PR. Small nit, worth creating a separate meta issue with the leftovers/todo - makes it easy not just to track the rest of the work but also to serve as a guide for similar integrations in the future.
/** | ||
* TokenListCategorizer that takes ownership of the CategorizationBytesRefHash and releases it when closed. | ||
*/ | ||
public static class CloseableTokenListCategorizer extends TokenListCategorizer implements Releasable { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FYI: this was needed, because the autogen code automatically closes releasables. However, the original TokenListCategorizer
doesn't take ownership of the ByteRef, leading to leaked memory (and failing tests; kudos for creating those tests)
.module("ingest-common") | ||
.apply(commonClusterConfig) | ||
.setting("xpack.ml.enabled", "false") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I remember I did that because of some locally running tests issues. You did this because of this or some other reason?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To implement this ES|QL categorize text feature, the esql
plugin has to depend on the ml
plugin, which contains the categorize text code. If ml
is fully enabled, it needs a writeable tmp dir, a bunch of named pipes etc. Part of that fails in this IT setup.
List.of(new TestCaseSupplier.TypedData(new BytesRef("blah blah blah"), dataType, "f")), | ||
"CategorizeInternalEvaluator[v=Attribute[channel=0]]", | ||
DataType.INTEGER, | ||
equalTo(0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
AbstractScalarFunctionTestCase
isn't designed for that, sadly. I wonder if we should not use it and make something that extends AbstractFunctionTestCase
. Or, well, something else? I'm not sure. This isn't a "scalar" function because of it's state.
@Fixed(includeInToString = false, build = true) TokenListCategorizer.CloseableTokenListCategorizer categorizer | ||
) { | ||
String s = v.utf8ToString(); | ||
try (TokenStream ts = analyzer.tokenStream("text", s)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's super expensive to do all this. But such is life.
I spent a little looking and am pretty sure there's a nice way to make a Reader
that works on the BytesRef
directly and you don't need to make a String here. I couldn't find anything easy to just plug in, so I think it can wait.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for pointing that out. I'll add it to the to do list.
return categorizer.computeCategory(ts, s.length(), 1).getId(); | ||
} catch (IOException e) { | ||
throw new RuntimeException(e); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We'd talked about allows IOException from a lot of things in ESQL - we do read from files and such. But we never go around to it and I'm not sure we'd do it for process
anyway. And this one isn't real anyway - at least, sort of not. It's throw if the token stream fails to do token stream things, but that's not real IO. Bleh. This is totally fine.
I've added a TODO list to the existing github issue. Merging this now, so that I can try to combine this with #112757 |
👍 Nice. I'm going to try and revive #112757, hopefully targeting this and see where that puts us. |
* Prepare TokenListCategorizer for usage in ES|QL * Expose text categorization from ML module * Let esql plugin depend on ml plugin * Fix/suppress this-escape * Incomplete v1 of ES|QL Categorize * Refactor / remove CategorizeInternal
* Prepare TokenListCategorizer for usage in ES|QL * Expose text categorization from ML module * Let esql plugin depend on ml plugin * Fix/suppress this-escape * Incomplete v1 of ES|QL Categorize * Refactor / remove CategorizeInternal
* Prepare TokenListCategorizer for usage in ES|QL * Expose text categorization from ML module * Let esql plugin depend on ml plugin * Fix/suppress this-escape * Incomplete v1 of ES|QL Categorize * Refactor / remove CategorizeInternal
* ES|QL categorize v1 (#112860) * Prepare TokenListCategorizer for usage in ES|QL * Expose text categorization from ML module * Let esql plugin depend on ml plugin * Fix/suppress this-escape * Incomplete v1 of ES|QL Categorize * Refactor / remove CategorizeInternal * Fix categorize csv test (#113089) * Move CATEGORIZE from EsqlFeatures to EsqlCapabilities * Make CATEGORIZE csv test deterministic * Unmute categorize test * spotless * Deterministic categorize csv test with category IDs --------- Co-authored-by: Jan Kuipers <148754765+jan-elastic@users.noreply.github.com>
Incomplete implementation of the machine learning categorize text aggregation in ES|QL.