-
Notifications
You must be signed in to change notification settings - Fork 24.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix categorize csv test #113089
Fix categorize csv test #113089
Conversation
Pinging @elastic/es-analytical-engine (Team:Analytics) |
0c93423
to
bf730ee
Compare
90e855c
to
7f515dd
Compare
/** | ||
* Supported the text categorization function "CATEGORIZE". | ||
*/ | ||
CATEGORIZE(true); | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Question: does this piece of code work as intended?
If you add a non-existing required capability to a CSV test, it always passes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's a capability presence check here:
elasticsearch/x-pack/plugin/esql/src/test/java/org/elasticsearch/xpack/esql/CsvTests.java
Lines 255 to 261 in bb872e6
if (Build.current().isSnapshot()) { | |
assertThat( | |
"Capability is not included in the enabled list capabilities on a snapshot build. Spelling mistake?", | |
testCase.requiredCapabilities, | |
everyItem(in(EsqlCapabilities.CAPABILITIES)) | |
); | |
} else { |
It was moved to snapshot-only for reasons I don't remember. It probably shouldn't.
I'm not sure if the CI runs a snapshot or not actually (I guess not); it feels weird that it didn't fail in any part of the process
Edit: Thinking about this again, the features are also automatically included as capabilities, so in snapshot mode, it maybe worked?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the CI runs snapshot, and therefore it succeeded
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's a tag you can add to make the tests run in release mode if you want.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've run the test locally with and without snapshot and both work, so I think we're good.
@@ -1,13 +1,15 @@ | |||
categorize | |||
required_capability: categorize | |||
|
|||
# Drop the category ID, because it's non-deterministic |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it not deterministic because of the order of the rows? What about sorting the data before it enters the STATS?
Dumping the tested function output is weird 👀
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've re-added the category
column and a SORT message
inbetween. Thanks for the suggestion.
For my understanding: does the additional SORT
step mean that every CATEGORIZE
is happening on the coordinating node? I think it must be, otherwise this isn't working correctly, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For my understanding: does the additional SORT step mean that every CATEGORIZE is happening on the coordinating node? I think it must be, otherwise this isn't working correctly, right?
Yes. Your instincts for where we split are almost certainly right. We'll just run the agg after the sort. It's not ideal testing, but it's fine for now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Uhm, the CSV tests are executed in different ways. The CsvTests
class, I think it executes like an unit test, like a single node. There are also tests like https://github.com/elastic/elasticsearch/blob/c1daf18bf5f5178d43dc17d3a3d1f5db9773098e/x-pack/plugin/esql/qa/server/multi-clusters/src/javaRestTest/java/org/elasticsearch/xpack/esql/ccq/MultiClusterSpecIT.java, which use a full cluster.
But now thinking about the categorize function and how it works, I don't really know how is the sort executed. And specially as this is the first grouping function depending on order, maybe it makes sense in fact to not check the category column...
Sorry, I think the original solution was correct 🙏
Let's see if @nik9000 can confirm this
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This failing test suggests that data does not arrive at the categorizer in sorted order.
@nik9000 do you understand why?
I'll leave the test muted for now. Once we replace the (arbitrary) IDs by regexes, the output will be deterministic.
/** | ||
* Supported the text categorization function "CATEGORIZE". | ||
*/ | ||
CATEGORIZE(true); | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's a capability presence check here:
elasticsearch/x-pack/plugin/esql/src/test/java/org/elasticsearch/xpack/esql/CsvTests.java
Lines 255 to 261 in bb872e6
if (Build.current().isSnapshot()) { | |
assertThat( | |
"Capability is not included in the enabled list capabilities on a snapshot build. Spelling mistake?", | |
testCase.requiredCapabilities, | |
everyItem(in(EsqlCapabilities.CAPABILITIES)) | |
); | |
} else { |
It was moved to snapshot-only for reasons I don't remember. It probably shouldn't.
I'm not sure if the CI runs a snapshot or not actually (I guess not); it feels weird that it didn't fail in any part of the process
Edit: Thinking about this again, the features are also automatically included as capabilities, so in snapshot mode, it maybe worked?
/** | ||
* Supported the text categorization function "CATEGORIZE". | ||
*/ | ||
CATEGORIZE(true); | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's a tag you can add to make the tests run in release mode if you want.
/** | ||
* Support categorize | ||
*/ | ||
public static final NodeFeature CATEGORIZE = new NodeFeature("esql.categorize"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh yikes. I hadn't noticed the feature vs capability. You can't really remove these once they've left snapshot-land. But this one stayed in snapshot-land forever so I think it's safe to remove it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And I missed the pretty explicit:
Don't make more of those - add them to {@link EsqlCapabilities} instead.
* Move CATEGORIZE from EsqlFeatures to EsqlCapabilities * Make CATEGORIZE csv test deterministic * Unmute categorize test * spotless * Deterministic categorize csv test with category IDs
* Move CATEGORIZE from EsqlFeatures to EsqlCapabilities * Make CATEGORIZE csv test deterministic * Unmute categorize test * spotless * Deterministic categorize csv test with category IDs
* ES|QL categorize v1 (#112860) * Prepare TokenListCategorizer for usage in ES|QL * Expose text categorization from ML module * Let esql plugin depend on ml plugin * Fix/suppress this-escape * Incomplete v1 of ES|QL Categorize * Refactor / remove CategorizeInternal * Fix categorize csv test (#113089) * Move CATEGORIZE from EsqlFeatures to EsqlCapabilities * Make CATEGORIZE csv test deterministic * Unmute categorize test * spotless * Deterministic categorize csv test with category IDs --------- Co-authored-by: Jan Kuipers <148754765+jan-elastic@users.noreply.github.com>
No description provided.