Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Analysis] Support normalizer in request param #24767

Merged
merged 4 commits into from
Jul 4, 2017

Conversation

johtani
Copy link
Contributor

@johtani johtani commented May 18, 2017

Support normalizer param and custom normalizer with char_filter/filter param.

In this PR, I didn't change a response.
If user send a request with keyword field name or normalizer name, analyze api display a response with tokenizer that is KeywordTokenizer.
Should we change a response format for normalizer?

Closes #23347

Copy link
Member

@cbuescher cbuescher left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@johtani I like this PR and had fun reviewing it and learning more about this analysis feature. I left some comments but I have to appologize in advance that I'm not an expert in this area yet, however I hope the comments might be useful

@@ -222,6 +237,9 @@ public void readFrom(StreamInput in) throws IOException {
field = in.readOptionalString();
explain = in.readBoolean();
attributes = in.readStringArray();
if (in.getVersion().onOrAfter(Version.V_6_0_0_alpha1_UNRELEASED)) {
normalizer = in.readOptionalString();
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we can start having a unit test for the AnalyzeRequest in which e.g. the validate method and the serialization can be checked.

analyzeRequest.normalizer(parser.text());
} else {
throw new IllegalArgumentException(currentFieldName + " should be normalizer's name");
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add a test for this parsing part to RestAnalyzeActionTests?

((request.tokenFilters() != null && request.tokenFilters().size() > 0)
|| (request.charFilters() != null && request.charFilters().size() > 0))) {
// normalizer + (tokenizer/analyzer) = no error, just ignore normalizer param
final IndexSettings indexSettings = indexAnalyzers == null ? null : indexAnalyzers.getIndexSettings();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can maybe go inside the following else branch.

final IndexSettings indexSettings = indexAnalyzers == null ? null : indexAnalyzers.getIndexSettings();
if (request.normalizer() != null) {
// Get normalizer from indexanalyzers
analyzer = indexAnalyzers.getNormalizer(request.normalizer());
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A question out of curiosity: the analyzer we get here doesn't have to be closed (via closeAnalyzer) because its not a new instance? I don't know enough about the lifecycle of these objects yet I'm afraid.

Copy link
Contributor Author

@johtani johtani Jun 11, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it already exists instance that created by IndexService or something. Only close if TransportAnalyzeAction create CustomAnalyzer

} else if (request.normalizer() != null ||
((request.tokenFilters() != null && request.tokenFilters().size() > 0)
|| (request.charFilters() != null && request.charFilters().size() > 0))) {
// normalizer + (tokenizer/analyzer) = no error, just ignore normalizer param
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wouldn't it be better to throw an error here? As far as I see specifying a normalizer and analyzer or tokenizer doesn't make sense? This combination can already be detected earlier on the request I think (is validate()) always called?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will add check logic in request.validate() method.
Unfortunately, it is not always called. If you call shardOperation yourself directly, validate() method is not called.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, I think its better than nothing

@@ -189,6 +190,44 @@ public static AnalyzeResponse analyze(AnalyzeRequest request, String field, Anal

analyzer = new CustomAnalyzer(tokenizerFactory, charFilterFactories, tokenFilterFactories);
closeAnalyzer = true;
} else if (request.normalizer() != null ||
((request.tokenFilters() != null && request.tokenFilters().size() > 0)
|| (request.charFilters() != null && request.charFilters().size() > 0))) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can this be split into the two cases request.normalizer() != null and (request.tokenFilters() != null && request.tokenFilters().size() > 0) || (request.charFilters() != null && request.charFilters().size() > 0) in two separate else if blocks instead of separating these cases later? I'm not entirely sure if this works, but I think it would make this part easier to read.


assertEquals(1, tokens.size());
assertEquals("abc", tokens.get(0).getTerm());
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it be possible to add a test for the second code path added in this PR (the case where normalizer == null but filter or char_filter is not null and tokenizer/analyzer is null)? I don't know if it is possible with this test setup but it might be useful

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I added that test case in rest api test. Now, we are moving to filter/char_filter to analysis-common module, so I think it would be better than in this test class.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

makes sense


Will cause the analysis to happen based on the analyzer configured in the
mapping for `obj1.field1` (and if not, the default index analyzer).

A `normalizer` can be provided for keyword field with normalizer associated with the `twitter` index.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

replace twitter with the new index name

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good catch :)

==== Support custom normalizer in Analyze API

Analyze API can analyze normalizer and custom normalizer.
In previous versions of Elasticsearch, Analyze API is required `tokenizer` or `analyzer` parameter.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: "is requiring a"


Analyze API can analyze normalizer and custom normalizer.
In previous versions of Elasticsearch, Analyze API is required `tokenizer` or `analyzer` parameter.
In Elaticsearch 6.0.0, Analyze API analyze a text as a keyword field with custom normalizer if `char_filter`/`filter` without `tokenizer`/`analyzer`.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: "can analyze",
nit: "... or if char_filter/filter is set and tokenizer/analyzer is not set"

@johtani
Copy link
Contributor Author

johtani commented Jun 11, 2017

@elasticmachine test this please

@johtani johtani force-pushed the support_normalizer_in_analyze_api branch 4 times, most recently from a2dbf1d to 39c3eec Compare June 12, 2017 05:47
@johtani
Copy link
Contributor Author

johtani commented Jun 12, 2017

@cbuescher Passed CI, please review again after the conference :)

@cbuescher cbuescher self-assigned this Jun 20, 2017
Copy link
Member

@cbuescher cbuescher left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @johtani, LGTM.
I left a few minor comments, feel free to adapt or simply ignore them. The question I left is only for my own understanding.

requestAnalyzer.analyzer("analyzer");
e = requestAnalyzer.validate();
assertTrue(e.getMessage().contains("tokenizer/analyze should be null if normalizer is specified"));
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

++ thanks for adding these checks

@@ -122,6 +124,17 @@ public void testParseXContentForAnalyzeRequestWithInvalidStringExplainParamThrow
assertThat(e.getMessage(), startsWith("explain must be either 'true' or 'false'"));
}

public void testParseXContentForAnalyzeRequestWithInvalidNromalizerThrowsException() throws Exception {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: s/Nromalizer/Normalizer

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch :)

}

public void testSerializationBwc() throws IOException {
final byte[] data = Base64.getDecoder().decode("AAABA2ZvbwEEdGV4dAAAAAAAAAABCm5vcm1hbGl6ZXI=");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

More of a question: I see how we use this in other bwc tests as well, I guess it represents the request. How did you get that String, do we have tools for that?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure... I made the string using Base64.getEncoder() and sysout...

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you add a comment saying what request it represents and which version it has been generated with?

final byte[] data = Base64.getDecoder().decode("AAABA2ZvbwEEdGV4dAAAAAAAAAABCm5vcm1hbGl6ZXI=");
final Version version = randomFrom(Version.V_5_0_0, Version.V_5_0_1, Version.V_5_0_2,
Version.V_5_1_1, Version.V_5_1_2, Version.V_5_3_0, Version.V_5_3_1, Version.V_5_3_2,
Version.V_5_4_0);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: maybe use VandomUtils#randomVersionBetween()

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, good to know. I don't know it :)

Copy link
Contributor

@jpountz jpountz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please call getMultiTermComponent on factories, but otherwise it looks good to me!

throw new IllegalArgumentException("Custom normalizer may not use filter ["
+ tokenFilter.name() + "]");
}
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks like you are missing the call to MultiTermAwareComponent.getMultiTermComponent?

}

public void testSerializationBwc() throws IOException {
final byte[] data = Base64.getDecoder().decode("AAABA2ZvbwEEdGV4dAAAAAAAAAABCm5vcm1hbGl6ZXI=");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you add a comment saying what request it represents and which version it has been generated with?

Support normalizer param
Support custom normalizer with char_filter/filter param

Closes elastic#23347
Add AnalyzeRequestTest
Fix some comments
Fix some comments
Remove non-use imports

elastic#23347
@johtani johtani force-pushed the support_normalizer_in_analyze_api branch from 6b06274 to 8d72356 Compare June 27, 2017 22:34
@johtani johtani force-pushed the support_normalizer_in_analyze_api branch from 8d72356 to c6dd360 Compare June 28, 2017 06:49
@johtani
Copy link
Contributor Author

johtani commented Jun 28, 2017

@jpountz Rebased master and moved check and call logic into parseTokenFilterFactories
Could you review this again?

Copy link
Contributor

@jpountz jpountz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@johtani johtani merged commit 6894ef6 into elastic:master Jul 4, 2017
jasontedor added a commit to jasontedor/elasticsearch that referenced this pull request Jul 4, 2017
* master:
  [Analysis] Support normalizer in request param (elastic#24767)
  Remove deprecated IdsQueryBuilder constructor (elastic#25529)
  Adds check for negative search request size (elastic#25397)
  test: also inspect the upgrade api response to check whether the upgrade really ran
  [DOCS] restructure java clients docs pages (elastic#25517)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants