Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add match_only_text, a space-efficient variant of text. #66172

Merged
merged 28 commits into from
Apr 22, 2021

Conversation

jpountz
Copy link
Contributor

@jpountz jpountz commented Dec 10, 2020

This adds a new match_only_text field, which indexes the same data as a text
field that has index_options: docs and norms: false and uses the _source
for positional queries like match_phrase.

Unlike text, this field doesn't support scoring and span queries.

This new field is part of the text family, so it is returned as a text field in the
_field_caps output.

Closes #64467

This adds a new `match_only_text` field, which indexes the same data as a `text`
field that has `index_options: docs` and `norms: false` and uses the `_source`
for positional queries like `match_phrase`. Unlike `text`, this field doesn't
support scoring.

An alternative to this new field could have been to make the `text` field still
able to run positional queries when positions are not indexed, but I like this
new field better because it avoids questions around how scoring should perform.
@jpountz jpountz added >feature release highlight :Search Foundations/Mapping Index mappings, including merging and defining field types labels Dec 10, 2020
@elasticmachine elasticmachine added the Team:Search Meta label for search team label Dec 10, 2020
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-search (Team:Search)

Copy link
Contributor

@romseygeek romseygeek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks very cool! I have some suggestions around making the confirmation step a bit more efficient, although I guess there will be tradeoffs between the cost of building the index from source and how much work has to be done during analysis.

}

@Override
protected void doXContentBody(XContentBuilder builder, boolean includeDefaults, Params params) throws IOException {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think you need to copy this bit, because there will be no BWC issue with a new mapper

}
}

public void testSpanNear() throws Exception {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this work, given the TODO in SourceConfirmedTextQuery implementation above?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, the query just uses a MatchAllDocsQuery as an approximation, but the query "works".

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hopefully we'll create better approximations for spans in a follow-up.

@@ -670,7 +670,7 @@ private void merge(FieldMapper toMerge, Conflicts conflicts) {
}
}

protected void toXContent(XContentBuilder builder, boolean includeDefaults) throws IOException {
public void toXContent(XContentBuilder builder, boolean includeDefaults) throws IOException {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you can avoid changing this, the concerns about BWC only apply to existing mappings and won't be a problem for entirely new field mappers.

Copy link
Contributor

@romseygeek romseygeek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks @jpountz

@@ -69,6 +69,7 @@ values.
==== Text search types

<<text,`text`>>:: Analyzed, unstructured text.
<<match-only-text,`match_only_text`>>:: A more space-efficient variant of `text`.
Copy link
Contributor

@jtibshirani jtibshirani Dec 16, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We decided to make a single page for the keyword type family, so that users could easily compare the trade-offs between each type. It'd be nice to do the same for the new 'text' family'.

@jpountz
Copy link
Contributor Author

jpountz commented Dec 16, 2020

Thanks for the helpful reviews @romseygeek and @jtibshirani.

Copy link
Contributor

@jtibshirani jtibshirani left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I left a couple last comments. One thing I wanted to double-check -- I guess some users could want scoring for most queries (as it can help surface relevant content) but confirm positional queries using _source? Maybe we're keeping things simple at first and assume this use case very often sorts on timestamp.

return SourceValueFetcher.toString(name(), context, format);
}

private Query toQuery(Query query, QueryShardContext queryShardContext) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If a user disables _source on the field, we still accept positional queries but will just return no results. Maybe we should throw an error instead, as we do when positions are disabled. We could at least just check QueryShardContext#isSourceEnabled to catch cases where source is turned off entirely.

} else if (query instanceof MultiPhraseQuery) {
return approximate((MultiPhraseQuery) query);
} else {
// TODO: spans and intervals
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we file an issue to track this? We just try not to have field types/ queries that scan all documents (unless they're runtime fields of course :))

@timroes timroes added the >new-field-mapper Added when a new field type / mapper is being introduced label Dec 17, 2020
@jpountz
Copy link
Contributor Author

jpountz commented Dec 17, 2020

I guess some users could want scoring for most queries (as it can help surface relevant content) but confirm positional queries using _source?

I have long hesitated about this sort of things. I have considered the following options besides this PR:

  1. Keep norms and term frequencies indexed. If we did this we could have 100% scoring compatibility with the text field type by computing phrase frequencies from the _source instead of positions. I didn't like this option because I felt like most users wouldn't care enough about scoring to spend disk space on term frequencies and norms.
  2. Only index docs like the PR does but try to maintain some scoring. We can't have 100% scoring compatibility because most similarities need norms, and some of them also need total term frequencies, which are only computed when term frequencies are indexed. We could still have scoring enabled, e.g. by always using 1 for term frequencies and field lengths, but I worried that this might be more confusing than disabling scoring entirely since it would be different from the scoring that you get on text fields.

This field still provides some scoring in the sense that the score of a document will be equal with the number of matching terms from the query with this field. So if your query is a match query for 404 robots.txt, documents that contain both terms will be returned before documents that contain only one of these terms when sorting by score. My assumption is that it would be good enough for the immense majority of users who have a Logging use-case.

@jpountz
Copy link
Contributor Author

jpountz commented Dec 17, 2020

I had initially thought I could make this field support span and interval queries, but this is more challenging than I thought because these query builders to not consult the field mappers to be constructed and just assume the field has been indexed with positions. We could fix this by adding new query builders in MappedFieldType but this would be a lot of work, and I believe we could go without span and intervals for the first version.

@jpountz
Copy link
Contributor Author

jpountz commented Apr 1, 2021

I finally took some time to come back to this PR. I moved the code from a new plugin in x-pack to modules/mapper-extras and changed license headers. I also added support for the intervals queries we have builders for in MappedFieldType, that is match and prefix queries, in order to further reduce the number of differences between text and match_only_text. @romseygeek Could you maybe review this part before I merge, specifically the SourceIntervalsSource class?


[horizontal]

<<analyzer,`analyzer`>>::
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One thought I had since reviewing: if this is just targeted at log lines, would it make sense to cut down on analysis config options? For example, not allowing for a different search analyzer or search quote analyzers. Or even removing the option configuring the analyzer, just using a default that targets log lines. This could make it simpler to maintain long BWC for the field type. (This is a rough idea, and I am not sure it makes sense... maybe many users in fact tweak analyzers for log lines.)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually I think it's a good call, e.g. as far as I know, ECS doesn't configure analyzers. It would be much easier to add it in the future if it proves needed than to remove it when we want to ease backward compatibility.

Copy link
Contributor

@romseygeek romseygeek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The IntervalsSource parts look good to me.

return doc;
}

private boolean setIterator(int doc) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this can throw IOException?

public int hashCode() {
// Not using matchesProvider and valueFetcherProvider, which don't identify this source but are only used to avoid scanning linearly
// through all documents
return Objects.hash(in, indexAnalyzer);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The index analyzer should be immutable for an open index, I think? So I'm not sure that it needs to be included here or in equals.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed. I'm including it because I want to avoid making too many assumptions about how this class is used by Elasticsearch. Is it fine with you?

@jpountz jpountz added v7.14.0 and removed v7.13.0 labels Apr 21, 2021
@jpountz
Copy link
Contributor Author

jpountz commented Apr 21, 2021

This field now fully supports intervals thanks to #71429.

@jpountz jpountz merged commit 83113ec into elastic:master Apr 22, 2021
@jpountz jpountz deleted the feature/source_phrase_queries branch April 22, 2021 06:41
jpountz added a commit to jpountz/elasticsearch that referenced this pull request Apr 22, 2021
…66172)

This adds a new `match_only_text` field, which indexes the same data as a `text`
field that has `index_options: docs` and `norms: false` and uses the `_source`
for positional queries like `match_phrase`. Unlike `text`, this field doesn't
support scoring.
@jpountz jpountz mentioned this pull request Jun 17, 2021
ywelsch added a commit that referenced this pull request Aug 2, 2021
Adds release highlights for match_only_text (#66172) and more memory-efficient composite aggregations (#74559).
probakowski pushed a commit to probakowski/elasticsearch that referenced this pull request Aug 2, 2021
Adds release highlights for match_only_text (elastic#66172) and more memory-efficient composite aggregations (elastic#74559).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
>feature >new-field-mapper Added when a new field type / mapper is being introduced release highlight :Search Foundations/Mapping Index mappings, including merging and defining field types Team:Search Meta label for search team v7.14.0 v8.0.0-alpha1
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Can phrase queries on text fields use _source for verification if positions are not indexed?
7 participants