Add `match_only_text`, a space-efficient variant of `text`. #66172

jpountz · 2020-12-10T14:19:48Z

This adds a new match_only_text field, which indexes the same data as a text
field that has index_options: docs and norms: false and uses the _source
for positional queries like match_phrase.

Unlike text, this field doesn't support scoring and span queries.

This new field is part of the text family, so it is returned as a text field in the
_field_caps output.

Closes #64467

This adds a new `match_only_text` field, which indexes the same data as a `text` field that has `index_options: docs` and `norms: false` and uses the `_source` for positional queries like `match_phrase`. Unlike `text`, this field doesn't support scoring. An alternative to this new field could have been to make the `text` field still able to run positional queries when positions are not indexed, but I like this new field better because it avoids questions around how scoring should perform.

elasticmachine · 2020-12-10T14:19:51Z

Pinging @elastic/es-search (Team:Search)

romseygeek

This looks very cool! I have some suggestions around making the confirmation step a bit more efficient, although I guess there will be tradeoffs between the cost of building the index from source and how much work has to be done during analysis.

romseygeek · 2020-12-10T14:28:23Z

...ext/src/main/java/org/elasticsearch/xpack/matchonlytext/mapper/MatchOnlyTextFieldMapper.java

+    }
+
+    @Override
+    protected void doXContentBody(XContentBuilder builder, boolean includeDefaults, Params params) throws IOException {


I don't think you need to copy this bit, because there will be no BWC issue with a new mapper

...text/src/main/java/org/elasticsearch/xpack/matchonlytext/query/SourceConfirmedTextQuery.java

romseygeek · 2020-12-10T14:42:56Z

...src/test/java/org/elasticsearch/xpack/matchonlytext/query/SourceConfirmedTextQueryTests.java

+        }
+    }
+
+    public void testSpanNear() throws Exception {


Does this work, given the TODO in SourceConfirmedTextQuery implementation above?

Yes, the query just uses a MatchAllDocsQuery as an approximation, but the query "works".

Hopefully we'll create better approximations for spans in a follow-up.

server/src/main/java/org/elasticsearch/index/mapper/TextFieldMapper.java

romseygeek · 2020-12-10T14:46:35Z

server/src/main/java/org/elasticsearch/index/mapper/FieldMapper.java

@@ -670,7 +670,7 @@ private void merge(FieldMapper toMerge, Conflicts conflicts) {
            }
        }

-        protected void toXContent(XContentBuilder builder, boolean includeDefaults) throws IOException {
+        public void toXContent(XContentBuilder builder, boolean includeDefaults) throws IOException {


I think you can avoid changing this, the concerns about BWC only apply to existing mappings and won't be a problem for entirely new field mappers.

romseygeek

LGTM, thanks @jpountz

...ext/src/main/java/org/elasticsearch/xpack/matchonlytext/mapper/MatchOnlyTextFieldMapper.java

jtibshirani · 2020-12-16T02:13:08Z

docs/reference/mapping/types.asciidoc

@@ -69,6 +69,7 @@ values.
 ==== Text search types

 <<text,`text`>>:: Analyzed, unstructured text.
+<<match-only-text,`match_only_text`>>:: A more space-efficient variant of `text`.


We decided to make a single page for the keyword type family, so that users could easily compare the trade-offs between each type. It'd be nice to do the same for the new 'text' family'.

jpountz · 2020-12-16T11:57:10Z

Thanks for the helpful reviews @romseygeek and @jtibshirani.

jtibshirani

I left a couple last comments. One thing I wanted to double-check -- I guess some users could want scoring for most queries (as it can help surface relevant content) but confirm positional queries using _source? Maybe we're keeping things simple at first and assume this use case very often sorts on timestamp.

jtibshirani · 2020-12-16T20:50:05Z

...ext/src/main/java/org/elasticsearch/xpack/matchonlytext/mapper/MatchOnlyTextFieldMapper.java

+            return SourceValueFetcher.toString(name(), context, format);
+        }
+
+        private Query toQuery(Query query, QueryShardContext queryShardContext) {


If a user disables _source on the field, we still accept positional queries but will just return no results. Maybe we should throw an error instead, as we do when positions are disabled. We could at least just check QueryShardContext#isSourceEnabled to catch cases where source is turned off entirely.

jtibshirani · 2020-12-16T20:55:23Z

...text/src/main/java/org/elasticsearch/xpack/matchonlytext/query/SourceConfirmedTextQuery.java

+        } else if (query instanceof MultiPhraseQuery) {
+            return approximate((MultiPhraseQuery) query);
+        } else {
+            // TODO: spans and intervals


Should we file an issue to track this? We just try not to have field types/ queries that scan all documents (unless they're runtime fields of course :))

jpountz · 2020-12-17T15:38:24Z

I guess some users could want scoring for most queries (as it can help surface relevant content) but confirm positional queries using _source?

I have long hesitated about this sort of things. I have considered the following options besides this PR:

Keep norms and term frequencies indexed. If we did this we could have 100% scoring compatibility with the text field type by computing phrase frequencies from the _source instead of positions. I didn't like this option because I felt like most users wouldn't care enough about scoring to spend disk space on term frequencies and norms.
Only index docs like the PR does but try to maintain some scoring. We can't have 100% scoring compatibility because most similarities need norms, and some of them also need total term frequencies, which are only computed when term frequencies are indexed. We could still have scoring enabled, e.g. by always using 1 for term frequencies and field lengths, but I worried that this might be more confusing than disabling scoring entirely since it would be different from the scoring that you get on text fields.

This field still provides some scoring in the sense that the score of a document will be equal with the number of matching terms from the query with this field. So if your query is a match query for 404 robots.txt, documents that contain both terms will be returned before documents that contain only one of these terms when sorting by score. My assumption is that it would be good enough for the immense majority of users who have a Logging use-case.

jpountz · 2020-12-17T16:42:07Z

I had initially thought I could make this field support span and interval queries, but this is more challenging than I thought because these query builders to not consult the field mappers to be constructed and just assume the field has been indexed with positions. We could fix this by adding new query builders in MappedFieldType but this would be a lot of work, and I believe we could go without span and intervals for the first version.

jpountz · 2021-04-01T09:27:47Z

I finally took some time to come back to this PR. I moved the code from a new plugin in x-pack to modules/mapper-extras and changed license headers. I also added support for the intervals queries we have builders for in MappedFieldType, that is match and prefix queries, in order to further reduce the number of differences between text and match_only_text. @romseygeek Could you maybe review this part before I merge, specifically the SourceIntervalsSource class?

jtibshirani · 2021-04-01T18:10:04Z

docs/reference/mapping/types/match-only-text.asciidoc

+
+[horizontal]
+
+<<analyzer,`analyzer`>>::


One thought I had since reviewing: if this is just targeted at log lines, would it make sense to cut down on analysis config options? For example, not allowing for a different search analyzer or search quote analyzers. Or even removing the option configuring the analyzer, just using a default that targets log lines. This could make it simpler to maintain long BWC for the field type. (This is a rough idea, and I am not sure it makes sense... maybe many users in fact tweak analyzers for log lines.)

Actually I think it's a good call, e.g. as far as I know, ECS doesn't configure analyzers. It would be much easier to add it in the future if it proves needed than to remove it when we want to ease backward compatibility.

romseygeek

The IntervalsSource parts look good to me.

romseygeek · 2021-04-06T15:00:36Z

modules/mapper-extras/src/main/java/org/elasticsearch/index/query/SourceIntervalsSource.java

+                return doc;
+            }
+
+            private boolean setIterator(int doc) {


I think this can throw IOException?

romseygeek · 2021-04-06T15:02:13Z

modules/mapper-extras/src/main/java/org/elasticsearch/index/query/SourceIntervalsSource.java

+    public int hashCode() {
+        // Not using matchesProvider and valueFetcherProvider, which don't identify this source but are only used to avoid scanning linearly
+        // through all documents
+        return Objects.hash(in, indexAnalyzer);


The index analyzer should be immutable for an open index, I think? So I'm not sure that it needs to be included here or in equals.

Agreed. I'm including it because I want to avoid making too many assumptions about how this class is used by Elasticsearch. Is it fine with you?

jpountz · 2021-04-21T11:54:16Z

This field now fully supports intervals thanks to #71429.

…66172) This adds a new `match_only_text` field, which indexes the same data as a `text` field that has `index_options: docs` and `norms: false` and uses the `_source` for positional queries like `match_phrase`. Unlike `text`, this field doesn't support scoring.

Adds release highlights for match_only_text (#66172) and more memory-efficient composite aggregations (#74559).

Adds release highlights for match_only_text (elastic#66172) and more memory-efficient composite aggregations (elastic#74559).

jpountz added >feature release highlight :Search Foundations/Mapping Index mappings, including merging and defining field types labels Dec 10, 2020

elasticmachine added the Team:Search Meta label for search team label Dec 10, 2020

jpountz requested a review from romseygeek December 10, 2020 14:19

romseygeek reviewed Dec 10, 2020

View reviewed changes

iter

6b0cb21

romseygeek approved these changes Dec 15, 2020

View reviewed changes

jtibshirani reviewed Dec 15, 2020

View reviewed changes

...ext/src/main/java/org/elasticsearch/xpack/matchonlytext/mapper/MatchOnlyTextFieldMapper.java Outdated Show resolved Hide resolved

jtibshirani reviewed Dec 16, 2020

View reviewed changes

jpountz added 4 commits December 16, 2020 10:34

Merge branch 'master' into feature/source_phrase_queries

7525e4f

Use source lookup from the shard context.

e57699e

Update release version.

9ec31c6

Consolidate docs with text.

7a03a0f

jpountz added v7.12.0 v8.0.0 labels Dec 16, 2020

rayafratkina mentioned this pull request Dec 16, 2020

Add support for 'match_only_text' field type elastic/kibana#86107

Closed

alisonelizabeth mentioned this pull request Dec 16, 2020

[Mappings editor] Add UI form for match_only_text field type elastic/kibana#86113

Open

jtibshirani reviewed Dec 16, 2020

View reviewed changes

timroes added the >new-field-mapper Added when a new field type / mapper is being introduced label Dec 17, 2020

jpountz added 5 commits December 17, 2020 13:48

Fail phrase queries when _source is disabled.

5774bc9

Remove support for store.

c0be502

Add tests for span and intervals queries.

feaf2f8

Test for fuzzy query.

d51db6c

More tests.

71adb75

jpountz added 3 commits April 1, 2021 10:36

Merge branch 'master' into feature/source_phrase_queries

96f668b

iter

3a85af4

iter

448eb28

Fix compilation.

f3e77f8

jtibshirani reviewed Apr 1, 2021

View reviewed changes

Analysis is no longer configurable.

c5f4f04

romseygeek reviewed Apr 6, 2021

View reviewed changes

jpountz mentioned this pull request Apr 7, 2021

Better out-of-the-box mappings for logs, metrics and synthetics #64978

Merged

jpountz added 4 commits April 7, 2021 15:21

iter

4818edc

Merge branch 'master' into feature/source_phrase_queries

339c8dc

Intervals unit tests.

e652aa4

Fix docs now that match_only_text supports all interval queries.

31a5bba

jpountz added v7.14.0 and removed v7.13.0 labels Apr 21, 2021

Undo testing hack.

3783f18

Merge branch 'master' into feature/source_phrase_queries

edaa5b0

jpountz merged commit 83113ec into elastic:master Apr 22, 2021

jpountz deleted the feature/source_phrase_queries branch April 22, 2021 06:41

jpountz mentioned this pull request Apr 22, 2021

Add match_only_text, a space-efficient variant of text. #72064

Merged

jpountz mentioned this pull request Jun 17, 2021

_id-less indices #48699

Open

stevejgordon mentioned this pull request Jul 1, 2021

7.14.0 Meta Ticket elastic/elasticsearch-net#5776

Closed

14 tasks

jakelandis added v8.0.0-alpha1 and removed v8.0.0 labels Jul 26, 2021

ywelsch mentioned this pull request Aug 2, 2021

Add more 7.14 release highlights #75939

Merged

ywelsch added a commit that referenced this pull request Aug 2, 2021

Add more 7.14 release highlights (#75939)

d6e1de5

Adds release highlights for match_only_text (#66172) and more memory-efficient composite aggregations (#74559).

probakowski pushed a commit to probakowski/elasticsearch that referenced this pull request Aug 2, 2021

Add more 7.14 release highlights (elastic#75939)

5277305

Adds release highlights for match_only_text (elastic#66172) and more memory-efficient composite aggregations (elastic#74559).

swallez mentioned this pull request Mar 18, 2022

Missing match_only_text field mapping. elastic/elasticsearch-specification#1548

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add `match_only_text`, a space-efficient variant of `text`. #66172

Add `match_only_text`, a space-efficient variant of `text`. #66172

jpountz commented Dec 10, 2020 •

edited

Loading

elasticmachine commented Dec 10, 2020

romseygeek left a comment

romseygeek Dec 10, 2020

romseygeek Dec 10, 2020

jpountz Dec 10, 2020

jpountz Dec 10, 2020

romseygeek Dec 10, 2020

romseygeek left a comment

jtibshirani Dec 16, 2020 •

edited

Loading

jpountz commented Dec 16, 2020

jtibshirani left a comment

jtibshirani Dec 16, 2020

jtibshirani Dec 16, 2020

jpountz commented Dec 17, 2020

jpountz commented Dec 17, 2020

jpountz commented Apr 1, 2021

jtibshirani Apr 1, 2021

jpountz Apr 2, 2021

romseygeek left a comment

romseygeek Apr 6, 2021

romseygeek Apr 6, 2021

jpountz Apr 6, 2021

jpountz commented Apr 21, 2021


		[horizontal]

		<<analyzer,`analyzer`>>::

Add match_only_text, a space-efficient variant of text. #66172

Add match_only_text, a space-efficient variant of text. #66172

Conversation

jpountz commented Dec 10, 2020 • edited Loading

elasticmachine commented Dec 10, 2020

romseygeek left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

romseygeek left a comment

Choose a reason for hiding this comment

jtibshirani Dec 16, 2020 • edited Loading

Choose a reason for hiding this comment

jpountz commented Dec 16, 2020

jtibshirani left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jpountz commented Dec 17, 2020

jpountz commented Dec 17, 2020

jpountz commented Apr 1, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

romseygeek left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jpountz commented Apr 21, 2021

Add `match_only_text`, a space-efficient variant of `text`. #66172

Add `match_only_text`, a space-efficient variant of `text`. #66172

jpountz commented Dec 10, 2020 •

edited

Loading

jtibshirani Dec 16, 2020 •

edited

Loading