Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Mappings: Ensure that reindexing is always possible #8142

Closed
jpountz opened this issue Oct 17, 2014 · 14 comments · Fixed by #10915
Closed

Mappings: Ensure that reindexing is always possible #8142

jpountz opened this issue Oct 17, 2014 · 14 comments · Fixed by #10915
Assignees

Comments

@jpountz
Copy link
Contributor

jpountz commented Oct 17, 2014

Our mappings are very permissive in terms of what can be enabled or disabled, and in particular it is possible to disable the _source. This can be problematic if at some point you need to perform a change that requires reindexing (eg. an analyzer change) because this means that the data needs to be pulled again from another data source.

Instead, we could disable some mappings options in order to make sure that we always have all the information needed to reindex (eg. with client helpers). Here are the settings that we would need to enforce:

  • _source is enabled and stored
  • _timestamp, _ttl, _routing, _parent are stored

Relates to #8870

@clintongormley
Copy link

+1

1 similar comment
@rmuir
Copy link
Contributor

rmuir commented Oct 17, 2014

+1

@dadoonet
Copy link
Member

I'm aware of some corner cases where people would definitely not store the source and want to get back only _id.
And as they send a massive amount of data, they don't want to store _source neither fields.
Same for users who only want to compute.

Not saying that we should not do it but that we need to be aware of the consequences for some users.
As I said, extreme corner cases though.

@rjernst
Copy link
Member

rjernst commented Oct 18, 2014

+1

1 similar comment
@Vineeth-Mohan
Copy link

+1

@clintongormley
Copy link

I can imagine there are certain expert cases where users actually do want to disable storing _source, so perhaps we should still maintain the ability to do so, but make it much harder. eg some non-dynamic cluster setting, the documentation for which also lists all the things you will lose if you disable _source.

@bleskes
Copy link
Contributor

bleskes commented Feb 12, 2015

+1 to that latest suggestion.

On Thu, Feb 12, 2015 at 7:04 PM, Clinton Gormley notifications@github.com
wrote:

I can imagine there are certain expert cases where users actually do want to disable storing _source, so perhaps we should still maintain the ability to do so, but make it much harder. eg some non-dynamic cluster setting, the documentation for which also lists all the things you will lose if you disable _source.

Reply to this email directly or view it on GitHub:
#8142 (comment)

@jpountz
Copy link
Contributor Author

jpountz commented Feb 12, 2015

Can we get more information about these expert use-cases? Having the _source stored is so valuable that I would like to make sure that we don't allow users to disable this without good reasons. And if we still decide to allow disabling the source, can we consider making backward compatibility best-effort only on such indices?

@clintongormley
Copy link

eg from an email I received:

A binary search engine? How are you indexing the binary fields to make them searchable?

The client uses Lucene to search over his binary data. In summary they insert and query data sending chunks in a binary format. Internally this binary format is broken into terms (base64 small strings) and offsets for each term. So, during the search they verify if the term and the offsets matches, returning the documents with the most matches of terms respecting the offsets. They have created new queries, scorers, collectors, refiners, etc.

This custom plugin wasn't able to keep up with indexing and query speed, but disabling the source allowed them to do that.

I realise that this is not typical, hence "expert case", but not storing the _source could be a tradeoff that some people would be willing to make. I just want to make it hard enough that the ordinary user won't do it without realising the consequences.

And if we still decide to allow disabling the source, can we consider making backward compatibility best-effort only on such indices?

Yes absolutely. That is one of the tradeoffs.

@colings86
Copy link
Contributor

Maybe we should have a cluster level setting to allow indices to disable the _source field? this way we can put a big notice in the documentation warning about how this is an expert feature and what affect it has on backwards compatibility. Also with a cluster-wide setting, we can detect the setting on startup and put a warning in the logs stating that backwards compatibility is not guaranteed with this setting on. This way we will have made it very clear to the user that there are serious implications to turning on this setting

@brwe
Copy link
Contributor

brwe commented Feb 24, 2015

Disabling _source does reduce the size on disk:
https://groups.google.com/forum/?utm_medium=email&utm_source=footer#!msg/elasticsearch/cZ7tkx8HZ70/AOFAt42mNt0J
I think when people store the raw data outside es in some secondary store then we should not force them to store the source in es again or even make it hard to disable it.

@rmuir
Copy link
Contributor

rmuir commented Feb 24, 2015

This custom plugin wasn't able to keep up with indexing and query speed, but disabling the source allowed them to do that.

They are indexing base64 binary? Come on man, you cant get much more esoteric than that.

These excuses are pathetic.

@rmuir
Copy link
Contributor

rmuir commented Feb 24, 2015

Also we optimized the stored fields in lucene a lot for 5.0, cpu costs around merging and so on are just not comparable to previous versions in most cases.

So any old complaints, benchmarks, etc around this ---> obselete.

@s1monw
Copy link
Contributor

s1monw commented Feb 24, 2015

They are indexing base64 binary? Come on man, you cant get much more esoteric than that.

I really agree here with robert. It's requirements like those that prevent progress and make 1 user happy for the cost of 99% having a harder life!

rjernst added a commit to rjernst/elasticsearch that referenced this issue Apr 27, 2015
Regardless of the outcome of elastic#8142, we should at least enforce that
when _source is enabled, it is sufficient to reindex. This change
removes the excludes and includes settings, since these modify
the source, causing us to lose the ability to reindex some fields.
rjernst added a commit to rjernst/elasticsearch that referenced this issue Apr 28, 2015
Regardless of the outcome of elastic#8142, we should at least enforce that
when _source is enabled, it is sufficient to reindex. This change
removes the excludes and includes settings, since these modify
the source, causing us to lose the ability to reindex some fields.

closes elastic#10814
rjernst added a commit to rjernst/elasticsearch that referenced this issue Apr 28, 2015
Regardless of the outcome of elastic#8142, we should at least enforce that
when _source is enabled, it is sufficient to reindex. This change
removes the excludes and includes settings, since these modify
the source, causing us to lose the ability to reindex some fields.

closes elastic#10814
rjernst added a commit to rjernst/elasticsearch that referenced this issue May 1, 2015
Current features (eg. update API) and future features (eg. reindex API)
depend on _source. This change locks down the field so that
it can no longer be disabled. It also removes legacy settings
compress/compress_threshold.

closes elastic#8142
rjernst added a commit to rjernst/elasticsearch that referenced this issue May 6, 2015
Current features (eg. update API) and future features (eg. reindex API)
depend on _source. This change locks down the field so that
it can no longer be disabled. It also removes legacy settings
compress/compress_threshold.

closes elastic#8142
closes elastic#10915
rjernst added a commit to rjernst/elasticsearch that referenced this issue May 6, 2015
Current features (eg. update API) and future features (eg. reindex API)
depend on _source. This change locks down the field so that
it can no longer be disabled. It also removes legacy settings
compress/compress_threshold.

closes elastic#8142
closes elastic#10915
@rjernst rjernst removed the discuss label May 6, 2015
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

10 participants