Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove includes and excludes from _source #10814

Merged
merged 1 commit into from
Apr 28, 2015

Conversation

rjernst
Copy link
Member

@rjernst rjernst commented Apr 27, 2015

Regardless of the outcome of #8142, we should at least enforce that
when _source is enabled, it is sufficient to reindex. This change
removes the excludes and includes settings, since these modify
the source, causing us to lose the ability to reindex some fields.

Note: I also renamed one of the UpdateMappingTests so the class names are unique.

@rjernst rjernst added v2.0.0-beta1 :Search Foundations/Mapping Index mappings, including merging and defining field types labels Apr 27, 2015
@jpountz
Copy link
Contributor

jpountz commented Apr 27, 2015

LGTM

rjernst added a commit to rjernst/elasticsearch that referenced this pull request Apr 28, 2015
Regardless of the outcome of elastic#8142, we should at least enforce that
when _source is enabled, it is sufficient to reindex. This change
removes the excludes and includes settings, since these modify
the source, causing us to lose the ability to reindex some fields.

closes elastic#10814
Regardless of the outcome of elastic#8142, we should at least enforce that
when _source is enabled, it is sufficient to reindex. This change
removes the excludes and includes settings, since these modify
the source, causing us to lose the ability to reindex some fields.

closes elastic#10814
@rjernst rjernst merged commit bf09e58 into elastic:master Apr 28, 2015
@rjernst
Copy link
Member Author

rjernst commented Apr 28, 2015

I also added migration docs and removed docs in master. Will add deprecation in 1.x as well and link here.

rjernst added a commit that referenced this pull request Apr 28, 2015
@uschindler
Copy link
Contributor

This is a big no-go for me and would cause big headaches to me. I have excluded a lot of data from the original source field (which is still duplicate information, but cannot be easily modelled with copy_to). Reindexing is not an issue to me, I will always be able to regenerate the data from the external source or even _source.

So big -1, sorry. This is a desaster. This now slows down searching, because it transfers a large amount of useless data.

@rjernst rjernst deleted the fix/source-lock-down branch May 8, 2015 06:47
@clintongormley
Copy link

@uschindler we have plans to support disabling _source again, but in a way that makes it less attractive to the uninitiated. But I'm surprised by the include/exclude thing. Could you elaborate more on your use case? Wondering if #9034 could help.

@uschindler
Copy link
Contributor

Hi @clintongormley !
#9034 looks like a good improvement, but there is no activity at the moment.

In general I use the _source field and I am happy with it, so I would never disable it completely, so I am just complaining about removing the functionality to filter parts during indexing.

In general there are 2 user types:

  • Users that want Elasticsearch as their primary data storage. Of course those users need to be a ble to reindex and change mappings. Those users may also need to update some of the fields (like you can do "update table set column=..." in RDBMS)
  • Users that just use Elasticsearch as a search engine for semi-structured fulltext data. Reindexing is not a problem for them, because they can generate the data from database, filesystem,...

In our case its the second type of user. So theoretically I could disable the _source field completely, but it is still good for easily fetching the information to show search results. Because of this, it is perfectly fine to only store the JSON information in the index thats needed for displaying search results and some statistics,...

Basically the Idea behind Lucene is to have an index where you index documents and only store the information in the index to show the document results. With this change (unable to disable _source or unable to filter _source) this is no longer possible.

The performance improvements are fine for stored fields, but if you have a lot of textual data solely there for searching (which never get retrieved) - partly up to several megabytes of text from PDF files, xml documents,... the additional overhead to extract the required fields is immense.

By changing to CBOR encoding of the _source field, this got better, but loading the whole stuff is still an overhead, if you just show like 2 kilobytes of thumbnail information. Instead you have to load 3 megabytes of CBOR data from the source field and let Elasticsearch filter it during returning search results.

One additional thing that happens on our side (as described before): We sometimes have to redundantly add the same data to the _source document several times, because a simple copy_to is not always easily possible. If copy_to would allow groovy scripts to dynamically copy stuff to other fields, I could live with that. E.g. one thing is to build suggestion fields or combining several fields with a special formatting, e.g. for Facets. In fact the data is redundant and could completely be regenerated also from the "filtered" source.

I agree with the problems that happen with disabled _source or applied filtering on _source when people want to do some things like updating or reindexing, but there should be done 2 things:

  • add warnings to the user
  • if a mapping contains filters on _source or has _source completely disabled, just report "unsupported operation" to the user if one tries to reindex or update existing fields. A clear error message is better than silently missing fields. By that a developer knows ASAP that disabling source or filtering source is a bad idea for primary data storage where you want reindexing and updates.

@clintongormley
Copy link

@uschindler thanks for the reply. I'm going to open a separate issue to discuss alternative solutions. Just one comment in the meantime:

One additional thing that happens on our side (as described before): We sometimes have to redundantly add the same data to the _source document several times, because a simple copy_to is not always easily possible. If copy_to would allow groovy scripts to dynamically copy stuff to other fields, I could live with that. E.g. one thing is to build suggestion fields or combining several fields with a special formatting, e.g. for Facets. In fact the data is redundant and could completely be regenerated also from the "filtered" source.

Have you seen the transform script? While the original _source field is stored, the transform script allows you to change the _source field that is sent for indexing, which sounds like what you need.

See http://www.elastic.co/guide/en/elasticsearch/reference/current/mapping-transform.html

@clintongormley
Copy link

@uschindler ticket opened here: #11116

@uschindler
Copy link
Contributor

Hi Clinton,

thanks for opening this issue!

Have you seen the transform script? While the original _source field is stored, the transform script allows you to change the _source field that is sent for indexing, which sounds like what you need.

I know this functionality and I already thought about using it. The common use-case is to prepare the "suggest" field, which needs a different processing than all the other fields. This is for example here currently done in the preprocessing of the XML data using XSL... I could rewrite that using groovy!

My general problem is, as said on the new issue: I just want to keep _source, but decide on myself what I really want to store in the index, because I index tons of data, but only want to display very small snippets.

@clintongormley clintongormley changed the title Mappings: Remove includes and excludes from _source Remove includes and excludes from _source Aug 13, 2015
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
>non-issue :Search Foundations/Mapping Index mappings, including merging and defining field types
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants