Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add reindex extension using scroll api. #270

Closed
wants to merge 1 commit into from

Conversation

cavvia
Copy link
Contributor

@cavvia cavvia commented Jan 19, 2016

Add a reindex extension similar to a helper found in the official python library (here).

This moves data (and not mappings) from a source index to a target index. It's useful when updating mapping definitions or moving from one hosted cluster to another. Some of the functionality is missing - such as chunk_size propagating to the bulk call - due to missing bulk helpers.

This has been tested functionally with Elasticsearch 2.1.x. Indices must have _source enabled.

@karmi
Copy link
Contributor

karmi commented Mar 4, 2016

Hi, sorry for the late reply, got tied with ElasticON and other things.

The reindex extension is something I always wanted to do. However, couple of days ago, a dedicated "Reindex API" has been added to Elasticsearch itself.

Your extension is of course something which would work with older versions of Elasticsearch, but I'm wondering whether this is something we want to include in the gem and support going forward... What do you think?

On a technical note, it would be nice to be able to pass a block to the class initializer which would transform the documents -- that way, the extension would be even more useful, allowing to "fix" the data in addition to fixing index mappings/settings.

@karmi
Copy link
Contributor

karmi commented Mar 14, 2016

Hi @cavvia, any thoughts on how we should proceed with the feature in light of the "Reindex" API in Elasticsearch itself?

@cavvia
Copy link
Contributor Author

cavvia commented Apr 8, 2016

Hi there. The new reindex API looks pretty comprehensive to me, but it does not support reindexing to a remote cluster - it's discussed in the PR as the "most ambitious" use case but it is explicitly marked as 'not done'.

Reindexing to any cluster (remote or not) using the scroll API is precisely the use case I handle in this extension.

Moreover, for anyone with hosted elasticsearch this extension is going to be useful for quite some time to come, as many teams are not even on version 2.1 yet let alone 2.3.

It's your call on whether you want to include this, but it does still meet a use case which the native API does not seem to.

@karmi karmi mentioned this pull request May 15, 2016
@karmi
Copy link
Contributor

karmi commented May 18, 2016

Hi @cavvia, I agree with your reasoning -- I have rebased and integrated your patch, and I am working on some tweaks to the API and implementation. I'll leave a note here when I'm done and it's all merged.

karmi pushed a commit that referenced this pull request May 19, 2016
This patch adds a re-index extension to the client, which allows
to copy document from one index or cluster to another one.

Related: #270, #68
@karmi karmi closed this in 8dd5a04 May 19, 2016
@karmi
Copy link
Contributor

karmi commented May 19, 2016

So, I've rather significantly refactored the extension, have a look at the attached commit. Among the most important changes is that I've aligned the arguments to the method with the core "Reindex" API, allowing people easier crossing between these two. Also, a reindex method is added directly to the client when the extension is loaded, for a more usable interface. An example:

require 'elasticsearch'
require 'elasticsearch/extensions/reindex'

client = Elasticsearch::Client.new
target_client = Elasticsearch::Client.new url: 'http://localhost:9250', log: true

client.index index: 'test', type: 'd', body: { title: 'foo' }

client.reindex source: { index: 'test' },
               target: { index: 'test', client: target_client, transform: lambda { |doc| doc['_source']['title'].upcase! } },
               refresh: true

# => { errors: 0 }

target_client.search index: 'test'

# => ... hits ... "title"=>"FOO"

@cavvia
Copy link
Contributor Author

cavvia commented May 19, 2016

Thanks for taking a look at this and improving the design! Looks great to me and the example is easy to follow.

karmi added a commit that referenced this pull request May 25, 2016
Moved the `:transform` option from `:target` into the top level
of Hash arguments -- this seems like a more natural way.

Related: #270
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants