Reindex API #15201

nik9000 · 2015-12-02T20:59:53Z

The time has come for a Elasticsearch to implement a native API for reindexing! The first request I've found for this is (#492) filed back in 2010. With the Task Management API (#15117) will make this easier to manage. This meta ticket will cover the following use cases:

Resharding
Incompatible mapping updates
touching documents to pick up mapping updates made on the fly
Limiting the reindexed documents using a query
Update-by-query style changes to portions of the index at a time

* [ ] Copying an index from a remote cluster into this one

NOTICE: This meta issue gets fairly rambly from here on out. It will change. The list above will change. Everything is up for negotiation and everything needs to be prototyped before we're sure of anything. Things higher on the list are more likely to be in the final product.

Resharding

It'd work like:

# Stop writes to index

curl -XPUT localhost:9200/index_v2 -d'{
  "settings": {
    "number_of_shards": 10
  }
}'

curl -XPOST localhost:9200/a_single_command_to_start_copying_all_documents_from_index_v1_to_index_v2
# Save the returned task id

while [ curl -s "localhost:9200/_task/$TASK_ID?pretty&awaitComplete" ]; do
  echo "not done"
done

# Do any manual checks that index_v2 is ok. Maybe warm it. Maybe raise its number of replicas if you built it with 0 replicas.

curl -XPOST localhost:9200/_aliases -d '{
    "actions": [
        { "remove": { "alias": "index", "index": "index_v1" }},
        { "add":    { "alias": "index", "index": "index_v2" }}
    ]
}
'

curl -XDELETE localhost:9200/index_v2

# Resume writes to index

You see from the example that its not automatic or atomic. It's still an event and it's very similar to an old blog post about changing mappings with no downtime. The advantages of this as opposed to the scroll implementation proposed in the blog post are:

Elasticsearch can handle the messy details of the scroll API like sort: "_doc" and clearing the context when the copy is done and retrying when things fail.
Elasticsearch can to optimize the process to the point where it can do filesystem level things rather than scroll. The first implementation of reindex won't support such optimizations but they are totally possible and could cut the runtime down significantly.

The two curl commands in the middle are the new bits. This should start a background task to perform the copy:

curl -XPOST localhost:9200/a_single_command_to_start_copying_all_documents_from_index_v1_to_index_v2

and this should block for a while waiting for the task to complete:

curl -s "localhost:9200/_task/$TASK_ID?pretty&awaitComplete"

This all piggy backs on the Task Management API (#15117) which isn't done yet, so it'll likely change. The reason this reindex command is a task is because it can take a long time. I, @nik9000, have personally seen these scroll type reindexes take hours for pretty big indexes. So if its going to take hours you'll need a way to cancel it or throttle it. And the task management API should have those ways though I have no idea what they'll look like.

You may ask "Why don't you combine the index creation, alias swap, and index delete into one task?" And that'd be a good question. It won't be part of the first implementation of this but might be part of later ones. Right now I don't like the idea very much. Keep reading. Maybe you'll agree with me. Maybe not. Leave a comment?

Incompatible mapping updates

These'll work almost just like resharding. So much so that I won't give a curl example because I trust you, dear reader, can figure it out. The manual check of the index becomes much more important in this case. It's fairly believable that you'd want to keep both indexes alive for a period of time to test both. A/B testing or something.

The other way that mapping updates differ from resharding is that filesystem level optimization are much much less likely.

`touch`ing documents to pick up mapping updates made on the fly

Some mapping updates can be made to an index on the fly but aren't picked up:

Adding a new field to a property
Adding a new property to a type when "dynamic": false

This offers a fairly complete example of adding a field using the PUT mapping API works and how you could use the reindex API to touch the documents.

This use case differs from the resharding and incompatible mapping update use cases in that the document isn't being added to an empty index, its being updated in an existing index. So if the the reindex process goes to touch the document but its changed between the time that the scroll took its snapshot of the index then the document shouldn't be changed. Luckily, Elasticsearch has built in support for optimistic concurrency control.

Limiting the reindexed documents using a query

This seems like the logical extension to the other use cases more than a use case on its own. Its just a useful optimization on top of the other use cases. For example, you could use a query to only touch documents modified after a certain time.

Update-by-query style changes to portions of the index at a time

"Increment counter on all documents matching this query" is a fairly normal operation on a relational database and Elasticsearch could have it too. Its fairly different internally from the other proposals but could be quite compelling though I admit to not having a good use case for it in mind. The trouble with this use case is that it tempts "increment counter on all documents" operations which are fairly inefficient in Elasticsearch. Its fairly inefficient in any system with concurrency control and most of them implement it anyway, but Elasticsearch makes an effort to make it difficult to do very inefficient things. Its inefficient because in Elasticsearch an update is an atomic delete and index operation and both of those operations are more expensive their relational counterparts. The delete itself is just as cheap but deleted document have to be reclaimed segment at a time rather than the aggressive measures relational datbases use. The index is much more expensive because the whole document has to be reanalyzed.

In many cases it'd be faster to copy the documents to a new index and then do the alias swap dance on it rather than update than it would be to touch every document in the index.

Even with all that it may be a fairly useful API to implement.

Copying an index from a remote cluster into this one

Maybe the most ambitious use case on the list, the idea here is to scroll on a remote cluster and index into the cluster handling the request. This seems like a sensible way to implement basic disaster recovery. It'd be better if the query could subscribe to updates and get them streamed back, but even as is it'd fairly nice to run daily/hourly updates. Especially if the documents had a last_modified_time style column.

The text was updated successfully, but these errors were encountered:

naivefun · 2015-12-03T04:55:20Z

Exactly +1

lukas-vlcek · 2015-12-03T08:36:01Z

@nik9000 nice! Is there any idea which ES version is targeted?

niemyjski · 2015-12-03T15:13:48Z

Please make this available in 3.0! Currently we use foundatio to help us with this and it would be really really nice to have this sooner than later: https://github.com/exceptionless/Foundatio/blob/master/src/Elasticsearch/Jobs/ReindexWorkItemHandler.cs

nik9000 · 2015-12-03T15:42:32Z

@nik9000 nice! Is there any idea which ES version is targeted?

3.0.0 initially but I really want to backport it to 2.3 as well.

Another thing I should mention: I talked with @imotov who is doing the task management. All tasks will have an option for wait_until_completion to wait for the copy. It'll make using the API simpler for smaller copies but isn't something you'd want to use for large copies.

I've started the implementation for the first 4 use cases in #15125. Right now its a plugin - I believe it'll be a prebundled plugin which will be a new thing in 2.3.

xiaoshi2013 · 2015-12-05T13:41:51Z

Very nice looking

bdharrington7 · 2016-01-06T17:57:38Z

@nik9000 I'm curious, in your example for resharding you have comments indicating that we would have to stop indexing. What kind of steps would have to be taken if this wasn't possible?

nik9000 · 2016-01-06T18:03:14Z

@nik9000 I'm curious, in your example for resharding you have comments indicating that we would have to stop indexing. What kind of steps would have to be taken if this wasn't possible?

You'll want to have some way of replaying the same updates on you application side against both indexes. At some point I'd like to be able to install a redirect mechanism in Elasticsearch for the duration of the reindex operation. It seems like an obvious thing. Its not part of what I'm working on now and it introduces yet more complexity around versioning but its important.

nik9000 · 2016-02-04T15:59:57Z

I've updated the list of things that reindex will do. Right now we don't have external cluster support and I don't know when that'll become a priority.

Right now this is what is left before we can merge the feature/reindex branch down to master:

Progress from the task API: Add reindex progress indicator #16461
Retry bulk failures if they are safe to retry. Like rejection exceptions. Teach reindex to retry on rejection #16556
Cancelation Teach reindex to stop when cancelled #16613
Move it from a plugin to a module so it ships with Elasticsearch by default Move reindex from a plugin to a module #16619
Actually merge it to master Merge reindex to master #16861

Here are things that are left to do in the first phase of the project:

Throttling
Backporting to 2.x

honzakral · 2016-03-22T17:01:42Z

This is indeed a super useful API, cannot wait!

Would it be possible to also, in future versions, provide additional functionality to allow update on the target index except of only index operations? My use case for this is entity centric indexing - imagine you have an index containing events and wish to group them by session. With the reindex api it should be possible to read the source events, apply a script (or just extract a field) to get the ID of a target document and pass it as a parameter to a specified update script.

Another use case we see a lot with users is that they want to move some data out of one index to another. Would it be possible to combine the reindex with delete-by-query essentially? After a document is indexed in the target index a delete operation will be issued on the source index. Of course this couldn't be done atomically, but even on best effort basis this would be super useful for a lot of people - essentially executing reindex and delete-by-query at the same time (on the same point in time snapshot of the index) with no additional guarantees than those two operations have individually.

I am happy to create individual issues for these use cases if they make sense to people.

nik9000 · 2016-04-26T20:26:16Z

I'm going to close this because reindex is done and live in 2.3.0 and 5.0.0-alpha1. I think @honzakral's point is really another feature request. @honzakral, can you make a new issue for it? Sorry!

honzakral · 2016-04-26T21:03:51Z

Done as #17998 and #17997

nik9000 · 2016-04-26T21:05:46Z

Thanks!

Related: elastic/elasticsearch#15201

nik9000 added Meta :Reindex API labels Dec 2, 2015

nik9000 self-assigned this Dec 2, 2015

nik9000 added the v5.0.0-alpha1 label Dec 3, 2015

rashidkpc mentioned this issue Dec 4, 2015

Feature request: Tool to fix mapping conflicts elastic/kibana#5547

Open

martijnvg mentioned this issue Dec 11, 2015

Allow deletion of mapping/type if no documents are present #15160

Closed

danielmitterdorfer mentioned this issue Dec 28, 2015

Basic reindex implementation #15125

Merged

russcam mentioned this issue Jan 9, 2016

Move GetManyExtensions to IElasticClient interface elastic/elasticsearch-net#1627

Closed

eskibars mentioned this issue Jan 13, 2016

Reindex from _source by document ID or Query #492

Closed

clintongormley mentioned this issue Jan 26, 2016

Allow deletion of document types from an index #16207

Closed

karmi mentioned this issue Mar 4, 2016

add reindex extension using scroll api. elastic/elasticsearch-ruby#270

Closed

brusic mentioned this issue Mar 7, 2016

[Feature Request] Add a river to ElasticSearch instance #1077

Closed

clintongormley added the >feature label Mar 10, 2016

honzakral mentioned this issue Mar 23, 2016

reindex doesn't preserve _version elastic/elasticsearch-py#383

Closed

clintongormley added v5.0.0-alpha2 and removed v5.0.0-alpha1 labels Apr 4, 2016

clintongormley added v5.0.0-alpha3 and removed v5.0.0-alpha2 labels Apr 26, 2016

nik9000 closed this as completed Apr 26, 2016

karmi added a commit to elastic/elasticsearch-ruby that referenced this issue May 15, 2016

[API] Added the "Reindex" API

1cf5fb9

Related: elastic/elasticsearch#15201

clintongormley removed the v5.0.0-alpha3 label May 26, 2016

lcawl added :Distributed/CRUD A catch all label for issues around indexing, updating and getting a doc by id. Not search. and removed :Reindex API labels Feb 13, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reindex API #15201

Reindex API #15201

nik9000 commented Dec 2, 2015

naivefun commented Dec 3, 2015

lukas-vlcek commented Dec 3, 2015

niemyjski commented Dec 3, 2015

nik9000 commented Dec 3, 2015

xiaoshi2013 commented Dec 5, 2015

bdharrington7 commented Jan 6, 2016

nik9000 commented Jan 6, 2016

nik9000 commented Feb 4, 2016 •

edited

Loading

honzakral commented Mar 22, 2016

nik9000 commented Apr 26, 2016

honzakral commented Apr 26, 2016

nik9000 commented Apr 26, 2016

Reindex API #15201

Reindex API #15201

Comments

nik9000 commented Dec 2, 2015

Resharding

Incompatible mapping updates

touching documents to pick up mapping updates made on the fly

Limiting the reindexed documents using a query

Update-by-query style changes to portions of the index at a time

Copying an index from a remote cluster into this one

naivefun commented Dec 3, 2015

lukas-vlcek commented Dec 3, 2015

niemyjski commented Dec 3, 2015

nik9000 commented Dec 3, 2015

xiaoshi2013 commented Dec 5, 2015

bdharrington7 commented Jan 6, 2016

nik9000 commented Jan 6, 2016

nik9000 commented Feb 4, 2016 • edited Loading

honzakral commented Mar 22, 2016

nik9000 commented Apr 26, 2016

honzakral commented Apr 26, 2016

nik9000 commented Apr 26, 2016

`touch`ing documents to pick up mapping updates made on the fly

nik9000 commented Feb 4, 2016 •

edited

Loading