Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reimplement delete-by-query as a bulk request #7052

Closed
clintongormley opened this issue Jul 28, 2014 · 19 comments
Closed

Reimplement delete-by-query as a bulk request #7052

clintongormley opened this issue Jul 28, 2014 · 19 comments

Comments

@clintongormley
Copy link

Delete-by-query is problematic: eg when deleting a document should trigger some action, eg: removing a percolator or removing a parent-child link. It is also executed on both primary and replica, and can result in deleting different documents.

Delete-by-query should be replaced with a document-by-document delete using the bulk API.

Fixes: #6025
Fixes: #1712
Fixes: #5797
Fixes: #3593

Depends on #6914

@mikemccand
Copy link
Contributor

We could use IndexWriter's tryDeleteDocument to delete by docID for each doc matching the query? Typically it would be successful, and fast, if the reader used for searching is "recent"; when it fails to do the delete, you'd have to fall back to delete-by-Term.

Alternatively, it's also possible to wrap any Query and "spy on" the communication between the consumer (IndexWriter in this case) and that query, to see all docIDs that were visited; this way we could continue to use IW's more efficient delete-by-Query, but gather up all docIDs that were in fact deleted. But this is a more hairy/evil/complex solution... though I think Solr already has something doing this.

@uschindler
Copy link
Contributor

Very nice: This would also allow to return the number of deleted documents to the consumer. So we can change the reply to include the document count. I sometimes have the problem that I need the count of deleted documents, which is hairy: You can first execute the query as count and later execute the delete, but this fails on high load because of non-atomicness.

mikemccand added a commit that referenced this issue Mar 4, 2015
Delete-by-query is incredibly costly because it forces a refresh each
time, so if you are also indexing this can cause massive segment
explosion.

This change throttles delete-by-query when merges can't keep up.  It's
likely not enough (#7052 is the long-term solution) but can only
help.

Closes #9986
mikemccand added a commit that referenced this issue Mar 4, 2015
Delete-by-query is incredibly costly because it forces a refresh each
time, so if you are also indexing this can cause massive segment
explosion.

This change throttles delete-by-query when merges can't keep up.  It's
likely not enough (#7052 is the long-term solution) but can only
help.

Closes #9986
mikemccand added a commit that referenced this issue Mar 4, 2015
Delete-by-query is incredibly costly because it forces a refresh each
time, so if you are also indexing this can cause massive segment
explosion.

This change throttles delete-by-query when merges can't keep up.  It's
likely not enough (#7052 is the long-term solution) but can only
help.

Closes #9986
@mikemccand
Copy link
Contributor

On #10067 @kimchy pointed out that we need not wait for task management API for this issue, i.e. we can just synchronously (in the user's current request) do the scan/scroll + bulk deletes...

@uschindler
Copy link
Contributor

+1

This also means the (incomplete) deprecation on delete-by-query APIs could be removed: #10082

@TwP
Copy link

TwP commented Mar 29, 2015

As a user who is quite a big fan of delete-by-query, I'm very much hoping it can be implemented in Elasticsearch itself as a scan/scroll + bulk deletes. Moving the scan/scroll + bulk deletes logic into the client would create a bit of wasted network traffic. But if that is the solution moving forward, then so be it. My vote is to keep the current API in place and change the underlying implementation.

Out of perverse curiosity .. with a client side solution, can the bulk deletes happen in parallel with the scroll/scan operations? As a set of documents are returned by the call to /_search/scroll, can those documents immediately be deleted via bulk delete operations? Or would I need to complete the entire scan and then delete documents.

Will Elasticsearch be unhappy if we start deleting the data that it is currently iterating over.

@uschindler
Copy link
Contributor

Hi @TwP,
you can do the scan and scroll in parallel. Scan and scroll uses the IndexReader openend at the time of the query and keeps it open. Deletes happening while scrolling will not be visible to this IndexReader until scrolling is done. All other search queries (or other scrolls) in parallel will see the deletes, of course. While scrolling you see a consistent view on the index.

I agree, we should keep the current API and implement the scan-scroll-delete behind the API (please also keep the Java API, not only the REST API (RequestBuilders,...). I am also a big fan of this API. One good thing would be: the delete-by-query can now return the number of deleted documents, which would be great, so I would be happy if the DeleteByQueryResponse would contain a "long getCount()"!

@mikemccand
Copy link
Contributor

you can do the scan and scroll in parallel.

That's right: scan/scroll gives you a point-in-time view of the index. It won't see any changes that happened after that time, as long as you keep the scroll "alive".

I'm very much hoping it can be implemented in Elasticsearch itself as a scan/scroll + bulk deletes.

This is the plan... AbstractClient's deleteByQuery methods will be changed to final methods that do the scan/scroll + bulk delete.

please also keep the Java API, not only the REST API (RequestBuilders,...).

Yeah this is also now the plan...

the delete-by-query can now return the number of deleted documents, which would be great, so I would be happy if the DeleteByQueryResponse would contain a "long getCount()"!

OK I'll add that, and also "int getCount()" to each IndexDeleteByQueryResponse, because you can pass multiple index names to DBQ.

@uschindler
Copy link
Contributor

This just came into my mind: I already implemented the bulk delete stuff locally (in Java client, see https://sourceforge.net/p/panfmp/code/647/tree//main/trunk/src/de/pangaea/metadataportal/processor/DocumentProcessor.java?diff=516c2e8d5fcbc9791083b0a3:646 for example). For the given code version numbers are not really a problem, but for the deleteByQuery API to be consistent, it should execute scan-scroll and collect all doc ids AND version (!!!) numbers. When doing the bulk delete, each of this deletes should use the version number as returned by the scan-scroll, otherwise it could happen that updates/inserts of a concurrent indexing are destroyed.

@kimchy
Copy link
Member

kimchy commented Mar 29, 2015

@uschindler aye, it should be implemented similar to what we do with single document update, so for example, making sure to handle parent and routing into account

@s1monw
Copy link
Contributor

s1monw commented Mar 30, 2015

I think we should split the problems into engine level and shard level. The shard level problems (consistency between replica and primary) can be solved in a different PR than the engine level problems.

I think we should take babysteps here and implement the deletion as a simple lucene search inside the engine and keep on using the query inside the transaction log etc. just keep everything as it is exception of iterating the hits inside the engine and delete them one by one. Also replication stays the same for now. This solves the refresh problem as well as the OOM etc. we can also utilize IW#tryDeleteDocument to speed up things.

The shard level problem is a bigger one and should maybe be solved by using sequence IDs (delete after) or a lock on the shard level that prevents any other changes to happen concurrently on both replica and primaries.

Solving the issue on the engine level lets up make progress on the right level IMO and we can fix the distributed system problems on the layer where they are happening.

@mikemccand
Copy link
Contributor

I think we should split the problems into engine level and shard level.

+1: if we can really decouple the two (engine impl vs shard consistency), that's wonderful. I can tackle the engine level, to fix the OOME during concurrent indexing (#6025).

@s1monw
Copy link
Contributor

s1monw commented Mar 31, 2015

@mikemccand to be honest I think we should just stop indexing for the duration of the delete by query. Simple solution though...

@nariman-haghighi
Copy link

To echo @TwP's sentiment, this is a massively common API that many teams use regularly to delete thousands of documents. Moving this to the client or deprecating it outright would be an utter disaster. A new server-side implementation that preserves the API is the only sensible path, please provide some final guidance on this so we can plan appropriately.

@tlrx tlrx removed the help wanted adoptme label Jun 11, 2015
@tlrx
Copy link
Member

tlrx commented Jun 11, 2015

@nariman-haghighi you may be interested in #11516 and #11584.

@tlrx tlrx closed this as completed in ba35406 Jun 17, 2015
@traviscollins
Copy link

I agree with @TwP and @nariman-haghighi - removing the delete-by-query API is really not serving users well. This is basic functionality that should be implemented on the server side in the core API with a simple client method.

@MattFriedman
Copy link

I see this issue is now closed. Has it been resolved? Will the native Elastic interface be maintained or removed? I read all the comments and I'm still not sure.

@clintongormley
Copy link
Author

@MattFriedman in 2.0 delete-by-query has been implemented using bulk, and moved into a plugin: https://github.com/elastic/elasticsearch/tree/master/plugins/delete-by-query

mute pushed a commit to mute/elasticsearch that referenced this issue Jul 29, 2015
Delete-by-query is incredibly costly because it forces a refresh each
time, so if you are also indexing this can cause massive segment
explosion.

This change throttles delete-by-query when merges can't keep up.  It's
likely not enough (elastic#7052 is the long-term solution) but can only
help.

Closes elastic#9986
@pulkitsinghal
Copy link

pulkitsinghal commented Aug 6, 2016

@clintongormley - I went looking but did not find it ... should i instead rely on the summary I found in the commits stating:

The Delete-By-Query plugin has been removed in favor of a new Delete By Query API implementation in core. It now supports throttling, retries and cancellation but no longer supports timeouts. Instead use the cancel API to cancel deletes that run too long.

@nik9000
Copy link
Member

nik9000 commented Aug 6, 2016

It should be mentioned somewhere in the migration docs but I'm on mobile.
Yes, in 5.0 delete-by-query is included in the distribution. It is
technically shipped in the mobile called reindex, meaning it is in a pre
bundled plugin.

On Aug 6, 2016 1:54 AM, "Pulkit Singhal" notifications@github.com wrote:

@clintongormley https://github.com/clintongormley - I went lookign but
did not find it ... should i instead rely on the summary I found in the
commits stating:

The Delete-By-Query plugin has been removed in favor of a new Delete By
Query API implementation in core. It now supports throttling, retries and
cancellation but no longer supports timeouts. Instead use the cancel API to
cancel deletes that run too long.


You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
#7052 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/AANLorGaujTViuxaRg6UAMUBnB-EN3ymks5qdCGvgaJpZM4CRdK3
.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests