-
Notifications
You must be signed in to change notification settings - Fork 24.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Force merge behaviour unreliable #102594
Comments
Pinging @elastic/es-distributed (Team:Distributed) |
There is no such thing in Elasticsearch that reverts a completed force merge, but there are various operations that may produce new segments or retain existing segments while the force merge is running. Among the ones I can think of:
Your use case probably needs to be adapted to take into consideration all these operations (and there are maybe more, sorry if I forgot some of them). A good start would be to create the index with refresh interval of -1 and index through an alias so that the alias can be dropped once bulk indexing is done (or even better, use Security and specific permission to ensure your process is the only possible indexing client). Then flush the index and add a write block to the index (see Add Block API), this would guarantee not further indexing. The index can then be relocated to a node with less active indexing so that indexing throttling won't be activated and shard relocation due to disk space less likely to happen. Then force merge to 1 segment and use the Task API to see the results. I'm keeping this issue opened since I think we can at least add a note in documentation about the various reason that could cause a shard to have more segments than expected after a force merge. |
Pinging @elastic/es-docs (Team:Docs) |
@tlrx Thanks for your response. It took me some time to find some time to respond. I understand there is no such thing as a reversal. Then there must be some misalignment, not sure what's causing it. In the indexing task, all documents are inserted into Elasticsearch, the force merge API is called and then periodically (every minute or so) the The behaviour I noticed was that at some point Prefably I'd see a forge merge task that allows me to state that no matter what happens and no matter how long it takes, I want the task to continue until there really is just one segment in every shard left. If this is unfeasible, it would at least be nice to have some reliable way to find out afterwards if it succeeded. The indexing process already has refresh disabled. We're only calling Right now the solution we've going with is:
Because So I don't think it would be solved by just updating the documentation, to be honest, there's something more going on. |
It seems like sending change from 0 replicas to 1 (or more):
right (like 1-2 seconds at most) after calling the _forcemerge also makes the operation fail silently. What is worse, the task log on _forcemerge says "completed": "true", but number of segments stays the same. |
Elasticsearch Version
8.10.3
Installed Plugins
No response
Java Version
bundled
OS Version
Linux 5.15.107+ #1 SMP x86_64 x86_64 x86_64 GNU/Linux
Problem Description
I monitor an index that relies on force merging to 1 segment for performance considerations. The process is:
Multiple concurrent processes like this can run on the same cluster. We've noticed on multiple occassions that step 3&4 is not enough: the task completes successfully but still more than 1 segments exist for 1 or more shards. So we added step 6 to retry if that happens.
A new issue surfaced recently: even step 6 was not good enough. The stats API reported a number of primary segments equal to the amount of shards for several minutes and replication was initiated. During the allocation of additional replicas, the number of segments jumped back up to 44. Essentially, the force merge on one shard was reversed.
This mostly seems to happen when multiple indices are force merged at the same time. We noticed log messages like:
This message appears to be a red flag - it looks like not only indexing is throttled but also force merging is partially cancelled.
This all feels like a bug somewhere in the force merge administration - the task is not monitored properly and not guaranteed to complete, and even if if does appear to complete correctly, there looks to be a change that the force merge is reversed.
Steps to Reproduce
Rinse & repeat a couple of times if necessary. Observe that the number of primary segments is not guaranteed to be 12 (6 on each index, 1 on each shard) after this.
The only lead we currently have is this: https://www.elastic.co/guide/en/elasticsearch/reference/current/ilm-forcemerge.html
that states:
Theoretically, the replication of one index could cause a rebalancing for primary shards on the other, effectively nullifying the effect of the force merge that is still ongoing.
It would be great to have some reliable grip on the process and being sure that it completed correctly.
Logs (if relevant)
now throttling indexing: numMergesInFlight=10, maxNumMerges=9
stop throttling indexing: numMergesInFlight=8, maxNumMerges=9
The text was updated successfully, but these errors were encountered: