-
Notifications
You must be signed in to change notification settings - Fork 1.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG] [segment replication] Search result is incorrect after replica promotion in a short time. #8985
Comments
@maosuhan Thanks for raising this issue. If the replica has not yet received a set of segments from the primary and the primary drops, the replica will be promoted as the new primary & need to replay ops from its translog in order to catch up. We have a similar test here. Though one difference here is this test is triggering a refresh explicitly after promotion. That refresh should not be required. Edit - |
@mch2 Thanks for looking into this. |
@maosuhan - In case of segment replication, the documents are indexed on the primary first and then the segments are copied to the replica to avoid duplication of effort on each of the copies. In the case as you described. If the primary fails and one of the replica is promoted to primary there is a possibility that the replica has not yet received all the segments. Thus there is this edge case where for a very short interval of time the replica which has just been promoted to the primary is lagging. However, do note in case of segment replication the replicas will always, even if it is for a fraction of a millisecond will lag the primary. If in that fraction a query is sent to both primary and the replica the result for the query will be different from different shard. This might be okay for a large number of use cases where recency of data is not critical. Having said that we can look to explore options that can reduce this. |
Looking at fixing this by blocking the search reqs until promotion has completed with |
@rohin Thanks for looking into this. In my case, replica does not lag the primary before replica bumping to primary.
|
Thanks @maosuhan for these steps. There are two ways that the search would return 0 results, looking at fixing both asap. The steps you call out - replica did receive segments is because during promotion of the replica we flip to a ReadOnly engine temporarily before the NRT engine is flushed/closed. The RO engine starts by reading the latest on-disk commit - we need to make an update to ensure the NRT engine commits before we open the RO engine so that the latest segments are read. The second way is if the replica has not yet received segments at all and during engine reset the new primary must replay from xlog. If a search hits this shard before the new primary completes reset it will show stale. This can be fixed easily by updating |
Have raised a PR #9495 to fix the first part of this. The second piece is more complex than I had thought. We would need to block reads until engine reset fully completes to guarantee freshness. If we are serving a search that expects eventual consistency, we don't need to do anything here. The issue arrives when strong reads are expected - if specifying _primary preference on a search or issuing a get/mget as realtime=true. To serve these requests we need to outright block the read until reset completes. I think we can still meet this with refresh listeners by setting a refresh pending location similar to that used by search idle. The difference here is refresh listeners clear by forcing a blocking refresh if the count of listeners exceeds a maximum - default is ~2k requests. In our case the long poll here is not the refresh but the engine reset which includes replay from xlog and/or fetching segments from remote which can take much longer. Further, the listeners clear based on translog location which would clear with any forced refresh even if the reader does not update with the expected docs. Alternatively we reject requests here outright until reset completes and take an availability hit over stale data. |
Describe the bug
I write an integration test to test shard promotion cases of segment replication.
I tried to mock the situation when a replica is promoted to primary when datanode of primary is shutdown.
After promotion, the index is yellow and I tried to search the data, but the result is incorrect. in my case no doc can be searched.
But if I sleep 1 second after the index becomes yellow, the result is right.
Expected behavior
I expect the result should be consistent because rolling restart/node shutdown are very common operations, if the data is incorrect during that time. It is not acceptable.
The index of document replication type works normally in this case.
How to reproduce
Try running my code https://github.com/maosuhan/OpenSearch/tree/fix_sr several times generally less than 10.
org.opensearch.indices.replication.SegmentReplicationPrimaryPromotionIT
The text was updated successfully, but these errors were encountered: