Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] [segment replication] Search result is incorrect after replica promotion in a short time. #8985

Closed
maosuhan opened this issue Jul 29, 2023 · 8 comments
Assignees
Labels
bug Something isn't working distributed framework Indexing:Replication Issues and PRs related to core replication framework eg segrep

Comments

@maosuhan
Copy link

maosuhan commented Jul 29, 2023

Describe the bug
I write an integration test to test shard promotion cases of segment replication.
I tried to mock the situation when a replica is promoted to primary when datanode of primary is shutdown.
After promotion, the index is yellow and I tried to search the data, but the result is incorrect. in my case no doc can be searched.
But if I sleep 1 second after the index becomes yellow, the result is right.

Expected behavior
I expect the result should be consistent because rolling restart/node shutdown are very common operations, if the data is incorrect during that time. It is not acceptable.
The index of document replication type works normally in this case.

How to reproduce
Try running my code https://github.com/maosuhan/OpenSearch/tree/fix_sr several times generally less than 10.
org.opensearch.indices.replication.SegmentReplicationPrimaryPromotionIT

@maosuhan maosuhan added bug Something isn't working untriaged labels Jul 29, 2023
@Poojita-Raj Poojita-Raj added distributed framework Indexing:Replication Issues and PRs related to core replication framework eg segrep and removed untriaged labels Jul 31, 2023
@maosuhan maosuhan changed the title [BUG] [segment replcation] Search result is incorrect after replica promotion in a short time. [BUG] [segment replication] Search result is incorrect after replica promotion in a short time. Aug 1, 2023
@mch2
Copy link
Member

mch2 commented Aug 4, 2023

@maosuhan Thanks for raising this issue.

If the replica has not yet received a set of segments from the primary and the primary drops, the replica will be promoted as the new primary & need to replay ops from its translog in order to catch up. We have a similar test here. Though one difference here is this test is triggering a refresh explicitly after promotion. That refresh should not be required.

Edit -
The refresh here in the test I linked is only delaying the subsequent doc count query until the shard has completed promotion and doc replay - replacing this with assertBusy(() -> assertTrue(shard.isPrimaryMode())); will have the same effect. Whats happening is the shard is still open for reads during engine reset with a RO engine. until the reset completes searches could see stale data.

@maosuhan
Copy link
Author

maosuhan commented Aug 7, 2023

@mch2 Thanks for looking into this.
I tried the linked test and the test fails if I remove "refresh" statement after first stopping node. The doc count of search result is 0 and it does not make sense because we have already waitForSearchableDocs=1 for both primary and replica.
It is not acceptable in our production if a rolling restart can trigger search result to be incorrect especially for unchanged indices.
In my test, replica has successfully received all the segments from primary if it is the case that waitForSearchableDocs succeed for both primary and replica. But immediate after replica promoted, the search result become wrong. I'm very curious why the search doc count is zero while the new primary has already received complete segments before.

@rohin
Copy link

rohin commented Aug 10, 2023

@maosuhan - In case of segment replication, the documents are indexed on the primary first and then the segments are copied to the replica to avoid duplication of effort on each of the copies. In the case as you described. If the primary fails and one of the replica is promoted to primary there is a possibility that the replica has not yet received all the segments. Thus there is this edge case where for a very short interval of time the replica which has just been promoted to the primary is lagging. However, do note in case of segment replication the replicas will always, even if it is for a fraction of a millisecond will lag the primary. If in that fraction a query is sent to both primary and the replica the result for the query will be different from different shard. This might be okay for a large number of use cases where recency of data is not critical. Having said that we can look to explore options that can reduce this.

@mch2
Copy link
Member

mch2 commented Aug 10, 2023

Looking at fixing this by blocking the search reqs until promotion has completed with IndexShard#awaitShardSearchActive.

@maosuhan
Copy link
Author

maosuhan commented Aug 15, 2023

@rohin Thanks for looking into this. In my case, replica does not lag the primary before replica bumping to primary.
The process is as below:

  1. start 2 node and write 3 docs to ES index
  2. Refresh ES index
  3. Both primary and replica has 3 docs and can be searched separately. I think replica has already received all the segments so there is no lag here.
  4. Stop primary
  5. EnsureYellowAndNoInitializingShards
  6. Immediately Search ES index but return 0 docs.

@mch2
Copy link
Member

mch2 commented Aug 15, 2023

Thanks @maosuhan for these steps. There are two ways that the search would return 0 results, looking at fixing both asap.

The steps you call out - replica did receive segments is because during promotion of the replica we flip to a ReadOnly engine temporarily before the NRT engine is flushed/closed. The RO engine starts by reading the latest on-disk commit - we need to make an update to ensure the NRT engine commits before we open the RO engine so that the latest segments are read.

The second way is if the replica has not yet received segments at all and during engine reset the new primary must replay from xlog. If a search hits this shard before the new primary completes reset it will show stale. This can be fixed easily by updating awaitShardSearchActive to block the search to wait for the primary has refreshed on the latest xlog location.

@mch2
Copy link
Member

mch2 commented Aug 23, 2023

Have raised a PR #9495 to fix the first part of this.

The second piece is more complex than I had thought. We would need to block reads until engine reset fully completes to guarantee freshness. If we are serving a search that expects eventual consistency, we don't need to do anything here. The issue arrives when strong reads are expected - if specifying _primary preference on a search or issuing a get/mget as realtime=true. To serve these requests we need to outright block the read until reset completes.

I think we can still meet this with refresh listeners by setting a refresh pending location similar to that used by search idle. The difference here is refresh listeners clear by forcing a blocking refresh if the count of listeners exceeds a maximum - default is ~2k requests. In our case the long poll here is not the refresh but the engine reset which includes replay from xlog and/or fetching segments from remote which can take much longer. Further, the listeners clear based on translog location which would clear with any forced refresh even if the reader does not update with the expected docs. Alternatively we reject requests here outright until reset completes and take an availability hit over stale data.

@mch2
Copy link
Member

mch2 commented Aug 31, 2023

Closing as completed with #9495. Thanks for reporting this @maosuhan.

@mch2 mch2 closed this as completed Aug 31, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working distributed framework Indexing:Replication Issues and PRs related to core replication framework eg segrep
Projects
Status: Done
Development

No branches or pull requests

4 participants