[BUG] [segment replication] Search result is incorrect after replica promotion in a short time. #8985

maosuhan · 2023-07-29T08:30:04Z

Describe the bug
I write an integration test to test shard promotion cases of segment replication.
I tried to mock the situation when a replica is promoted to primary when datanode of primary is shutdown.
After promotion, the index is yellow and I tried to search the data, but the result is incorrect. in my case no doc can be searched.
But if I sleep 1 second after the index becomes yellow, the result is right.

Expected behavior
I expect the result should be consistent because rolling restart/node shutdown are very common operations, if the data is incorrect during that time. It is not acceptable.
The index of document replication type works normally in this case.

How to reproduce
Try running my code https://github.com/maosuhan/OpenSearch/tree/fix_sr several times generally less than 10.
org.opensearch.indices.replication.SegmentReplicationPrimaryPromotionIT

The text was updated successfully, but these errors were encountered:

mch2 · 2023-08-04T16:17:55Z

@maosuhan Thanks for raising this issue.

If the replica has not yet received a set of segments from the primary and the primary drops, the replica will be promoted as the new primary & need to replay ops from its translog in order to catch up. We have a similar test here. Though one difference here is this test is triggering a refresh explicitly after promotion. That refresh should not be required.

Edit -
The refresh here in the test I linked is only delaying the subsequent doc count query until the shard has completed promotion and doc replay - replacing this with assertBusy(() -> assertTrue(shard.isPrimaryMode())); will have the same effect. Whats happening is the shard is still open for reads during engine reset with a RO engine. until the reset completes searches could see stale data.

maosuhan · 2023-08-07T13:44:23Z

@mch2 Thanks for looking into this.
I tried the linked test and the test fails if I remove "refresh" statement after first stopping node. The doc count of search result is 0 and it does not make sense because we have already waitForSearchableDocs=1 for both primary and replica.
It is not acceptable in our production if a rolling restart can trigger search result to be incorrect especially for unchanged indices.
In my test, replica has successfully received all the segments from primary if it is the case that waitForSearchableDocs succeed for both primary and replica. But immediate after replica promoted, the search result become wrong. I'm very curious why the search doc count is zero while the new primary has already received complete segments before.

rohin · 2023-08-10T18:18:37Z

@maosuhan - In case of segment replication, the documents are indexed on the primary first and then the segments are copied to the replica to avoid duplication of effort on each of the copies. In the case as you described. If the primary fails and one of the replica is promoted to primary there is a possibility that the replica has not yet received all the segments. Thus there is this edge case where for a very short interval of time the replica which has just been promoted to the primary is lagging. However, do note in case of segment replication the replicas will always, even if it is for a fraction of a millisecond will lag the primary. If in that fraction a query is sent to both primary and the replica the result for the query will be different from different shard. This might be okay for a large number of use cases where recency of data is not critical. Having said that we can look to explore options that can reduce this.

mch2 · 2023-08-10T20:00:30Z

Looking at fixing this by blocking the search reqs until promotion has completed with IndexShard#awaitShardSearchActive.

maosuhan · 2023-08-15T14:18:05Z

@rohin Thanks for looking into this. In my case, replica does not lag the primary before replica bumping to primary.
The process is as below:

start 2 node and write 3 docs to ES index
Refresh ES index
Both primary and replica has 3 docs and can be searched separately. I think replica has already received all the segments so there is no lag here.
Stop primary
EnsureYellowAndNoInitializingShards
Immediately Search ES index but return 0 docs.

mch2 · 2023-08-15T20:43:13Z

Thanks @maosuhan for these steps. There are two ways that the search would return 0 results, looking at fixing both asap.

The steps you call out - replica did receive segments is because during promotion of the replica we flip to a ReadOnly engine temporarily before the NRT engine is flushed/closed. The RO engine starts by reading the latest on-disk commit - we need to make an update to ensure the NRT engine commits before we open the RO engine so that the latest segments are read.

The second way is if the replica has not yet received segments at all and during engine reset the new primary must replay from xlog. If a search hits this shard before the new primary completes reset it will show stale. This can be fixed easily by updating awaitShardSearchActive to block the search to wait for the primary has refreshed on the latest xlog location.

mch2 · 2023-08-23T00:08:51Z

Have raised a PR #9495 to fix the first part of this.

The second piece is more complex than I had thought. We would need to block reads until engine reset fully completes to guarantee freshness. If we are serving a search that expects eventual consistency, we don't need to do anything here. The issue arrives when strong reads are expected - if specifying _primary preference on a search or issuing a get/mget as realtime=true. To serve these requests we need to outright block the read until reset completes.

I think we can still meet this with refresh listeners by setting a refresh pending location similar to that used by search idle. The difference here is refresh listeners clear by forcing a blocking refresh if the count of listeners exceeds a maximum - default is ~2k requests. In our case the long poll here is not the refresh but the engine reset which includes replay from xlog and/or fetching segments from remote which can take much longer. Further, the listeners clear based on translog location which would clear with any forced refresh even if the reader does not update with the expected docs. Alternatively we reject requests here outright until reset completes and take an availability hit over stale data.

mch2 · 2023-08-31T15:44:55Z

Closing as completed with #9495. Thanks for reporting this @maosuhan.

maosuhan added bug Something isn't working untriaged labels Jul 29, 2023

Poojita-Raj added distributed framework Indexing:Replication Issues and PRs related to core replication framework eg segrep and removed untriaged labels Jul 31, 2023

maosuhan changed the title ~~[BUG] [segment replcation] Search result is incorrect after replica promotion in a short time.~~ [BUG] [segment replication] Search result is incorrect after replica promotion in a short time. Aug 1, 2023

mch2 self-assigned this Aug 15, 2023

mch2 mentioned this issue Aug 22, 2023

[Segment Replication] Fix bug where replica shows stale doc count during engine reset. #9495

Merged

6 tasks

mch2 closed this as completed Aug 31, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] [segment replication] Search result is incorrect after replica promotion in a short time. #8985

[BUG] [segment replication] Search result is incorrect after replica promotion in a short time. #8985

maosuhan commented Jul 29, 2023 •

edited

Loading

mch2 commented Aug 4, 2023 •

edited

Loading

maosuhan commented Aug 7, 2023 •

edited

Loading

rohin commented Aug 10, 2023

mch2 commented Aug 10, 2023

maosuhan commented Aug 15, 2023 •

edited

Loading

mch2 commented Aug 15, 2023 •

edited

Loading

mch2 commented Aug 23, 2023 •

edited

Loading

mch2 commented Aug 31, 2023

[BUG] [segment replication] Search result is incorrect after replica promotion in a short time. #8985

[BUG] [segment replication] Search result is incorrect after replica promotion in a short time. #8985

Comments

maosuhan commented Jul 29, 2023 • edited Loading

mch2 commented Aug 4, 2023 • edited Loading

maosuhan commented Aug 7, 2023 • edited Loading

rohin commented Aug 10, 2023

mch2 commented Aug 10, 2023

maosuhan commented Aug 15, 2023 • edited Loading

mch2 commented Aug 15, 2023 • edited Loading

mch2 commented Aug 23, 2023 • edited Loading

mch2 commented Aug 31, 2023

maosuhan commented Jul 29, 2023 •

edited

Loading

mch2 commented Aug 4, 2023 •

edited

Loading

maosuhan commented Aug 7, 2023 •

edited

Loading

maosuhan commented Aug 15, 2023 •

edited

Loading

mch2 commented Aug 15, 2023 •

edited

Loading

mch2 commented Aug 23, 2023 •

edited

Loading