Double-check local checkpoint for staleness #29276

DaveCTurner · 2018-03-28T07:46:47Z

Today, when determining if an operation is stale, we compare the seqno against
the local checkpoint before looking in the version map. However, in between
these two checks the local checkpoint could advance, causing some tombstones to
become stale, and then the stale tombstones could be collected. In this
situation we might incorrectly decide that the operation is fresh and apply it.

To avoid this situation, check the local checkpoint again after calling
getVersionFromMap(). Since it only ever increases, this gives the right result
despite other concurrent activity.

Today, when determining if an operation is stale, we compare the seqno against the local checkpoint before looking in the version map. However, in between these two checks the local checkpoint could advance, causing some tombstones to become stale, and then the stale tombstones could be collected. In this situation we might incorrectly decide that the operation is fresh and apply it. To avoid this situation, check the local checkpoint again after calling getVersionFromMap(). Since it only ever increases, this gives the right result despite other concurrent activity.

DaveCTurner · 2018-03-28T07:52:37Z

Actually turning this into something that can reliably fail a test case seems tricky. Your thoughts welcome.

bleskes · 2018-03-28T08:11:55Z

Thanks @DaveCTurner . The change is semantically correct, but I prefer we bundle things more. Right we first compare the seq# to the local checkpoint and then call
compareOpToLuceneDocBasedOnSeqNo. Instead I would prefer to move all of it into compareOpToLuceneDocBasedOnSeqNo so it's all in one place (or alternatively it move both checks out of method). This will allow us to make the flow "cleaner":

Preflight against the local checpoint.
Check the version map and lucene
Check again against the local checkpoint (in case it advanced, making 2 not reliable).

This will bundle the version map with Lucene as a single unit with a requirement to remember everything above the local checkpoint. I think that's cleaner and align better with future direction. WDYT?

Actually turning this into something that can reliably fail a test case seems tricky. Your thoughts welcome.

I was thinking about this this morning. I was hoping that if we add refreshes to testConcurrentOutOfDocsOnReplica and sometimes run it with gc deletes equal to 0, it will reproduce.

This models how indexing and deletion operations are handled on the replica, including the optimisations for append-only operations and the interaction with Lucene commits and the version map. It incorporates - elastic/elasticsearch#28787 - elastic/elasticsearch#28790 - elastic/elasticsearch#29276 - a proposal to always prune tombstones

DaveCTurner · 2018-03-28T08:37:27Z

Instead I would prefer to move all of it into compareOpToLuceneDocBasedOnSeqNo

Sure, I like that idea too.

I was thinking about this this morning. I was hoping that if we add refreshes to testConcurrentOutOfDocsOnReplica and sometimes run it with gc deletes equal to 0, it will reproduce.

🤞

I was wondering about adding some kind of

assert randomDelay();

at strategic points in order to expose rare races more frequently.

elasticmachine · 2018-03-28T12:13:45Z

Pinging @elastic/es-distributed

…3-28-double-check-local-checkpoint

Today, each test has its own process for deciding whether to refresh/flush/GC deletes after applying an operation. This change moves this decision into generateSingleDocHistory().

DaveCTurner · 2018-04-06T09:16:15Z

@bleskes I looked at testConcurrentOutOfDocsOnReplica and it already performs refreshes (every 4 operations) but does not simulate GC deletes as per assertOpsOnPrimary. I thought it'd be useful to wrap up each operation along with the things to do after applying it (refresh/flush/GC) as per 5d46eee, and then make the tests more consistent about applying these things, but didn't want to go too far down that path without your thoughts. WDYT?

bleskes · 2018-04-30T08:15:18Z

@DaveCTurner shall we close this in favour of #30121 ?

DaveCTurner · 2018-04-30T08:50:42Z

Yes, lets.

DaveCTurner added >bug v7.0.0 v6.3.0 :Distributed/Engine Anything around managing Lucene and the Translog in an open shard. labels Mar 28, 2018

DaveCTurner requested review from bleskes and dnhatn March 28, 2018 07:46

hub-cap added :Distributed/Engine Anything around managing Lucene and the Translog in an open shard. and removed :Distributed/Engine Anything around managing Lucene and the Translog in an open shard. labels Mar 28, 2018

DaveCTurner added 8 commits April 4, 2018 08:27

Merge branch 'master' into 2018-03-28-double-check-local-checkpoint

922aa39

Add first LCP check to compareOpToLuceneDocBasedOnSeqNo()

02d3b98

Consistentify comment delimiters

a438e4b

Remove duplicated check from indexing flow

bd9544b

Remove duplicated check from deletion flow

dd4521c

Inline 'plan' and 'status'

9b2a6d8

Merge branch 'master' of github.com:elastic/elasticsearch into 2018-0…

d53649b

…3-28-double-check-local-checkpoint

Introduce DocHistoryEntry

5d46eee

Today, each test has its own process for deciding whether to refresh/flush/GC deletes after applying an operation. This change moves this decision into generateSingleDocHistory().

DaveCTurner added 2 commits April 17, 2018 13:35

Merge branch 'master' into 2018-03-28-double-check-local-checkpoint

c95af40

Back out changes to tests

f331de6

bleskes added v6.3.1 v6.4.0 and removed v6.3.0 v6.3.1 labels Apr 26, 2018

DaveCTurner closed this Apr 30, 2018

lcawl added the >non-issue label Aug 22, 2018

colings86 added v7.0.0-beta1 and removed v7.0.0 labels Feb 7, 2019

DaveCTurner deleted the 2018-03-28-double-check-local-checkpoint branch July 23, 2022 10:44

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Double-check local checkpoint for staleness #29276

Double-check local checkpoint for staleness #29276

DaveCTurner commented Mar 28, 2018

DaveCTurner commented Mar 28, 2018

bleskes commented Mar 28, 2018

DaveCTurner commented Mar 28, 2018

elasticmachine commented Mar 28, 2018

DaveCTurner commented Apr 6, 2018

bleskes commented Apr 30, 2018

DaveCTurner commented Apr 30, 2018

Double-check local checkpoint for staleness #29276

Double-check local checkpoint for staleness #29276

Conversation

DaveCTurner commented Mar 28, 2018

DaveCTurner commented Mar 28, 2018

bleskes commented Mar 28, 2018

DaveCTurner commented Mar 28, 2018

elasticmachine commented Mar 28, 2018

DaveCTurner commented Apr 6, 2018

bleskes commented Apr 30, 2018

DaveCTurner commented Apr 30, 2018