Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add global checkpoint to translog checkpoints #21254

Merged

Conversation

jasontedor
Copy link
Member

@jasontedor jasontedor commented Nov 2, 2016

This commit adds the sequence number global checkpoint to translog
checkpoints, and removes them from Lucene commits.

This commit adds the sequence number global checkpoint to translog
checkpoints, and removes them from Lucene commits.
Copy link
Contributor

@bleskes bleskes left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like this. Left some comments where I think we need some more thinking (as we discussed). Thanks!

throw new IndexFormatTooOldException("trasnlog", "translog has no generation nor a UUID - this might be an index from a previous version consider upgrading to N-1 first");
}
}
final Translog translog = new Translog(translogConfig, generation);
final Translog translog = new Translog(translogConfig, generation, seqNoService::getGlobalCheckpoint);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit - shall we make this a parameter to the method, like the other things?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I pushed 6cbed98.

* {@code globalCheckpoint} should be set to the last known global
* checkpoint for this shard, or
* {@link SequenceNumbersService#UNASSIGNED_SEQ_NO}.
* Initialize the sequence number service. The {@code maxSeqNo} should be set to the last sequence number assigned by this shard, or
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OMG what happened here :)

@@ -155,6 +149,10 @@ public void updateAllocationIdsFromMaster(Set<String> activeAllocationIds, Set<S
* of one of the active allocations is not known.
*/
public boolean updateGlobalCheckpointOnPrimary() {
return globalCheckpointService.updateCheckpointOnPrimary();
final boolean maybeUpdateGlobalCheckpoint = globalCheckpointService.updateCheckpointOnPrimary();
if (maybeUpdateGlobalCheckpoint) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure this is what we want? maybeUpdateGlobalCheckpoint may be true because the global checkpoint isn't updated but we miss some local checkpoint. I think it's still ok to call this runnable, but we should document it if we keep it like this. Alternatively maybe we should split this method in two - updateGlobalCheckpointOnPrimary and needsLocalCheckpointSync and call those from IndexShard's updateGlobalCheckpointOnPrimary?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've reverted the changes to this method, they are not needed after 86ec4d0.

+ Long.BYTES // generation
+ CodecUtil.footerLength();

// TODO: remove legacy support, not needed in 6.0.0
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

make this a nocommit so we'll clean up later on?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I pushed 37a50bb.

void onGlobalCheckpointUpdate() {
try (ReleasableLock ignored = readLock.acquire()) {
ensureOpen();
translog.sync();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

since this all runs on an indexing thread, i'm concerned it will make us fsync twice. How about not fsync here but rather trigger a sync from the global checkpoint sync action?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I pushed 86ec4d0.

* @return the sequence number stats
* @throws IOException if an I/O exception occurred reading the Lucene commit point or the translog checkpoint
*/
private static SeqNoStats loadSeqNoStats(final EngineConfig engineConfig, final IndexWriter writer) throws IOException {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is a little funky as we read some stuff from given parameters and some stuff from disk but then only if we're not supposed to ignore the translog. How about never reading the global checkpoint from the translog on opening and make it part of the recovery from translog? also it means peer recovery will need to also pass the global checkpoint as part the engine opening (but we can do this a follow up - add a no commit please if so)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I pushed 19d4db0.

This commit replaces a TODO on removing legacy support for
non-checksummed translog checkpoints in Checkpoint.java with a nocommit.
This commit adds a nocommit to convert the way the global checkpoint is
acquired when starting the engine to make it part of the recovery from
the translog.
This commit causes the translog to be synced after every global
checkpoint sync and removes syncing of the global checkpoint from the
indexing path.
This commit modifies the signature of InternalEngine#openTranslog to
feed through the global checkpoint supplier down to the opening of the
translog.
@jasontedor
Copy link
Member Author

Thanks @bleskes. I've responded to your feedback.

Copy link
Contributor

@bleskes bleskes left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, pending one testing question.

return new ReplicaResult();
}

private void syncTranslog(final IndexShard indexShard) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do we need the conversion to an unchecked exceptions? shardOperationOnX Allows throwing them?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@bleskes This is correct for shardOperationOnPrimary (declares checked Exception) but not shardOperationOnReplica (does not declare any checked exceptions). I did push d742a68 since this was missing from the signature for GlobalCheckpointSyncAction#shardOperationOnPrimary.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In master both shardOperationOnPrimary and shardOperationOnReplica declare the top-level checked exception. We can pick this up and remove the wrapping after integrating master into feature/seq_no. I pushed 6c1338c.

@@ -136,7 +137,7 @@ public void tearDown() throws Exception {
}

private Translog create(Path path) throws IOException {
return new Translog(getTranslogConfig(path), null);
return new Translog(getTranslogConfig(path), null, () -> SequenceNumbersService.UNASSIGNED_SEQ_NO);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we have a test that checks that the global checkpoint given by this supplier is properly persisted?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I pushed 37af724.

This commit adds a missing checked exception declared on
TransportReplicationAction#shardOperationOnPrimary but not declared on
the overridden method
GlobalCheckpointSyncAction#shardOperationOnPrimary.
This commit adds a nocommit to GlobalCheckpointSyncAction to remove some
exception wrapping after integrating master into feature/seq_no. This
wrapping will not be needed after integration as
TransportReplicationAction#shardOperationOnPrimary and
TransportReplicationAction#shardOperationOnReplica will both declare a
top-level checked exception.
This commit adds a low-level test that the supplied global checkpoint is
persisted in translog checkpoints when the translog is synced.
@jasontedor
Copy link
Member Author

@bleskes I pushed commits responding to your last round of feedback. Do you want to take another look?

This commit simplifies the advancement of the global checkpoint in
TranslogTests#testBasicCheckpoint by moving the advancement into the
test code.
@bleskes
Copy link
Contributor

bleskes commented Nov 8, 2016

still LGTM. thx.

@jasontedor jasontedor merged commit 9ceb0f2 into elastic:feature/seq_no Nov 8, 2016
@jasontedor jasontedor deleted the global-checkpoint-in-translog branch November 8, 2016 15:09
@jasontedor
Copy link
Member Author

Thanks @bleskes.

@clintongormley clintongormley added :Engine :Distributed/Engine Anything around managing Lucene and the Translog in an open shard. and removed :Sequence IDs labels Feb 14, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Distributed/Engine Anything around managing Lucene and the Translog in an open shard. >enhancement v6.0.0-alpha1
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants