Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Never delete translog-N.tlog file when creation fails #15788

Merged
merged 2 commits into from
Jan 6, 2016

Conversation

s1monw
Copy link
Contributor

@s1monw s1monw commented Jan 6, 2016

We today delete the translog-N.tlog file if any subsequent operation fails
but we might actually be in a good state if for instance the creation of the writer
failes after we sucessfully baked the new translog generation into the checkpoint. In this situation
we used to delete the translog-N.tlog file and failed on the next recovery of the translog with a
NoSuchFileException | FileNotFoundException just like in https://discuss.elastic.co/t/cannot-recover-index-because-of-missing-tanslog-files/38336

This commit changes the behavior and cleans up that limbo state on recovery if we already have a generation+1 file written but not baked into
the checkpoint we remove that file but only if the previous ckp file has already been renamed otherwise we know we can't be in this state.

We today delete the translog-N.tlog file if any subsequent operation fails
but we might actually be in a good state if for instance the creation of the writer
failes after we sucessfully baked the new translog generation into the checkpoint. In this situation
we used to delete the translog-N.tlog file and failed on the next recovery of the translog with a
NoSuchFileException | FileNotFoundException just like in https://discuss.elastic.co/t/cannot-recover-index-because-of-missing-tanslog-files/38336

This commit changes the behavior and cleans up that limbo state on recovery if we already have a generation+1 file written but not baked into
the checkpoint we remove that file but only if the previous ckp file has already been renamed otherwise we know we can't be in this state.
assert Files.exists(nextTranslogFile) == false || Files.size(nextTranslogFile) <= TranslogWriter.getHeaderLength(translogUUID);
if (Files.exists(currentCheckpointFile) // current checkpoint is already copied
&& Files.deleteIfExists(nextTranslogFile)) { // delete it and log a warning
logger.warn("Deleted invalid next generation before opening writer {} this is like caused by some previously tragic exception ", nextTranslogFile.getFileName());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we soften this message? maybe "deleted previously created, but not yet committed, next generation [{}]. This can happen due to a tragic exception when creating a new generation"

@bleskes
Copy link
Contributor

bleskes commented Jan 6, 2016

LGTM. Left some minor comments. Great catch.

s1monw added a commit that referenced this pull request Jan 6, 2016
Never delete translog-N.tlog file when creation fails
@s1monw s1monw merged commit 8a90c80 into elastic:master Jan 6, 2016
@s1monw s1monw deleted the dont_delete_tlog_file branch January 6, 2016 13:31
s1monw added a commit to s1monw/elasticsearch that referenced this pull request Jan 6, 2016
Never delete translog-N.tlog file when creation fails
s1monw added a commit that referenced this pull request Jan 6, 2016
Never delete translog-N.tlog file when creation fails
s1monw added a commit that referenced this pull request Jan 6, 2016
Never delete translog-N.tlog file when creation fails
s1monw added a commit that referenced this pull request Jan 6, 2016
Never delete translog-N.tlog file when creation fails
s1monw added a commit to s1monw/elasticsearch that referenced this pull request Jan 7, 2016
This commit adds a new test that can throw an IOException at any point in time
and ensures that all previously synced documents can be successfully recovered after hitting
an excepiton.

Relates to elastic#15788
s1monw added a commit that referenced this pull request Jan 7, 2016
This commit adds a new test that can throw an IOException at any point in time
and ensures that all previously synced documents can be successfully recovered after hitting
an excepiton.

Relates to #15788
s1monw added a commit that referenced this pull request Jan 7, 2016
This commit adds a new test that can throw an IOException at any point in time
and ensures that all previously synced documents can be successfully recovered after hitting
an excepiton.

Relates to #15788
s1monw added a commit that referenced this pull request Jan 7, 2016
This commit adds a new test that can throw an IOException at any point in time
and ensures that all previously synced documents can be successfully recovered after hitting
an excepiton.

Relates to #15788
s1monw added a commit that referenced this pull request Jan 7, 2016
This commit adds a new test that can throw an IOException at any point in time
and ensures that all previously synced documents can be successfully recovered after hitting
an excepiton.

Relates to #15788
@clintongormley clintongormley added :Distributed/Distributed A catch all label for anything in the Distributed Area. If you aren't sure, use this one. :Distributed/Engine Anything around managing Lucene and the Translog in an open shard. and removed :Translog :Distributed/Distributed A catch all label for anything in the Distributed Area. If you aren't sure, use this one. labels Feb 13, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
blocker >bug :Distributed/Engine Anything around managing Lucene and the Translog in an open shard. v2.0.3 v2.1.2 v2.2.0 v2.3.0 v5.0.0-alpha1
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants