Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Zen2] Change MetaDataStateFormat write semantics #34709

Merged
merged 6 commits into from
Oct 24, 2018

Conversation

andrershov
Copy link
Contributor

@andrershov andrershov commented Oct 22, 2018

Currently if MetaDataStateFormat.write throws an IOExceptions if there was some problem with persisting state to disk. If exception is thrown, loadLatestState may read either old state or new state. This is not enough for Zen2 algorithm. In case of failure, we need to distinguish between 2 cases: storage is left in clean state or storage is left in dirty state.
If storage is left in clean state, loadLatestState may read only old state. If storage is left in dirty state, loadLatestState may read either old or new state.
If an exception occurs when writing manifest file to disk this distinction is important for Zen2. If storage is clean, node can continue to be a part of the cluster and may try to accept further cluster state updates (if it fails to accept cluster state updates it will be kicked off from the cluster using different mechanism). But if storage is dirty, node should be restarted and it will be able to startup successfully only once it successfully re-writes manifest file to disl.
This PR changes MetaDataStateFormat.write signature, replacing IOException with WriteStateException, which “isDirty” method could be used to distinguish between 2 failure cases.
We need to minimise number of failures, that leave storage in dirty state. That’s why this PR changes algorithm that is used to store state to disk. It has the following layout:

  1. For the first state location, create and fsync tmp file with state content.
  2. For each extra location, copy and fsync tmp file with state content.
  3. Atomically rename tmp file in the first location.
  4. For each extra location, atomically rename tmp file.
  5. For each location, fsync state directory.
  6. Perform cleanup of old files, ignoring exceptions.

If an exception occurs in steps 1-3, storage is clearly in the clean state. If an exception occurs in step 5, storage is clearly in dirty state. Exception in step 4 is questionable, there are 2 options:

  1. Consider it as failure. If the first disk fails, state disappears. So this is a failure and storage is in dirty state.
  2. Do not consider it as failure at all, ignore disk failures.

This PR prefers 1st approach and MetaDataTestFormatTests.testFailRandomlyAndReadAnyState tests for disk failures. But we could easily switch to option 2 if requested by reviewers.

Andrey Ershov added 2 commits October 22, 2018 15:38
@andrershov andrershov added >enhancement :Distributed/Cluster Coordination Cluster formation and cluster state publication, including cluster membership and fault detection. labels Oct 22, 2018
@andrershov andrershov mentioned this pull request Oct 22, 2018
6 tasks
@ywelsch
Copy link
Contributor

ywelsch commented Oct 22, 2018

I'm not convinced that we need another exception type and do all this wrapping. If the goal is to treat exceptions in step 4 and 5 specially, maybe pass in a call-back to the write method that is to be called when step 4 or 5 fail. In most cases (i.e, for all files except the manifest file), we want to those failures to be treated just as any regular IOException, so the default will probably be to pass in an empty closure. In case of the manifest file, we can then kill the node in the closure.

@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-distributed

@andrershov
Copy link
Contributor Author

@ywelsch If you don't like converting WriteStateException back to IOException in classes that do not care, I can propose to make WriteStateException extend IOException. In this case, all this wrapping will disappear. For methods that really care, they can catch WriteStateException directly and check dirty flag. I'm not a fan of inflating write method signature to accept a closure, that will have the side-effect of shutting down the node.

@andrershov
Copy link
Contributor Author

@ywelsch I've made WriteStateException extend IOException in b0ac9aa. Please let me know if it works for you.

Copy link
Contributor

@DaveCTurner DaveCTurner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My only substantive comment is about the extra calls to performDirectoryCleanup that I think aren't necessary. I left a handful of nits too.

I'm in two minds about the WriteStateException <: IOException thing vs the callback. On the one hand I find this quite easy to read; on the other hand we have to remember to catch WriteStateException specifically where we care about needing to shut the node down. On balance I think this approach is good (we don't care about WriteStateException in very many places).

/**
* This exception is thrown when there is a problem of writing state to disk. <br>
* If {@link #isDirty()} returns false, state is guaranteed to be not written to disk.
* If {@link #isDirty()} returns true, we don't know if state is written to disk.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably better for these docs to be on the isDirty() method.

@@ -103,6 +103,7 @@ public void testReadWriteState() throws IOException {
final long id = addDummyFiles("foo-", dirs);
Format format = new Format("foo-");
DummyState state = new DummyState(randomRealisticUnicodeOfCodepointLengthBetween(1, 1000), randomInt(), randomLong(), randomDouble(), randomBoolean());
int version = between(0, Integer.MAX_VALUE/2);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems unused.

@@ -116,6 +117,7 @@ public void testReadWriteState() throws IOException {
DummyState read = format.read(NamedXContentRegistry.EMPTY, list[0]);
assertThat(read, equalTo(state));
}
final int version2 = between(version, Integer.MAX_VALUE);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems unused.

@@ -143,6 +145,7 @@ public void testVersionMismatch() throws IOException {

Format format = new Format("foo-");
DummyState state = new DummyState(randomRealisticUnicodeOfCodepointLengthBetween(1, 1000), randomInt(), randomLong(), randomDouble(), randomBoolean());
int version = between(0, Integer.MAX_VALUE/2);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems unused.

@@ -166,6 +169,7 @@ public void testCorruption() throws IOException {
final long id = addDummyFiles("foo-", dirs);
Format format = new Format("foo-");
DummyState state = new DummyState(randomRealisticUnicodeOfCodepointLengthBetween(1, 1000), randomInt(), randomLong(), randomDouble(), randomBoolean());
int version = between(0, Integer.MAX_VALUE/2);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems unused.

try {
firstStateDirectory.rename(tmpFileName, fileName);
} catch (IOException e) {
throw new WriteStateException(false, "failed to rename tmp file to final name in the first state location", e);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it'd be useful to see the filenames in the exception message.

try {
extraStateDirectory.rename(tmpFileName, fileName);
} catch (IOException e) {
throw new WriteStateException(true, "failed to rename tmp file to final name in extra state location",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it'd be useful to see the filenames in the exception message.

try {
stateDirectories.get(i).v2().syncMetaData();
} catch (IOException e) {
throw new WriteStateException(true, "meta data directory fsync has failed", e);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it'd be useful to see the path in the exception message.

}
return extraStateDir;
} catch (Exception e) {
throw new WriteStateException(false, "failed to copy tmp state file to extra location", e);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it'd be useful to see the filenames in the exception message.

}
return stateDir;
} catch (Exception e) {
throw new WriteStateException(false, "failed to write state to the first location tmp file", e);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it'd be useful to see the filenames in the exception message.

@andrershov
Copy link
Contributor Author

@DaveCTurner I'm done with the changes, could you please make another round?

Copy link
Contributor

@DaveCTurner DaveCTurner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. I left one optional nit.

}
}

private static void performDirectoryCleanup(Path stateLocation, Directory stateDir, String tmpFileName) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is short and only used in one place, so I think I'd inline it.

@andrershov
Copy link
Contributor Author

run gradle build tests please. Transient download failure

@DaveCTurner
Copy link
Contributor

@elasticmachine run the gradle build tests please. (either another transient download failure or else it didn't hear the first time)

@andrershov andrershov merged commit 7a3cd10 into elastic:zen2 Oct 24, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Distributed/Cluster Coordination Cluster formation and cluster state publication, including cluster membership and fault detection. >enhancement
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants