[Zen2] Change MetaDataStateFormat write semantics #34709

andrershov · 2018-10-22T13:23:01Z

Currently if MetaDataStateFormat.write throws an IOExceptions if there was some problem with persisting state to disk. If exception is thrown, loadLatestState may read either old state or new state. This is not enough for Zen2 algorithm. In case of failure, we need to distinguish between 2 cases: storage is left in clean state or storage is left in dirty state.
If storage is left in clean state, loadLatestState may read only old state. If storage is left in dirty state, loadLatestState may read either old or new state.
If an exception occurs when writing manifest file to disk this distinction is important for Zen2. If storage is clean, node can continue to be a part of the cluster and may try to accept further cluster state updates (if it fails to accept cluster state updates it will be kicked off from the cluster using different mechanism). But if storage is dirty, node should be restarted and it will be able to startup successfully only once it successfully re-writes manifest file to disl.
This PR changes MetaDataStateFormat.write signature, replacing IOException with WriteStateException, which “isDirty” method could be used to distinguish between 2 failure cases.
We need to minimise number of failures, that leave storage in dirty state. That’s why this PR changes algorithm that is used to store state to disk. It has the following layout:

For the first state location, create and fsync tmp file with state content.
For each extra location, copy and fsync tmp file with state content.
Atomically rename tmp file in the first location.
For each extra location, atomically rename tmp file.
For each location, fsync state directory.
Perform cleanup of old files, ignoring exceptions.

If an exception occurs in steps 1-3, storage is clearly in the clean state. If an exception occurs in step 5, storage is clearly in dirty state. Exception in step 4 is questionable, there are 2 options:

Consider it as failure. If the first disk fails, state disappears. So this is a failure and storage is in dirty state.
Do not consider it as failure at all, ignore disk failures.

This PR prefers 1st approach and MetaDataTestFormatTests.testFailRandomlyAndReadAnyState tests for disk failures. But we could easily switch to option 2 if requested by reviewers.

Write should clearly report if storage is left in dirty state.

ywelsch · 2018-10-22T15:17:42Z

I'm not convinced that we need another exception type and do all this wrapping. If the goal is to treat exceptions in step 4 and 5 specially, maybe pass in a call-back to the write method that is to be called when step 4 or 5 fail. In most cases (i.e, for all files except the manifest file), we want to those failures to be treated just as any regular IOException, so the default will probably be to pass in an empty closure. In case of the manifest file, we can then kill the node in the closure.

elasticmachine · 2018-10-22T18:51:11Z

Pinging @elastic/es-distributed

andrershov · 2018-10-23T06:36:08Z

@ywelsch If you don't like converting WriteStateException back to IOException in classes that do not care, I can propose to make WriteStateException extend IOException. In this case, all this wrapping will disappear. For methods that really care, they can catch WriteStateException directly and check dirty flag. I'm not a fan of inflating write method signature to accept a closure, that will have the side-effect of shutting down the node.

andrershov · 2018-10-23T09:55:46Z

@ywelsch I've made WriteStateException extend IOException in b0ac9aa. Please let me know if it works for you.

DaveCTurner

My only substantive comment is about the extra calls to performDirectoryCleanup that I think aren't necessary. I left a handful of nits too.

I'm in two minds about the WriteStateException <: IOException thing vs the callback. On the one hand I find this quite easy to read; on the other hand we have to remember to catch WriteStateException specifically where we care about needing to shut the node down. On balance I think this approach is good (we don't care about WriteStateException in very many places).

DaveCTurner · 2018-10-23T10:09:03Z

server/src/main/java/org/elasticsearch/gateway/WriteStateException.java

+/**
+ * This exception is thrown when there is a problem of writing state to disk. <br>
+ * If {@link #isDirty()} returns false, state is guaranteed to be not written to disk.
+ * If {@link #isDirty()} returns true, we don't know if state is written to disk.


Probably better for these docs to be on the isDirty() method.

DaveCTurner · 2018-10-23T10:10:59Z

server/src/test/java/org/elasticsearch/gateway/MetaDataStateFormatTests.java

@@ -103,6 +103,7 @@ public void testReadWriteState() throws IOException {
        final long id = addDummyFiles("foo-", dirs);
        Format format = new Format("foo-");
        DummyState state = new DummyState(randomRealisticUnicodeOfCodepointLengthBetween(1, 1000), randomInt(), randomLong(), randomDouble(), randomBoolean());
+        int version = between(0, Integer.MAX_VALUE/2);


Seems unused.

DaveCTurner · 2018-10-23T10:11:04Z

server/src/test/java/org/elasticsearch/gateway/MetaDataStateFormatTests.java

@@ -116,6 +117,7 @@ public void testReadWriteState() throws IOException {
            DummyState read = format.read(NamedXContentRegistry.EMPTY, list[0]);
            assertThat(read, equalTo(state));
        }
+        final int version2 = between(version, Integer.MAX_VALUE);


Seems unused.

DaveCTurner · 2018-10-23T10:11:08Z

server/src/test/java/org/elasticsearch/gateway/MetaDataStateFormatTests.java

@@ -143,6 +145,7 @@ public void testVersionMismatch() throws IOException {

        Format format = new Format("foo-");
        DummyState state = new DummyState(randomRealisticUnicodeOfCodepointLengthBetween(1, 1000), randomInt(), randomLong(), randomDouble(), randomBoolean());
+        int version = between(0, Integer.MAX_VALUE/2);


Seems unused.

DaveCTurner · 2018-10-23T10:11:16Z

server/src/test/java/org/elasticsearch/gateway/MetaDataStateFormatTests.java

@@ -166,6 +169,7 @@ public void testCorruption() throws IOException {
        final long id = addDummyFiles("foo-", dirs);
        Format format = new Format("foo-");
        DummyState state = new DummyState(randomRealisticUnicodeOfCodepointLengthBetween(1, 1000), randomInt(), randomLong(), randomDouble(), randomBoolean());
+        int version = between(0, Integer.MAX_VALUE/2);


Seems unused.

DaveCTurner · 2018-10-23T10:42:06Z

server/src/main/java/org/elasticsearch/gateway/MetaDataStateFormat.java

+        try {
+            firstStateDirectory.rename(tmpFileName, fileName);
+        } catch (IOException e) {
+            throw new WriteStateException(false, "failed to rename tmp file to final name in the first state location", e);


I think it'd be useful to see the filenames in the exception message.

DaveCTurner · 2018-10-23T10:42:10Z

server/src/main/java/org/elasticsearch/gateway/MetaDataStateFormat.java

+            try {
+                extraStateDirectory.rename(tmpFileName, fileName);
+            } catch (IOException e) {
+                throw new WriteStateException(true, "failed to rename tmp file to final name in extra state location",


I think it'd be useful to see the filenames in the exception message.

DaveCTurner · 2018-10-23T10:42:30Z

server/src/main/java/org/elasticsearch/gateway/MetaDataStateFormat.java

+            try {
+                stateDirectories.get(i).v2().syncMetaData();
+            } catch (IOException e) {
+                throw new WriteStateException(true, "meta data directory fsync has failed", e);


I think it'd be useful to see the path in the exception message.

DaveCTurner · 2018-10-23T10:43:07Z

server/src/main/java/org/elasticsearch/gateway/MetaDataStateFormat.java

            }
+            return extraStateDir;
+        } catch (Exception e) {
+            throw new WriteStateException(false, "failed to copy tmp state file to extra location", e);


I think it'd be useful to see the filenames in the exception message.

DaveCTurner · 2018-10-23T10:43:18Z

server/src/main/java/org/elasticsearch/gateway/MetaDataStateFormat.java

+            }
+            return stateDir;
+        } catch (Exception e) {
+            throw new WriteStateException(false, "failed to write state to the first location tmp file", e);


I think it'd be useful to see the filenames in the exception message.

andrershov · 2018-10-23T12:54:43Z

@DaveCTurner I'm done with the changes, could you please make another round?

DaveCTurner

LGTM. I left one optional nit.

DaveCTurner · 2018-10-23T13:33:27Z

server/src/main/java/org/elasticsearch/gateway/MetaDataStateFormat.java

+        }
+    }
+
+    private static void performDirectoryCleanup(Path stateLocation, Directory stateDir, String tmpFileName) {


This is short and only used in one place, so I think I'd inline it.

andrershov · 2018-10-23T15:48:56Z

run gradle build tests please. Transient download failure

DaveCTurner · 2018-10-24T07:54:27Z

@elasticmachine run the gradle build tests please. (either another transient download failure or else it didn't hear the first time)

Andrey Ershov added 2 commits October 22, 2018 15:38

Fix test failure. Convert IllegalStateException to IOException

1ec0c73

Change meta data write failure semantics

0cfffd5

Write should clearly report if storage is left in dirty state.

andrershov added >enhancement :Distributed/Cluster Coordination Cluster formation and cluster state publication, including cluster membership and fault detection. labels Oct 22, 2018

andrershov requested review from ywelsch and DaveCTurner October 22, 2018 13:23

andrershov mentioned this pull request Oct 22, 2018

Zen2 ClusterState storage #33958

Closed

6 tasks

WriteStateException extends IOException

b0ac9aa

andrershov force-pushed the zen2_write_semantics branch from 8ce0c82 to b0ac9aa Compare October 23, 2018 09:52

DaveCTurner reviewed Oct 23, 2018

View reviewed changes

Andrey Ershov added 2 commits October 23, 2018 15:14

Fix David's code review comments

5f70ad8

Open all directories as the first algorithm step

aa443d6

DaveCTurner approved these changes Oct 23, 2018

View reviewed changes

Inline performDirectoryCleanup

d8fdf27

DaveCTurner mentioned this pull request Oct 24, 2018

[Zen2] Fix test failure. Convert IllegalStateException to IOException #34711

Closed

andrershov merged commit 7a3cd10 into elastic:zen2 Oct 24, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Zen2] Change MetaDataStateFormat write semantics #34709

[Zen2] Change MetaDataStateFormat write semantics #34709

andrershov commented Oct 22, 2018 •

edited

Loading

ywelsch commented Oct 22, 2018

elasticmachine commented Oct 22, 2018

andrershov commented Oct 23, 2018

andrershov commented Oct 23, 2018

DaveCTurner left a comment

DaveCTurner Oct 23, 2018

DaveCTurner Oct 23, 2018

DaveCTurner Oct 23, 2018

DaveCTurner Oct 23, 2018

DaveCTurner Oct 23, 2018

DaveCTurner Oct 23, 2018

DaveCTurner Oct 23, 2018

DaveCTurner Oct 23, 2018

DaveCTurner Oct 23, 2018

DaveCTurner Oct 23, 2018

andrershov commented Oct 23, 2018

DaveCTurner left a comment

DaveCTurner Oct 23, 2018

andrershov commented Oct 23, 2018

DaveCTurner commented Oct 24, 2018

[Zen2] Change MetaDataStateFormat write semantics #34709

[Zen2] Change MetaDataStateFormat write semantics #34709

Conversation

andrershov commented Oct 22, 2018 • edited Loading

ywelsch commented Oct 22, 2018

elasticmachine commented Oct 22, 2018

andrershov commented Oct 23, 2018

andrershov commented Oct 23, 2018

DaveCTurner left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

andrershov commented Oct 23, 2018

DaveCTurner left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

andrershov commented Oct 23, 2018

DaveCTurner commented Oct 24, 2018

andrershov commented Oct 22, 2018 •

edited

Loading