Only allow x-pack metadata if all nodes are ready #30743

ywelsch · 2018-05-19T20:26:25Z

This PR enables a rolling restart from the OSS distribution to the x-pack based distribution by preventing x-pack code from installing custom metadata into the cluster state until all nodes are capable of deserializing this metadata. Changes in this PR are all local to the x-pack code. It's still missing some tests, but can and should already be reviewed. I've done a backport of this PR to 6.3 and it passed the rolling upgrade tests introduced in #30728 (more of these tests will be required) from 6.2 OSS to 6.3 default distribution.

The changes to the ML, Watcher and License components were pretty straight-forward, the TokenService changes are a bit more involved, and I would like to get a pair of eyes from the @elastic/es-security team on this.

@droberts195 this PR includes the custom x-pack node attribute. I'm not sure it makes sense to have a separate PR for that as this is the PR that's actually making use of it.

Relates to #30731

elasticmachine · 2018-05-19T20:26:29Z

Pinging @elastic/es-distributed

elasticmachine · 2018-05-19T20:26:31Z

Pinging @elastic/es-core-infra

elasticmachine · 2018-05-19T20:26:39Z

Pinging @elastic/ml-core

tvernum · 2018-05-21T01:12:13Z

x-pack/plugin/security/src/main/java/org/elasticsearch/xpack/security/authc/TokenService.java

+                });
+            }
+        } else {
+            installTokenMetadataCheck.set(false);


I don't follow why this gets set to false if the custom metadata already exists.
It seems like it could introduces an extraordinarily unlikely race condition (probably impossible in practice):

Thread 1: custom metadata is null,

Thread 1: set installTokenMetadataCheck to true

Thread 2: custom metadata is null

Thread 1: install metadata

Thread 3: custom metadata is non-null

Thread 3: set installTokenMetadataCheck to false

Thread 2: set installTokenMetadataCheck to true

Thread 2: try into install duplicate metadata.

I think the pattern has been copied from other sections of code that want to be defensive against the possibility of their cluster state changes being deleted somehow, and wanting to reinstall them. It's not really a problem if the unlikely race condition you describe occurs, as it doesn't hurt correctness to get into the execute() method twice, only performance.

The history is that there are similar bits of code in ML and Watcher that didn't have the atomic guard variable when first written. So originally they might update the cluster state 100 times with the same change as the cluster started up. After observing this happening we added the atomic guard and it stopped enough of the duplicate metadata additions that it's no longer a noticeable problem.

Also this happens inside of a cluster state listener and IIRC these are all executed on the cluster state update thread, which prevents multi-threading issues around this.

As @droberts195 and @jaymode explained, installTokenMetadata is called by a single thread, but possibly repeatedly so while the cluster state update task has not taken effect yet. Initially I copied the code from a similar task in ML, but have made a smaller simplification in 87ad80f that should make it clearer why and when the flag is reset.

jaymode

The security changes LGTM. Left a minor comment about cleaning up a method override.

jaymode · 2018-05-21T13:52:18Z

x-pack/plugin/security/src/main/java/org/elasticsearch/xpack/security/Security.java

-        } else {
-            return Collections.emptyMap();
-        }
+        // TODO: Remove this whole concept of InitialClusterStateCustomSupplier


just remove the method override here?

sure, I've pushed e1437e8

bleskes

This looks good. I left some questions and a nit.

bleskes · 2018-05-22T09:16:34Z

x-pack/plugin/core/src/main/java/org/elasticsearch/xpack/core/XPackPlugin.java

+
+            return super.additionalSettings();
+        } else {
+            if (settings.get(xpackInstalledNodeAttrSetting) != null &&


why do we allow this setting to be already set in this case? shouldn't we lock it down?

I think its because the internal cluster integration test framework will restart nodes with settings copied from the node immediately before it was stopped. There's a comment here for the same problem that could be copied over to prevent confusion.

exactly. I wanted to fix that behavior, but did not want to make it part of this PR.

I see. Makes sense. Thanks.

I've added a comment in ba94bdd

I've opened #30780 which would allow me to make this check more strict.

I've merged that PR and made the check more strict in 1b6ae1d

bleskes · 2018-05-22T09:27:37Z

x-pack/plugin/core/src/main/java/org/elasticsearch/xpack/core/XPackPlugin.java

@@ -138,6 +152,78 @@ protected Clock getClock() {
    public static LicenseService getSharedLicenseService() { return licenseService.get(); }
    public static XPackLicenseState getSharedLicenseState() { return licenseState.get(); }

+    /**
+     * Checks if the cluster state allows this node to add x-pack metadata to the cluster state,


can you document that if the cluster state already contains xpack metadata it is always considered "ready"?

I've pushed 5f25ea3

bleskes · 2018-05-22T09:30:16Z

.../ml/src/main/java/org/elasticsearch/xpack/ml/action/TransportFinalizeJobExecutionAction.java

@@ -57,6 +58,7 @@ protected void masterOperation(FinalizeJobExecutionAction.Request request, Clust
        clusterService.submitStateUpdateTask(source, new ClusterStateUpdateTask() {
            @Override
            public ClusterState execute(ClusterState currentState) throws Exception {
+                XPackPlugin.checkReadyForXPackCustomMetadata(currentState);


@droberts195 this feels strange given the name of the method. If we get so far that TransportFinalizeJobExecutionAction is called, shouldn't the cluster state already have the ML type? can you please double check?

Yes, I think the only endpoints that need protecting are the ones that put jobs and datafeeds. Until the user has created an ML entity the endpoints that operate on them should be no-ops.

(There is actually a bug in this action in that if you passed it a non-existent job ID it would currently fail with an NPE. That's not a disaster as it's an undocumented action intended for internal use, but I will make it more defensive in another PR.)

droberts195 · 2018-05-22T09:47:41Z

x-pack/plugin/ml/src/main/java/org/elasticsearch/xpack/ml/MlInitializationService.java

+                logger.debug("cannot add ML metadata to cluster as the following nodes might not understand the ML metadata: {}",
+                    () -> XPackPlugin.nodesNotReadyForXPackCustomMetadata(event.state()));
+                return;
+            }


Following the changes I made in #30751 there's no need to change this file in this PR. We no longer eagerly install the ML metadata on startup.

droberts195

Thanks for adding the xpack.installed attribute. I'll update the meta issue to say it's been done in this PR.

droberts195 · 2018-05-22T10:16:34Z

x-pack/plugin/ml/src/main/java/org/elasticsearch/xpack/ml/job/JobManager.java

@@ -188,6 +189,9 @@ public void putJob(PutJobAction.Request request, AnalysisRegistry analysisRegist
            DEPRECATION_LOGGER.deprecated("Creating jobs with delimited data format is deprecated. Please use xcontent instead.");
        }

+        // pre-flight check, not necessarily required, but avoids figuring this out while on the CS update thread


I think it would be best to remove the not necessarily required bit, and make this the primary check for protecting the cluster state against ML jobs. Then the check can be removed from buildNewClusterState() at the bottom of the file.

droberts195 · 2018-05-22T10:19:18Z

x-pack/plugin/ml/src/main/java/org/elasticsearch/xpack/ml/job/JobManager.java

@@ -565,6 +569,7 @@ public ClusterState execute(ClusterState currentState) {
    }

    private static ClusterState buildNewClusterState(ClusterState currentState, MlMetadata.Builder builder) {
+        XPackPlugin.checkReadyForXPackCustomMetadata(currentState);


I think if creation of ML jobs and datafeeds is prevented elsewhere then this is not necessary. Any other updates to the ML custom cluster state imply that it already exists.

@droberts195 and I discussed this. We will keep the extra checks for now, but will investigate which ones can be dropped in a follow-up, adding more assertions to ensure we have all places covered.

…vadoc

bleskes

My questions were answered. LGTM.

…k-if-all-nodes-x-pack

…evel class

…k-if-all-nodes-x-pack

Enables a rolling restart from the OSS distribution to the x-pack based distribution by preventing x-pack code from installing custom metadata into the cluster state until all nodes are capable of deserializing this metadata.

* master: (25 commits) [DOCS] Splits auditing.asciidoc into smaller files Reintroduce mandatory http pipelining support (elastic#30820) Painless: Types Section Clean Up (elastic#30283) Add support for indexed shape routing in geo_shape query (elastic#30760) [test] java tests for archive packaging (elastic#30734) Revert "Make http pipelining support mandatory (elastic#30695)" (elastic#30813) [DOCS] Fix more edit URLs in Stack Overview (elastic#30704) Use correct cluster state version for node fault detection (elastic#30810) Change serialization version of doc-value fields. [DOCS] Fixes broken link for native realm [DOCS] Clarified audit.index.client.hosts (elastic#30797) [TEST] Don't expect acks when isolating nodes Add a `format` option to `docvalue_fields`. (elastic#29639) Fixes UpdateSettingsRequestStreamableTests mutate bug Mustes {p0=snapshot.get_repository/10_basic/*} YAML test Revert "Mutes MachineLearningTests.testNoAttributes_givenSameAndMlEnabled" Only allow x-pack metadata if all nodes are ready (elastic#30743) Mutes MachineLearningTests.testNoAttributes_givenSameAndMlEnabled Use original settings on full-cluster restart (elastic#30780) Only ack cluster state updates successfully applied on all nodes (elastic#30672) ...

* 6.x: [DOCS] Fixes typos in security settings Add support for indexed shape routing in geo_shape query (#30760) [DOCS] Splits auditing.asciidoc into smaller files Painless: Types Section Clean Up (#30283) [test] java tests for archive packaging (#30734) Deprecate http.pipelining setting (#30786) [DOCS] Fix more edit URLs in Stack Overview (#30704) Use correct cluster state version for node fault detection (#30810) [DOCS] Fixes broken link for native realm [DOCS] Clarified audit.index.client.hosts (#30797) Change serialization version of doc-value fields. Add a `format` option to `docvalue_fields`. (#29639) [TEST] Don't expect acks when isolating nodes Fixes UpdateSettingsRequestStreamableTests mutate bug Revert "Add more yaml tests for get alias API (#29513)" Revert "Mutes MachineLearningTests.testNoAttributes_givenSameAndMlEnabled" Only allow x-pack metadata if all nodes are ready (#30743) Mutes MachineLearningTests.testNoAttributes_givenSameAndMlEnabled Use original settings on full-cluster restart (#30780) Only ack cluster state updates successfully applied on all nodes (#30672) Replace Request#setHeaders with addHeader (#30588) [TEST] remove endless wait in RestClientTests (#30776) QA: Add xpack tests to rolling upgrade (#30795) Add support for search templates to the high-level REST client. (#30473) Reduce CLI scripts to one-liners on Windows (#30772) Fold RestGetAllSettingsAction in RestGetSettingsAction (#30561) Add more yaml tests for get alias API (#29513) [Docs] Fix script-fields snippet execution (#30693) Convert FieldCapabilitiesResponse to a ToXContentObject. (#30182) Remove assert statements from field caps documentation. (#30601) Fix a bug in FieldCapabilitiesRequest#equals and hashCode. (#30181) Add support for field capabilities to the high-level REST client. (#29664) [DOCS] Add SAML configuration information (#30548) [DOCS] Remove X-Pack references from SQL CLI (#30694) [Docs] Fix typo in circuit breaker docs (#29659) [Feature] Adding a char_group tokenizer (#24186) Increase the maximum number of filters that may be in the cache. (#30655) [Docs] Fix broken cross link in documentation Test: wait for netty threads in a JUnit ClassRule (#30763) [Security] Include an empty json object in an json array when FLS filters out all fields (#30709) [DOCS] fixed incorrect default [TEST] Wait for CS to be fully applied in testDeleteCreateInOneBulk Enable installing plugins from snapshots.elastic.co (#30765) Ignore empty completion input (#30713) Fix docs failure on language analyzers (#30722) [Docs] Fix inconsistencies in snapshot/restore doc (#30480) Add Delete Repository High Level REST API (#30666) Reduce CLI scripts to one-liners (#30759)

* master: [DOCS] Fixes typos in security settings Fix GeoShapeQueryBuilder serialization after backport [DOCS] Splits auditing.asciidoc into smaller files Reintroduce mandatory http pipelining support (#30820) Painless: Types Section Clean Up (#30283) Add support for indexed shape routing in geo_shape query (#30760) [test] java tests for archive packaging (#30734) Revert "Make http pipelining support mandatory (#30695)" (#30813) [DOCS] Fix more edit URLs in Stack Overview (#30704) Use correct cluster state version for node fault detection (#30810) Change serialization version of doc-value fields. [DOCS] Fixes broken link for native realm [DOCS] Clarified audit.index.client.hosts (#30797) [TEST] Don't expect acks when isolating nodes Add a `format` option to `docvalue_fields`. (#29639) Fixes UpdateSettingsRequestStreamableTests mutate bug Mustes {p0=snapshot.get_repository/10_basic/*} YAML test Revert "Mutes MachineLearningTests.testNoAttributes_givenSameAndMlEnabled" Only allow x-pack metadata if all nodes are ready (#30743) Mutes MachineLearningTests.testNoAttributes_givenSameAndMlEnabled Use original settings on full-cluster restart (#30780) Only ack cluster state updates successfully applied on all nodes (#30672) Expose Lucene's FeatureField. (#30618) Fix a grammatical error in the 'search types' documentation. Remove http pipelining from integration test case (#30788)

* es/ccr: (55 commits) [DOCS] Fixes typos in security settings Fix GeoShapeQueryBuilder serialization after backport [DOCS] Splits auditing.asciidoc into smaller files Reintroduce mandatory http pipelining support (elastic#30820) Painless: Types Section Clean Up (elastic#30283) Add support for indexed shape routing in geo_shape query (elastic#30760) [test] java tests for archive packaging (elastic#30734) Revert "Make http pipelining support mandatory (elastic#30695)" (elastic#30813) [DOCS] Fix more edit URLs in Stack Overview (elastic#30704) Use correct cluster state version for node fault detection (elastic#30810) Change serialization version of doc-value fields. [DOCS] Fixes broken link for native realm [DOCS] Clarified audit.index.client.hosts (elastic#30797) [TEST] Don't expect acks when isolating nodes Mute CorruptedFileIT in CCR Add a `format` option to `docvalue_fields`. (elastic#29639) Fixes UpdateSettingsRequestStreamableTests mutate bug Mustes {p0=snapshot.get_repository/10_basic/*} YAML test Revert "Mutes MachineLearningTests.testNoAttributes_givenSameAndMlEnabled" Only allow x-pack metadata if all nodes are ready (elastic#30743) ...

Otherwise we could end up with persistent tasks metadata in the cluster that some of the nodes might not understand in case where the cluster is during rolling upgrade from the default 6.2 to the default 6.3 distribution. Follow-up to #30743

This commit fixes a backwards compatibility bug in the token service that causes token decoding to fail when there is a pre 6.0.0-beta2 node in the cluster. The token encoding is actually the culprit as a version check is missing around the serialization of the key hash bytes. This value was added in 6.0.0-beta2 and cannot be sent to nodes that do not know about this value. The version check has been added and the token service unit tests have been enhanced to randomly run with some 5.6.x nodes in the cluster service. Additionally, a small change was made to the way we check to see if the token metadata needs to be installed. Previously we would pass the metadata to the install method and check that the token metadata is null. This null check is now done prior to checking if the metadata can be installed. Relates elastic#30743 Closes elastic#31195

This commit fixes a backwards compatibility bug in the token service that causes token decoding to fail when there is a pre 6.0.0-beta2 node in the cluster. The token encoding is actually the culprit as a version check is missing around the serialization of the key hash bytes. This value was added in 6.0.0-beta2 and cannot be sent to nodes that do not know about this value. The version check has been added and the token service unit tests have been enhanced to randomly run with some 5.6.x nodes in the cluster service. Additionally, a small change was made to the way we check to see if the token metadata needs to be installed. Previously we would pass the metadata to the install method and check that the token metadata is null. This null check is now done prior to checking if the metadata can be installed. Relates #30743 Closes #31195

This infrastructure was introduced in #26144 and made obsolete in #30743

Only activate x-pack if all nodes ready for it

12110b9

ywelsch requested review from s1monw, bleskes and jasontedor May 19, 2018 20:26

jasontedor mentioned this pull request May 19, 2018

Enable rolling upgrades from default distribution prior to 6.3.0 to default distribution post 6.3.0 #30731

Closed

9 tasks

tvernum reviewed May 21, 2018

View reviewed changes

jaymode reviewed May 21, 2018

View reviewed changes

bleskes reviewed May 22, 2018

View reviewed changes

droberts195 reviewed May 22, 2018

View reviewed changes

ywelsch added 2 commits May 22, 2018 13:36

add comment about node setting

ba94bdd

refactor checkReadyForXPackCustomMetadata into two methods and add ja…

5f25ea3

…vadoc

bleskes approved these changes May 22, 2018

View reviewed changes

ywelsch added 3 commits May 22, 2018 14:17

Merge remote-tracking branch 'elastic/master' into only-activate-xpac…

8911179

…k-if-all-nodes-x-pack

move note about removal of InitialClusterStateCustomSupplier to top-l…

e1437e8

…evel class

simplify installTokenMetadata

87ad80f

ywelsch mentioned this pull request May 22, 2018

Use original settings on full-cluster restart #30780

Merged

add unit test for isReadyForXPackCustomMetadata

cec9704

nik9000 mentioned this pull request May 22, 2018

QA: Add xpack tests to rolling upgrade #30795

Merged

ywelsch added 3 commits May 23, 2018 09:04

Merge remote-tracking branch 'elastic/master' into only-activate-xpac…

ed235a5

…k-if-all-nodes-x-pack

Simplify node attributes check

1b6ae1d

Fix ML tests

a70ea64

ywelsch merged commit 8145a82 into elastic:master May 23, 2018

ywelsch mentioned this pull request May 30, 2018

Allow rollup job creation only if cluster is x-pack ready #30963

Merged

jaymode mentioned this pull request Jun 11, 2018

Security: fix token bwc with pre 6.0.0-beta2 #31254

Merged

ywelsch mentioned this pull request Jul 31, 2018

Remove cluster state initial customs #32501

Merged

ywelsch added a commit that referenced this pull request Aug 2, 2018

Remove cluster state initial customs (#32501)

db6e8c7

This infrastructure was introduced in #26144 and made obsolete in #30743

jpountz removed :Data Management/Watcher :Security/Security Security issues without another label :ml Machine learning labels Jan 29, 2019

colings86 added v7.0.0-beta1 and removed v7.0.0 labels Feb 7, 2019

jakelandis mentioned this pull request Feb 15, 2023

Remove rolling restart checks for x-pack #93843

Open

jakelandis mentioned this pull request Apr 20, 2023

Remove rolling restart checks for x-pack #95428

Draft

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Only allow x-pack metadata if all nodes are ready #30743

Only allow x-pack metadata if all nodes are ready #30743

ywelsch commented May 19, 2018

elasticmachine commented May 19, 2018

elasticmachine commented May 19, 2018

elasticmachine commented May 19, 2018

tvernum May 21, 2018

droberts195 May 21, 2018

jaymode May 21, 2018

ywelsch May 22, 2018

jaymode left a comment

jaymode May 21, 2018

ywelsch May 22, 2018

bleskes left a comment

bleskes May 22, 2018

droberts195 May 22, 2018

ywelsch May 22, 2018

bleskes May 22, 2018

ywelsch May 22, 2018

ywelsch May 22, 2018

ywelsch May 23, 2018

bleskes May 22, 2018

ywelsch May 22, 2018

bleskes May 22, 2018

droberts195 May 22, 2018

droberts195 May 22, 2018

droberts195 left a comment

droberts195 May 22, 2018

droberts195 May 22, 2018

ywelsch May 22, 2018

bleskes left a comment

Only allow x-pack metadata if all nodes are ready #30743

Only allow x-pack metadata if all nodes are ready #30743

Conversation

ywelsch commented May 19, 2018

elasticmachine commented May 19, 2018

elasticmachine commented May 19, 2018

elasticmachine commented May 19, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jaymode left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

bleskes left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

droberts195 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

bleskes left a comment

Choose a reason for hiding this comment