Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Log warnings when cluster state publication failed to some nodes #31233

Merged
merged 3 commits into from
Jun 13, 2018

Conversation

bleskes
Copy link
Contributor

@bleskes bleskes commented Jun 10, 2018

If the publishing of a cluster state to a node fails, we currently only log it as debug information and only on the master. This makes it hard to see the cause of (test) failures when logging is set to default levels. This PR adds a warn level log on the node receiving the cluster state when it fails to deserialise the cluster state and a warn level log on the master with a list of nodes for which publication failed.

@bleskes bleskes added >non-issue v7.0.0 v6.3.0 :Distributed/Cluster Coordination Cluster formation and cluster state publication, including cluster membership and fault detection. v6.4.0 labels Jun 10, 2018
@bleskes bleskes requested a review from ywelsch June 10, 2018 08:08
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-distributed

Copy link
Member

@jasontedor jasontedor left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, one nit.

@@ -64,6 +67,8 @@ public void onResponse(DiscoveryNode node) {
public void onFailure(DiscoveryNode node, Exception e) {
boolean found = pendingNodes.remove(node);
assert found : "node [" + node + "] already responded or failed";
boolean added = failedNodes.add(node);
assert added : "double failures for " + node;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

double -> duplicate

Copy link
Contributor

@ywelsch ywelsch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One ask, looks good o.w.

@@ -397,6 +403,9 @@ protected void handleIncomingClusterStateRequest(BytesTransportRequest request,
} catch (IncompatibleClusterStateVersionException e) {
incompatibleClusterStateDiffReceivedCount.incrementAndGet();
throw e;
} catch (Exception e) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will also catch exceptions when calling incomingClusterStateListener.onIncomingClusterState, which in return calls ZenDiscovery.validateIncomingState. I think it's confusing to catch these as well here and erroneously reporting them here as deserialization errors.
Maybe it would be better to reduce the scope of the catch clause...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a great comment. I missed that call. Let me see how I can shuffle things.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch @ywelsch.

@bleskes
Copy link
Contributor Author

bleskes commented Jun 11, 2018

@ywelsch I updated the PR - can you take another look?

Copy link
Contributor

@ywelsch ywelsch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@bleskes bleskes added v6.3.1 and removed v6.3.0 labels Jun 13, 2018
@bleskes bleskes merged commit 8c9360b into elastic:master Jun 13, 2018
@bleskes bleskes deleted the log_warn_failed_deserialization branch June 13, 2018 11:22
bleskes added a commit that referenced this pull request Jun 13, 2018
)

If the publishing of a cluster state to a node fails, we currently only log it as debug information and 
only on the master. This makes it hard to see the cause of (test) failures when logging is set to 
default levels. This PR adds a warn level log on the node receiving the cluster state when it fails to 
deserialise the cluster state and a warn level log on the master with a list of nodes for which 
publication failed.
bleskes added a commit that referenced this pull request Jun 13, 2018
)

If the publishing of a cluster state to a node fails, we currently only log it as debug information and 
only on the master. This makes it hard to see the cause of (test) failures when logging is set to 
default levels. This PR adds a warn level log on the node receiving the cluster state when it fails to 
deserialise the cluster state and a warn level log on the master with a list of nodes for which 
publication failed.
dnhatn added a commit that referenced this pull request Jun 14, 2018
* master:
  Remove RestGetAllAliasesAction (#31308)
  Temporary fix for broken build
  Reenable Checkstyle's unused import rule (#31270)
  Remove remaining unused imports before merging #31270
  Fix non-REST doc snippet
  [DOC] Extend SQL docs
  Immediately flush channel after writing to buffer (#31301)
  [DOCS] Shortens ML API intros
  Use quotes in the call invocation (#31249)
  move security ingest processors to a sub ingest directory (#31306)
  Add 5.6.11 version constant.
  Fix version detection.
  SQL: Whitelist SQL utility class for better scripting (#30681)
  [Docs] All Rollup docs experimental, agg limitations, clarify DeleteJob (#31299)
  CCS: don't proxy requests for already connected node (#31273)
  Mute ScriptedMetricAggregatorTests testSelfReferencingAggStateAfterMap
  [test] opensuse packaging turn up debug logging
  Add unreleased version 6.3.1
  Removes experimental tag from scripted_metric aggregation (#31298)
  [Rollup] Metric config parser must use builder so validation runs (#31159)
  [ML] Check licence when datafeeds use cross cluster search  (#31247)
  Add notion of internal index settings (#31286)
  Test: Remove broken yml test feature (#31255)
  REST hl client: cluster health to default to cluster level (#31268)
  [ML] Update test thresholds to account for changes to memory control (#31289)
  Log warnings when cluster state publication failed to some nodes (#31233)
  Fix AntFixture waiting condition (#31272)
  Ignore numeric shard count if waiting for ALL (#31265)
  [ML] Implement new rules design (#31110)
  index_prefixes back-compat should test 6.3 (#30951)
  Core: Remove plain execute method on TransportAction (#30998)
  Update checkstyle to 8.10.1 (#31269)
  Set analyzer version in PreBuiltAnalyzerProviderFactory (#31202)
  Modify pipelining handlers to require full requests (#31280)
  Revert upgrade to Netty 4.1.25.Final (#31282)
  Use armored input stream for reading public key (#31229)
  Fix Netty 4 Server Transport tests. Again.
  REST hl client: adjust wait_for_active_shards param in cluster health (#31266)
  REST high-level Client: remove deprecated API methods (#31200)
  [DOCS] Mark SQL feature as experimental
  [DOCS] Updates machine learning custom URL screenshots (#31222)
  Fix naming conventions check for XPackTestCase
  Fix security Netty 4 transport tests
  Fix race in clear scroll (#31259)
  [DOCS] Clarify audit index settings when remote indexing (#30923)
  Delete typos in SAML docs (#31199)
  REST high-level client: add Cluster Health API (#29331)
  [ML][TEST] Mute tests using rules (#31204)
  Support RequestedAuthnContext (#31238)
  SyncedFlushResponse to implement ToXContentObject (#31155)
  Add Get Aliases API to the high-level REST client (#28799)
  Remove some line length supressions (#31209)
  Validate xContentType in PutWatchRequest. (#31088)
  [INGEST] Interrupt the current thread if evaluation grok expressions take too long (#31024)
  Suppress extras FS on caching directory tests
  Revert "[DOCS] Added 6.3 info & updated the upgrade table. (#30940)"
  Revert "Fix snippets in upgrade docs"
  Fix snippets in upgrade docs
  [DOCS] Added 6.3 info & updated the upgrade table. (#30940)
  LLClient: Support host selection (#30523)
  Upgrade to Netty 4.1.25.Final (#31232)
  Enable custom credentials for core REST tests (#31235)
  Move ESIndexLevelReplicationTestCase to test framework (#31243)
  Encapsulate Translog in Engine (#31220)
  HLRest: Add get index templates API (#31161)
  Remove all unused imports and fix CRLF (#31207)
  [Tests] Fix self-referencing tests
  [TEST] Fix testRecoveryAfterPrimaryPromotion
  [Docs] Remove mention pattern files in Grok processor (#31170)
  Use stronger write-once semantics for Azure repository (#30437)
  Don't swallow exceptions on replication (#31179)
  Limit the number of concurrent requests per node (#31206)
  Call ensureNoSelfReferences() on _agg state variable after scripted metric agg script executions (#31044)
  Move java version checker back to its own jar (#30708)
  [test] add fix for rare virtualbox error (#31212)
dnhatn added a commit that referenced this pull request Jun 14, 2018
* 6.x:
  SQL: Fix build on Java 10
  [Tests] Mutualize fixtures code in BaseHttpFixture (#31210)
  [TEST] Fix RemoteClusterClientTests#testEnsureWeReconnect
  [ML] Update test thresholds to account for changes to memory control (#31289)
  Reenable Checkstyle's unused import rule (#31270)
  [ML] Check licence when datafeeds use cross cluster search  (#31247)
  Fix non-REST doc snippet
  [DOC] Extend SQL docs
  [DOCS] Shortens ML API intros
  Use quotes in the call invocation (#31249)
  move security ingest processors to a sub ingest directory (#31306)
  SQL: Whitelist SQL utility class for better scripting (#30681)
  Add 5.6.11 version constant.
  Fix version detection.
  [Docs] All Rollup docs experimental, agg limitations, clarify DeleteJob (#31299)
  Add missing release notes.
  Security: fix token bwc with pre 6.0.0-beta2 (#31254)
  Fix compilation error in UpdateSettingsIT (#31304)
  Test: Remove broken yml test feature (#31255)
  Add unreleased version 6.3.1
  [Rollup] Metric config parser must use builder so validation runs (#31159)
  Removes experimental tag from scripted_metric aggregation (#31298)
  [DOCS] Removes coming tag from 6.3.0 release notes
  6.3 release notes.
  Add notion of internal index settings (#31286)
  REST high-level client: add Cluster Health API (#29331)
  Remove leftover usage of deprecated client API
  SyncedFlushResponse to implement ToXContentObject (#31155)
  Add Get Aliases API to the high-level REST client (#28799)
  HLRest: Add get index templates API (#31161)
  Log warnings when cluster state publication failed to some nodes (#31233)
  Fix AntFixture waiting condition (#31272)
  [TEST] Mute RecoveryIT.testHistoryUUIDIsGenerated
  Ignore numeric shard count if waiting for ALL (#31265)
  Update checkstyle to 8.10.1 (#31269)
  Set analyzer version in PreBuiltAnalyzerProviderFactory (#31202)
  Revert upgrade to Netty 4.1.25.Final (#31282)
  Use armored input stream for reading public key (#31229)
  [DOCS] Added 'fail_on_unsupported_field' param to MLT. Closes #28008 (#31160)
  Fix Netty 4 Server Transport tests. Again.
  [DOCS] Fixed typo.
  [DOCS] Added release highlights for 6.3 (#31256)
  [DOCS] Mark SQL feature as experimental
  [DOCS] Updates machine learning custom URL screenshots (#31222)
  Fix naming conventions check for XPackTestCase
  Fix security Netty 4 transport tests
  Fix race in clear scroll (#31259)
  [DOCS] Clarify audit index settings when remote indexing (#30923)
  [ML][TEST] Mute tests using rules (#31204)
  Support RequestedAuthnContext (#31238)
  Validate xContentType in PutWatchRequest. (#31088)
  [INGEST] Interrupt the current thread if evaluation grok expressions take too long (#31024)
  Upgrade to Netty 4.1.25.Final (#31232)
  Suppress extras FS on caching directory tests
  Revert "[DOCS] Added 6.3 info & updated the upgrade table. (#30940)"
  Revert "Fix snippets in upgrade docs"
  Fix snippets in upgrade docs
  [DOCS] Added 6.3 info & updated the upgrade table. (#30940)
  Enable custom credentials for core REST tests (#31235)
  Move ESIndexLevelReplicationTestCase to test framework (#31243)
  Encapsulate Translog in Engine (#31220)
  [DOCS] Adds machine learning 6.3.0 release notes (#31217)
  Remove all unused imports and fix CRLF (#31207)
  [TEST] Fix testRecoveryAfterPrimaryPromotion
  [Docs] Remove mention pattern files in Grok processor (#31170)
  Use stronger write-once semantics for Azure repository (#30437)
  Don't swallow exceptions on replication (#31179)
  Compliant SAML Response destination check (#31175)
  Move java version checker back to its own jar (#30708)
  TEST:  Retry synced-flush if ongoing ops on primary (#30978)
  [test] add fix for rare virtualbox error (#31212)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Distributed/Cluster Coordination Cluster formation and cluster state publication, including cluster membership and fault detection. >non-issue v6.3.1 v6.4.0 v7.0.0-beta1
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants