Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[CI] StackOverflowError when executing SnapshotDisruptionIT.testDisruptionOnSnapshotInitialization #28169

Closed
tlrx opened this issue Jan 10, 2018 · 3 comments
Assignees
Labels
:Distributed/Snapshot/Restore Anything directly related to the `_snapshot/*` APIs >test-failure Triaged test failures from CI v7.0.0-beta1

Comments

@tlrx
Copy link
Member

tlrx commented Jan 10, 2018

The test SnapshotDisruptionIT.testDisruptionOnSnapshotInitialization() failed on CI today on Windows Server 2012 R2 6.3 amd64/Oracle Corporation 1.8.0_92 (64-bit).

I first thought that it was a snapshot/restore regression due to my recent changes in #28078 or #27931 but after looking at the test execution log I'm not so sure.

I wonder if in this test the cluster ends up in a situation where a listener.onFailure() call caused a stack overflow error on a network thread (that was uncaught by the usual Elasticsearch's UncaughtExceptionHandler) that caused the NIO Selector to be closed too and not listening to incoming requests.

The test starts 3 master only nodes and 1 data only node. Once the cluster is stable, it sets up a snapshot repository and creates a first snapshot to check that everything is working correctly. Then it sets up a disruption scheme that is designed to isolate the master node as soon as a snapshot-in-progress entry in INIT state is found in the cluster state. Next step in the test is to create a second snapshot that triggers the disruption scheme and waits for the cluster to elect a new master that terminates this second snapshot.

In this execution logs node_tm0, node_tm1 and node_tm2 are started as master only nodes. node_tm1 is elected as master and adds the data only node node_td3 to the cluster.

The test runs correctly to the point where the second snapshot is created:

[2018-01-10T05:23:10,388][INFO ][o.e.d.SnapshotDisruptionIT] --> starting disruption
...
[2018-01-10T05:23:10,395][INFO ][o.e.s.SnapshotsService ] [node_tm1] snapshot [test-repo:test-snap-2/vyIDc0GFSIKVtGd--HP_hQ] started

Once the disruption is started the master node node_tm1 is isolated. Other nodes think it left:

[2018-01-10T05:23:12,061][INFO ][o.e.t.d.TestZenDiscovery ] [node_tm0] master_left [{node_tm1}{05tuxEPVT5ypY7Mz6o_VRA}{p_hm4MVNRoOvl-n281Co2A}{127.0.0.1}{127.0.0.1:30101}], reason [failed to ping, tried [1] times, each with maximum [1s] timeout]
...
[2018-01-10T05:23:12,061][INFO ][o.e.t.d.TestZenDiscovery ] [node_tm2] master_left [{node_tm1}{05tuxEPVT5ypY7Mz6o_VRA}{p_hm4MVNRoOvl-n281Co2A}{127.0.0.1}{127.0.0.1:30101}], reason [failed to ping, tried [1] times, each with maximum [1s] timeout]
...
[2018-01-10T05:23:12,125][INFO ][o.e.t.d.TestZenDiscovery ] [node_td3] master_left [{node_tm1}{05tuxEPVT5ypY7Mz6o_VRA}{p_hm4MVNRoOvl-n281Co2A}{127.0.0.1}{127.0.0.1:30101}], reason [failed to ping, tried [1] times, each with maximum [1s] timeout]

So the remaining nodes elect node_tm0 as the new master node. It uses the last commited cluster state in version 25:

[2018-01-10T05:23:16,069][INFO ][o.e.c.s.ClusterApplierService] [node_tm0] new_master {node_tm0}{JvhDBfzERq6UD13DUGBUJg}{Grp3AjfZR4m4Wb4Re-p5MA}{127.0.0.1}{127.0.0.1:30100}, reason: apply cluster state (from master [master {node_tm0}{JvhDBfzERq6UD13DUGBUJg}{Grp3AjfZR4m4Wb4Re-p5MA}{127.0.0.1}{127.0.0.1:30100} committed version [25] source [zen-disco-elected-as-master ([1] nodes joined)[{node_tm2}{nPk7Z6bvRI-MufskXAhGTw}{-NZTnhd1SrSolB7FjkDwcg}{127.0.0.1}{127.0.0.1:30102}]]])

The new master node updates the cluster state to notify that it is the master now. But the publishing of the cluster state version 26 is not processed by the old master node which is still isolated:

[2018-01-10T05:23:17,075][WARN ][o.e.d.z.PublishClusterStateAction] [node_tm0] timed out waiting for all nodes to process published state [26] (timeout [1s], pending nodes: [{node_tm1} ....)

So the master node gives up, removes the old master node from the cluster state and publish a new version 27 of the cluster state where node_tm1 is removed:

[2018-01-10T05:23:17,089][INFO ][o.e.c.s.ClusterApplierService] [node_td3] removed {{node_tm1}[redacted], reason: apply cluster state (from master [master {node_tm0}[redacted] committed version [27]])

[2018-01-10T05:23:17,089][INFO ][o.e.c.s.ClusterApplierService] [node_tm2] removed {{node_tm1}[redacted], reason: apply cluster state (from master [master {node_tm0}[redacted] committed version [27]])

[2018-01-10T05:23:17,092][INFO ][o.e.c.s.ClusterApplierService] [node_tm0] removed {{node_tm1}[redacted], reason: apply cluster state (from master [master {node_tm0}[redacted] committed version [27] source [zen-disco-node-failed({node_tm1}

And the new master node cleans up the second snapshot as expected:

[2018-01-10T05:23:17,378][INFO ][o.e.s.SnapshotsService ] [node_tm0] snapshot [test-repo:test-snap-2/vyIDc0GFSIKVtGd--HP_hQ] completed with state [SUCCESS]
[2018-01-10T05:23:17,523][INFO ][o.e.s.SnapshotsService ] [node_tm0] snapshot [test-repo:test-snap-2/vyIDc0GFSIKVtGd--HP_hQ] deleted

So the test is somewhat good as its first purpose is to test that the snapshot is correctly terminated by the new master. Before the test ends, it stops the disruption and waits for the cluster to be stable again:

[2018-01-10T05:23:18,605][INFO ][o.e.d.SnapshotDisruptionIT] --> stopping disrupting

The old master node detects the timeout when it tried to publish the initial cluster state 25 (when the second snapshot is STARTED in cluster state):

[2018-01-10T05:23:40,406][WARN ][o.e.t.d.TestZenDiscovery ] [node_tm1] zen-disco-failed-to-publish,

[2018-01-10T05:23:40,407][WARN ][o.e.c.s.MasterService ] [node_tm1] failing [update_snapshot [test-repo:test-snap-2/vyIDc0GFSIKVtGd--HP_hQ]]: failed to commit cluster state version [25]

And the old master fails the snapshot locally, which is also expected.

[2018-01-10T05:23:40,408][WARN ][o.e.s.SnapshotsService ] [node_tm1] [test-snap-2/vyIDc0GFSIKVtGd--HP_hQ] failed to create snapshot

And the old master joins back the cluster...

[2018-01-10T05:23:45,588][INFO ][o.e.c.s.MasterService ] [node_tm0] zen-disco-node-join[{node_tm1}

... and this is where things get blurred for me.

It seems that node_tm1 cannot ping the other nodes, and the ESSelector loop is closed:

[2018-01-10T05:23:45,792][WARN ][o.e.t.n.MockNioTransport ] [node_tm1] send message failed [channel: NioSocketChannel{localAddress=/127.0.0.1:30101, remoteAddress=/127.0.0.1:49349}]
1> java.nio.channels.ClosedChannelException: null
1> at org.elasticsearch.nio.NioSocketChannel.closeFromSelector(NioSocketChannel.java:53) ~[elasticsearch-nio-7.0.0-alpha1-SNAPSHOT.jar:7.0.0-alpha1-SNAPSHOT]
1> at org.elasticsearch.nio.EventHandler.handleClose(EventHandler.java:72) ~[elasticsearch-nio-7.0.0-alpha1-SNAPSHOT.jar:7.0.0-alpha1-SNAPSHOT]
1> at org.elasticsearch.nio.ESSelector.closePendingChannels(ESSelector.java:239) ~[elasticsearch-nio-7.0.0-alpha1-SNAPSHOT.jar:7.0.0-alpha1-SNAPSHOT]
1> at org.elasticsearch.nio.ESSelector.cleanupAndCloseChannels(ESSelector.java:135) ~[elasticsearch-nio-7.0.0-alpha1-SNAPSHOT.jar:7.0.0-alpha1-SNAPSHOT]
1> at org.elasticsearch.nio.ESSelector.runLoop(ESSelector.java:81) ~[elasticsearch-nio-7.0.0-alpha1-SNAPSHOT.jar:7.0.0-alpha1-SNAPSHOT]
1> at java.lang.Thread.run(Thread.java:745) [?:1.8.0_92]

Then all connection attempts from node_tm1 failed, so the cluster cannot recover to 4 nodes and the test suite times out. I think that all errors after that are caused by the test framework trying to stop the nodes.

But I'm worried about is the StackOverflowError in logs:

jan 10, 2018 5:23:45 AM com.carrotsearch.randomizedtesting.RandomizedRunner$QueueUncaughtExceptionsHandler uncaughtException
WARNING: Uncaught exception in thread: Thread[elasticsearch[node_tm1][es_nio_transport_worker][T#2],5,TGRP-SnapshotDisruptionIT]
java.lang.StackOverflowError
at __randomizedtesting.SeedInfo.seed([F1A2A7B82C4D4B21]:0)
at org.elasticsearch.action.support.master.TransportMasterNodeAction$AsyncSingleAction$1.onResponse(TransportMasterNodeAction.java:167)
at org.elasticsearch.action.support.master.TransportMasterNodeAction$AsyncSingleAction$1.onResponse(TransportMasterNodeAction.java:164)
at org.elasticsearch.action.support.master.TransportMasterNodeAction$AsyncSingleAction$1.onResponse(TransportMasterNodeAction.java:167)
at org.elasticsearch.action.support.master.TransportMasterNodeAction$AsyncSingleAction$1.onResponse(TransportMasterNodeAction.java:164)
at org.elasticsearch.action.support.master.TransportMasterNodeAction$AsyncSingleAction$1.onResponse(TransportMasterNodeAction.java:167)
at org.elasticsearch.action.support.master.TransportMasterNodeAction$AsyncSingleAction$1.onResponse(TransportMasterNodeAction.java:164)
at org.elasticsearch.action.support.master.TransportMasterNodeAction$AsyncSingleAction$1.onResponse(TransportMasterNodeAction.java:167)

As well as the NPEs at the beginning of the tests:

[2018-01-10T05:23:03,033][WARN ][o.e.d.z.UnicastZenPing ] [node_tm0] unexpected error while pinging
java.lang.NullPointerException: null
at java.util.stream.Collectors.lambda$toMap$58(Collectors.java:1320) ~[?:1.8.0_92]
at java.util.stream.ReduceOps$3ReducingSink.accept(ReduceOps.java:169) ~[?:1.8.0_92]
at java.util.ArrayList$ArrayListSpliterator.forEachRemaining(ArrayList.java:1374) ~[?:1.8.0_92]
at java.util.stream.Streams$ConcatSpliterator.forEachRemaining(Streams.java:742) ~[?:1.8.0_92]
at java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:481) ~[?:1.8.0_92]
at java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:471) ~[?:1.8.0_92]
at java.util.stream.ReduceOps$ReduceOp.evaluateSequential(ReduceOps.java:708) ~[?:1.8.0_92]
at java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234) ~[?:1.8.0_92]
at java.util.stream.ReferencePipeline.collect(ReferencePipeline.java:499) ~[?:1.8.0_92]
at org.elasticsearch.discovery.zen.UnicastZenPing.sendPings(UnicastZenPing.java:475) ~[main/:?]
at org.elasticsearch.discovery.zen.UnicastZenPing$1.doRun(UnicastZenPing.java:332) [main/:?]
at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:635) [main/:?]
at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) [main/:?]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) [?:1.8.0_92]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) [?:1.8.0_92]
at java.lang.Thread.run(Thread.java:745) [?:1.8.0_92]

@tbrooks8 @bleskes I have some trouble digging more this failure, so I'd be happy to have your opinion / gut feeling on this. Just an overlook would be helpfull at this stage.

@tlrx tlrx added :Distributed/Snapshot/Restore Anything directly related to the `_snapshot/*` APIs >test-failure Triaged test failures from CI v7.0.0 labels Jan 10, 2018
@tlrx tlrx self-assigned this Jan 10, 2018
@ywelsch
Copy link
Contributor

ywelsch commented Jan 12, 2018

@tlrx I think I know what happened. Even though the logs don't provide evidence for the whole scenario, what's in the logs makes the following quite plausible. The test deliberately causes a FailedToCommitException in ZenDiscovery.publish(...):

  1> [2018-01-10T05:23:40,406][WARN ][o.e.t.d.TestZenDiscovery ] [node_tm1] zen-disco-failed-to-publish, current nodes: nodes: 
  1>    {node_tm1}{05tuxEPVT5ypY7Mz6o_VRA}{p_hm4MVNRoOvl-n281Co2A}{127.0.0.1}{127.0.0.1:30101}, local, master
  1>    {node_td3}{Pvyfq38HSLqL04-F3jFIMw}{hnOj7wrPQNyK8PobpXXwIg}{127.0.0.1}{127.0.0.1:30103}
  1>    {node_tm2}{nPk7Z6bvRI-MufskXAhGTw}{-NZTnhd1SrSolB7FjkDwcg}{127.0.0.1}{127.0.0.1:30102}
  1>    {node_tm0}{JvhDBfzERq6UD13DUGBUJg}{Grp3AjfZR4m4Wb4Re-p5MA}{127.0.0.1}{127.0.0.1:30100}

  1> [2018-01-10T05:23:40,407][WARN ][o.e.c.s.MasterService    ] [node_tm1] failing [update_snapshot [test-repo:test-snap-2/vyIDc0GFSIKVtGd--HP_hQ]]: failed to commit cluster state version [25]
  1> org.elasticsearch.discovery.Discovery$FailedToCommitClusterStateException: timed out while waiting for enough masters to ack sent cluster state. [1] left
  1> 	at org.elasticsearch.discovery.zen.PublishClusterStateAction$SendingController.waitForCommit(PublishClusterStateAction.java:532) ~[main/:?]
  1> 	at org.elasticsearch.discovery.zen.PublishClusterStateAction.innerPublish(PublishClusterStateAction.java:196) ~[main/:?]
  1> 	at org.elasticsearch.discovery.zen.PublishClusterStateAction.publish(PublishClusterStateAction.java:161) ~[main/:?]
  1> 	at org.elasticsearch.discovery.zen.ZenDiscovery.publish(ZenDiscovery.java:338) ~[main/:?]
  1> 	at org.elasticsearch.cluster.service.MasterService.runTasks(MasterService.java:226) [main/:?]
  1> 	at org.elasticsearch.cluster.service.MasterService$Batcher.run(MasterService.java:133) [main/:?]

What happens in that case in ZenDiscovery.publish(...) is that the rejoin method is called. This in turn tells ClusterApplierService to submit an update task to apply the last cluster state from ZenDiscovery (clusterApplier.onNewClusterState(...)), the cluster state which has the master set to null. ZenDiscovery / MasterService does not wait for the result of that, but returns to the caller. The test has meanwhile executed the command ensureStableCluster(4, masterNode1); which sends a cluster health request to the node that just failed to commit. This is handled by TransportClusterHealthAction, a subclass of TransportMasterNodeAction for which we see the stackoverflow error. The request that was sent has request.waitForEvents() != null and request.local() == false, hence it is queued on the MasterService with the call clusterService.submitStateUpdateTask(...). The node is not master anymore according to ZenDiscovery, hence MasterService will call onNoLongerMaster, which in turn calls doExecute on the same ClusterHealthAction. Let's see what happens there. This method is implemented by TransportMasterNodeAction. The node gets the latest state from the ClusterApplierService to determine if the current node is a master node or not. If the ClusterApplierService has not applied the cluster state yet that was submitted above by the rejoin method in ZenDiscovery, then it still looks like the node is a master. It will therefore wrap the current listener in a delegate and then call masterOperation with that delegate (in the same thread, as TransportClusterHealthAction.executor() == ThreadPool.Names.SAME. This in turn will submit an update task on the MasterService, which will call onNoLongerMaster again on it's master state update thread. The same cycle continues and continues. There is no stackoverflow yet as we're switching threads each time we submit an update task to MasterService, even though each time we wrap the response listener. At some point ClusterApplierService has successfully applied the cluster state that has the info that the node is no longer master. TransportMasterNodeAction would wait up to 30 seconds for a master to appear. Note that node_tm1 has seen and joined the master after 5 seconds (at the time the stackoverflow happens):

[2018-01-10T05:23:45,696][INFO ][o.e.c.s.ClusterApplierService] [node_tm1] detected_master {node_tm0}{JvhDBfzERq6UD13DUGBUJg}{Grp3AjfZR4m4Wb4Re-p5MA}{127.0.0.1}{127.0.0.1:30100}, reason: apply cluster state (from master [master {node_tm0}{JvhDBfzERq6UD13DUGBUJg}{Grp3AjfZR4m4Wb4Re-p5MA}{127.0.0.1}{127.0.0.1:30100} committed version [33]])

TransportMasterNodeAction reroutes the health request to the new master, where it gets a successful response (all nodes have rejoined the cluster). Now it calls listener.onResponse on node_tm1 which is this object which has many times wrapped the original listener, each object calling listener.onResponse on the inner object that it wraps. This leads to the stackoverflow.

Now to the question what could have caused it or why we're not seeing this more often. What's odd is that it took such a long time for ClusterApplierService to apply the cluster state that just contains the info that the master is not master anymore. Unfortunately the logs don't show when that cluster state was applied. The only indication is that there are some GC logs showing all nodes in distress and also the log line above could indicate that the previous cluster just finished being applied. Also note that when we apply a cluster state, we validate connectivity to the nodes in the cluster state, independent of whether that cluster state has a master or not (see nodeConnectionsService.connectToNodes(...) in ClusterApplierService). We do this blockingly in NodeConnectionsService and this might have caused the cluster state from taking a while to be applied.

What fixes should we do? The first thing we should change is for TransportClusterHealthAction not to directly call doExecute but leave the retry logic to its superclass TransportMasterNodeAction by calling listener.onFailure instead. We could also possibly not validate connections when applying a cluster state that has no master (i.e. this would ensure that the cluster state is timely applied). Another option (I'm less excited by this one) would be to wait in ZenDiscovery.rejoin for the cluster state to be fully applied before returning control to the MasterService. I'll look further into those options and open PRs.

@tlrx
Copy link
Member Author

tlrx commented Jan 12, 2018

@ywelsch Thanks for the detailed comment, I understand the execution flow and I agree that it can explain the StackOverflowError.

I'll look further into those options and open PRs.

Please let me know if/how I can help. Also, feel free to reassign this issue to you if you want.

ywelsch added a commit that referenced this issue Jan 16, 2018
)

ClusterHealthAction does not use the regular retry logic, possibly causing StackOverflowErrors.

Relates #28169
ywelsch added a commit that referenced this issue Jan 16, 2018
)

ClusterHealthAction does not use the regular retry logic, possibly causing StackOverflowErrors.

Relates #28169
@ywelsch
Copy link
Contributor

ywelsch commented Jan 26, 2018

I think we can close this. The StackOverflowError should be addressed by #28195

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Distributed/Snapshot/Restore Anything directly related to the `_snapshot/*` APIs >test-failure Triaged test failures from CI v7.0.0-beta1
Projects
None yet
Development

No branches or pull requests

3 participants