NetworkDisruptionIT testJobRelocation failing #35052

mayya-sharipova · 2018-10-29T17:37:06Z

This test fails from time to time:

REPRODUCE WITH: ./gradlew :x-pack:plugin:ml:internalClusterTest \
  -Dtests.seed=8D4D9C772ADB301E \
  -Dtests.class=org.elasticsearch.xpack.ml.integration.NetworkDisruptionIT \
  -Dtests.method="testJobRelocation" \
  -Dtests.security.manager=true \
  -Dtests.locale=pt-PT \
  -Dtests.timezone=US/Michigan \
  -Dcompiler.java=10 \
  -Druntime.java=8

Log : https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+6.5+multijob-unix-compatibility/os=debian/8/console
The failure is not reproducible for me.

Looks like there is problem of forming a cluster: split brain in the cluster (two master nodes detected: node_t3 and node_t0, and not enough master nodes for the quorum:

1> [2018-10-29T13:32:47,544][INFO ][o.e.c.s.MasterService    ] [node_t3] zen-disco-elected-as-master ([2] nodes joined)[{node_t4}{vdyH-Xp0S_eoZNx9KcMaAw}{XaKVGoOGSCympiARN1x8mw}{127.0.0.1}{127.0.0.1:42010}{ml.machine_memory=63464030208, ml.max_open_jobs=20, xpack.installed=true, ml.enabled=true}, {node_t2}{dh3pD3ONQ0yjTKGqY2lj1g}{HCp8GrhtQVmHoybwpNwO9w}{127.0.0.1}{127.0.0.1:39450}{ml.machine_memory=63464030208, ml.max_open_jobs=20, xpack.installed=true, ml.enabled=true}], reason: new_master {node_t3}{UoshLV9fRWezwvy7lDcAPw}{GX8SBaUzQESrj_ewmZflRg}{127.0.0.1}{127.0.0.1:52930}{ml.machine_memory=63464030208, xpack.installed=true, ml.max_open_jobs=20, ml.enabled=true}, added {{node_t4}{vdyH-Xp0S_eoZNx9KcMaAw}{XaKVGoOGSCympiARN1x8mw}{127.0.0.1}{127.0.0.1:42010}{ml.machine_memory=63464030208, ml.max_open_jobs=20, xpack.installed=true, ml.enabled=true},{node_t2}{dh3pD3ONQ0yjTKGqY2lj1g}{HCp8GrhtQVmHoybwpNwO9w}{127.0.0.1}{127.0.0.1:39450}{ml.machine_memory=63464030208, ml.max_open_jobs=20, xpack.installed=true, ml.enabled=true},}
  1> [2018-10-29T13:32:47,547][INFO ][o.e.c.s.MasterService    ] [node_t0] zen-disco-node-join[{node_t1}{mRSV0xF-SC2Qy2NZz0J_5g}{1KEoP2T5Tze2Vtl3wU7WKg}{127.0.0.1}{127.0.0.1:59887}{ml.machine_memory=63464030208, ml.max_open_jobs=20, xpack.installed=true, ml.enabled=true}], reason: added {{node_t1}{mRSV0xF-SC2Qy2NZz0J_5g}{1KEoP2T5Tze2Vtl3wU7WKg}{127.0.0.1}{127.0.0.1:59887}{ml.machine_memory=63464030208, ml.max_open_jobs=20, xpack.installed=true, ml.enabled=true},}
  1> [2018-10-29T13:32:47,556][INFO ][o.e.c.s.ClusterApplierService] [node_t2] detected_master {node_t3}{UoshLV9fRWezwvy7lDcAPw}{GX8SBaUzQESrj_ewmZflRg}{127.0.0.1}{127.0.0.1:52930}{ml.machine_memory=63464030208, ml.max_open_jobs=20, xpack.installed=true, ml.enabled=true}, added {{node_t4}{vdyH-Xp0S_eoZNx9KcMaAw}{XaKVGoOGSCympiARN1x8mw}{127.0.0.1}{127.0.0.1:42010}{ml.machine_memory=63464030208, ml.max_open_jobs=20, xpack.installed=true, ml.enabled=true},{node_t3}{UoshLV9fRWezwvy7lDcAPw}{GX8SBaUzQESrj_ewmZflRg}{127.0.0.1}{127.0.0.1:52930}{ml.machine_memory=63464030208, ml.max_open_jobs=20, xpack.installed=true, ml.enabled=true},}, reason: apply cluster state (from master [master {node_t3}{UoshLV9fRWezwvy7lDcAPw}{GX8SBaUzQESrj_ewmZflRg}{127.0.0.1}{127.0.0.1:52930}{ml.machine_memory=63464030208, ml.max_open_jobs=20, xpack.installed=true, ml.enabled=true} committed version [1]])
  1> [2018-10-29T13:32:47,556][INFO ][o.e.c.s.ClusterApplierService] [node_t4] detected_master {node_t3}{UoshLV9fRWezwvy7lDcAPw}{GX8SBaUzQESrj_ewmZflRg}{127.0.0.1}{127.0.0.1:52930}{ml.machine_memory=63464030208, ml.max_open_jobs=20, xpack.installed=true, ml.enabled=true}, added {{node_t2}{dh3pD3ONQ0yjTKGqY2lj1g}{HCp8GrhtQVmHoybwpNwO9w}{127.0.0.1}{127.0.0.1:39450}{ml.machine_memory=63464030208, ml.max_open_jobs=20, xpack.installed=true, ml.enabled=true},{node_t3}{UoshLV9fRWezwvy7lDcAPw}{GX8SBaUzQESrj_ewmZflRg}{127.0.0.1}{127.0.0.1:52930}{ml.machine_memory=63464030208, ml.max_open_jobs=20, xpack.installed=true, ml.enabled=true},}, reason: apply cluster state (from master [master {node_t3}{UoshLV9fRWezwvy7lDcAPw}{GX8SBaUzQESrj_ewmZflRg}{127.0.0.1}{127.0.0.1:52930}{ml.machine_memory=63464030208, ml.max_open_jobs=20, xpack.installed=true, ml.enabled=true} committed version [1]])
  1> [2018-10-29T13:32:47,557][INFO ][o.e.c.s.ClusterApplierService] [node_t1] detected_master {node_t0}{RGD13uxVTcWUm62e6oUCmQ}{TG2rTEBARZyN4IEJq5KFuA}{127.0.0.1}{127.0.0.1:34213}{ml.machine_memory=63464030208, ml.max_open_jobs=20, xpack.installed=true, ml.enabled=true}, added {{node_t0}{RGD13uxVTcWUm62e6oUCmQ}{TG2rTEBARZyN4IEJq5KFuA}{127.0.0.1}{127.0.0.1:34213}{ml.machine_memory=63464030208, ml.max_open_jobs=20, xpack.installed=true, ml.enabled=true},}, reason: apply cluster state (from master [master {node_t0}{RGD13uxVTcWUm62e6oUCmQ}{TG2rTEBARZyN4IEJq5KFuA}{127.0.0.1}{127.0.0.1:34213}{ml.machine_memory=63464030208, ml.max_open_jobs=20, xpack.installed=true, ml.enabled=true} committed version [4]])
  1> [2018-10-29T13:32:47,566][INFO ][o.e.n.Node               ] [[test_TEST-CHILD_VM=[2]-CLUSTER_SEED=[4793419988162627123]-HASH=[6DBB78660FF]-cluster[T#4]]] started
  1> [2018-10-29T13:32:47,566][INFO ][o.e.n.Node               ] [[test_TEST-CHILD_VM=[2]-CLUSTER_SEED=[4793419988162627123]-HASH=[6DBB78660FF]-cluster[T#2]]] started
  1> [2018-10-29T13:32:47,568][INFO ][o.e.c.s.ClusterApplierService] [node_t3] new_master {node_t3}{UoshLV9fRWezwvy7lDcAPw}{GX8SBaUzQESrj_ewmZflRg}{127.0.0.1}{127.0.0.1:52930}{ml.machine_memory=63464030208, xpack.installed=true, ml.max_open_jobs=20, ml.enabled=true}, added {{node_t4}{vdyH-Xp0S_eoZNx9KcMaAw}{XaKVGoOGSCympiARN1x8mw}{127.0.0.1}{127.0.0.1:42010}{ml.machine_memory=63464030208, ml.max_open_jobs=20, xpack.installed=true, ml.enabled=true},{node_t2}{dh3pD3ONQ0yjTKGqY2lj1g}{HCp8GrhtQVmHoybwpNwO9w}{127.0.0.1}{127.0.0.1:39450}{ml.machine_memory=63464030208, ml.max_open_jobs=20, xpack.installed=true, ml.enabled=true},}, reason: apply cluster state (from master [master {node_t3}{UoshLV9fRWezwvy7lDcAPw}{GX8SBaUzQESrj_ewmZflRg}{127.0.0.1}{127.0.0.1:52930}{ml.machine_memory=63464030208, xpack.installed=true, ml.max_open_jobs=20, ml.enabled=true} committed version [1] source [zen-disco-elected-as-master ([2] nodes joined)[{node_t4}{vdyH-Xp0S_eoZNx9KcMaAw}{XaKVGoOGSCympiARN1x8mw}{127.0.0.1}{127.0.0.1:42010}{ml.machine_memory=63464030208, ml.max_open_jobs=20, xpack.installed=true, ml.enabled=true}, {node_t2}{dh3pD3ONQ0yjTKGqY2lj1g}{HCp8GrhtQVmHoybwpNwO9w}{127.0.0.1}{127.0.0.1:39450}{ml.machine_memory=63464030208, ml.max_open_jobs=20, xpack.installed=true, ml.enabled=true}]]])
  1> [2018-10-29T13:32:47,570][INFO ][o.e.n.Node               ] [[test_TEST-CHILD_VM=[2]-CLUSTER_SEED=[4793419988162627123]-HASH=[6DBB78660FF]-cluster[T#3]]] started
  1> [2018-10-29T13:32:47,602][INFO ][o.e.l.LicenseService     ] [node_t1] license [a5fb9648-fdb2-46ae-8185-9ee424906bde] mode [trial] - valid
  1> [2018-10-29T13:32:47,603][INFO ][o.e.n.Node               ] [[test_TEST-CHILD_VM=[2]-CLUSTER_SEED=[4793419988162627123]-HASH=[6DBB78660FF]-cluster[T#1]]] started
  1> [2018-10-29T13:32:47,603][INFO ][o.e.c.s.ClusterApplierService] [node_t0] added {{node_t1}{mRSV0xF-SC2Qy2NZz0J_5g}{1KEoP2T5Tze2Vtl3wU7WKg}{127.0.0.1}{127.0.0.1:59887}{ml.machine_memory=63464030208, ml.max_open_jobs=20, xpack.installed=true, ml.enabled=true},}, reason: apply cluster state (from master [master {node_t0}{RGD13uxVTcWUm62e6oUCmQ}{TG2rTEBARZyN4IEJq5KFuA}{127.0.0.1}{127.0.0.1:34213}{ml.machine_memory=63464030208, xpack.installed=true, ml.max_open_jobs=20, ml.enabled=true} committed version [4] source [zen-disco-node-join[{node_t1}{mRSV0xF-SC2Qy2NZz0J_5g}{1KEoP2T5Tze2Vtl3wU7WKg}{127.0.0.1}{127.0.0.1:59887}{ml.machine_memory=63464030208, ml.max_open_jobs=20, xpack.installed=true, ml.enabled=true}]]])
  1> [2018-10-29T13:32:47,605][WARN ][o.e.d.z.ElectMasterService] [node_t0] value for setting "discovery.zen.minimum_master_nodes" is too low. This can result in data loss! Please set it to at least a quorum of master-eligible nodes (current value: [1], total number of master-eligible nodes used for publishing in this round: [2])

The text was updated successfully, but these errors were encountered:

elasticmachine · 2018-10-29T17:37:33Z

Pinging @elastic/ml-core

droberts195 · 2018-10-29T17:52:29Z

It looks like the bug in the test is that it needs to set discovery.zen.minimum_master_nodes to 3 because it creates a 5 node cluster.

elasticmachine · 2018-10-29T18:03:00Z

Pinging @elastic/es-distributed

DaveCTurner · 2018-10-29T18:17:45Z

It's a little subtle. Here we started the first node with discovery.zen.minimum_master_nodes set to 1 and then add four more nodes. Since autoManageMinMasterNodes is true the four new nodes do have discovery.zen.minimum_master_nodes set to 3, but unfortunately this isn't sufficient: three of the new four nodes gang together and form their own cluster, independent of the first node which is itself already an elected master.

What puzzles me is that we allow 3 seconds for all the nodes to find each other, which should be plenty of time to discover the existing master:

  1> [2018-10-29T13:32:44,510][INFO ][o.e.t.t.MockTransportService] [[test_TEST-CHILD_VM=[2]-CLUSTER_SEED=[4793419988162627123]-HASH=[6DBB78660FF]-cluster[T#2]]] publish_address {127.0.0.1:39450}, bound_addresses {[::1]:59095}, {127
  1> [2018-10-29T13:32:47,544][INFO ][o.e.c.s.MasterService    ] [node_t3] zen-disco-elected-as-master ([2] nodes joined)[{node_t4}{vdyH-Xp0S_eoZNx9KcMaAw}{XaKVGoOGSCympiARN1x8mw}{127.0.0.1}{127.0.0.1:42010}{ml.machine_memory=634640

I will try and reproduce this with more logging. It's not ML-specific, although the form of this test does make it more likely to occur here: if we said we wanted 5 nodes up-front using the @ESIntegTestCase.ClusterScope() annotation then they'd all start up correctly configured.

DaveCTurner · 2018-10-30T08:48:06Z

I managed to reproduce a similar failure (after ≥500 iterations) without extra logging. With extra logging enabled it's been running overnight and consistently passing ☹️ I was a little concerned that perhaps we weren't discovering the correct set of nodes, but I see no indication of this in the logs (of passing runs) and the code looks to be doing the right sort of thing.

I have pushed f7760dd to enable more logging in CI, which will tell us what really happens in the next failure (or fix the problem 😉)

It could just be that things were going a bit slowly on the CI worker. This is a failure mode that can occur when growing a cluster too quickly if the timing is sufficiently bad. We could fix this by growing the cluster more slowly (e.g. one-node-at-a-time), or else set minimum_master_nodes to 5 on the new nodes so they can't form a cluster without the existing master. In the long run this'll be fixed by #32006.

Sometimes the cluster forming here will split-brain when it grows up to 5 nodes. This could be a timing issue or could be something going wrong in discovery, so this asks for more logs. Relates #35052

droberts195 · 2018-10-30T09:00:10Z

Thanks for investigating this @DaveCTurner.

if we said we wanted 5 nodes up-front using the @ESIntegTestCase.ClusterScope() annotation then they'd all start up correctly configured

There's no particular reason why this test is creating a 1 node cluster and then adding 4 more nodes to it. It could still test what it's supposed to be testing by immediately creating a 5 node cluster - the way it's creating the cluster now is just due to ignorance of the subtleties of the different ways of creating the cluster. I guess this is extremely unlikely to happen at a real customer because they wouldn't be running 5 nodes on the same VM and the timeout for the nodes to find each other is 30 seconds. So if you get fed up investigating I'm very happy to just switch to @ESIntegTestCase.ClusterScope() to create the cluster.

DaveCTurner · 2018-10-30T09:20:10Z

I think it'd be good to give this a few weeks to fail again with more logging just so we can be sure that we're not doing something incorrect with how discovery interacts with ESIntegTestCase when the clusters grow, as there have been some changes in this area recently. Quite happy to track that in separate test case if you'd prefer.

By the way, I think what the test is doing here is legitimate: this is a bug in the framework, I think, and the suggestion to use the annotation is merely a workaround.

droberts195 · 2018-10-30T09:23:56Z

Thanks @DaveCTurner. There's no need to create a separate test case. I'm happy to use this ML one for this purpose.

DaveCTurner · 2018-10-31T14:48:21Z

I forgot that I left my CI running this job yesterday, but was just notified that iteration 10337 failed again with trace logging enabled, and the failure now makes sense despite the 3-second pinging delay. It is a consequence of how today's discovery implementation does not gossip symmetrically: it only shares pings that it has received, not their responses. In this case, only one node, node_t1, sees any seed addresses, presumably because the others start pinging before the discovery file has been written:

  1> [2018-10-31T15:55:34,510][DEBUG][o.e.d.z.FileBasedUnicastHostsProvider] [node_t0] seed addresses: []
  1> [2018-10-31T15:55:37,689][DEBUG][o.e.d.z.FileBasedUnicastHostsProvider] [node_t3] seed addresses: []
  1> [2018-10-31T15:55:37,691][DEBUG][o.e.d.z.FileBasedUnicastHostsProvider] [node_t4] seed addresses: []
  1> [2018-10-31T15:55:37,691][DEBUG][o.e.d.z.FileBasedUnicastHostsProvider] [node_t2] seed addresses: []
  1> [2018-10-31T15:55:37,697][DEBUG][o.e.d.z.FileBasedUnicastHostsProvider] [node_t1] seed addresses: [127.0.0.1:36535, 127.0.0.1:44463, 127.0.0.1:40167, 127.0.0.1:46343]

Once a pinging round has started we do not check the unicast hosts provider again. node_t1 pings all the nodes, and initially none of the other nodes can do anything. node_t0 is a master so it responds but does not ping node_t1 back, and all the other nodes do ping it back. This means that node_t1 accumulates pings from all nodes except node_t0 and broadcasts these to all the other nodes in subsequent pings, so that now node_t{1,2,3,4} all know about each other but only node_t1 knows about node_t0. After the pinging round finishes, node_t1 joins node_t0 as it is already a master, but node_t{2,3,4} think there is no master, and decide that node_t3 is the best choice, splitting the cluster in two.

Today when ESIntegTestCase starts some nodes it writes out the unicast hosts files each time a node starts its transport service. This does mean that a number of nodes can start and perform their first pinging round without any unicast hosts which, if the timing is unlucky and a lot of nodes are all started at the same time, can lead to a split brain as in elastic#35052. Prior to elastic#33554 this was unlikely to happen since the MockUncasedHostsProvider would always have yielded the existing hosts, so the timing would have to have been implausibly unlucky. Since elastic#33554, however, it's more likely because the race occurs between the start of the first round of pinging and the writing of the unicast hosts file. It is realistic that new nodes will be configured with the existing nodes from startup, so this change reinstates that behaviour Closes elastic#35052.

Today when ESIntegTestCase starts some nodes it writes out the unicast hosts files each time a node starts its transport service. This does mean that a number of nodes can start and perform their first pinging round without any unicast hosts which, if the timing is unlucky and a lot of nodes are all started at the same time, can lead to a split brain as in #35052. Prior to #33554 this was unlikely to happen since the MockUncasedHostsProvider would always have yielded the existing hosts, so the timing would have to have been implausibly unlucky. Since #33554, however, it's more likely because the race occurs between the start of the first round of pinging and the writing of the unicast hosts file. It is realistic that new nodes will be configured with the existing nodes from startup, so this change reinstates that behaviour. Closes #35052.

jdconrad · 2018-12-06T20:42:46Z

@DaveCTurner This looks like the same issue happening back in 5.6? (https://internal-ci.elastic.co/job/elastic+x-pack-elasticsearch+5.6+multijob-windows-compatibility/508/console). I assume this won't be backported that far, but wanted to confirm.

DaveCTurner · 2018-12-17T18:00:15Z

@jdconrad thanks for the ping. That failure is not the same issue, although 5.6 looks susceptible to this specific issue and you are right that the fix is not backported that far (indeed, #35136 wouldn't work in 5.6). I opened #36732 for the failure that you linked.

mayya-sharipova added >test-failure Triaged test failures from CI :ml Machine learning labels Oct 29, 2018

DaveCTurner added :Distributed/Cluster Coordination Cluster formation and cluster state publication, including cluster membership and fault detection. and removed :ml Machine learning labels Oct 29, 2018

DaveCTurner self-assigned this Oct 30, 2018

DaveCTurner removed their assignment Oct 31, 2018

DaveCTurner added the team-discuss label Oct 31, 2018

DaveCTurner mentioned this issue Oct 31, 2018

Pre-populate unicast hosts files #35136

Merged

DaveCTurner closed this as completed in #35136 Oct 31, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

NetworkDisruptionIT testJobRelocation failing #35052

NetworkDisruptionIT testJobRelocation failing #35052

mayya-sharipova commented Oct 29, 2018 •

edited by DaveCTurner

Loading

elasticmachine commented Oct 29, 2018

droberts195 commented Oct 29, 2018

elasticmachine commented Oct 29, 2018

DaveCTurner commented Oct 29, 2018

DaveCTurner commented Oct 30, 2018 •

edited

Loading

droberts195 commented Oct 30, 2018

DaveCTurner commented Oct 30, 2018

droberts195 commented Oct 30, 2018

DaveCTurner commented Oct 31, 2018

jdconrad commented Dec 6, 2018

DaveCTurner commented Dec 17, 2018

NetworkDisruptionIT testJobRelocation failing #35052

NetworkDisruptionIT testJobRelocation failing #35052

Comments

mayya-sharipova commented Oct 29, 2018 • edited by DaveCTurner Loading

elasticmachine commented Oct 29, 2018

droberts195 commented Oct 29, 2018

elasticmachine commented Oct 29, 2018

DaveCTurner commented Oct 29, 2018

DaveCTurner commented Oct 30, 2018 • edited Loading

droberts195 commented Oct 30, 2018

DaveCTurner commented Oct 30, 2018

droberts195 commented Oct 30, 2018

DaveCTurner commented Oct 31, 2018

jdconrad commented Dec 6, 2018

DaveCTurner commented Dec 17, 2018

mayya-sharipova commented Oct 29, 2018 •

edited by DaveCTurner

Loading

DaveCTurner commented Oct 30, 2018 •

edited

Loading