Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NetworkDisruptionIT testJobRelocation failing #35052

Closed
mayya-sharipova opened this issue Oct 29, 2018 · 11 comments
Closed

NetworkDisruptionIT testJobRelocation failing #35052

mayya-sharipova opened this issue Oct 29, 2018 · 11 comments
Labels
:Distributed/Cluster Coordination Cluster formation and cluster state publication, including cluster membership and fault detection. team-discuss >test-failure Triaged test failures from CI

Comments

@mayya-sharipova
Copy link
Contributor

mayya-sharipova commented Oct 29, 2018

This test fails from time to time:

REPRODUCE WITH: ./gradlew :x-pack:plugin:ml:internalClusterTest \
  -Dtests.seed=8D4D9C772ADB301E \
  -Dtests.class=org.elasticsearch.xpack.ml.integration.NetworkDisruptionIT \
  -Dtests.method="testJobRelocation" \
  -Dtests.security.manager=true \
  -Dtests.locale=pt-PT \
  -Dtests.timezone=US/Michigan \
  -Dcompiler.java=10 \
  -Druntime.java=8

Log : https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+6.5+multijob-unix-compatibility/os=debian/8/console
The failure is not reproducible for me.

Looks like there is problem of forming a cluster: split brain in the cluster (two master nodes detected: node_t3 and node_t0, and not enough master nodes for the quorum:

1> [2018-10-29T13:32:47,544][INFO ][o.e.c.s.MasterService    ] [node_t3] zen-disco-elected-as-master ([2] nodes joined)[{node_t4}{vdyH-Xp0S_eoZNx9KcMaAw}{XaKVGoOGSCympiARN1x8mw}{127.0.0.1}{127.0.0.1:42010}{ml.machine_memory=63464030208, ml.max_open_jobs=20, xpack.installed=true, ml.enabled=true}, {node_t2}{dh3pD3ONQ0yjTKGqY2lj1g}{HCp8GrhtQVmHoybwpNwO9w}{127.0.0.1}{127.0.0.1:39450}{ml.machine_memory=63464030208, ml.max_open_jobs=20, xpack.installed=true, ml.enabled=true}], reason: new_master {node_t3}{UoshLV9fRWezwvy7lDcAPw}{GX8SBaUzQESrj_ewmZflRg}{127.0.0.1}{127.0.0.1:52930}{ml.machine_memory=63464030208, xpack.installed=true, ml.max_open_jobs=20, ml.enabled=true}, added {{node_t4}{vdyH-Xp0S_eoZNx9KcMaAw}{XaKVGoOGSCympiARN1x8mw}{127.0.0.1}{127.0.0.1:42010}{ml.machine_memory=63464030208, ml.max_open_jobs=20, xpack.installed=true, ml.enabled=true},{node_t2}{dh3pD3ONQ0yjTKGqY2lj1g}{HCp8GrhtQVmHoybwpNwO9w}{127.0.0.1}{127.0.0.1:39450}{ml.machine_memory=63464030208, ml.max_open_jobs=20, xpack.installed=true, ml.enabled=true},}
  1> [2018-10-29T13:32:47,547][INFO ][o.e.c.s.MasterService    ] [node_t0] zen-disco-node-join[{node_t1}{mRSV0xF-SC2Qy2NZz0J_5g}{1KEoP2T5Tze2Vtl3wU7WKg}{127.0.0.1}{127.0.0.1:59887}{ml.machine_memory=63464030208, ml.max_open_jobs=20, xpack.installed=true, ml.enabled=true}], reason: added {{node_t1}{mRSV0xF-SC2Qy2NZz0J_5g}{1KEoP2T5Tze2Vtl3wU7WKg}{127.0.0.1}{127.0.0.1:59887}{ml.machine_memory=63464030208, ml.max_open_jobs=20, xpack.installed=true, ml.enabled=true},}
  1> [2018-10-29T13:32:47,556][INFO ][o.e.c.s.ClusterApplierService] [node_t2] detected_master {node_t3}{UoshLV9fRWezwvy7lDcAPw}{GX8SBaUzQESrj_ewmZflRg}{127.0.0.1}{127.0.0.1:52930}{ml.machine_memory=63464030208, ml.max_open_jobs=20, xpack.installed=true, ml.enabled=true}, added {{node_t4}{vdyH-Xp0S_eoZNx9KcMaAw}{XaKVGoOGSCympiARN1x8mw}{127.0.0.1}{127.0.0.1:42010}{ml.machine_memory=63464030208, ml.max_open_jobs=20, xpack.installed=true, ml.enabled=true},{node_t3}{UoshLV9fRWezwvy7lDcAPw}{GX8SBaUzQESrj_ewmZflRg}{127.0.0.1}{127.0.0.1:52930}{ml.machine_memory=63464030208, ml.max_open_jobs=20, xpack.installed=true, ml.enabled=true},}, reason: apply cluster state (from master [master {node_t3}{UoshLV9fRWezwvy7lDcAPw}{GX8SBaUzQESrj_ewmZflRg}{127.0.0.1}{127.0.0.1:52930}{ml.machine_memory=63464030208, ml.max_open_jobs=20, xpack.installed=true, ml.enabled=true} committed version [1]])
  1> [2018-10-29T13:32:47,556][INFO ][o.e.c.s.ClusterApplierService] [node_t4] detected_master {node_t3}{UoshLV9fRWezwvy7lDcAPw}{GX8SBaUzQESrj_ewmZflRg}{127.0.0.1}{127.0.0.1:52930}{ml.machine_memory=63464030208, ml.max_open_jobs=20, xpack.installed=true, ml.enabled=true}, added {{node_t2}{dh3pD3ONQ0yjTKGqY2lj1g}{HCp8GrhtQVmHoybwpNwO9w}{127.0.0.1}{127.0.0.1:39450}{ml.machine_memory=63464030208, ml.max_open_jobs=20, xpack.installed=true, ml.enabled=true},{node_t3}{UoshLV9fRWezwvy7lDcAPw}{GX8SBaUzQESrj_ewmZflRg}{127.0.0.1}{127.0.0.1:52930}{ml.machine_memory=63464030208, ml.max_open_jobs=20, xpack.installed=true, ml.enabled=true},}, reason: apply cluster state (from master [master {node_t3}{UoshLV9fRWezwvy7lDcAPw}{GX8SBaUzQESrj_ewmZflRg}{127.0.0.1}{127.0.0.1:52930}{ml.machine_memory=63464030208, ml.max_open_jobs=20, xpack.installed=true, ml.enabled=true} committed version [1]])
  1> [2018-10-29T13:32:47,557][INFO ][o.e.c.s.ClusterApplierService] [node_t1] detected_master {node_t0}{RGD13uxVTcWUm62e6oUCmQ}{TG2rTEBARZyN4IEJq5KFuA}{127.0.0.1}{127.0.0.1:34213}{ml.machine_memory=63464030208, ml.max_open_jobs=20, xpack.installed=true, ml.enabled=true}, added {{node_t0}{RGD13uxVTcWUm62e6oUCmQ}{TG2rTEBARZyN4IEJq5KFuA}{127.0.0.1}{127.0.0.1:34213}{ml.machine_memory=63464030208, ml.max_open_jobs=20, xpack.installed=true, ml.enabled=true},}, reason: apply cluster state (from master [master {node_t0}{RGD13uxVTcWUm62e6oUCmQ}{TG2rTEBARZyN4IEJq5KFuA}{127.0.0.1}{127.0.0.1:34213}{ml.machine_memory=63464030208, ml.max_open_jobs=20, xpack.installed=true, ml.enabled=true} committed version [4]])
  1> [2018-10-29T13:32:47,566][INFO ][o.e.n.Node               ] [[test_TEST-CHILD_VM=[2]-CLUSTER_SEED=[4793419988162627123]-HASH=[6DBB78660FF]-cluster[T#4]]] started
  1> [2018-10-29T13:32:47,566][INFO ][o.e.n.Node               ] [[test_TEST-CHILD_VM=[2]-CLUSTER_SEED=[4793419988162627123]-HASH=[6DBB78660FF]-cluster[T#2]]] started
  1> [2018-10-29T13:32:47,568][INFO ][o.e.c.s.ClusterApplierService] [node_t3] new_master {node_t3}{UoshLV9fRWezwvy7lDcAPw}{GX8SBaUzQESrj_ewmZflRg}{127.0.0.1}{127.0.0.1:52930}{ml.machine_memory=63464030208, xpack.installed=true, ml.max_open_jobs=20, ml.enabled=true}, added {{node_t4}{vdyH-Xp0S_eoZNx9KcMaAw}{XaKVGoOGSCympiARN1x8mw}{127.0.0.1}{127.0.0.1:42010}{ml.machine_memory=63464030208, ml.max_open_jobs=20, xpack.installed=true, ml.enabled=true},{node_t2}{dh3pD3ONQ0yjTKGqY2lj1g}{HCp8GrhtQVmHoybwpNwO9w}{127.0.0.1}{127.0.0.1:39450}{ml.machine_memory=63464030208, ml.max_open_jobs=20, xpack.installed=true, ml.enabled=true},}, reason: apply cluster state (from master [master {node_t3}{UoshLV9fRWezwvy7lDcAPw}{GX8SBaUzQESrj_ewmZflRg}{127.0.0.1}{127.0.0.1:52930}{ml.machine_memory=63464030208, xpack.installed=true, ml.max_open_jobs=20, ml.enabled=true} committed version [1] source [zen-disco-elected-as-master ([2] nodes joined)[{node_t4}{vdyH-Xp0S_eoZNx9KcMaAw}{XaKVGoOGSCympiARN1x8mw}{127.0.0.1}{127.0.0.1:42010}{ml.machine_memory=63464030208, ml.max_open_jobs=20, xpack.installed=true, ml.enabled=true}, {node_t2}{dh3pD3ONQ0yjTKGqY2lj1g}{HCp8GrhtQVmHoybwpNwO9w}{127.0.0.1}{127.0.0.1:39450}{ml.machine_memory=63464030208, ml.max_open_jobs=20, xpack.installed=true, ml.enabled=true}]]])
  1> [2018-10-29T13:32:47,570][INFO ][o.e.n.Node               ] [[test_TEST-CHILD_VM=[2]-CLUSTER_SEED=[4793419988162627123]-HASH=[6DBB78660FF]-cluster[T#3]]] started
  1> [2018-10-29T13:32:47,602][INFO ][o.e.l.LicenseService     ] [node_t1] license [a5fb9648-fdb2-46ae-8185-9ee424906bde] mode [trial] - valid
  1> [2018-10-29T13:32:47,603][INFO ][o.e.n.Node               ] [[test_TEST-CHILD_VM=[2]-CLUSTER_SEED=[4793419988162627123]-HASH=[6DBB78660FF]-cluster[T#1]]] started
  1> [2018-10-29T13:32:47,603][INFO ][o.e.c.s.ClusterApplierService] [node_t0] added {{node_t1}{mRSV0xF-SC2Qy2NZz0J_5g}{1KEoP2T5Tze2Vtl3wU7WKg}{127.0.0.1}{127.0.0.1:59887}{ml.machine_memory=63464030208, ml.max_open_jobs=20, xpack.installed=true, ml.enabled=true},}, reason: apply cluster state (from master [master {node_t0}{RGD13uxVTcWUm62e6oUCmQ}{TG2rTEBARZyN4IEJq5KFuA}{127.0.0.1}{127.0.0.1:34213}{ml.machine_memory=63464030208, xpack.installed=true, ml.max_open_jobs=20, ml.enabled=true} committed version [4] source [zen-disco-node-join[{node_t1}{mRSV0xF-SC2Qy2NZz0J_5g}{1KEoP2T5Tze2Vtl3wU7WKg}{127.0.0.1}{127.0.0.1:59887}{ml.machine_memory=63464030208, ml.max_open_jobs=20, xpack.installed=true, ml.enabled=true}]]])
  1> [2018-10-29T13:32:47,605][WARN ][o.e.d.z.ElectMasterService] [node_t0] value for setting "discovery.zen.minimum_master_nodes" is too low. This can result in data loss! Please set it to at least a quorum of master-eligible nodes (current value: [1], total number of master-eligible nodes used for publishing in this round: [2])
@mayya-sharipova mayya-sharipova added >test-failure Triaged test failures from CI :ml Machine learning labels Oct 29, 2018
@elasticmachine
Copy link
Collaborator

Pinging @elastic/ml-core

@droberts195
Copy link
Contributor

It looks like the bug in the test is that it needs to set discovery.zen.minimum_master_nodes to 3 because it creates a 5 node cluster.

@DaveCTurner DaveCTurner added :Distributed/Cluster Coordination Cluster formation and cluster state publication, including cluster membership and fault detection. and removed :ml Machine learning labels Oct 29, 2018
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-distributed

@DaveCTurner
Copy link
Contributor

It's a little subtle. Here we started the first node with discovery.zen.minimum_master_nodes set to 1 and then add four more nodes. Since autoManageMinMasterNodes is true the four new nodes do have discovery.zen.minimum_master_nodes set to 3, but unfortunately this isn't sufficient: three of the new four nodes gang together and form their own cluster, independent of the first node which is itself already an elected master.

What puzzles me is that we allow 3 seconds for all the nodes to find each other, which should be plenty of time to discover the existing master:

  1> [2018-10-29T13:32:44,510][INFO ][o.e.t.t.MockTransportService] [[test_TEST-CHILD_VM=[2]-CLUSTER_SEED=[4793419988162627123]-HASH=[6DBB78660FF]-cluster[T#2]]] publish_address {127.0.0.1:39450}, bound_addresses {[::1]:59095}, {127
  1> [2018-10-29T13:32:47,544][INFO ][o.e.c.s.MasterService    ] [node_t3] zen-disco-elected-as-master ([2] nodes joined)[{node_t4}{vdyH-Xp0S_eoZNx9KcMaAw}{XaKVGoOGSCympiARN1x8mw}{127.0.0.1}{127.0.0.1:42010}{ml.machine_memory=634640

I will try and reproduce this with more logging. It's not ML-specific, although the form of this test does make it more likely to occur here: if we said we wanted 5 nodes up-front using the @ESIntegTestCase.ClusterScope() annotation then they'd all start up correctly configured.

@DaveCTurner
Copy link
Contributor

DaveCTurner commented Oct 30, 2018

I managed to reproduce a similar failure (after ≥500 iterations) without extra logging. With extra logging enabled it's been running overnight and consistently passing ☹️ I was a little concerned that perhaps we weren't discovering the correct set of nodes, but I see no indication of this in the logs (of passing runs) and the code looks to be doing the right sort of thing.

I have pushed f7760dd to enable more logging in CI, which will tell us what really happens in the next failure (or fix the problem 😉)

It could just be that things were going a bit slowly on the CI worker. This is a failure mode that can occur when growing a cluster too quickly if the timing is sufficiently bad. We could fix this by growing the cluster more slowly (e.g. one-node-at-a-time), or else set minimum_master_nodes to 5 on the new nodes so they can't form a cluster without the existing master. In the long run this'll be fixed by #32006.

DaveCTurner added a commit that referenced this issue Oct 30, 2018
Sometimes the cluster forming here will split-brain when it grows up to 5
nodes. This could be a timing issue or could be something going wrong in
discovery, so this asks for more logs. Relates #35052
@droberts195
Copy link
Contributor

Thanks for investigating this @DaveCTurner.

if we said we wanted 5 nodes up-front using the @ESIntegTestCase.ClusterScope() annotation then they'd all start up correctly configured

There's no particular reason why this test is creating a 1 node cluster and then adding 4 more nodes to it. It could still test what it's supposed to be testing by immediately creating a 5 node cluster - the way it's creating the cluster now is just due to ignorance of the subtleties of the different ways of creating the cluster. I guess this is extremely unlikely to happen at a real customer because they wouldn't be running 5 nodes on the same VM and the timeout for the nodes to find each other is 30 seconds. So if you get fed up investigating I'm very happy to just switch to @ESIntegTestCase.ClusterScope() to create the cluster.

@DaveCTurner
Copy link
Contributor

I think it'd be good to give this a few weeks to fail again with more logging just so we can be sure that we're not doing something incorrect with how discovery interacts with ESIntegTestCase when the clusters grow, as there have been some changes in this area recently. Quite happy to track that in separate test case if you'd prefer.

By the way, I think what the test is doing here is legitimate: this is a bug in the framework, I think, and the suggestion to use the annotation is merely a workaround.

@droberts195
Copy link
Contributor

Thanks @DaveCTurner. There's no need to create a separate test case. I'm happy to use this ML one for this purpose.

@DaveCTurner DaveCTurner self-assigned this Oct 30, 2018
@DaveCTurner
Copy link
Contributor

I forgot that I left my CI running this job yesterday, but was just notified that iteration 10337 failed again with trace logging enabled, and the failure now makes sense despite the 3-second pinging delay. It is a consequence of how today's discovery implementation does not gossip symmetrically: it only shares pings that it has received, not their responses. In this case, only one node, node_t1, sees any seed addresses, presumably because the others start pinging before the discovery file has been written:

  1> [2018-10-31T15:55:34,510][DEBUG][o.e.d.z.FileBasedUnicastHostsProvider] [node_t0] seed addresses: []
  1> [2018-10-31T15:55:37,689][DEBUG][o.e.d.z.FileBasedUnicastHostsProvider] [node_t3] seed addresses: []
  1> [2018-10-31T15:55:37,691][DEBUG][o.e.d.z.FileBasedUnicastHostsProvider] [node_t4] seed addresses: []
  1> [2018-10-31T15:55:37,691][DEBUG][o.e.d.z.FileBasedUnicastHostsProvider] [node_t2] seed addresses: []
  1> [2018-10-31T15:55:37,697][DEBUG][o.e.d.z.FileBasedUnicastHostsProvider] [node_t1] seed addresses: [127.0.0.1:36535, 127.0.0.1:44463, 127.0.0.1:40167, 127.0.0.1:46343]

Once a pinging round has started we do not check the unicast hosts provider again. node_t1 pings all the nodes, and initially none of the other nodes can do anything. node_t0 is a master so it responds but does not ping node_t1 back, and all the other nodes do ping it back. This means that node_t1 accumulates pings from all nodes except node_t0 and broadcasts these to all the other nodes in subsequent pings, so that now node_t{1,2,3,4} all know about each other but only node_t1 knows about node_t0. After the pinging round finishes, node_t1 joins node_t0 as it is already a master, but node_t{2,3,4} think there is no master, and decide that node_t3 is the best choice, splitting the cluster in two.

@DaveCTurner DaveCTurner removed their assignment Oct 31, 2018
DaveCTurner added a commit to DaveCTurner/elasticsearch that referenced this issue Oct 31, 2018
Today when ESIntegTestCase starts some nodes it writes out the unicast hosts
files each time a node starts its transport service. This does mean that a
number of nodes can start and perform their first pinging round without any
unicast hosts which, if the timing is unlucky and a lot of nodes are all
started at the same time, can lead to a split brain as in elastic#35052.

Prior to elastic#33554 this was unlikely to happen since the MockUncasedHostsProvider
would always have yielded the existing hosts, so the timing would have to have
been implausibly unlucky. Since elastic#33554, however, it's more likely because the
race occurs between the start of the first round of pinging and the writing of
the unicast hosts file. It is realistic that new nodes will be configured with
the existing nodes from startup, so this change reinstates that behaviour

Closes elastic#35052.
DaveCTurner added a commit that referenced this issue Oct 31, 2018
Today when ESIntegTestCase starts some nodes it writes out the unicast hosts
files each time a node starts its transport service. This does mean that a
number of nodes can start and perform their first pinging round without any
unicast hosts which, if the timing is unlucky and a lot of nodes are all
started at the same time, can lead to a split brain as in #35052.

Prior to #33554 this was unlikely to happen since the MockUncasedHostsProvider
would always have yielded the existing hosts, so the timing would have to have
been implausibly unlucky. Since #33554, however, it's more likely because the
race occurs between the start of the first round of pinging and the writing of
the unicast hosts file. It is realistic that new nodes will be configured with
the existing nodes from startup, so this change reinstates that behaviour.

Closes #35052.
DaveCTurner added a commit that referenced this issue Oct 31, 2018
Today when ESIntegTestCase starts some nodes it writes out the unicast hosts
files each time a node starts its transport service. This does mean that a
number of nodes can start and perform their first pinging round without any
unicast hosts which, if the timing is unlucky and a lot of nodes are all
started at the same time, can lead to a split brain as in #35052.

Prior to #33554 this was unlikely to happen since the MockUncasedHostsProvider
would always have yielded the existing hosts, so the timing would have to have
been implausibly unlucky. Since #33554, however, it's more likely because the
race occurs between the start of the first round of pinging and the writing of
the unicast hosts file. It is realistic that new nodes will be configured with
the existing nodes from startup, so this change reinstates that behaviour.

Closes #35052.
DaveCTurner added a commit that referenced this issue Oct 31, 2018
Today when ESIntegTestCase starts some nodes it writes out the unicast hosts
files each time a node starts its transport service. This does mean that a
number of nodes can start and perform their first pinging round without any
unicast hosts which, if the timing is unlucky and a lot of nodes are all
started at the same time, can lead to a split brain as in #35052.

Prior to #33554 this was unlikely to happen since the MockUncasedHostsProvider
would always have yielded the existing hosts, so the timing would have to have
been implausibly unlucky. Since #33554, however, it's more likely because the
race occurs between the start of the first round of pinging and the writing of
the unicast hosts file. It is realistic that new nodes will be configured with
the existing nodes from startup, so this change reinstates that behaviour.

Closes #35052.
@jdconrad
Copy link
Contributor

jdconrad commented Dec 6, 2018

@DaveCTurner This looks like the same issue happening back in 5.6? (https://internal-ci.elastic.co/job/elastic+x-pack-elasticsearch+5.6+multijob-windows-compatibility/508/console). I assume this won't be backported that far, but wanted to confirm.

@DaveCTurner
Copy link
Contributor

@jdconrad thanks for the ping. That failure is not the same issue, although 5.6 looks susceptible to this specific issue and you are right that the fix is not backported that far (indeed, #35136 wouldn't work in 5.6). I opened #36732 for the failure that you linked.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Distributed/Cluster Coordination Cluster formation and cluster state publication, including cluster membership and fault detection. team-discuss >test-failure Triaged test failures from CI
Projects
None yet
Development

No branches or pull requests

5 participants