Deadlock on simultaneous nodeup #91

kzemek · 2018-07-12T01:39:46Z

I'm having an issue similar to #60, reproducible very often when I bring up containers with the app at roughly the same time. Looks like each node is waiting for another one, and they're perpetually stuck in :syncing state. Here are the :sys.get_status(Swarm.Tracer) results from my 5 nodes: https://pastebin.com/EYLg6YNE . No custom options set, all default; clustering with libcluster gossip strategy.

The text was updated successfully, but these errors were encountered:

kzemek · 2018-07-12T11:39:51Z

Please see https://github.com/kzemek/swarm-deadlock-repro for reliable reproduction of the issue.

kzemek · 2018-07-12T12:06:02Z

These are the logs produced with debug: true: https://gist.github.com/kzemek/f5067111dcc5b1c40803b12966cd1a1f#file-gistfile1-txt
There are no more debug logs after that point.

kzemek · 2018-07-12T12:29:26Z

I've also tried manipulating the choice of sync node in hopes that it would solve the lock: kzemek@28516d9

But instead, the states of the Swarm.Tracker processes got stranger: https://gist.github.com/kzemek/f5067111dcc5b1c40803b12966cd1a1f#file-nodes_sync_to_smallest-txt

All nodes tried to sync to repro_2 (the "smallest" node), except repro_2 itself which synced to repro_3. repro_3 synced successfully and was put into :tracking state, while at the same time repro_2 was put into :awaiting_sync_ack and sent cast {sync_recv,<16250.182.0>,{{0,1},0},[]} to repro_3. But sync_recv cast is not handled in :tracking state, so repro_2 got stuck, and so did other nodes that tried to sync to it.

kzemek · 2018-07-12T13:11:36Z

This particular issue is not there when reverting to commit c305633 (pre 412bad9). The nodes all go into :tracking state almost instantly.

joxford531 · 2018-10-19T03:43:34Z

Seeing this issue as well. When I revert to version 3.1 I don't see any problems with deadlocking on startup.

malmovich · 2018-11-29T16:54:46Z

We've been having this issue as well, and I'm pretty sure we also had this in 3.3.1

In our case we observed the following scenario. Lets say we have node A,B and C and the following happens:
A - :sync -> B
B - :sync -> C
C - :sync -> A

All nodes are now in syncing state waiting for a :sync_recv message.

So far we have resolved this with a state timeout in syncing, were stops the syncing and tries another node. It seems to work fine, however, this approach gave a few complications and made it a bit more complex. So a simpler approach could be to drop the pending_sync_request strategy and and just decline the sync request while syncing.

kzemek added a commit to kzemek/swarm-deadlock-repro that referenced this issue Jul 12, 2018

Add repro for bitwalker/swarm#91

a543778

kzemek added a commit to kzemek/swarm-deadlock-repro that referenced this issue Jul 12, 2018

Add repro for bitwalker/swarm#91

0d5feff

bitwalker added the investigating label Aug 25, 2018

morganatwishpond mentioned this issue Sep 20, 2018

Swarm tracker clock gets weird, eats memory #106

Open

miguelfteixeira mentioned this issue Dec 10, 2018

Discard pending sync request from sync_node #118

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Deadlock on simultaneous nodeup #91

Deadlock on simultaneous nodeup #91

kzemek commented Jul 12, 2018 •

edited

Loading

kzemek commented Jul 12, 2018

kzemek commented Jul 12, 2018 •

edited

Loading

kzemek commented Jul 12, 2018

kzemek commented Jul 12, 2018

joxford531 commented Oct 19, 2018

malmovich commented Nov 29, 2018

Deadlock on simultaneous nodeup #91

Deadlock on simultaneous nodeup #91

Comments

kzemek commented Jul 12, 2018 • edited Loading

kzemek commented Jul 12, 2018

kzemek commented Jul 12, 2018 • edited Loading

kzemek commented Jul 12, 2018

kzemek commented Jul 12, 2018

joxford531 commented Oct 19, 2018

malmovich commented Nov 29, 2018

kzemek commented Jul 12, 2018 •

edited

Loading

kzemek commented Jul 12, 2018 •

edited

Loading