Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: in the swarm move Connectedness emit after releasing conns #2373

Merged

Conversation

Jorropo
Copy link
Contributor

@Jorropo Jorropo commented Jun 16, 2023

go-libp2p-kad-dht now listen to both EvtPeerIdentificationCompleted and EvtPeerConnectednessChanged and EvtPeerIdentificationCompleted calls .ConnsToPeer inorder to do some filtering.

However it happens that it deadlocks because if the swarm is trying to emit a EvtPeerConnectednessChanged while the subscriber is trying to process an EvtPeerIdentificationCompleted, the subscriber is stuck on s.conns.RLock() while the swarm wont release it before having sent EvtPeerConnectednessChanged. Deadlock !

I havn't confirmed this fixes my bug given this takes time to reproduce, I'll startup a new experiment later.

go-libp2p-kad-dht now listen to both EvtPeerIdentificationCompleted and EvtPeerConnectednessChanged
and EvtPeerIdentificationCompleted calls .ConnsToPeer inorder to do some filtering.

However it happens that it deadlocks because if the swarm is trying to emit a EvtPeerConnectednessChanged
while the subscriber is trying to process an EvtPeerIdentificationCompleted, the subscriber is stuck on
s.conns.RLock() while the swarm wont release it before having sent EvtPeerConnectednessChanged.
Deadlock !

I havn't confirmed this fixes my bug given this takes time to reproduce, I'll startup a new experiment soon.
@Jorropo Jorropo force-pushed the do-not-keep-swarm-lock-while-notifying branch 2 times, most recently from c66d5a4 to cbc2e56 Compare June 16, 2023 08:22
@sukunrt
Copy link
Member

sukunrt commented Jun 16, 2023

See: #2373 (comment)

This is only a deadlock if your EvtPeerConnectednessChanged subscription channel is full, right?

Updated:
Okay no, I got it.
You cannot emit peer connectedness changed because the goroutine emitting event peer identification completed has the eventbus lock.
and the go routine emitting identification completed depends on the lock held by the other goroutine.

@Jorropo
Copy link
Contributor Author

Jorropo commented Jun 16, 2023

Yes, this has happened to me in the wild.

@sukunrt
Copy link
Member

sukunrt commented Jun 16, 2023

I agree this should be fixed, just want to get the scenario right.

  • You have a subscription for both EvtPeerIdentificationCompleted and EvtPeerConnectednessChanged.
  • You receive the identification completed event and try to obtain a swarm.conns.RLock
  • The goroutine cannot acquire the lock because another goroutine has the lock which is going to push connectedness changed event. This goroutine cannot push the event because the queue of the same subscriber is full.

Does this sound correct?

@Jorropo
Copy link
Contributor Author

Jorropo commented Jun 16, 2023

@sukunrt what you described is exactly what happens right now because the emit is done while holding conns.
With my patch the goroutine will first release conns before emiting, so the RLock is free and taken by the consumer, the consumer then can continue to process events emptying the channel.
The only way my patch fail is if there are 2 goroutines trying to emit an event on the same stream, the first goroutine is stuck on emiting on a full channel the second one is blocked on notifylk and the consumer of the channel is blocked on conns (which hasn't been released yet) however I don't think that happens given previous releases works even tho the dht use notify (which is emited synchronously bellow already) and because notifylk is per stream not shared accross the swarm.

@MarcoPolo
Copy link
Collaborator

Great work @Jorropo! I didn't consider how a consumer could try to grab a lock that the publisher has and thus create a deadlock.

Publishers should only grab locks to guarantee the ordering of connect/disconnect events. I don't think it needs the swarm.conns lock, though I'll double check the code path.

I can help here and add some tests.

@Jorropo
Copy link
Contributor Author

Jorropo commented Jun 16, 2023

@MarcoPolo I don't have time to finish any of this, feel free to fix ci and add tests thx

@MarcoPolo
Copy link
Collaborator

Test added and fixed a related deadlock (on the disconnect side of things).

@MarcoPolo
Copy link
Collaborator

Once CI passes I'll merge a patch release

Copy link
Contributor Author

@Jorropo Jorropo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

✔️ (can't approve) nice for the disconnect, I don't know how I overlooked this obvious codepath.

@MarcoPolo MarcoPolo merged commit 5e3fc8a into libp2p:master Jun 16, 2023
@Jorropo Jorropo deleted the do-not-keep-swarm-lock-while-notifying branch June 16, 2023 18:13
MarcoPolo added a commit that referenced this pull request Jun 16, 2023
* fix: in the swarm move Connectedness emit after releasing conns

go-libp2p-kad-dht now listen to both EvtPeerIdentificationCompleted and EvtPeerConnectednessChanged
and EvtPeerIdentificationCompleted calls .ConnsToPeer inorder to do some filtering.

However it happens that it deadlocks because if the swarm is trying to emit a EvtPeerConnectednessChanged
while the subscriber is trying to process an EvtPeerIdentificationCompleted, the subscriber is stuck on
s.conns.RLock() while the swarm wont release it before having sent EvtPeerConnectednessChanged.
Deadlock !

I havn't confirmed this fixes my bug given this takes time to reproduce, I'll startup a new experiment soon.

* Fix other deadlock and add a test

* Make test a little faster

* Bind on localhost

---------

Co-authored-by: Marco Munizaga <git@marcopolo.io>
MarcoPolo added a commit that referenced this pull request Jun 16, 2023
* fix: in the swarm move Connectedness emit after releasing conns

go-libp2p-kad-dht now listen to both EvtPeerIdentificationCompleted and EvtPeerConnectednessChanged
and EvtPeerIdentificationCompleted calls .ConnsToPeer inorder to do some filtering.

However it happens that it deadlocks because if the swarm is trying to emit a EvtPeerConnectednessChanged
while the subscriber is trying to process an EvtPeerIdentificationCompleted, the subscriber is stuck on
s.conns.RLock() while the swarm wont release it before having sent EvtPeerConnectednessChanged.
Deadlock !

I havn't confirmed this fixes my bug given this takes time to reproduce, I'll startup a new experiment soon.

* Fix other deadlock and add a test

* Make test a little faster

* Bind on localhost

---------

Co-authored-by: Marco Munizaga <git@marcopolo.io>
MarcoPolo added a commit that referenced this pull request Jun 19, 2023
* fix: in the swarm move Connectedness emit after releasing conns (#2373)

* fix: in the swarm move Connectedness emit after releasing conns

go-libp2p-kad-dht now listen to both EvtPeerIdentificationCompleted and EvtPeerConnectednessChanged
and EvtPeerIdentificationCompleted calls .ConnsToPeer inorder to do some filtering.

However it happens that it deadlocks because if the swarm is trying to emit a EvtPeerConnectednessChanged
while the subscriber is trying to process an EvtPeerIdentificationCompleted, the subscriber is stuck on
s.conns.RLock() while the swarm wont release it before having sent EvtPeerConnectednessChanged.
Deadlock !

I havn't confirmed this fixes my bug given this takes time to reproduce, I'll startup a new experiment soon.

* Fix other deadlock and add a test

* Make test a little faster

* Bind on localhost

---------

Co-authored-by: Marco Munizaga <git@marcopolo.io>

* Release version v0.27.7

* identify: set stream deadlines for Identify and Identify Push streams (#2382)

---------

Co-authored-by: Jorropo <jorropo.pgm@gmail.com>
Co-authored-by: Marten Seemann <martenseemann@gmail.com>
marten-seemann pushed a commit that referenced this pull request Jun 20, 2023
* fix: in the swarm move Connectedness emit after releasing conns

go-libp2p-kad-dht now listen to both EvtPeerIdentificationCompleted and EvtPeerConnectednessChanged
and EvtPeerIdentificationCompleted calls .ConnsToPeer inorder to do some filtering.

However it happens that it deadlocks because if the swarm is trying to emit a EvtPeerConnectednessChanged
while the subscriber is trying to process an EvtPeerIdentificationCompleted, the subscriber is stuck on
s.conns.RLock() while the swarm wont release it before having sent EvtPeerConnectednessChanged.
Deadlock !

I havn't confirmed this fixes my bug given this takes time to reproduce, I'll startup a new experiment soon.

* Fix other deadlock and add a test

* Make test a little faster

* Bind on localhost

---------

Co-authored-by: Marco Munizaga <git@marcopolo.io>
@MarcoPolo MarcoPolo mentioned this pull request Jul 14, 2023
29 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants