Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: force-closing local discovery TCP server could always time out #498

Merged
merged 5 commits into from
Apr 30, 2024

Conversation

EvanHahn
Copy link
Contributor

@EvanHahn EvanHahn commented Feb 22, 2024

We have a way to force-close the local discovery TCP server. This was susceptible to the following race condition bug:

  1. A connection opens and is established.

  2. A duplicate connection opens. We decide to keep the newer one. We start to close the old connection and enqueue a "promotion" of the new one.

    https://github.com/digidem/mapeo-core-next/blob/5e05e35200297435f4ebd33df2c5b1444ed4ef1b/src/discovery/local-discovery.js#L229-L232

  3. We close the TCP server. This does not immediately halt the server; it "stops the server from accepting new connections and keeps existing connections". In other words, it closes the doors but doesn't kick people out.

    We iterate through all existing connections and destroy them...

    https://github.com/digidem/mapeo-core-next/blob/5e05e35200297435f4ebd33df2c5b1444ed4ef1b/src/discovery/local-discovery.js#L278-L280

    ...but the duplicate connection is not yet in this list so it is never destroyed and the server will never close.

  4. The old connection is closed and "promotion" of the new connection completes. It is never destroyed.

This caused intermittent CI failures and should be fixed by this change.

We have a way to force-close the local discovery TCP server. This was
susceptible to the following race condition bug:

1. A connection opens and is established.

2. A duplicate connection opens. [We decide to keep the new one. We
   start to close the old connection and enqueue a "promotion" of the
   new one.][2]

3. We close the TCP server. This does not immediately halt the server;
   it ["stops the server from accepting new connections and keeps
   existing connections"][docs].

   [We iterate through all existing connections and destroy them][3],
   but the duplicate connection is not yet in this list so it is never
   destroyed and the server will never close.

4. The old connection is closed and ["promotion" of the new connection
   completes][4]. It is never destroyed.

This caused [intermittent CI failures][ci] and should be fixed by this
change.

[docs]: https://nodejs.org/docs/latest-v18.x/api/net.html#serverclosecallback
[ci]: https://github.com/digidem/mapeo-core-next/actions/runs/7996074898/job/21837815901
[2]: https://github.com/digidem/mapeo-core-next/blob/5e05e35200297435f4ebd33df2c5b1444ed4ef1b/src/discovery/local-discovery.js#L229-L232
[3]: https://github.com/digidem/mapeo-core-next/blob/5e05e35200297435f4ebd33df2c5b1444ed4ef1b/src/discovery/local-discovery.js#L278-L280
[4]: https://github.com/digidem/mapeo-core-next/blob/5e05e35200297435f4ebd33df2c5b1444ed4ef1b/src/discovery/local-discovery.js#L173
Comment on lines +237 to +244
// Bail if the server has already stopped by this point. This should be
// rare but can happen during connection swaps if the new connection is
// "promoted" after the server's doors have already been closed.
if (!this.#server.listening) {
this.#log('server stopped, destroying connection %h', remotePublicKey)
conn.destroy()
return
}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm confident1 in my understanding of the problem...less so in my solution. Do we want to do this if the server is not force-closed, for example?

Footnotes

  1. this might be hubris

@EvanHahn EvanHahn marked this pull request as ready for review February 22, 2024 18:56
@EvanHahn EvanHahn added the mvp Requirement for MVP label Mar 11, 2024
@EvanHahn EvanHahn requested a review from achou11 March 11, 2024 21:46
Copy link
Member

@gmaclennan gmaclennan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have been unable to get my head deep enough into this code to fully understand the problem/solution. I think we are still missing a piece around gracefully closing connections to peers. Currently I think peers stay connected until the connection errors or they are force-terminated. Ideally what would happen is:

  1. Peer starts intention to close connections (e.g. when going into the background)
  2. Peer immediately stops accepting new connections
  3. For each connection to a peer, if all data is "in sync", close the connection
  4. Else, wait peers to be "in sync", then close the connection
  5. If after a timeout, peers are still not "in sync", then force close the connection

This comment probably belongs somewhere else, just thought I'd initially drop it here since I think it's relevant to the logic here.

@EvanHahn EvanHahn self-assigned this Mar 27, 2024
@EvanHahn EvanHahn marked this pull request as draft April 1, 2024 16:52
@EvanHahn
Copy link
Contributor Author

EvanHahn commented Apr 1, 2024

Moving to draft because we might delete a lot of this code.

@EvanHahn EvanHahn changed the title fix: force-closing mDNS TCP server could always time out fix: force-closing local discovery TCP server could always time out Apr 22, 2024
@EvanHahn EvanHahn removed the request for review from achou11 April 22, 2024 16:01
@EvanHahn EvanHahn removed the mvp Requirement for MVP label Apr 22, 2024
@EvanHahn EvanHahn marked this pull request as ready for review April 22, 2024 16:10
@EvanHahn EvanHahn removed their assignment Apr 22, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants