bridge: call connection_lost "soon" #20881

allisonkarlitskaya · 2024-08-13T10:40:38Z

The standard Python asyncio Transport classes protect their caller from having to deal with exceptions when calling .write(). In case an exception does occur, they capture it and close the transport, passing the exception on to the connection_lost() handler. This makes it a lot easier to interact with the transport: because OS-level errors are handled out of band, protocols don't have to consider that any write might fail.

Our Transport classes do the same thing, but there's an important difference: the standard library defers the call to .connection_lost(), but our transports invoke it immediately.

This creates a weird scenario whereby your connection_lost() handler can get called in the middle of your connection_made() handler, if you should happen to do a .write() there which fails. Since connection_lost() reasonably assumes that connection_made() has run, you can get into all kinds of trouble. Worst of all, once .connection_lost() is done running, control flow returns back to your .connection_made() handler, which continues running as if nothing has happened in the meantime.

Let's adopt the standard Python behaviour here and defer the .connection_lost() call. We need to adjust some unit tests which assumed the old behaviour: adjust them to ensure that we specifically don't have that behaviour anymore.

allisonkarlitskaya · 2024-08-13T11:00:37Z

Note: in #20880 I speculated that this might be behind the issue we had with fswatch1 being "closed twice". ie: my idea was that it wasn't closed twice, but had actually never finished being opened.

I don't think that this is true, though, and here's why:

Imagine a scenario where someone creates a fswatch1 channel and immediately slams the transport closed. Then when the ready message for the fswatch channel gets sent, we're going to catch an OSError and enter our protocol shutdown code immediately (which will try to close the channel). But look:

    def do_open(self, options: JsonObject) -> None:
        self._path = get_str(options, 'path')
        self._tag = None

        self._active = False
        self._watch = PathWatch(self._path, self)
        self._active = True

        self.ready()

ie: by the time we call .ready() (which is where we suppose the exception occurs) we've already initialized self._watch. I'm not saying that this is impossible (via some other path), but it certainly seems less likely now, having seen the actual code....

For that reason, I didn't revert #20634 here.

martinpitt · 2024-08-13T11:21:52Z

I smoke-tested this against c-files (see cockpit-project/bots#6730 (comment))

After a few iterations I still get the crash which the first commit in #20880 fixed:

Traceback (most recent call last):
File "/usr/lib/python3.12/asyncio/events.py", line 88, in _run
self._context.run(self._callback, *self._args)
File "/usr/lib/python3/dist-packages/cockpit/protocol.py", line 160, in connection_lost
self.close(exc)
File "/usr/lib/python3/dist-packages/cockpit/protocol.py", line 170, in close
self.do_closed(exc)
File "/usr/lib/python3/dist-packages/cockpit/router.py", line 245, in do_closed
self.eof_received()
File "/usr/lib/python3/dist-packages/cockpit/router.py", line 234, in eof_received
endpoint.do_close()
File "/usr/lib/python3/dist-packages/cockpit/channel.py", line 425, in do_close
if self._transport is not None:
^^^^^^^^^^^^^^^
AttributeError: 'SubprocessStreamChannel' object has no attribute '_transport'

For devensiveness we really should initialize this variable. do_close() is already written defensively, so I don't see how it can hurt to initialize the instance member. When I apply that locally, it survived 30 iterations of TestFiles.testSorting, so I'm reasonably sure the commit here replaces the second commit of #20880 . So this plus initing _transport and we have a winner?

allisonkarlitskaya · 2024-08-13T11:24:02Z

For devensiveness we really should initialize this variable. do_close() is already written defensively, so I don't see how it can hurt to initialize the instance member. When I apply that locally, it survived 30 iterations of TestFiles.testSorting, so I'm reasonably sure the commit here replaces the second commit of #20880 . So this plus initing _transport and we have a winner?

Yes. This was meant as an alternative for the second commit, indeed. The uninitialized variable is still wrong.

The standard Python asyncio Transport classes protect their caller from having to deal with exceptions when calling .write(). In case an exception does occur, they capture it and close the transport, passing the exception on to the connection_lost() handler. This makes it a lot easier to interact with the transport: because OS-level errors are handled out of band, protocols don't have to consider that any write might fail. Our Transport classes do the same thing, but there's an important difference: the standard library defers the call to .connection_lost(), but our transports invoke it immediately. This creates a weird scenario whereby your connection_lost() handler can get called in the middle of your connection_made() handler, if you should happen to do a .write() there which fails. Since connection_lost() reasonably assumes that connection_made() has run, you can get into all kinds of trouble. Worst of all, once .connection_lost() is done running, control flow returns back to your .connection_made() handler, which continues running as if nothing has happened in the meantime. Let's adopt the standard Python behaviour here and defer the .connection_lost() call. We need to adjust some unit tests which assumed the old behaviour: adjust them to ensure that we specifically *don't* have that behaviour anymore. This is something that I always knew we did differently than the Python standard library, but I didn't understand the purpose of the additional complexity until now. Thanks goes to Martin for the initial research to uncover this issue.

This is the only thing in the class which isn't initialized and it's causing issues. Fix that. Thanks for tracking this down go to Martin as well :)

martinpitt

Cheers! Stop bridge from crashing beats pitti being grumpy about class variables.

allisonkarlitskaya requested a review from martinpitt August 13, 2024 10:40

allisonkarlitskaya mentioned this pull request Aug 13, 2024

Fix bridge crashes on early channel opening failure #20880

Closed

allisonkarlitskaya marked this pull request as ready for review August 13, 2024 11:11

allisonkarlitskaya added 2 commits August 13, 2024 13:26

bridge: add missing ProtocolChannel._transport initializer

562ae22

This is the only thing in the class which isn't initialized and it's causing issues. Fix that. Thanks for tracking this down go to Martin as well :)

allisonkarlitskaya force-pushed the defer-connection_lost branch from fd489d8 to 562ae22 Compare August 13, 2024 11:27

martinpitt approved these changes Aug 13, 2024

View reviewed changes

allisonkarlitskaya merged commit 686a5d3 into cockpit-project:main Aug 13, 2024
78 of 79 checks passed

allisonkarlitskaya deleted the defer-connection_lost branch August 13, 2024 12:25

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

bridge: call connection_lost "soon" #20881

bridge: call connection_lost "soon" #20881

allisonkarlitskaya commented Aug 13, 2024

allisonkarlitskaya commented Aug 13, 2024

martinpitt commented Aug 13, 2024

allisonkarlitskaya commented Aug 13, 2024

martinpitt left a comment

bridge: call connection_lost "soon" #20881

bridge: call connection_lost "soon" #20881

Conversation

allisonkarlitskaya commented Aug 13, 2024

allisonkarlitskaya commented Aug 13, 2024

martinpitt commented Aug 13, 2024

allisonkarlitskaya commented Aug 13, 2024

martinpitt left a comment

Choose a reason for hiding this comment