Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bridge: call connection_lost "soon" #20881

Conversation

allisonkarlitskaya
Copy link
Member

The standard Python asyncio Transport classes protect their caller from having to deal with exceptions when calling .write(). In case an exception does occur, they capture it and close the transport, passing the exception on to the connection_lost() handler. This makes it a lot easier to interact with the transport: because OS-level errors are handled out of band, protocols don't have to consider that any write might fail.

Our Transport classes do the same thing, but there's an important difference: the standard library defers the call to .connection_lost(), but our transports invoke it immediately.

This creates a weird scenario whereby your connection_lost() handler can get called in the middle of your connection_made() handler, if you should happen to do a .write() there which fails. Since connection_lost() reasonably assumes that connection_made() has run, you can get into all kinds of trouble. Worst of all, once .connection_lost() is done running, control flow returns back to your .connection_made() handler, which continues running as if nothing has happened in the meantime.

Let's adopt the standard Python behaviour here and defer the .connection_lost() call. We need to adjust some unit tests which assumed the old behaviour: adjust them to ensure that we specifically don't have that behaviour anymore.

@allisonkarlitskaya
Copy link
Member Author

Note: in #20880 I speculated that this might be behind the issue we had with fswatch1 being "closed twice". ie: my idea was that it wasn't closed twice, but had actually never finished being opened.

I don't think that this is true, though, and here's why:

Imagine a scenario where someone creates a fswatch1 channel and immediately slams the transport closed. Then when the ready message for the fswatch channel gets sent, we're going to catch an OSError and enter our protocol shutdown code immediately (which will try to close the channel). But look:

    def do_open(self, options: JsonObject) -> None:
        self._path = get_str(options, 'path')
        self._tag = None

        self._active = False
        self._watch = PathWatch(self._path, self)
        self._active = True

        self.ready()

ie: by the time we call .ready() (which is where we suppose the exception occurs) we've already initialized self._watch. I'm not saying that this is impossible (via some other path), but it certainly seems less likely now, having seen the actual code....

For that reason, I didn't revert #20634 here.

@allisonkarlitskaya allisonkarlitskaya marked this pull request as ready for review August 13, 2024 11:11
@martinpitt
Copy link
Member

I smoke-tested this against c-files (see cockpit-project/bots#6730 (comment))

After a few iterations I still get the crash which the first commit in #20880 fixed:

Traceback (most recent call last):
File "/usr/lib/python3.12/asyncio/events.py", line 88, in _run
self._context.run(self._callback, *self._args)
File "/usr/lib/python3/dist-packages/cockpit/protocol.py", line 160, in connection_lost
self.close(exc)
File "/usr/lib/python3/dist-packages/cockpit/protocol.py", line 170, in close
self.do_closed(exc)
File "/usr/lib/python3/dist-packages/cockpit/router.py", line 245, in do_closed
self.eof_received()
File "/usr/lib/python3/dist-packages/cockpit/router.py", line 234, in eof_received
endpoint.do_close()
File "/usr/lib/python3/dist-packages/cockpit/channel.py", line 425, in do_close
if self._transport is not None:
^^^^^^^^^^^^^^^
AttributeError: 'SubprocessStreamChannel' object has no attribute '_transport'

For devensiveness we really should initialize this variable. do_close() is already written defensively, so I don't see how it can hurt to initialize the instance member. When I apply that locally, it survived 30 iterations of TestFiles.testSorting, so I'm reasonably sure the commit here replaces the second commit of #20880 . So this plus initing _transport and we have a winner?

@allisonkarlitskaya
Copy link
Member Author

For devensiveness we really should initialize this variable. do_close() is already written defensively, so I don't see how it can hurt to initialize the instance member. When I apply that locally, it survived 30 iterations of TestFiles.testSorting, so I'm reasonably sure the commit here replaces the second commit of #20880 . So this plus initing _transport and we have a winner?

Yes. This was meant as an alternative for the second commit, indeed. The uninitialized variable is still wrong.

The standard Python asyncio Transport classes protect their caller from
having to deal with exceptions when calling .write().  In case an
exception does occur, they capture it and close the transport, passing
the exception on to the connection_lost() handler.  This makes it a lot
easier to interact with the transport: because OS-level errors are
handled out of band, protocols don't have to consider that any write
might fail.

Our Transport classes do the same thing, but there's an important
difference: the standard library defers the call to .connection_lost(),
but our transports invoke it immediately.

This creates a weird scenario whereby your connection_lost() handler can
get called in the middle of your connection_made() handler, if you
should happen to do a .write() there which fails.  Since
connection_lost() reasonably assumes that connection_made() has run, you
can get into all kinds of trouble.  Worst of all, once
.connection_lost() is done running, control flow returns back to your
.connection_made() handler, which continues running as if nothing has
happened in the meantime.

Let's adopt the standard Python behaviour here and defer the
.connection_lost() call.  We need to adjust some unit tests which
assumed the old behaviour: adjust them to ensure that we specifically
*don't* have that behaviour anymore.

This is something that I always knew we did differently than the Python
standard library, but I didn't understand the purpose of the additional
complexity until now.  Thanks goes to Martin for the initial research to
uncover this issue.
This is the only thing in the class which isn't initialized and it's
causing issues.  Fix that.

Thanks for tracking this down go to Martin as well :)
Copy link
Member

@martinpitt martinpitt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cheers! Stop bridge from crashing beats pitti being grumpy about class variables.

@allisonkarlitskaya allisonkarlitskaya merged commit 686a5d3 into cockpit-project:main Aug 13, 2024
78 of 79 checks passed
@allisonkarlitskaya allisonkarlitskaya deleted the defer-connection_lost branch August 13, 2024 12:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants