Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Deadlocks caused by openssl's handling of TLS 1.3 session tickets #819

Closed
njsmith opened this issue Dec 23, 2018 · 8 comments · Fixed by #1171
Closed

Deadlocks caused by openssl's handling of TLS 1.3 session tickets #819

njsmith opened this issue Dec 23, 2018 · 8 comments · Fixed by #1171
Labels
TLS Relevant to our TLS/SSL implementation

Comments

@njsmith
Copy link
Member

njsmith commented Dec 23, 2018

We have a test that tries running SSLStream over an unbuffered transport, in part to confirm we aren't making assumptions about buffering. It turns out that openssl 1.1.1/TLS 1.3 does make assumptions about buffering, and the test hangs. Now what?

Specifically: we create an unbuffered transport, then call do_handshake on both the client and server sides. The client.do_handshake() completes successfully. The server.do_handshake() hangs.

I believe what's happening is that when openssl 1.1.1 is doing TLS 1.3, the server automatically sends some session tickets after the handshake completes. Technically these aren't part of the handshake, so the client.do_handshake() call doesn't wait for them. But on the server side we don't know this; all we can see is that we called SSL_do_handshake and it spit out some data to send, so we assume that the handshake won't complete until we send that data.

I guess there are two scenarios that the designers imagined: (1) the client is trying to read application-level data, so it's reading from the transport, and it gets the session tickets as a bonus. (2) the client isn't trying to read application-level data, so it's not reading from the transport, but the transport has enough intrinsic buffering that the session tickets can just sit there forever without causing any problems. But in our test neither is true, so it breaks.

As far as I know, this shouldn't have too much real-world impact, since real transports almost always have some buffering. (Empirically the session tickets appear to be 510 bytes, which is smaller than pretty much any real transport buffer.) And, there is a workaround available: modify clogged_stream_maker so the server does await server.do_handshake(); await server.send_all(b"x"), and the client does await client.do_handshake(); await client.receive_some(1). Sending a bit of application level data from server→client like this should flush out the session tickets and get back to the state we want for the rest of the tests.

Nonetheless, it's not ideal that openssl is making subtle assumptions about buffering like this. Shoudl we do something?

Unfortunately, TLS 1.3's decision to move session tickets out of the handshake doesn't give us a lot of options here. We could disable session tickets, but that seems sub-optimal (and currently the ssl module doesn't expose a knob for this anyway).

We could potentially delay sending the session tickets until the first time we send application-level data. That way server.do_handshake() wouldn't need to make any assumptions about buffering, and it's totally reasonable that server.send_all(...) has to wait for the peer to call client.receive_some(...). But, there are two challenges:

  • I'm not sure how to implement this. Specifically, it requires figuring out that the data openssl has dumped into its send buffer is the session tickets, as opposed to part of the handshake. Is there any reliable way to do this?

  • There are some cases where session tickets are successfully delivered now, but wouldn't be if we implemented this. Specifically, if you have a connection where the server never sends any application-level data, but the client does read from network. This sounds a little weird, but it could happen: e.g. any twisted or asyncio client will act like this; or, if you're using trio, you could intentionally implement a client like this exactly because you want to get session tickets.

So.... I'm not having any brilliant ideas here. I think for right now we'll just implement the workaround to get the test suite working, and then wait to see if this becomes an issue or not.

CC: @tiran – I don't think there's anything for you to do here in particular, but it seems like the kind of issue that you like to know about.

@njsmith
Copy link
Member Author

njsmith commented Dec 23, 2018

Oh, this also causes a mess with test_closing_nice_case. (In TLS, there are no nice cases.)

The test that fails is:

trio/trio/tests/test_ssl.py

Lines 882 to 894 in 5a03bee

# Check that a graceful close *before* handshaking gives a clean EOF on
# the other side
client_ssl, server_ssl = ssl_memory_stream_pair()
async def expect_eof_server():
with assert_checkpoints():
assert await server_ssl.receive_some(10) == b""
with assert_checkpoints():
await server_ssl.aclose()
async with _core.open_nursery() as nursery:
nursery.start_soon(client_ssl.aclose)
nursery.start_soon(expect_eof_server)

So this involves a somewhat delicate thing: if you open a TLS connection and then immediately close it again, how should that be handled? Specifically here we have the client calling aclose() as the first thing it does, while the server calls receive_some(), and the expectation is that receive_some() should return b"" to signal EOF. It looks pretty straightforward.

And in HTTPS-compatible mode it is straightforward, because the client just closes the transport, and the server interprets that as EOF. But in standards-compliant mode, we're only supposed to report EOF if we got a proper cryptographic notification of the client's intention to close the connection. So in this case, the client's aclose() first triggers an implicit handshake, and then after the handshake is finished, the client immediately sends a shutdown notification, and then it closes the transport. And similarly, the server's receive_some() triggers an implicit handshake, and then it tries to read from the connection and it gets the clean shutdown notification. Great.

Except... with TLS 1.3, after the handshake, the server tries to send the session tickets. And the standard semantics for transport streams (as determined by e.g. TCP/BSD sockets) are that if your peer has closed the connection, then you can keep reading data until you get everything they sent before they closed it, but if you send data then you get an immediate error and lose all the data that you might have received. Which is what happens here: the client has closed the transport, so the server's attempt to send the session tickets raises an exception, and the client's clean shutdown notification gets lost, so the server never has a chance to see it.

I... don't think there's anything we can actually do here? I think the server is simply stuck getting BrokenResourceError (as if the network had dropped) instead of a clean EOF.

I guess we could swap the roles in the test, so the server's the one who closes it immediately – that would at least let us continue to exercise the aclose-triggering-handshake logic.

@njsmith
Copy link
Member Author

njsmith commented Dec 23, 2018

Ugh, it's not just that one sequence I analyzed above that triggers this, it's a whole bunch of them that trigger it non-deterministically, probably depending on whether or not the scheduler lets the client close the connection before the server manages to enqueue the session tickets.

@njsmith
Copy link
Member Author

njsmith commented Dec 23, 2018

Boiling this down further: the major issue with the auto-sending session tickets is that they cannot work reliably in any usage where the server never sends data to the client. As a special case, that includes our toy testcases that just do things like establish a connection and then close it immediately, but it's much more general than that. Also, it doesn't matter whether you use standards-compliant mode or HTTPS-compliant mode.

Here are two openssl issues where people ran into these problems before me:

As noted in the second issue, we could potentially avoid the close-related problems if the client always did a full round-trip shutdown, because waiting for the server's close_notify would automatically consume the session tickets. Currently Trio never does a full round-trip shutdown: if we're in standards-compliant mode then we send a close_notify and then immediately close the connection, and if we're in HTTPS-compatible mode then we skip the close_notify entirely. I suppose we could detect when we're using TLS 1.3 and switch to attempting the full round-trip close_notify dance, and ignore errors as appropriate. This seems suboptimal though; upgrading to TLS 1.3 isn't supposed to add round-trips. And anyway, when we're a server we can't control what clients connecting to us do! And clients are specifically allowed by the standard to skip doing a full round-trip shutdown.

The PR attempts to implement a different solution: it ignores EPIPE/ECONNRESET if those occur while attempting to send session tickets. But this has two problems AFAICT. First, we use memory BIOs and then do the I/O ourselves, so any EPIPE/ECONNRESET occurs later, after the call to SSL_do_handshake. They suggest that in this case it's our job to figure out how to deal with it, which would be fair enough, except that we can't implement the same behavior, because we don't know whether the data we're attempting to send is the session tickets, because openssl doesn't tell us that.

But that doesn't really matter, because there's another problem that's much more serious. Even if you ignore EPIPE/ECONNRESET while sending the session tickets, that doesn't guarantee you can go back to calling recv afterwards, because of how TCP works!

...Ugh, I guess I'm filing a bug on openssl.

@njsmith
Copy link
Member Author

njsmith commented Dec 23, 2018

openssl/openssl#7948

njsmith added a commit to njsmith/trio that referenced this issue Dec 23, 2018
See python-trio#817

We don't have CI for this yet, but at least on my laptop it helps.
However, it doesn't fix everything.

- Add OP_ENABLE_MIDDLEBOX_COMPAT to the list of static ssl constants
- Handle SSLZeroReturnError in more places
- Fix some -- but not all -- of the places where session tickets cause
  failures. See python-trio#819
njsmith added a commit to njsmith/trio that referenced this issue Jan 7, 2019
It's currently a mess because of how openssl v1.1.1 handles session
tickets. There isn't consensus yet about whether this is an openssl
bug or what:

python-trio#819
openssl/openssl#7948
openssl/openssl#7967
njsmith added a commit to njsmith/trio that referenced this issue Jan 7, 2019
It's currently a mess because of how openssl v1.1.1 handles session
tickets. There isn't consensus yet about whether this is an openssl
bug or what:

python-trio#819
openssl/openssl#7948
openssl/openssl#7967
@oremanj oremanj added the TLS Relevant to our TLS/SSL implementation label May 4, 2019
@njsmith njsmith changed the title How should we handle TLS 1.3 session tickets? Deadlocks caused by openssl's handling of TLS 1.3 session tickets Aug 2, 2019
@njsmith
Copy link
Member Author

njsmith commented Aug 2, 2019

There's a lot more information about the issue in openssl/openssl#7948 and openssl/openssl#7967. At this point I'm pretty sure that the right answer is that the server should wait, and send the session tickets the first time it has some other reason to send data (e.g. because of a call to SSL_write). That's also what BoringSSL has switched to doing unconditionally. However, the openssl devs are not yet convinced of this, and are especially unconvinced that they should change this in a stable release, and RHEL 8 has now shipped with a openssl 1.1.1 that contains this bug. So we should keep trying to convince them to fix their stuff, but we should also think about workarounds.

I guess the most elegant workaround would be to detect when openssl tries to send session tickets at the wrong time (= immediately after the handshake), and then instead of sending them we could stash them away to send later, when the right time arrives (= the next time some other operation requires sending data). The stashing part is pretty straightforward. The tricky part is reliably recognizing the session tickets, because openssl doesn't distinguish them from the rest of the handshake – it just gives you a bunch of opaque bytes to send back and forth, and you're not supposed to think too hard about which bytes represent which messages. But here we need to break that abstraction.

My first thought was: maybe we could peek at the raw bytes to find some kind of message type tag. Unfortunately, it turns out that TLS 1.3 starts encrypting everything mid-way through the handshake. In particular, the session tickets, and the messages that come before them, are both encrypted. Unless we somehow extract the secrets from openssl, all we have is a length field. (And we don't have any way to extract the secrets.)

However, I think there actually is a solution. Looking at RFC 8446, we see in section 4.6.1:

At any time after the server has received the client Finished message, it MAY send a NewSessionTicket message

So openssl can't send tickets until after it sees the client's Finished message.

And, the client Finished message always comes after the server Finished message – you can see this by examining Figures 1 through 4 that show all the possible handshake patterns, and also if I'm reading sections 4.4.1 and 4.4.4 correctly, the client Finished contains a hash of the server Finished.

So when we loop calling SSL_do_handshake on the server, for TLS 1.3 it always goes through the following states sequentially:

  1. until ClientHello arrives: return WantRead with no data in the outgoing buffer
  2. when ClientHello arrives: return WantRead, with the ServerHello ... Finished messages in the outgoing buffer, which we then send. At this point the server's part of the handshake is actually complete.
  3. until client Finished arrives: return WantRead with no data in the outgoing buffer
  4. after client Finished arrives: return successfully, with session tickets in the outgoing buffer, which we then send

But! This means that we can actually reliably identify session tickets! If:

  • We're on the server side, and
  • SSL_do_handshake just completed with no error, and
  • we negotiated TLS 1.3 (in Python: ssl_obj.version() == "TLSv1.3")

...then the only thing that can possibly be in the outgoing buffer at this point is session tickets. The server's handshake finished in step 2, so what else can SSL_do_handshake be sending in step 4?

AFAICT, this heuristic is currently 100% reliable. Should we worry about things changing in the future in a way that breaks this heuristic?

In principle, TLS 1.3 allows the server to start sending application data before receiving the client Finished message (you can see this in the handshake diagrams), so you might think SSL_do_handshake could return before seeing the client Finished. But it turns out that most TLS 1.3 implementations don't make use of this. There are some caveats:

Servers MAY send data after sending their first flight, but because the handshake is not yet complete, they have no assurance of either the peer's identity or its liveness (i.e., the ClientHello might have been replayed). [section 4.4.4]

And David Benjamin says "I don't think anyone bothers".

OpenSSL does have an API for this – in TLS 1.3, "early data" is a concept that only makes sense for clients, so they decided that if a server calls their "early data" functions, then that's how you write application data before seeing the client Finished. But this involves a completely different API that replaces SSL_do_handshake. So I don't think we need to worry that SSL_do_handshake will at some point start returning early before seeing the client Finished message.

Of course at some point we'll want to add support for early data (#223), so we'll need to revisit this then. But for now neither the stdlib ssl module nor pyopenssl expose any of the necessary API, so we can kick that can down the road a bit.

Is there anything else that SSL_do_handshake might send at the end of the exchange? There are several other "post handshake messages" (see section 4.6), but these don't seem relevant:

  • Post-handshake authentication: this allows the server to request a cert from the client. But if the server wanted a cert from the client during the handshake, it can just do that; this is only used if the server decided not to request a cert during the handshake, and then later changes its mind. But if we haven't even left SSL_do_handshake yet, it hasn't had a chance to change its mind! Also I think OpenSSL only sends this message if you explicitly call SSL_verify_client_post_handshake.

  • KeyUpdate: this just tells the other side that hey fyi, all the data I send after this will be using a new set of keys (and it can also suggest to the other side that they do a KeyUpdate as well at their convenience). There's never any reason to send this from SSL_do_handshake, and even if openssl did for some reason, it would be fine for us to delay the delivery until we have more data to send.

So...... unless I'm missing something, this is both reliable and safe, so we should probably do it!

@njsmith
Copy link
Member Author

njsmith commented Aug 2, 2019

Oh, one more thing that's kind of obvious but probably worth saying explicitly: the heuristic above is also robust enough to work correctly if we're using a version of openssl where this bug is already fixed (like boringssl). In this case, at step 4 there simply won't be any session tickets in the outgoing buffer, so our stashing logic will stash 0 bytes, and everything works fine.

@njsmith
Copy link
Member Author

njsmith commented Aug 2, 2019

David Benjamin was kind enough to read over my post above, and says: "Your state listing is missing HelloRetryRequest, but I think the heuristic works."

@njsmith
Copy link
Member Author

njsmith commented Nov 17, 2020

As a small coda to this whole saga: it turns out that just like I predicted, this openssl issue has actually been causing data loss in real world situations, with protocols that use unidirectional connections like FTPS and syslog, e.g.:

Which sucks, but at least it's nice to know that the effort we put into this was dealing with a real problem.

(Also if openssl keeps dragging their feet we might eventually be forced to add some kind of workaround on the client side to deal with broken servers, but I guess we can wait until someone hits this in real life.)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
TLS Relevant to our TLS/SSL implementation
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants