Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PoC TLS resume on Linux client #64369

Merged
merged 17 commits into from
Mar 25, 2022
Merged

PoC TLS resume on Linux client #64369

merged 17 commits into from
Mar 25, 2022

Conversation

wfurt
Copy link
Member

@wfurt wfurt commented Jan 27, 2022

This complements #57079 and contributes to #22977

Unlike the server part, this is somewhat more complicated. It may still need some work but I would like to get early feedback for the overall strategy.

Lets start with some numbers:

|                           Method | Toolchain | protocol |        Mean |      Error |     StdDev |      Median |         Min |         Max | Ratio | RatioSD |  Gen 0 | Allocated | Alloc Ratio |
|--------------------------------- |---------- |--------- |------------:|-----------:|-----------:|------------:|------------:|------------:|------:|--------:|-------:|----------:|------------:|
| DefaultHandshakeContextIPv4Async |        pr |        ? |   965.19 us |  15.799 us |  14.779 us |   964.67 us |   934.66 us |   989.94 us |  1.00 |    0.00 |      - |   17897 B |        1.00 |
| DefaultHandshakeContextIPv4Async |      main |        ? | 5,377.33 us |  40.143 us |  33.521 us | 5,370.95 us | 5,330.34 us | 5,431.63 us |  5.57 |    0.11 |      - |   19848 B |        1.11 |
| DefaultHandshakeContextIPv6Async |        pr |        ? |   949.92 us |  20.529 us |  23.641 us |   949.12 us |   918.10 us |   995.11 us |  1.00 |    0.00 |      - |   17896 B |        1.00 |
| DefaultHandshakeContextIPv6Async |      main |        ? | 5,171.75 us | 114.255 us | 126.995 us | 5,114.88 us | 5,051.99 us | 5,456.90 us |  5.44 |    0.16 |      - |   19921 B |        1.11 |
|        DefaultHandshakeIPv4Async |        pr |        ? | 5,070.35 us | 171.510 us | 197.512 us | 5,034.52 us | 4,759.74 us | 5,492.81 us |  1.00 |    0.00 |      - |   23679 B |        1.00 |
|        DefaultHandshakeIPv4Async |      main |        ? | 5,572.56 us | 101.193 us |  89.705 us | 5,552.17 us | 5,464.39 us | 5,792.33 us |  1.10 |    0.05 |      - |   22730 B |        0.96 |
|        DefaultHandshakeIPv6Async |        pr |        ? | 5,562.64 us | 148.510 us | 165.069 us | 5,527.18 us | 5,260.93 us | 5,832.12 us |  1.00 |    0.00 |      - |   23721 B |        1.00 |
|        DefaultHandshakeIPv6Async |      main |        ? | 5,550.79 us | 108.666 us |  96.330 us | 5,506.31 us | 5,454.39 us | 5,768.76 us |  1.00 |    0.03 |      - |   23401 B |        0.99 |

When SslStreamCertificateContext is used on server side, TLS resume is possible and we get ~ 5.5x boost.
This should get us on par with Windows but I don't have to same machines to verify this.
On server side #57079 linked the cache to the SslStreamCertificateContext. So when server will reuse it it will also provide ability to resume previous session. When the context is Disposed, it will also dispose the SSL_CTX and the cache.

This creates first complication for client. It also needs to use same SSL_CTX but since there is typically no certificate, there is nothing to hook to.
To solve this I added separate dictionary to hold contexts for client. That allows to reuse previous context and save native allocations. But the native contexts will be preserved during duration of the process. Complete test runs for SslStream creates 13 instance, in ideal case there will be one e.g. SslProtocols.None. e.g. it depends on unique combinations of SslProtocols used by the client.

To make it more complicated, OpenSSL pushes responsibility for setting the sessions to caller e.g. SslStream.
So we use the TargetHost e.g. SNI host to lookup previous session. Since there may be multiple series running on same name this is not perfect. But since this is just offer to the server, it is up to the server to decide if the session is valid or not. Based on my testing, this is how SCHANNEL operates. Some implementations use IP/Port info for the lookup but SslStream does not have access to it in general case. Even with that, this can be just load balancer and the SSL may be terminated or different servers. Once again, if the session is offered to "wrong" server or if the server was restored or cleaned up the session, it should proceed with full handshake.

The rest is mostly technicality. In order to process the sessions, we need to register callbacks and OpenSSL will inform us when it wants to add or remove session. There is some plumbing to map that to .NET and original TargetHost.
There can be multiple sessions for given host but the current strategy is to keep the last one.

I added some methods to SafeSslContextHandle to allocate cache and do operations. The GCHandle can possibly be stored in native SSL_CTX but I felt it would be easier for debug to keep it on managed side. Since the sessions are not directly used anywhere in managed code, I use IntPtr instead of SafeHandle.

@wfurt wfurt added area-System.Net.Security os-linux Linux OS (any supported distro) labels Jan 27, 2022
@wfurt wfurt added this to the 7.0.0 milestone Jan 27, 2022
@wfurt wfurt requested review from stephentoub, bartonjs and a team January 27, 2022 04:47
@wfurt wfurt self-assigned this Jan 27, 2022
@ghost
Copy link

ghost commented Jan 27, 2022

Tagging subscribers to this area: @dotnet/ncl, @vcsjones
See info in area-owners.md if you want to be subscribed.

Issue Details

This complements #57079 and contributes to #22977

Unlike the server part, this is somewhat more complicated. It may still need some work but I would like to get early feedback for the overall strategy.

Lets start with some numbers:

|                           Method | Toolchain | protocol |        Mean |      Error |     StdDev |      Median |         Min |         Max | Ratio | RatioSD |  Gen 0 | Allocated | Alloc Ratio |
|--------------------------------- |---------- |--------- |------------:|-----------:|-----------:|------------:|------------:|------------:|------:|--------:|-------:|----------:|------------:|
| DefaultHandshakeContextIPv4Async |        pr |        ? |   965.19 us |  15.799 us |  14.779 us |   964.67 us |   934.66 us |   989.94 us |  1.00 |    0.00 |      - |   17897 B |        1.00 |
| DefaultHandshakeContextIPv4Async |      main |        ? | 5,377.33 us |  40.143 us |  33.521 us | 5,370.95 us | 5,330.34 us | 5,431.63 us |  5.57 |    0.11 |      - |   19848 B |        1.11 |
| DefaultHandshakeContextIPv6Async |        pr |        ? |   949.92 us |  20.529 us |  23.641 us |   949.12 us |   918.10 us |   995.11 us |  1.00 |    0.00 |      - |   17896 B |        1.00 |
| DefaultHandshakeContextIPv6Async |      main |        ? | 5,171.75 us | 114.255 us | 126.995 us | 5,114.88 us | 5,051.99 us | 5,456.90 us |  5.44 |    0.16 |      - |   19921 B |        1.11 |
|        DefaultHandshakeIPv4Async |        pr |        ? | 5,070.35 us | 171.510 us | 197.512 us | 5,034.52 us | 4,759.74 us | 5,492.81 us |  1.00 |    0.00 |      - |   23679 B |        1.00 |
|        DefaultHandshakeIPv4Async |      main |        ? | 5,572.56 us | 101.193 us |  89.705 us | 5,552.17 us | 5,464.39 us | 5,792.33 us |  1.10 |    0.05 |      - |   22730 B |        0.96 |
|        DefaultHandshakeIPv6Async |        pr |        ? | 5,562.64 us | 148.510 us | 165.069 us | 5,527.18 us | 5,260.93 us | 5,832.12 us |  1.00 |    0.00 |      - |   23721 B |        1.00 |
|        DefaultHandshakeIPv6Async |      main |        ? | 5,550.79 us | 108.666 us |  96.330 us | 5,506.31 us | 5,454.39 us | 5,768.76 us |  1.00 |    0.03 |      - |   23401 B |        0.99 |

When SslStreamCertificateContext is used on server side, TLS resume is possible and we get ~ 5.5x boost.
This should get us on par with Windows but I don't have to same machines to verify this.
On server side #57079 linked the cache to the SslStreamCertificateContext. So when server will reuse it it will also provide ability to resume previous session. When the context is Disposed, it will also dispose the SSL_CTX and the cache.

This creates first complication for client. It also needs to use same SSL_CTX but since there is typically no certificate, there is nothing to hook to.
To solve this I added separate dictionary to hold contexts for client. That allows to reuse previous context and save native allocations. But the native contexts will be preserved during duration of the process. Complete test runs for SslStream creates 13 instance, in ideal case there will be one e.g. SslProtocols.None. e.g. it depends on unique combinations of SslProtocols used by the client.

To make it more complicated, OpenSSL pushes responsibility for setting the sessions to caller e.g. SslStream.
So we use the TargetHost e.g. SNI host to lookup previous session. Since there may be multiple series running on same name this is not perfect. But since this is just offer to the server, it is up to the server to decide if the session is valid or not. Based on my testing, this is how SCHANNEL operates. Some implementations use IP/Port info for the lookup but SslStream does not have access to it in general case. Even with that, this can be just load balancer and the SSL may be terminated or different servers. Once again, if the session is offered to "wrong" server or if the server was restored or cleaned up the session, it should proceed with full handshake.

The rest is mostly technicality. In order to process the sessions, we need to register callbacks and OpenSSL will inform us when it wants to add or remove session. There is some plumbing to map that to .NET and original TargetHost.
There can be multiple sessions for given host but the current strategy is to keep the last one.

I added some methods to SafeSslContextHandle to allocate cache and do operations. The GCHandle can possibly be stored in native SSL_CTX but I felt it would be easier for debug to keep it on managed side. Since the sessions are not directly used anywhere in managed code, I use IntPtr instead of SafeHandle.

Author: wfurt
Assignees: wfurt
Labels:

area-System.Net.Security, os-linux

Milestone: 7.0.0

Copy link
Member

@rzikm rzikm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I left some comments, I am not confident in how TLS resumption works so it would be best if somebody else added their review as well

string? name = Marshal.PtrToStringAnsi(serverName);
if (!string.IsNullOrEmpty(name))
{
Interop.Ssl.SessionSetHostname(session, serverName);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It feels really weird to me that SetHostname would be inside TryAddSession.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I need something to find the session in removal. Associating name with it allows me to get the string and then do lookup. It would be great if we can come up with something that allows to lookup by both IntPtr and Name.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure... it just feels like calling SetHostname'd be the responsibility of the caller. Doesn't it have to be done in the case where TryAddSession returns false?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is ony only for the lookup - there is no functional difference. And I tried to hide all this inside the handle.
If we can lookup/remove the entry just from the IntPtr session, we would not need to do that. But I'm not sure if there is good way.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I put in comment and made changes to make it clear. I also moved complementing removal so both parts are done inside the SafeHandle.

{
Interop.Ssl.SessionSetHostname(session, serverName);

lock (_sslSessions)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The locking around _sslSessions makes sense, since you're manipulating state depending on how the dictionary performed.

But, since you're already locking it, it feels like you want a non-Concurrent dictionary.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

agreed. I was also thinking about grabbing extra reference on the session. That would allow me to use ConcurrentDictionary without locking as the session would never be released in the middle.
Do you have preference/recommendation @bartonjs ?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

grabbing extra reference on the session

Something like

Interop.Ssl.SslSessionUpRef(session);

if (!_sslSessions.TryAdd(...))
{
    // Undo the upref since it's not in the dictionary
    Interop.Ssl.SslSessionFree(session);
}

? (Upref inside has a race condition with the cleanup in ReleaseHandle)

That would get a little weird since in the cleanup you'd need to call free twice, I think?

The fact that we wrote ConcurrentDictionary suggests that it gives better perf (on average) than manual locking... but if the code to interact with it is doing memory/lifetime management and it becomes unreadable with the gymnastics... then locking is better for maintainability. (If it's clean code and more performant, than by all means use upref+ConcurrentDictionary)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes. and then call free twice. I'm inclined to good with the lock and better maintainability as the perf does not depend on this. This happens one in while - not even for each SSL session. I started with ConcurrentDictionary but you are right - we don't need it at this point.

@wfurt
Copy link
Member Author

wfurt commented Feb 2, 2022

Test failures are related. It seems like bug in old OpenSSL. The ServerAsyncAuthenticate_SniSetVersion_Success will first negotiate Tls1.1. And when we offer ticket in next round (the client side did not change e.g. should advertise 1.1 and 1.2) but the negotiation fails because the client is trying to restore only 1.1. But the server is now configured to only use 1.2 and that fails with TLS Alert.

       public async Task ServerAsyncAuthenticate_SniSetVersion_Success(SslProtocols version)
        {
            var serverOptions = new SslServerAuthenticationOptions() { ServerCertificate = _serverCertificate, EnabledSslProtocols = version };
            var clientOptions = new SslClientAuthenticationOptions() { TargetHost = _serverCertificate.GetNameInfo(X509NameType.SimpleName, forIssuer: false), EnabledSslProtocols = SslProtocols.Tls11 | SslProtocols.Tls12 };

the PAL shim also turned out more complicated for old OpenSSL than I really wanted.
I'm inclined to enable client resume only for 1.1.1. That should be OK for all modern distributions (including current RH7), leaving behind Ubuntu 16 that is out of support anyway according to https://wiki.ubuntu.com/Releases

Any thoughts on that @bartonjs? (and support for 1.0.1 in general?)

@bartonjs
Copy link
Member

bartonjs commented Feb 2, 2022

Any thoughts on that @bartonjs? (and support for 1.0.1 in general?)

The OpenSSL project no longer considers 1.0.x to be in support. Ubuntu 16.04 ESM, and other distros older than 2018 that are in a mode other than full-EoL, presumably is custom-patching their source base to try to maintain parity with 1.1.1's security patches.

Ubuntu 18.04 for x86 and x64 upgraded from 1.1.0 to 1.1.1 in Dec 2018. https://packages.ubuntu.com/bionic/libssl1.1 says that other architectures of Ubuntu 18.04 still use 1.1.0... but I don't know what our support story is for .NET on any of those architectures on 18.04.

That said, I think it's fine to say that a feature only works on an OS with a new-enough version of OpenSSL.

@wfurt
Copy link
Member Author

wfurt commented Feb 4, 2022

This should be ready for another review pass with following changes

  • I sprinkled more Asserts and comments through the code
  • Removed ConcurrentDictionary for sessions
  • no resume on old OpenSSL versions (may be revisited in future but the impact seems minimal)

Copy link
Member

@rzikm rzikm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, only minor suggestions.

@wfurt
Copy link
Member Author

wfurt commented Feb 18, 2022

/azp run runtime-extra-platforms

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@karelz
Copy link
Member

karelz commented Feb 22, 2022

@bartonjs can you please finalize your Code Review? Is there something you consider blocking?

Comment on lines 603 to 604
if (newSessionCb != NULL || removeSessionCb != NULL)
if (newSessionCb != NULL)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This logically nested (but not indented) if is really just the latter condition. One of these two lines should be deleted.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

removed 603. I probably started something I end up not needing.


const char* CryptoNative_SslSessionGetHostname(SSL_SESSION *session)
{
#ifdef SSL_SESSION_get0_hostname
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't believe the ifdef will evaluate to true on a non-portable build. (Since it's a function identifier then)

This should probably instead be based on the NEED_OPENSSL_x_y values.

Try building with ./build-native.sh -portableBuild=false and see if it works as expected.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did build on Ubuntu 16.04 with OpenSSL 1.0.1e with portableBuild true and false. And it did not fail. Same on Ubuntu 20.04 with OpenSSL 1.1.1.
Would NEED_OPENSSL_1_1 cover also 3.0? I can still change that if you prefer it.

Copy link
Member

@bartonjs bartonjs Mar 3, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did build on Ubuntu 16.04 with OpenSSL 1.0.1e with portableBuild true and false. And it did not fail.

That's... really surprising. Looking at the headers over time, it looks like it was added in 1.1.0, and has always been a function (vs a macro acting like a function), so what I'd expect is

1.0.1e in non-portable:

The compile doesn't see any definition of SSL_SESSION_get0_hostname at all, turns this into #if false and removes the code.

1.1.1 in non-portable:

SSL_SESSION_get0_hostname is a function, so the preprocessor doesn't see it, so this still turns into #if false and goes away.

In both cases the compile would succeed, but I'd expect 1.1.1-direct to fail to run (remember that ./build-native.sh doesn't update the testhost, so you need to manually copy over it or make it a symlink -- or just load up the compile output in a debugger and see if there's a function body or not)

Would NEED_OPENSSL_1_1 cover also 3.0?

No, you'd need #if defined NEED_OPENSSL_1_1 || defined NEED_OPENSSL_3_0

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you are right. updated. It is interesting that we don't really have any test coverage for non-portable builds, right? (aside from compilation)

@wfurt
Copy link
Member Author

wfurt commented Mar 24, 2022

The build stuff should be resolved @bartonjs. can you please take another look?

Comment on lines +725 to +729
return 1;
}

// OpenSSL will destroy session.
return 0;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are these interesting from a logging perspective (if so, that can easily be in a future change)?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was thinking about logging but there are conditions that make it normal. If we hit this, there should be no functional change as we simply won't do the caching & resume.

@wfurt wfurt merged commit f8d5706 into dotnet:main Mar 25, 2022
@wfurt wfurt deleted the tlsClientResume branch March 25, 2022 22:08
radekdoulik pushed a commit to radekdoulik/runtime that referenced this pull request Mar 30, 2022
* TLS resume on client

* shim functions introduced in 1.1.1

* add missing struct

* disable resume on old OpenSSL

* feedback from review

* fix source build

* update comment

* feedback from review

* feedback from review

* avoild client resume on old OpenSSL
@ghost ghost locked as resolved and limited conversation to collaborators Apr 25, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
area-System.Net.Security os-linux Linux OS (any supported distro)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants