Change SocketsHttpHandler's default ConnectTimeout to a reasonable finite value #66297

CarnaViire · 2022-03-07T15:31:48Z

Windows will timeout a connection attempt after ~21s. Linux apparently waits much longer – around 3 minutes or more.

The connection pooling changes in 6.0 made it so that a pending connection is not associated with a specific request. As such, request cancellation no longer causes pending connections to be canceled. We do not impose any connect timeout by default (DefaultConnectTimeout = Timeout.InfiniteTimeSpan), so we are now exposed to Linux’s long connection timeout. In case of small request timeouts, this may lead to a potentially big number of requests timing out as they try to wait/reuse the same "pending" connection.

We should set a default ConnectTimeout to something reasonably small. We should also consider backporting that to 6.0.

I propose to unify Windows and Linux timeouts, i.e. set DefaultConnectTimeout to 21s.

The text was updated successfully, but these errors were encountered:

ghost · 2022-03-07T15:31:53Z

Tagging subscribers to this area: @dotnet/ncl
See info in area-owners.md if you want to be subscribed.

Issue Details

Windows will timeout a connection attempt after ~21s. Linux apparently waits much longer – around 3 minutes or more.

The connection pooling changes in 6.0 made it so that a pending connection is not associated with a specific request. As such, request cancellation no longer causes pending connections to be canceled. We do not impose any connect timeout by default (DefaultConnectTimeout = Timeout.InfiniteTimeSpan), so we are now exposed to Linux’s long connection timeout. In case of small request timeouts, this may lead to a potentially big number of requests timing out as they try to wait/reuse the same "pending" connection.

We should set a default ConnectTimeout to something reasonably small. We should also consider backporting that to 6.0.

I propose to unify Windows and Linux timeouts, i.e. set DefaultConnectTimeout to 21s.

Author:	CarnaViire
Assignees:	-
Labels:	`area-System.Net.Http`, `untriaged`
Milestone:	-

stephentoub · 2022-03-08T01:28:06Z

This seems reasonable. Technically it's a breaking change, but if a connection is taking that long to be established, it's likely going to fail, anyway. There's also value in consistency across platforms.

Is the exception that occurs in the case of a timeout due to ConnectTimeout the same that occurs in response to the OS timing out a connection?

CarnaViire · 2022-03-08T12:36:32Z

Is the exception that occurs in the case of a timeout due to ConnectTimeout the same that occurs in response to the OS timing out a connection?

Unfortunately, no.

If OS did the timeout, the error would bubble up from sockets, so we would see HttpRequestException with inner SocketException ((110): Connection timed out on Linux and (10060): A connection attempt failed because the connected party did not properly respond after a period of time.... on Windows)

If it was our timeout, we would see TaskCanceledException with inner TimeoutException: A connection could not be established within the configured ConnectTimeout.

geoffkizer · 2022-03-08T17:11:34Z

Perhaps we should use 15s. That means that typically we'd do the timeout ourselves on both platforms. It's also a reasonably round number as compared to 21s.

MihaZupan · 2022-03-08T18:12:34Z

Since we try connecting to all IPs returned by DNS (currently in serial), would changing the ConnectTimeout to something lower than the OS's timeout change the behavior so that we may no longer end up trying different IPs?

CarnaViire · 2022-03-09T11:34:13Z

Since we try connecting to all IPs returned by DNS (currently in serial), would changing the ConnectTimeout to something lower than the OS's timeout change the behavior so that we may no longer end up trying different IPs?

You are right 😢 and we currently don't have any per-IP timeout for multi-connect in Sockets (only a cancellation token for the whole operation) -- should we consider some new API like that?

ManickaP · 2022-03-09T14:30:09Z

would changing the ConnectTimeout to something lower than the OS's timeout change the behavior so that we may no longer end up trying different IPs?

Only in case of Windows. In case of Linux, we're not losing anything, since the OS timeout ends up being ~2 minutes. The user can always change it back to infinite if this proves problematic. They can also implement their own ConnectCallback.

Eventually we could do something like happy eyeballs and try to connect to multiple IPs in parallel (related #26177).

wfurt · 2022-03-09T18:35:06Z

Even with eyeballs, you can have multiple addresses for the same family. And only some of them may be unreachable/unresponsive. But I would feel in most cases callers would care about overall responsiveness instead of details for each individual address.

ManickaP · 2022-03-09T20:03:07Z

By "something like happy eyeballs" I meant trying the parallel connect, even to multiple addresses of the same family at the same time. This is just an idea so please do not crucify me on all the details and pitfalls. I'm aware this wouldn't be as easy as running n connects in n tasks and seeing which finishes first. It would need to be carefully designed to not to drag down perf, hog ephemeral ports etc.

CarnaViire · 2022-03-14T13:37:11Z

Linking the existing issue containing API for socket's happy-eyeball-like parallel connect #33418
We might want to revisit it and add more flexible options, e.g. not connecting to all IPs at once, but rather try to connect to the first IP, wait for 5s, and if it doesn't connect, try to connect to the second one without cancelling the first one, etc.

CarnaViire · 2022-03-14T18:34:37Z

Triage: change DefaultConnectTimeout to 15s in 7.0 (breaking change).
Multi-connect issues should be relatively rare. If it occurs, ConnectTimeout could be to set back to a big value. If we would see demand we would prioritize socket connect changes linked above.
For 6.0, consider adding telemetry to help troubleshoot hanged "pending" connection issue.

The additional trace should help troubleshoot hanged "pending" connection issues (i.e. the reasons why new connections were not created) in HttpConnectionPool. The issue manifests as growing number of timeouted requests. It is not easy to link a timeouted request (esp. with a small timeout) to a connection that was created 3 minutes ago but still is not connected. There are no traces of the state of HttpConnectionPool, e.g. number of pending connections, so it is unclear from existing traces why the pool decided not to create a connection. Related to #66297

The additional trace should help troubleshoot hanged "pending" connection issues (i.e. the reasons why new connections were not created) in HttpConnectionPool. The issue manifests as growing number of timeouted requests. It is not easy to link a timeouted request (esp. with a small timeout) to a connection that was created 3 minutes ago but still is not connected. There are no traces of the state of HttpConnectionPool, e.g. number of pending connections, so it is unclear from existing traces why the pool decided not to create a connection. Related to dotnet#66297

CarnaViire · 2022-04-28T11:43:10Z

I've put up a PR to revert this change #68649

The change proved to be problematic as-is, because

There is a high chance that the user code has a retry logic that does retry when SocketException is caught but does not retry in case of OperationCanceledException. This can lead to a spike of (unretried) exceptions which weren't there before. Example:
https://github.com/NuGet/NuGet.Client/blob/93bb3921490e9d499be1fb48d6cb9dcaff734db1/src/NuGet.Core/NuGet.Protocol/HttpSource/HttpRetryHandler.cs#L212-L263
The workaround with setting the ConnectTimeout back to infinity may not be available -- if HttpClientHandler is used instead of SocketsHttpHandler (e.g. due to reusing code between .NET Framework and .NET Core/.NET 5+), or if the code that executes the requests is hidden inside a library.

I'm reopening the issue for further discussion.

CarnaViire · 2022-04-28T15:54:46Z

Triage: we should address the issue with long OS timeouts in 7.0 in one way or another (retaining the old exception type, tuning OS timeouts, reworking pooling logic in some way).

pentp · 2022-05-03T18:36:32Z

As a SocketsHttpHandler user I would find it very surprising to get a TaskCanceledException for something I haven't canceled. I would expect either SocketException or TimeoutException.

karelz · 2022-05-10T16:45:45Z

Triage: Let's separate 2 parts here:

nuget (and potentially other customers) was broken. We may break them again, but we need to document it properly.
- The remaining risk is that someone doesn't like 15s timeout and cannot change it easily (e.g. because they target .NET Standard) -- ugly workaround for that is Reflection.
Which exception to throw to be consistent with our stack -- let's look at recent changes we did in the space and align with them. We may decide to follow that pattern (TaskCanceledException with inner TimeoutException), or fake SocketsException with inner TimeoutException, or start throwing directly TimetoutException.

CarnaViire · 2022-05-24T16:11:04Z

Which exception to throw to be consistent with our stack -- let's look at recent changes we did in the space and align with them. We may decide to follow that pattern (TaskCanceledException with inner TimeoutException), or fake SocketsException with inner TimeoutException, or start throwing directly TimetoutException.

We already follow the pattern "TaskCanceledException with inner TimeoutException" in both places:
For full-request-timeout

runtime/src/libraries/System.Net.Http/src/System/Net/Http/HttpClient.cs

Lines 607 to 612 in 215b39a

    
           else if (!pendingRequestsCts.IsCancellationRequested) 
        
           { 
        
               // If this exception is for cancellation, but cancellation wasn't requested, either by the caller's token or by the pending requests source, 
        
               // the only other cause could be a timeout.  Treat it as such. 
        
               e = toThrow = new TaskCanceledException(SR.Format(SR.net_http_request_timedout, _timeout.TotalSeconds), new TimeoutException(e.Message, e), oce.CancellationToken); 
        
           }

and for connect timeout

runtime/src/libraries/System.Net.Http/src/System/Net/Http/SocketsHttpHandler/HttpConnectionPool.cs

Lines 435 to 443 in 975aa54

    
           private static Exception CreateConnectTimeoutException(OperationCanceledException oce) 
        
           { 
        
               // The pattern for request timeouts (on HttpClient) is to throw an OCE with an inner exception of TimeoutException. 
        
               // Do the same for ConnectTimeout-based timeouts. 
        
               TimeoutException te = new TimeoutException(SR.net_http_connect_timedout, oce.InnerException); 
        
               Exception newException = CancellationHelper.CreateOperationCanceledException(te, oce.CancellationToken); 
        
               ExceptionDispatchInfo.SetCurrentStackTrace(newException); 
        
               return newException; 
        
           }

There was a huge discussion in #21965 on what is better to do on timeouts, which ended up introducing this specific pattern. In general, we didn't want to change the exception type as historically TCE was thrown, and we just added a way to programmatically check whether it was a timeout (by putting TimeoutException inside)

Re mimicking SocketException, it is a bit funny given that we make sure we throw OCE and not SocketException on cancellation in Sockets 😄

runtime/src/libraries/System.Net.Sockets/src/System/Net/Sockets/Socket.Tasks.cs

Lines 120 to 124 in 975aa54

    
           catch (SocketException se) when (se.SocketErrorCode == SocketError.OperationAborted) 
        
           { 
        
               cancellationToken.ThrowIfCancellationRequested(); 
        
               throw; 
        
           }

But there's no way on Socket level to say that the token signifies the timeout though.
I'm also not sure it would be ok to break existing behavior of ConnectTimeout, e.g. someone might depend that TCE was thrown on it?

CarnaViire · 2022-05-31T12:39:17Z

We discussed it within the team, and we don't see a solution that would not be breaking in one way or another.

The least breaking seem to be to simulate SocketException. This has more probability to be already retried by the user code, and that was the "default" behavior on Windows anyway, just after a bit longer interval (21s). This will bring parity between OSes in a more expected way. But we will effectively change existing ConnectTimeout property behavior.
The other option is to bring back the initial PR as is (with just changing the default ConnectTimeout value) and document this as a breaking change properly, because we are already "following the pattern". This might be troublesome for all the reasoning already brought up when we reverted the PR (users not handling TCE properly).

Re following pattern of TCE with inner TimeoutException -- first, it was actually added to ConnectTimeout only in 6.0, before it was just plain TCE. And the exception nesting is a bit of a mess now (TCE->TimeoutException->TCE->TimeoutException, and it also has a misleading message, but there's an issue to fix that #67505) — the bottom line is I don't feel like a lot of people are depending on that.... and other point that was brought on team discussion is that we need to have a way to tell the difference between timeout on Connect and overall request timeout, and we can't do that in a good way now.

halter73 · 2022-06-01T03:40:42Z

I vote for simulating a SocketException with an inner TimeoutException if it's the socket that failed to connect within the timeout in order to more easily distinguish from a reqular request timeout. I usually don't really care if it's the OS or the framework that timed out the socket connection, but the message should make this clear.

it is a bit funny given that we make sure we throw OCE and not SocketException on cancellation in Sockets 😄

If the cancellationToken fires before the timeout it should still be an OCE particularly if this is what could have caused the "timeout". I realize this is racy if in the background.

antonfirsov · 2022-06-01T13:31:48Z

SocketException as a subtype of Win32Exception, which is a type that indicates an occurrence of a low level error reported by the OS. Throwing such an exception when there is no low level error directly linked to the request is not what I would expect from a framework. This would look even weirder when ConnectCallback is overridden in a way that there are no sockets involved at all.

stephentoub · 2022-06-01T14:37:29Z

As I mentioned offline, I continue to think that this is working around a symptom rather than addressing the root of the problem.

The core problem stems from a change made in .NET 6. When a request arrives and there’s no connection immediately available in the pool and there’s no MaxConnectionsPerServer limit set that’s been reached, a new connection will be created. In .NET 5 and earlier, that connection was dedicated to this specific request and vice versa: this request would be made on this connection and only this connection once it was ready. The downside to this is, if it takes a while to establish that connection and another connection becomes available in the meantime, that request would continue waiting for the connection it spawned. The upside to this, however, is if the connection took a long time to be established (or took so long to be established that it timed out), the only request it would harm (again assuming no MaxConnectionsPerServer limit reached) is the one request with which it was associated. In .NET 6, this behavior was changed, improving the downside but harming the upside. When a request arrives and there’s no connection available (including one that's pending), a new one is still immediately spawned, but it’s no longer associated with only that request and that request is no longer associated only with it. If another connection becomes available in the meantime, that newly-available connection can service the request. That improves the aforementioned downside. But now we have this pending connection that’s been spawned, and so when the next request arrives, it’ll see there’s already a connection pending and will try to use that one rather than spawning another. If this connection will never become usable for whatever reason, any requests that associate with it will effectively stall as well, such that this one request can now take out multiple requests rather than just one. The discussed ConnectTimeout change is an attempt to minimize the impact of this by lessening the default time that connection might be pending and thus lessening the number of requests that associate with it, but there are other ways to address this. Note as well that a relatively recent .NET 7 change made it so that when the connection eventually fails, it only fails the request that originally spawned it; if that request has already moved on or completed or been canceled, the connection’s failure ends up being silent (other than logging).

For example, we could change the pooling to avoid considering a pending connection as one that can be used for an incoming request; if a new request arrives and there’s no immediately usable connection to service the request, create a new one. If we’re worried about a backlog of pending connections to a bad server building up, we can have some internal limit on how many such pending connections are allowed. (If there is a MaxConnectionsPerServer and it’s been reached, then we wait just as we always did). Instead or in addition, we could also lower the connection timeout once the connection is no longer associated with the original request, since it’s not going to fail anyone else anyway.

CarnaViire · 2022-06-01T15:31:33Z

@halter73 In the end, on team discussion we decided agains SocketException. The main point is what @antonfirsov said - that ConnectCallback can change the transport entirely, so SocketException would be unexpected.
I agree with what @stephentoub says. We should try to overcome the problem we brought in our 6.0 connection pooling change. This won't be an easy change though. We might still end up going with the breaking change of 15s ConnectTimeout instead or in addition to it.

The ideas were:

Always spawn a new connection if none were available - within connection limits (in line with .NET 5 behavior) -- what Stephen said above. But this will not help folks that explicitly use MaxConnectionPerServer=1
"Deprioritize" pending connection after some time (stop counting it against request queue size). As with the one above, this will not help if MaxConnectionPerServer=1
With a shorter connect timeout, retry connection creation, e.g. 3 times, while it is still within the originating request's timeout.
Somehow apply originating request's timeout to connection creation
If the connection was not connected after originating request is completed, tear it down right away or after some short timeout
If we end up with timed out requests while the connection was pending, tear down such connection

antonfirsov · 2022-06-23T19:35:45Z

To me it looks like a combination of the following two mitigations would bring back the upsides of the old 5.0 connection and request cancellation behavior without hurting the benefits introduced in 6.0, and without the need of introducing an arbitrary ConnectTimeout value or other intrusive/breaking changes:

Spawn a new connection if none is available - within limits, as recommended in Stephen's comment.

This will not help the MaxConnectionPerServer=1 case, therefore also:

Link the connection's cancellation token with the cancellation token of the request that initiated it.

In case of MaxConnectionPerServer == 1 this would practically result in a behavior that is identical to 5.0: there could be only one connection attempt, which is always associated with the first request in the queue. If the request is cancelled, the connection attempt is also cancelled, they are failing together.

In case of MaxConnectionPerServer > 1 this would help against the accumulation of too many pending connection attempts to badly behaving servers. Thanks to #62935, a connection failure/cancellation should never fail an unrelated request. In fact, a connection failure helps serving follow-up requests, since in most cases it will spawn a new connection attempt. This would be also true if the cancellation happens because of the cancellation of the original unrelated request, therefore I see only positive effects from this change.

Note that point 2. alone does not solve the case when an earlier request with a long timeout creates a pending faulty connection that is stalling a later requests with a short timeout. From this perspective, this is why I'm recommending to combine it with more aggressive connection spawning.

@stephentoub @dotnet/ncl I wonder if I'm missing something or if you have some concerns regarding this idea?

stephentoub · 2022-06-24T15:20:54Z

therefore I see only positive effects from this change.

The downsides I see to that are:

If a request is made and then quickly canceled, we'll end up tearing down a connection that could be perfectly usable almost immediately. We had the same issue in .NET 5, but in theory the pooling changes for .NET 6 fixed that, whether it was actually an issue in real workloads or not.
If another connection handles a request that spawned a new connection, and that connection takes a long time to be established, it'll be holding on to the resources associated with the initial request's CancellationToken. That might not be a big deal, but it's something to consider.
If the initial request completes successfully due to another connection picking it up, and then after it completes successfully its token is canceled, that could again end up tearing down a connection that would have otherwise been usable soon thereafter.

It might be the right tradeoffs, of course. Just pointing out it's not without possible issue.

antonfirsov · 2022-06-24T17:51:30Z

We can make the linking of the request and connection cancellation conditional, and only do it if we cannot spawn new "extra connections" anymore, eg. because we are approaching the pool limit. (Meaning it would be the default behavior with MaxConnectionPerServer == 1).

This would mean that we would be able to handle problematic cases and constrained pools without hurting the performance of the general, optimistic path.

…tes (#71785) Resolves #66297

dotnet-issue-labeler bot added area-System.Net.Http untriaged New issue has not been triaged by the area owner labels Mar 7, 2022

wfurt removed the untriaged New issue has not been triaged by the area owner label Mar 8, 2022

wfurt added this to the 7.0.0 milestone Mar 8, 2022

This was referenced Mar 14, 2022

Add additional trace to HttpConnectionPool #66605

Merged

Set SocketsHttpHandler's default connect timeout to 15s #66607

Merged

ghost added the in-pr There is an active PR which will close this issue when it is merged label Mar 14, 2022

stephentoub closed this as completed in #66607 Mar 15, 2022

ghost removed the in-pr There is an active PR which will close this issue when it is merged label Mar 15, 2022

MihaZupan mentioned this issue Mar 21, 2022

HttpConnectionPool.RequestQueue leaking items #66673

Open

ghost locked as resolved and limited conversation to collaborators Apr 14, 2022

CarnaViire mentioned this issue Apr 28, 2022

Revert "Set SocketsHttpHandler's default connect timeout to 15s (#66607)" #68649

Merged

CarnaViire reopened this Apr 28, 2022

CarnaViire added the untriaged New issue has not been triaged by the area owner label Apr 28, 2022

dotnet unlocked this conversation Apr 28, 2022

CarnaViire removed the untriaged New issue has not been triaged by the area owner label Apr 28, 2022

MihaZupan mentioned this issue Apr 28, 2022

Consider setting a finite ConnectTimeout by default microsoft/reverse-proxy#1678

Closed

MihaZupan mentioned this issue May 19, 2022

Set infinite ConnectTimeout for HttpClient benchmark aspnet/Benchmarks#1738

Merged

karelz added the enhancement Product code improvement that does NOT require public API changes/additions label Jun 14, 2022

antonfirsov self-assigned this Jun 20, 2022

antonfirsov mentioned this issue Jul 7, 2022

Tear down pending HTTP connection when the originating request completes #71785

Merged

ghost added the in-pr There is an active PR which will close this issue when it is merged label Jul 7, 2022

antonfirsov closed this as completed in #71785 Jul 26, 2022

antonfirsov added a commit that referenced this issue Jul 26, 2022

Tear down pending HTTP connection when the originating request comple…

1729ca6

…tes (#71785) Resolves #66297

ghost removed the in-pr There is an active PR which will close this issue when it is merged label Jul 26, 2022

ghost locked as resolved and limited conversation to collaborators Aug 26, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Change SocketsHttpHandler's default ConnectTimeout to a reasonable finite value #66297

Change SocketsHttpHandler's default ConnectTimeout to a reasonable finite value #66297

CarnaViire commented Mar 7, 2022

ghost commented Mar 7, 2022

stephentoub commented Mar 8, 2022

CarnaViire commented Mar 8, 2022

geoffkizer commented Mar 8, 2022

MihaZupan commented Mar 8, 2022

CarnaViire commented Mar 9, 2022

ManickaP commented Mar 9, 2022

wfurt commented Mar 9, 2022

ManickaP commented Mar 9, 2022

CarnaViire commented Mar 14, 2022

CarnaViire commented Mar 14, 2022

CarnaViire commented Apr 28, 2022

CarnaViire commented Apr 28, 2022 •

edited

Loading

pentp commented May 3, 2022

karelz commented May 10, 2022

CarnaViire commented May 24, 2022

CarnaViire commented May 31, 2022

halter73 commented Jun 1, 2022

antonfirsov commented Jun 1, 2022 •

edited

Loading

stephentoub commented Jun 1, 2022 •

edited

Loading

CarnaViire commented Jun 1, 2022

antonfirsov commented Jun 23, 2022

stephentoub commented Jun 24, 2022

antonfirsov commented Jun 24, 2022

Change SocketsHttpHandler's default ConnectTimeout to a reasonable finite value #66297

Change SocketsHttpHandler's default ConnectTimeout to a reasonable finite value #66297

Comments

CarnaViire commented Mar 7, 2022

ghost commented Mar 7, 2022

stephentoub commented Mar 8, 2022

CarnaViire commented Mar 8, 2022

geoffkizer commented Mar 8, 2022

MihaZupan commented Mar 8, 2022

CarnaViire commented Mar 9, 2022

ManickaP commented Mar 9, 2022

wfurt commented Mar 9, 2022

ManickaP commented Mar 9, 2022

CarnaViire commented Mar 14, 2022

CarnaViire commented Mar 14, 2022

CarnaViire commented Apr 28, 2022

CarnaViire commented Apr 28, 2022 • edited Loading

pentp commented May 3, 2022

karelz commented May 10, 2022

CarnaViire commented May 24, 2022

CarnaViire commented May 31, 2022

halter73 commented Jun 1, 2022

antonfirsov commented Jun 1, 2022 • edited Loading

stephentoub commented Jun 1, 2022 • edited Loading

CarnaViire commented Jun 1, 2022

antonfirsov commented Jun 23, 2022

stephentoub commented Jun 24, 2022

antonfirsov commented Jun 24, 2022

CarnaViire commented Apr 28, 2022 •

edited

Loading

antonfirsov commented Jun 1, 2022 •

edited

Loading

stephentoub commented Jun 1, 2022 •

edited

Loading