HttpConnectionPool.RequestQueue leaking items #66673

davidni · 2022-03-15T18:18:08Z

Description

While investigating a large increase in memory usage in one of our services running on .NET 6.0.2, I investigated a memory dump and found an HttpConnectionPool containing a RequestQueue object with a very large number of items (>10k). This was problematic because those items were hanging on to HttpRequestMessage objects which in turn had large header payloads, and ultimately led to large memory consumption for this process.

Interestingly, looking at telemetry, I see no evidence of ever having more than ~500 concurrent outstanding requests in total in the entire process, across all of our connection pools. So it seems odd, to say the least, that a single connection pool would be keeping track of this many queued requests.

I haven't figured out why these requests are in the queue and/or why they aren't draining. We can share the memory dump internally if that would help.

0:000> !do 00000166497AA310
Name:        System.Collections.Generic.Queue`1[[System.Net.Http.HttpConnectionPool+RequestQueue`1+QueueItem[[System.Net.Http.Http2Connection, System.Net.Http]], System.Net.Http]]
MethodTable: 00007ff8a9336fe0
EEClass:     00007ff8a8c8b4e0
Tracked Type: false
Size:        40(0x28) bytes
File:        D:\SvcFab\_App\CoreServices.IslandGatewayAppType_App42\IslandGatewayPkg.Code.1.53.01880.147-release-217a0a55-net6\System.Private.CoreLib.dll
Fields:
              MT    Field   Offset                 Type VT     Attr            Value Name
00007ff8a8ca8238  400209c        8 ...ivate.CoreLib]][]  0 instance 00000166579f3410 _array
00007ff8a7b69480  400209d       10         System.Int32  1 instance                0 _head
00007ff8a7b69480  400209e       14         System.Int32  1 instance            11233 _tail
00007ff8a7b69480  400209f       18         System.Int32  1 instance            11233 _size
00007ff8a7b69480  40020a0       1c         System.Int32  1 instance            11842 _version

Reproduction Steps

Unclear

Expected behavior

HttpConnectionPool shouldn't be holding on to queued requests that aren't in flight.

Actual behavior

HttpConnectionPool seems hold on to requests that aren't in flight, leading to excessive memory usage.

Regression?

We did not notice this specifically with .NET Core 3.1, but it is possible it had been there too.

Known Workarounds

N/A

Configuration

.NET 6.0.2
Microsoft Windows 10.0.17763
RuntimeIdentifier: win10-x64

Other information

No response

The text was updated successfully, but these errors were encountered:

ghost · 2022-03-15T18:18:13Z

Tagging subscribers to this area: @dotnet/ncl
See info in area-owners.md if you want to be subscribed.

Issue Details

Description

While investigating a large increase in memory usage in one of our services running on .NET 6.0.2, I investigated a memory dump and found an HttpConnectionPool containing a RequestQueue object with a very large number of items (>10k). This was problematic because those items were hanging on to HttpRequestMessage objects which in turn had large header payloads, and ultimately led to large memory consumption for this process.

Interestingly, looking at telemetry, I see no evidence of ever having more than ~500 concurrent outstanding requests in total in the entire process, across all of our connection pools. So it seems odd, to say the least, that a single connection pool would be keeping track of this many queued requests.

I haven't figured out why these requests are in the queue and/or why they aren't draining. We can share the memory dump internally if that would help.

0:000> !do 00000166497AA310
Name:        System.Collections.Generic.Queue`1[[System.Net.Http.HttpConnectionPool+RequestQueue`1+QueueItem[[System.Net.Http.Http2Connection, System.Net.Http]], System.Net.Http]]
MethodTable: 00007ff8a9336fe0
EEClass:     00007ff8a8c8b4e0
Tracked Type: false
Size:        40(0x28) bytes
File:        D:\SvcFab\_App\CoreServices.IslandGatewayAppType_App42\IslandGatewayPkg.Code.1.53.01880.147-release-217a0a55-net6\System.Private.CoreLib.dll
Fields:
              MT    Field   Offset                 Type VT     Attr            Value Name
00007ff8a8ca8238  400209c        8 ...ivate.CoreLib]][]  0 instance 00000166579f3410 _array
00007ff8a7b69480  400209d       10         System.Int32  1 instance                0 _head
00007ff8a7b69480  400209e       14         System.Int32  1 instance            11233 _tail
00007ff8a7b69480  400209f       18         System.Int32  1 instance            11233 _size
00007ff8a7b69480  40020a0       1c         System.Int32  1 instance            11842 _version

Reproduction Steps

Unclear

Expected behavior

HttpConnectionPool shouldn't be holding on to queued requests that aren't in flight.

Actual behavior

HttpConnectionPool seems hold on to requests that aren't in flight, leading to excessive memory usage.

Regression?

We did not notice this specifically with .NET Core 3.1, but it is possible it had been there too.

Known Workarounds

N/A

Configuration

.NET 6.0.2
Microsoft Windows 10.0.17763
RuntimeIdentifier: win10-x64

Other information

No response

Author:	davidni
Assignees:	-
Labels:	`area-System.Net.Http`, `untriaged`
Milestone:	-

karelz · 2022-03-15T18:39:13Z

@CarnaViire could it be related to your recent investigations in the space?

CarnaViire · 2022-03-15T18:58:47Z

could it be related to your recent investigations in the space?

I don't think so. This issue is on Windows, and it seems to be for HTTP/2. (While mine were timeouts on Linux for HTTP/1.1)

@davidni what is the typical behavior of your app (in terms of HTTP requests)? Do you only use HTTP/2, or HTTP/1.1 as well? What timeouts do you have set, if any? Are any exceptions happening, or all requests succeed, but the queue keeps growing?

davidni · 2022-03-15T19:26:56Z

what is the typical behavior of your app (in terms of HTTP requests)?

We use YARP to proxy requests to multiple (1k+) destinations.

Do you only use HTTP/2, or HTTP/1.1 as well?

In this case HTTP/2 only. While we do both, in this case, looking at the dump, this connection pool was only handling HTTP/2 requests, and the queue that grew was the _http2RequestQueue.

What timeouts do you have set, if any? Are any exceptions happening, or all requests succeed, but the queue keeps growing?

30 seconds cancellation token when calling SocketsHttpHandler.SendAsync
We were indeed facing lots of exceptions while the queue was growing (timeouts, as well as transport errors). Though seeing exceptions like this isn't unexpected, we proxy requests to a lot of destinations and some of them can be unavailable at any point. Some of the top hitter exceptions are shown below:

IslandGateway.Core.Service.ErrorHandling.ProxyException: YARP error YarpRequest
System.Net.Http.HttpRequestException: No connection could be made because the target machine actively refused it. ([REDACTED])
System.Net.Sockets.SocketException: No connection could be made because the target machine actively refused it.

IslandGateway.Core.Service.ErrorHandling.ProxyException: YARP error YarpRequest
System.Net.Http.HttpRequestException: The SSL connection could not be established, see inner exception.
System.IO.IOException:  Received an unexpected EOF or 0 bytes from the transport stream.

karelz · 2022-03-17T17:08:12Z

Triage: We need to understand the root cause. Fix at minimum in 7.0, most likely also service 6.0.x.

MihaZupan · 2022-03-21T12:03:33Z

Before 6.0, cancelling a request would result in it being removed from the request queue immediately.
With the 6.0 connection pooling changes, the queue entry will be marked as canceled, but it will only be removed once an established connection starts processing the queue / a failed connection drops its initiating request.

From looking at the dump, the requests sitting in the queue are canceled and the connection attempt never completes. Since there is no connection, nothing cleans up canceled queue entries, resulting in the leak.

The main mitigation for this is to set the SocketsHttpHandler.ConnectTimeout to a reasonable value. We're changing the default from infinite to 15s in 7.0 (#66297). You may choose to go even lower.
Doing so will give the pool a chance to retry a connection, potentially processing all the requests, or at least make some forward progress in draining the queue.

There is also a bug here in the connection pool that will make such leaks more common: when injecting a connection to the pool, we only check whether a request exists in the queue, but not whether it's already canceled. This means that we will issue a connection attempt even for already-canceled requests.
What's worse is that a failed connection attempt will only remove the one initiating request from the queue before attempting to inject more connections.

As a result, you can end up in a situation like this (I hope I'm misreading something in the code):

Start 1000 HTTP/2 requests to a dead endpoint
Connection pool starts 1 connection attempts
The connection takes a long time or fails to establish
All 1000 request timeout and are canceled, but they remain in the queue
The connection pool observes a connection failure, removes the first request from the queue and initiates a new connection
The new connection similarly eventually fails, removes the first request from the queue and initiates a new connection
Repeat above for every request - attempting 1000 connections for requests that are all already canceled

Not only are we keeping the requests and everything associated with them rooted for a very long time, we are also wasting connection attempts on dead requests.

CarnaViire · 2022-03-21T13:33:50Z

Given that the default socket timeout on Windows is 21s, I'm afraid just setting ConnectTimeout to 15s wouldn't help that much. My hunch is that the exact case you've described in the second half is manifesting 😢 But in that case setting EnableMultipleHttp2Connections to true here should help, so all the requests wouldn't wait for the single HTTP/2 connection to be established.

On our side, we need to clean up the queue at least not only on successful, but on failed attempts too, and maybe even more aggressively.

MihaZupan · 2022-03-21T13:40:10Z

My suspicion is that for whatever reason the connection took longer than 21 seconds to fail since I saw the queue had never moved at all in the dump. As in the first connection attempt was still pending while 7k requests were added and canceled on its queue.

EnableMultipleHttp2Connections will help once a connection is successfully established, but likely wouldn't mitigate the above problem since we allow at most 1 pending connection attempt for HTTP/2 at any given time (though we may change that in the future).

ManickaP · 2022-03-21T13:49:34Z

Mitigation might be to remove and throw away any cancelled requests in TryPeek method when we check for connection injection, but this will only help up to the first "active" request in the queue.

Thinking more generally, do we need to keep request references directly in the queue? Do we even need them there - what are they used for after getting picked from the queue?

MihaZupan · 2022-03-21T13:56:31Z

Opened #66944 to track aggressively opening more connections for HTTP/2.

Mitigation might be to remove and throw away any cancelled requests in TryPeek method when we check for connection injection, but this will only help up to the first "active" request in the queue.

Yes, this wouldn't entirely solve the problem of keeping the requests rooted, but it would be a solution to the issue of issuing connection attempts for pre-canceled requests.

Thinking more generally, do we need to keep request references directly in the queue? Do we even need them there - what are they used for after getting picked from the queue?

They are currently used to determine whether to fail the first request in the queue on connection failures.
A connection attempt stores a reference to the first request in the queue. On failure, it checks whether the same request is still at the start, in which case it fails it to force forward progress.
We could probably change the mechanism here to storing something other than the request (e.g. a timestamp/counter).

MihaZupan · 2022-04-13T16:06:46Z

The initial fixes that help with this issue have been merged (#66990, #66607).
A possible HTTP/2 mitigation by opening multiple connections at a time is tracked by #66944.

We could address this completely by removing queue entries as soon as the request is canceled (.NET 5 behavior).
Moving back to untriaged.

MihaZupan · 2022-04-19T13:37:17Z

It has been mitigated, but the underlying issue still exists: we're not removing entries immediately when they're canceled.

karelz · 2022-04-21T16:03:26Z

Triage: Mitigated 2 ways:

~~We have default ConnectTimeout~~ this has been reverted - discussion ongoing in Change SocketsHttpHandler's default ConnectTimeout to a reasonable finite value #66297
Fixes for HttpConnectionPool attempts to create connections for pre-canceled requests #66990 help to remove cancelled requests

The remaining part could be done, to help reduce memory usage a bit, but it won't cause a memory leak anymore.
Moving to Future.

davidni · 2022-04-21T20:06:27Z

Looks like #66990 is on track to ship in .NET 6.0.5, do we know when it would be released? Thanks all for your work on this and for providing a viable short-term mitigation!

See dotnet/runtime#66673 (comment)

dotnet-issue-labeler bot added area-System.Net.Http untriaged New issue has not been triaged by the area owner labels Mar 15, 2022

karelz assigned MihaZupan Mar 17, 2022

karelz added this to the 7.0.0 milestone Mar 17, 2022

karelz added bug tenet-performance Performance related issue and removed untriaged New issue has not been triaged by the area owner labels Mar 17, 2022

MihaZupan mentioned this issue Mar 22, 2022

HttpConnectionPool attempts to create connections for pre-canceled requests #66990

Closed

karelz mentioned this issue Mar 31, 2022

[release/6.0] Avoid issuing connection attempts for already cancelled requests #67226

Merged

MihaZupan removed this from the 7.0.0 milestone Apr 13, 2022

MihaZupan added the untriaged New issue has not been triaged by the area owner label Apr 13, 2022

MihaZupan removed their assignment Apr 13, 2022

karelz closed this as completed Apr 19, 2022

karelz added this to the 6.0.x milestone Apr 19, 2022

karelz removed the untriaged New issue has not been triaged by the area owner label Apr 19, 2022

MihaZupan removed this from the 6.0.x milestone Apr 19, 2022

MihaZupan added the untriaged New issue has not been triaged by the area owner label Apr 19, 2022

MihaZupan reopened this Apr 19, 2022

karelz added this to the Future milestone Apr 21, 2022

karelz added enhancement Product code improvement that does NOT require public API changes/additions and removed bug untriaged New issue has not been triaged by the area owner labels Apr 21, 2022

mahpate added a commit to mahpate/reverse-proxy that referenced this issue Aug 17, 2022

Bug fix for connection leak in SocketsHttpHandler

3bcc95c

See dotnet/runtime#66673 (comment)

mahpate mentioned this issue Aug 17, 2022

Set SocketsHttpHandler.ConnectTimeout to 15 seconds by default microsoft/reverse-proxy#1833

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

HttpConnectionPool.RequestQueue leaking items #66673

HttpConnectionPool.RequestQueue leaking items #66673

davidni commented Mar 15, 2022

ghost commented Mar 15, 2022

Description

Reproduction Steps

Expected behavior

Actual behavior

Regression?

Known Workarounds

Configuration

Other information

karelz commented Mar 15, 2022

CarnaViire commented Mar 15, 2022 •

edited

Loading

davidni commented Mar 15, 2022

karelz commented Mar 17, 2022

MihaZupan commented Mar 21, 2022 •

edited

Loading

CarnaViire commented Mar 21, 2022

MihaZupan commented Mar 21, 2022 •

edited

Loading

ManickaP commented Mar 21, 2022

MihaZupan commented Mar 21, 2022 •

edited

Loading

MihaZupan commented Apr 13, 2022 •

edited

Loading

MihaZupan commented Apr 19, 2022 •

edited

Loading

karelz commented Apr 21, 2022 •

edited by MihaZupan

Loading

davidni commented Apr 21, 2022

HttpConnectionPool.RequestQueue leaking items #66673

HttpConnectionPool.RequestQueue leaking items #66673

Comments

davidni commented Mar 15, 2022

Description

Reproduction Steps

Expected behavior

Actual behavior

Regression?

Known Workarounds

Configuration

Other information

ghost commented Mar 15, 2022

Description

Reproduction Steps

Expected behavior

Actual behavior

Regression?

Known Workarounds

Configuration

Other information

karelz commented Mar 15, 2022

CarnaViire commented Mar 15, 2022 • edited Loading

davidni commented Mar 15, 2022

karelz commented Mar 17, 2022

MihaZupan commented Mar 21, 2022 • edited Loading

CarnaViire commented Mar 21, 2022

MihaZupan commented Mar 21, 2022 • edited Loading

ManickaP commented Mar 21, 2022

MihaZupan commented Mar 21, 2022 • edited Loading

MihaZupan commented Apr 13, 2022 • edited Loading

MihaZupan commented Apr 19, 2022 • edited Loading

karelz commented Apr 21, 2022 • edited by MihaZupan Loading

davidni commented Apr 21, 2022

CarnaViire commented Mar 15, 2022 •

edited

Loading

MihaZupan commented Mar 21, 2022 •

edited

Loading

MihaZupan commented Mar 21, 2022 •

edited

Loading

MihaZupan commented Mar 21, 2022 •

edited

Loading

MihaZupan commented Apr 13, 2022 •

edited

Loading

MihaZupan commented Apr 19, 2022 •

edited

Loading

karelz commented Apr 21, 2022 •

edited by MihaZupan

Loading