Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HttpConnectionPool.RequestQueue leaking items #66673

Open
davidni opened this issue Mar 15, 2022 · 14 comments
Open

HttpConnectionPool.RequestQueue leaking items #66673

davidni opened this issue Mar 15, 2022 · 14 comments
Labels
area-System.Net.Http enhancement Product code improvement that does NOT require public API changes/additions tenet-performance Performance related issue
Milestone

Comments

@davidni
Copy link
Contributor

davidni commented Mar 15, 2022

Description

While investigating a large increase in memory usage in one of our services running on .NET 6.0.2, I investigated a memory dump and found an HttpConnectionPool containing a RequestQueue object with a very large number of items (>10k). This was problematic because those items were hanging on to HttpRequestMessage objects which in turn had large header payloads, and ultimately led to large memory consumption for this process.

Interestingly, looking at telemetry, I see no evidence of ever having more than ~500 concurrent outstanding requests in total in the entire process, across all of our connection pools. So it seems odd, to say the least, that a single connection pool would be keeping track of this many queued requests.

I haven't figured out why these requests are in the queue and/or why they aren't draining. We can share the memory dump internally if that would help.

0:000> !do 00000166497AA310
Name:        System.Collections.Generic.Queue`1[[System.Net.Http.HttpConnectionPool+RequestQueue`1+QueueItem[[System.Net.Http.Http2Connection, System.Net.Http]], System.Net.Http]]
MethodTable: 00007ff8a9336fe0
EEClass:     00007ff8a8c8b4e0
Tracked Type: false
Size:        40(0x28) bytes
File:        D:\SvcFab\_App\CoreServices.IslandGatewayAppType_App42\IslandGatewayPkg.Code.1.53.01880.147-release-217a0a55-net6\System.Private.CoreLib.dll
Fields:
              MT    Field   Offset                 Type VT     Attr            Value Name
00007ff8a8ca8238  400209c        8 ...ivate.CoreLib]][]  0 instance 00000166579f3410 _array
00007ff8a7b69480  400209d       10         System.Int32  1 instance                0 _head
00007ff8a7b69480  400209e       14         System.Int32  1 instance            11233 _tail
00007ff8a7b69480  400209f       18         System.Int32  1 instance            11233 _size
00007ff8a7b69480  40020a0       1c         System.Int32  1 instance            11842 _version

Reproduction Steps

Unclear

Expected behavior

HttpConnectionPool shouldn't be holding on to queued requests that aren't in flight.

Actual behavior

HttpConnectionPool seems hold on to requests that aren't in flight, leading to excessive memory usage.

Regression?

We did not notice this specifically with .NET Core 3.1, but it is possible it had been there too.

Known Workarounds

N/A

Configuration

  • .NET 6.0.2
  • Microsoft Windows 10.0.17763
  • RuntimeIdentifier: win10-x64

Other information

No response

@dotnet-issue-labeler dotnet-issue-labeler bot added area-System.Net.Http untriaged New issue has not been triaged by the area owner labels Mar 15, 2022
@ghost
Copy link

ghost commented Mar 15, 2022

Tagging subscribers to this area: @dotnet/ncl
See info in area-owners.md if you want to be subscribed.

Issue Details

Description

While investigating a large increase in memory usage in one of our services running on .NET 6.0.2, I investigated a memory dump and found an HttpConnectionPool containing a RequestQueue object with a very large number of items (>10k). This was problematic because those items were hanging on to HttpRequestMessage objects which in turn had large header payloads, and ultimately led to large memory consumption for this process.

Interestingly, looking at telemetry, I see no evidence of ever having more than ~500 concurrent outstanding requests in total in the entire process, across all of our connection pools. So it seems odd, to say the least, that a single connection pool would be keeping track of this many queued requests.

I haven't figured out why these requests are in the queue and/or why they aren't draining. We can share the memory dump internally if that would help.

0:000> !do 00000166497AA310
Name:        System.Collections.Generic.Queue`1[[System.Net.Http.HttpConnectionPool+RequestQueue`1+QueueItem[[System.Net.Http.Http2Connection, System.Net.Http]], System.Net.Http]]
MethodTable: 00007ff8a9336fe0
EEClass:     00007ff8a8c8b4e0
Tracked Type: false
Size:        40(0x28) bytes
File:        D:\SvcFab\_App\CoreServices.IslandGatewayAppType_App42\IslandGatewayPkg.Code.1.53.01880.147-release-217a0a55-net6\System.Private.CoreLib.dll
Fields:
              MT    Field   Offset                 Type VT     Attr            Value Name
00007ff8a8ca8238  400209c        8 ...ivate.CoreLib]][]  0 instance 00000166579f3410 _array
00007ff8a7b69480  400209d       10         System.Int32  1 instance                0 _head
00007ff8a7b69480  400209e       14         System.Int32  1 instance            11233 _tail
00007ff8a7b69480  400209f       18         System.Int32  1 instance            11233 _size
00007ff8a7b69480  40020a0       1c         System.Int32  1 instance            11842 _version

Reproduction Steps

Unclear

Expected behavior

HttpConnectionPool shouldn't be holding on to queued requests that aren't in flight.

Actual behavior

HttpConnectionPool seems hold on to requests that aren't in flight, leading to excessive memory usage.

Regression?

We did not notice this specifically with .NET Core 3.1, but it is possible it had been there too.

Known Workarounds

N/A

Configuration

  • .NET 6.0.2
  • Microsoft Windows 10.0.17763
  • RuntimeIdentifier: win10-x64

Other information

No response

Author: davidni
Assignees: -
Labels:

area-System.Net.Http, untriaged

Milestone: -

@karelz
Copy link
Member

karelz commented Mar 15, 2022

@CarnaViire could it be related to your recent investigations in the space?

@CarnaViire
Copy link
Member

CarnaViire commented Mar 15, 2022

could it be related to your recent investigations in the space?

I don't think so. This issue is on Windows, and it seems to be for HTTP/2. (While mine were timeouts on Linux for HTTP/1.1)

@davidni what is the typical behavior of your app (in terms of HTTP requests)? Do you only use HTTP/2, or HTTP/1.1 as well? What timeouts do you have set, if any? Are any exceptions happening, or all requests succeed, but the queue keeps growing?

@davidni
Copy link
Contributor Author

davidni commented Mar 15, 2022

what is the typical behavior of your app (in terms of HTTP requests)?

We use YARP to proxy requests to multiple (1k+) destinations.


Do you only use HTTP/2, or HTTP/1.1 as well?

In this case HTTP/2 only. While we do both, in this case, looking at the dump, this connection pool was only handling HTTP/2 requests, and the queue that grew was the _http2RequestQueue.


What timeouts do you have set, if any? Are any exceptions happening, or all requests succeed, but the queue keeps growing?

  • 30 seconds cancellation token when calling SocketsHttpHandler.SendAsync
  • We were indeed facing lots of exceptions while the queue was growing (timeouts, as well as transport errors). Though seeing exceptions like this isn't unexpected, we proxy requests to a lot of destinations and some of them can be unavailable at any point. Some of the top hitter exceptions are shown below:
IslandGateway.Core.Service.ErrorHandling.ProxyException: YARP error YarpRequest
System.Net.Http.HttpRequestException: No connection could be made because the target machine actively refused it. ([REDACTED])
System.Net.Sockets.SocketException: No connection could be made because the target machine actively refused it.

IslandGateway.Core.Service.ErrorHandling.ProxyException: YARP error YarpRequest
System.Net.Http.HttpRequestException: The SSL connection could not be established, see inner exception.
System.IO.IOException:  Received an unexpected EOF or 0 bytes from the transport stream.

@karelz karelz added this to the 7.0.0 milestone Mar 17, 2022
@karelz karelz added bug tenet-performance Performance related issue and removed untriaged New issue has not been triaged by the area owner labels Mar 17, 2022
@karelz
Copy link
Member

karelz commented Mar 17, 2022

Triage: We need to understand the root cause. Fix at minimum in 7.0, most likely also service 6.0.x.

@MihaZupan
Copy link
Member

MihaZupan commented Mar 21, 2022

Before 6.0, cancelling a request would result in it being removed from the request queue immediately.
With the 6.0 connection pooling changes, the queue entry will be marked as canceled, but it will only be removed once an established connection starts processing the queue / a failed connection drops its initiating request.

From looking at the dump, the requests sitting in the queue are canceled and the connection attempt never completes. Since there is no connection, nothing cleans up canceled queue entries, resulting in the leak.

The main mitigation for this is to set the SocketsHttpHandler.ConnectTimeout to a reasonable value. We're changing the default from infinite to 15s in 7.0 (#66297). You may choose to go even lower.
Doing so will give the pool a chance to retry a connection, potentially processing all the requests, or at least make some forward progress in draining the queue.


There is also a bug here in the connection pool that will make such leaks more common: when injecting a connection to the pool, we only check whether a request exists in the queue, but not whether it's already canceled. This means that we will issue a connection attempt even for already-canceled requests.
What's worse is that a failed connection attempt will only remove the one initiating request from the queue before attempting to inject more connections.

As a result, you can end up in a situation like this (I hope I'm misreading something in the code):

  • Start 1000 HTTP/2 requests to a dead endpoint
  • Connection pool starts 1 connection attempts
  • The connection takes a long time or fails to establish
  • All 1000 request timeout and are canceled, but they remain in the queue
  • The connection pool observes a connection failure, removes the first request from the queue and initiates a new connection
  • The new connection similarly eventually fails, removes the first request from the queue and initiates a new connection
  • Repeat above for every request - attempting 1000 connections for requests that are all already canceled

Not only are we keeping the requests and everything associated with them rooted for a very long time, we are also wasting connection attempts on dead requests.

@CarnaViire
Copy link
Member

Given that the default socket timeout on Windows is 21s, I'm afraid just setting ConnectTimeout to 15s wouldn't help that much. My hunch is that the exact case you've described in the second half is manifesting 😢 But in that case setting EnableMultipleHttp2Connections to true here should help, so all the requests wouldn't wait for the single HTTP/2 connection to be established.

On our side, we need to clean up the queue at least not only on successful, but on failed attempts too, and maybe even more aggressively.

@MihaZupan
Copy link
Member

MihaZupan commented Mar 21, 2022

My suspicion is that for whatever reason the connection took longer than 21 seconds to fail since I saw the queue had never moved at all in the dump. As in the first connection attempt was still pending while 7k requests were added and canceled on its queue.

EnableMultipleHttp2Connections will help once a connection is successfully established, but likely wouldn't mitigate the above problem since we allow at most 1 pending connection attempt for HTTP/2 at any given time (though we may change that in the future).

@ManickaP
Copy link
Member

Mitigation might be to remove and throw away any cancelled requests in TryPeek method when we check for connection injection, but this will only help up to the first "active" request in the queue.

Thinking more generally, do we need to keep request references directly in the queue? Do we even need them there - what are they used for after getting picked from the queue?

@MihaZupan
Copy link
Member

MihaZupan commented Mar 21, 2022

Opened #66944 to track aggressively opening more connections for HTTP/2.

Mitigation might be to remove and throw away any cancelled requests in TryPeek method when we check for connection injection, but this will only help up to the first "active" request in the queue.

Yes, this wouldn't entirely solve the problem of keeping the requests rooted, but it would be a solution to the issue of issuing connection attempts for pre-canceled requests.

Thinking more generally, do we need to keep request references directly in the queue? Do we even need them there - what are they used for after getting picked from the queue?

They are currently used to determine whether to fail the first request in the queue on connection failures.
A connection attempt stores a reference to the first request in the queue. On failure, it checks whether the same request is still at the start, in which case it fails it to force forward progress.
We could probably change the mechanism here to storing something other than the request (e.g. a timestamp/counter).

@MihaZupan
Copy link
Member

MihaZupan commented Apr 13, 2022

The initial fixes that help with this issue have been merged (#66990, #66607).
A possible HTTP/2 mitigation by opening multiple connections at a time is tracked by #66944.

We could address this completely by removing queue entries as soon as the request is canceled (.NET 5 behavior).
Moving back to untriaged.

@karelz karelz closed this as completed Apr 19, 2022
@karelz karelz added this to the 6.0.x milestone Apr 19, 2022
@karelz karelz removed the untriaged New issue has not been triaged by the area owner label Apr 19, 2022
@MihaZupan
Copy link
Member

MihaZupan commented Apr 19, 2022

It has been mitigated, but the underlying issue still exists: we're not removing entries immediately when they're canceled.

@MihaZupan MihaZupan removed this from the 6.0.x milestone Apr 19, 2022
@MihaZupan MihaZupan added the untriaged New issue has not been triaged by the area owner label Apr 19, 2022
@MihaZupan MihaZupan reopened this Apr 19, 2022
@karelz
Copy link
Member

karelz commented Apr 21, 2022

Triage: Mitigated 2 ways:

The remaining part could be done, to help reduce memory usage a bit, but it won't cause a memory leak anymore.
Moving to Future.

@karelz karelz added this to the Future milestone Apr 21, 2022
@karelz karelz added enhancement Product code improvement that does NOT require public API changes/additions and removed bug untriaged New issue has not been triaged by the area owner labels Apr 21, 2022
@davidni
Copy link
Contributor Author

davidni commented Apr 21, 2022

Looks like #66990 is on track to ship in .NET 6.0.5, do we know when it would be released? Thanks all for your work on this and for providing a viable short-term mitigation!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area-System.Net.Http enhancement Product code improvement that does NOT require public API changes/additions tenet-performance Performance related issue
Projects
None yet
Development

No branches or pull requests

5 participants