-
Notifications
You must be signed in to change notification settings - Fork 4.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Occasional timeouts in HttpConnectionPool.ConnectAsync #63706
Comments
Tagging subscribers to this area: @dotnet/ncl Issue DetailsDescriptionWe run a pretty busy service, let's call it the "gateway", which accepts inbound https calls, and based on some rules, issues outbound https requests to other services (let's call those "internal" services). The gateway is implemented in Asp.Net 5 and runs on Linux, deployed on Azure Kubernetes (AKS). The internal services run on various technologies, most but not all on .Net/Linux/AKS. The gateway uses HttpClient for outbound requests. The HttpClientHandler's lifetime is managed with the IHttpClientFactory pattern. There is one shared client handler for all outbound requests to all internal services. The handler uses a client certificate for authentication. This client handler is recycled by the factory every 10 minutes. The puzzling behavior is that occasionally we see failures to connect which end in a timeout. I call them puzzling because:
Upon failure, either a TaskCanceledException is thrown or, most often, an OperationCanceledException
The error can be recognized by the fact that it seems to happen during the connection phase
The gateway also receives other timeouts from the internal services, such as requests taking too long to process, but those cause a distinctively different call stack in the caller, and we can correlate them using correlation Ids in logs. Those actual request timeouts will always have "SendAsync" in the call stack instead of "ConnectAsync". The requests that fail with connection errors leave no trace on the internal service; no data ever makes it there. I looked at the code of various internal services to see if there's an issue there, maybe some custom connection handler deadlock, but it's mostly just standard Kestrel stuff. I also took process dumps on the one of the internal services to see if any threads are hanging in the connection stage, found nothing so far. How do we investigate this next? We could add retries, but this is problematic for several reasons:
While this is not a huge issue, it's a constant behavior and it causes SLA issues for customers, so I'd very much like a solution. Where do I go from here? Reproduction StepsNo specific repro steps - issue reproduces occasionally. Expected behaviorNo connection timeouts Actual behaviorOccasional connection timeouts Regression?Does not seem to be a regression. The service was recently converted to .Net 5 from .Net Core 3.1 but the errors were seen before the conversion. Known WorkaroundsNo response Configuration
Other informationNo response
|
TLS handshake is the most CPU intensive part. You can also check state of thread pool to see if things are moving. |
Triage: Packet capture will tell if it is external, or caused by the (perhaps overloaded) system. Things that might help: |
This issue has been marked |
To clarify my original post - I realize the root cause may be hard to track down; the problem may be either at the caller or at the callee, or somewhere in between - given that the network calls in Azure go through several layers of software load-balancing, each of those could cause an issue. I don't think this is specifically a bug in the .Net HttpClient. What I would like is to mitigate for these errors effectively, and that, I think, requires two things:
I considered port exhaustion on the caller side but that seems unlikely - those failures are rare, and they never come in "clusters", they are isolated while traffic flows just fine around them. For example, in the last 24 hours the "gateway" deployed in one busy Azure region have correctly handled 4.9 million requests (evenly split across 3 nodes) while failing to connect only 650 times (also split across nodes) - to various destinations. @karelz - This also makes it next to impossible to run any packet capture software to identify the failing connections, but I'm running against the limits of my knowledge here, there surely are tools I'm not aware of. @wfurt - are you suggesting the caller switch from using HttpClientHandler to using SocketsHttpHandler? HttpClientHandler delegates work internally to SocketsHttpHandler, but it looks like some of the properties of SocketsHttpHandler are not exposed or used. The latter has a ConnectTimeout properties which seems to be meant for controlling how long the handler will wait for a connection to be established. Any chance this could be exposed in future versions of HttpClientHandler? @wfurt - How would ConnectCallback help? I couldn't find any meaningful examples of how to use it to control how connections are established. Would a connection callback receive any meaningful information for such timeouts? Thanks, |
This exists: |
My thought was to use the connect callback to log IP info so it is easier to correlate with packet captures. If packet captures are not possible, there may not be great benefit. And yes, I realized that while HttpClientHandler really is SocketsHttpHandler not all properties are exposed. We did talk about it in the past but HttpClientHandler remains generic API across all platforms while SocketsHttpHandler have some more knobs to play with. In order to use them, you you would have to use SocketsHttpHandler directly. |
Thanks for your replies. Considering the rather limited impact of this issue, probably the best way to handle it for now is to switch to using SocketsHttpHandler and configure the connection timeout as @stephentoub suggested. Looks like the code that consumes SocketsHttpHandler.ConnectTimeout is in the HttpConnectionPool in the same repo. Although I'd prefer better public documentation on the behavior of those classes, that's the next best thing. Other than mitigating with retries how does one effectively determine the cause of such sporadic connection failures? We've had a related issue yesterday where of our internal services (.Net/Kestrel running on Service Fabric cluster in Azure) stopped receiving calls altogether for 10 minutes in one region. The "gateway" service described here was getting the same connection timeouts, but so did other services which were calling that service; this points to some issue at the target, but investigations (after connectivity recovered without intervention) on that service's nodes found nothing to work with. Can the .Net http client handler possibly be enhanced to include diagnostics for failed connections in those situations or is it operating at too high a level to be useful for debugging? |
Follow-up - I tested a bit with SocketsHttpHandler, triggering a connection timeout in the caller by delaying the connection completion on the Kestrel server using a sleep in the ServerCertificateSelector callback. The good news is that the ConnectTimeout value is respected. However, unfortunately the exception is a run of the mill OperationCanceledException with no inner exception and the same callstack I posted in my original message. It looks like the only way for the client to figure out if the exception is a connection exception is to search the exception's call stack for strings like "EstablishSslConnectionAsyncCore", "ConnectAsync", "CreateHttp11ConnectionAsync", "GetHttpConnectionAsync", but that seems extremely hacky and not very reliable. Any other ideas? Supplying a custom ConnectCallback to replace the default does not help, as the exception is not thrown during Socket.SonnectAsync but is most often seen in ConnectHelper.EstablishSslConnectionAsyncCore, which is called after the socket connection is established. |
Is it at least possible to make a change to indicate there was a timeout during the SSL negotiation? Adding a detailed inner exception to the OperationCanceledException would help. The fact that the SSL negotiation usually causes this failure narrows down the field. It is possible that either the client or the service has trouble finding a certificate or retrieving the certificate keys to negotiate the SSL handshake. I am able to generate this very error in the client by adding a delay in the server's ServerCertificateSelector callback (Kestrel-specific). While some of the internal services called by the "gateway" which run into this error use a ServerCertificateSelector to pick the server's certificate, not all the services do. I can't find any connection-related code that may cause a deadlock on the server side. Several of the internal services the Gateway calls are consistently the biggest "offenders" causing this error. They are all running .Net Core+Kestrel on Linux/AKS, whereas the service with the highest traffic (orders of magnitude higher than the others) sees the fewest SSL negotiation errors and runs Java on AKS/Linux. The gateway service calls all services using the same code, so this seems to indicate that the issue lies with the services, not the caller. This is just speculation for now. |
Another update - I deployed a fix along the lines of what was described earlier in this thread, using a SocketsHttpHandler connection timeout of 10 seconds in our pre-production environment. I'm happy to see that since Friday morning 100% of SSL connection failures caused by timeouts succeeded on retry. We'll be deploying this in production sometime later this week. What's the optimal maximum time to wait for an SSL connection to complete anyway? 10 seconds was just an arbitrary warm and fuzzy number I picked, but it seems too high. However, two things still remain:
public static bool IsConnectionTimeout(this Exception ex)
{
return (ex is TaskCanceledException || ex is OperationCanceledException) &&
ex.StackTrace.Contains("HttpConnectionPool.GetHttpConnectionAsync", StringComparison.Ordinal) &&
ex.StackTrace.Contains("HttpConnectionPool.ConnectAsync", StringComparison.Ordinal) &&
ex.StackTrace.Contains("ConnectHelper.EstablishSslConnectionAsyncCore", StringComparison.Ordinal) &&
ex.StackTrace.Contains("SslStream.ForceAuthenticationAsync", StringComparison.Ordinal) ||
ex.InnerException != null && ex.InnerException.IsConnectionTimeout();
} A better solution can be achieved by either one of those:
@wfurt - does the issue you linked above cover this latter proposal? |
Triage: We believe there is similar issue already existing (to provide enum about error causing request failure) -- we should duplicate it against it. |
I found the issue I had in mind: #47484
@fstugren I believe that what was done in the aforementioned PR is exactly what you're asking for. Could you please confirm that? Also, I believe that @wfurt PR #63851 allowing to create So, is there anything else you'd like to see in |
@ManickaP, thanks for following up. I am not yet able to upgrade our services from .Net 5 to .Net 6 to validate this; I may be able to do that in the next couple of months. Also, we notice both TaskCanceledException and OperationCanceledException in these situations; the call stacks look slightly different. Both of those errors have a nearly 100% rate of success on retry. Otherwise, going beyond the client-side handling of those errors, I'd be very interested to find the root cause of those hangs. Based on observations there seems to be a correlation between this client-side error and having Kestrel at the other end, but that's still just speculation at the moment. |
@fstugren you could turn on networking telemetry https://devblogs.microsoft.com/dotnet/net-5-new-networking-improvements/#telemetry, it should show you how long does each part of establishing connection take. And obviously turn on server-side logging if you can, that might shed some light as well. Let me know if you need more help. Otherwise, I don't think there's anything actionable for us at the moment. |
Triage: closing now, all should be addressed now, partially in 6.0 partially in main (7.0). Please feel free to re-open or open a new one if you find any more issues in your investigations. |
@ManickaP - can you clarify how the change you outlined above solves the problem described here? I looked at the Github PR you sent but I am not familiar enough with the asp.net internals to understand how I would apply it. Ideally, the caller would be to be able to tell that a timeout was caused by a failure to establish a connection rather than the server being too slow to process the request. This is currently not possible, except through hacks like the one I outlined above. Do you have an example for how to handle connection errors? |
This method https://source.dot.net/#System.Net.Http/System/Net/Http/SocketsHttpHandler/HttpConnectionPool.cs,2db3901400a1bccd,references was added, which adds inner So this should give you enough info to discern cancellation from timeout and the type of the timeout, without checking the exception call stack. Does this answer your question? |
@ManickaP, @karelz, @stephentoub - I've given the latest .Net 6.0 a try and here's what I saw when the caller catches a connect timeout exception triggered by a call to HttpClient.SendAsync:
This message is inaccurate - yes, the client default timeout was 100 seconds, but the connection timeout set on the SocketsHttpHandler was 10 seconds. The timeout did not happen because 100 seconds elapsed. It looks like the code that throws makes an assumption without being aware of the connect timeout.
The exception that indicates a connect timeout is nested three levels deep, at least in the case I tested for. Is this the intended behavior? It's a bit awkward. Caller must recursively dig into inner exceptions until they find one with the message "A connection could not be established..." A dedicated "ConnectTimeoutException" would have offered the caller an unambiguous indication of what happened. Anyway, still better than searching through the stack trace. Thanks |
No, the way a ConnectTimeout bubbles up is not intentional. |
Description
We run a pretty busy service, let's call it the "gateway", which accepts inbound https calls, and based on some rules, issues outbound https requests to other services (let's call those "internal" services). The gateway is implemented in Asp.Net 5 and runs on Linux, deployed on Azure Kubernetes (AKS). The internal services run on various technologies, most but not all on .Net/Linux/AKS.
The gateway uses HttpClient for outbound requests. The HttpClientHandler's lifetime is managed with the IHttpClientFactory pattern. There is one shared client handler for all outbound requests to all internal services. The handler uses a client certificate for authentication. This client handler is recycled by the factory every 10 minutes.
The puzzling behavior is that occasionally we see failures to connect when the service makes outbound calls, which end in a timeout. I call them puzzling because:
Upon failure, either a TaskCanceledException is thrown or, most often, an OperationCanceledException.
This is the most common call stack:
The errors can be recognized by this sequence of methods in the call stack which seem to indicate they happen during the connection phase of an outbound request:
The gateway also receives other timeouts from the internal services, such as requests taking too long to process, but those cause a different exception call stack, and we can correlate them using correlation Ids in cross-service logs. Those actual request timeouts always have "SendAsync(...)" in the call stack instead of "ConnectAsync". The requests that fail with connection errors leave no trace on the internal service; no data ever makes it there.
I looked at the code of various internal services to see if there's an issue there, maybe some custom connection handler deadlock, but it's mostly just standard Kestrel stuff. I also took process dumps on the one of the internal services to see if any threads are hanging in the connection stage, found nothing so far.
How do we investigate this next?
What can be the root cause?
We could add retries, but this is problematic for several reasons:
While this is not a huge issue, it's a constant behavior and it causes SLA issues for customers, so I'd very much like a solution.
Where do I go from here?
Reproduction Steps
No specific repro steps - issue reproduces occasionally.
Expected behavior
No connection timeouts
Actual behavior
Occasional connection timeouts
Regression?
Does not seem to be a regression. The service was recently converted to .Net 5 from .Net Core 3.1 but the errors were seen before the conversion.
Known Workarounds
No response
Configuration
Other information
No response
The text was updated successfully, but these errors were encountered: