-
Notifications
You must be signed in to change notification settings - Fork 10k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Kestrel stops listening/responding to requests after some time (only after updating to .NET 7.0) #45215
Comments
I'm not sure this is enough to reproduce the problem. A couple more questions:
|
I will continue my attempts to narrow the problem down. However it is very frustrating because I have to wait so long each time before it get's into a bad state. I am currently running a test on another machine, as perhaps it has something to do with my environment. But even if that IS the case, why should it behave any differently than it does with .NET 6.0? (that's the only variable).
No. It is hosted with only with Kestrel and I launch the application from the command line using
I shall do this the next time that happens. I assume I should use this tool? https://learn.microsoft.com/en-us/dotnet/core/diagnostics/dotnet-dump. Shall I upload the process dump somewhere after I have captured it? |
Once you have the dump, open it in VS and look at the threads window for blocked threads, and the parallel tasks window for blocked tasks. That should give you an idea of where things are stalling. |
I'm not quite sure how to diagnose the issue using the dump, but when I view the Parallel Stacks, I can see that there are some blocked Async Logical Stacks: Here is another view: The source code for the Serilog.Sinks.Async.BackgroundWorkerSink looks like this: The source code for Crypto.Websocket.Extensions.Core.OrderBooks.Sources.OrderBookSourceBase looks like this: I noticed that both of them use:
Could this be the culprit? Otherwise, could somebody point me in the right direction of what I should look for next? |
StartNew is fine. _bufferPauseEvent.WaitOne() isn't great. What unblocks that event? There are async structures you can use to avoid blocking the thread, like SemaphoreSlim, Channels, ConcurrentQueue, etc. |
Does BackgroundWorkerSink.Pump do something similar? |
I already tried excluding this code path entirely, and it didn't help
If you follow the link to the source code, you can see what it does. But that's also irrelevant because I tried excluding this code as well and it didn't help either. |
In other news, ever since a few weeks ago my Visual Studio credentials keep expiring on me (like every few minutes) and then my app fails to launch until I refresh the credentials. If this happens overnight after my machine reboots to install Windows updates, then my app won't relaunch until I wake up and realize the problem. Prior to a few weeks ago, my credentials remained cached for months at a time. And now, whenever I have to refresh the credentials, it hardly even works at all: I just get a loading spinner: I think it all started happening after installing a recent update for Visual Studio. Could this have anything to do with it? I have now tried running my app on a different machine, and the problem still occurs there too. I am at my wits' end. I've probably spent over 40 hours trying to resolve this problem and I am no further than when I started. 😭 |
Can you share the parallel stacks (not tasks) window after disabling the serilog logging? Also, is it possible to share the .dmp file? (this can contain PII, so if the app contains customer info don't share it) |
The VS creds issue should be unrelated - please report it though the VS feedback button so they can help with that. |
I have now managed to create a minimalistic project which reproduces this issue: https://github.com/shaynevanasperen/WebApplication I suspect that the problem has something to do with the |
Actually, I just re-read my initial comment where I had mentioned that I tried it with self-signed certificates instead of using |
Nothing there looks inherently bad. When you get into this state, do new requests get stuck? Do requests to specific URLs get stuck?
Can you pinpoint the requests that are failing? If it's only the https requests, then that might be a clue that its lettuce encrypt. |
No requests seem to be able to make it into my app from "outside", meaning by using the public DNS name. But requests from "inside" seem to work fine. To be clear: "outside" means Browser requests to ANY application URLs starting with My Grafana Agent (Windows Service) scrapes my application using the URL It seems like Kestrel just stops listening on the port or something. But that can't be right because the Grafana Agent's requests still make it into my application. Could this be by virtue of the HTTP connection being kept alive because the Grafana Agent is hitting it so frequently (every 15 seconds), whereas the user-interface URLs are only being hit when I decide to press F5 in my browser, which can be after several hours of no requests from my browser?
No. ALL requests fail (except those coming from Grafana Agent). Even HTTP (not HTTPS) request fail, because I've opened the Network tab in my browser and I can confirm that the HTTP request did not make it into my application, because otherwise I would see the redirect to HTTPS. The thing is that I have managed to reproduce this issue while using self-signed certificates (with my All clues currently seem to be pointing towards some sort of "KeepAlive" on the HTTP connection. Because the Grafana Agent is continually polling the To test this theory out, I will now try running the site as before but keep one of the UI pages open where I happen to have a JavaScript infinite loop running to fetch a partial view and update part of the page every 1 second. If my theory is correct, then this will prevent the timeout problem from occurring. I should know within the next 24 hours or so, and then I will report back with my findings. |
Disable https completely and see if you can reproduce the issue |
This test ran for about 2 hours before it started exhibiting the timeout issue. At least that's how much time passed before I checked it and noticed. It may have lasted far less than 2 hours. This further proves that my theory regarding the HTTP KeepAlive is probably correct. I've now re-launched it in this cut-down mode with HTTPS and Authentication disabled completely, but now also with all my background services disabled. Let's see how long it lasts this time. |
Yep, less than 2 hours passed before the timeout issue occurred again. This is with no HTTPS, no Authentication, and just the following code for building the var builder = WebApplication.CreateBuilder(args);
builder.WebHost.UseUrls($"http://{DomainName}:8020");
var options = new DefaultAzureCredentialOptions
{
VisualStudioTenantId = tenantId,
InteractiveBrowserTenantId = tenantId,
ExcludeManagedIdentityCredential = true
};
var credentials = new DefaultAzureCredential(options);
builder.Configuration.AddAzureAppConfiguration(options => options
.ConfigureClientOptions(x => x.AddPolicy(new LoggingPolicy(logger), HttpPipelinePosition.PerRetry))
.Connect(new(endpoint), credentials)
.ConfigureRefresh(x => x.Register("Sentinel", true))
.ConfigureKeyVault(x => x.SetCredential(credentials)));
builder.Logging.ClearProviders();
builder.Host.UseSerilog(configure);
builder.Services.AddSingleton<IPolicyEvaluator, AuthorizationPolicyEvaluator>();
builder.Services.AddPrometheusCounters();
builder.Services.AddPrometheusAspNetCoreMetrics();
builder.Services.AddPrometheusHttpClientMetrics();
builder.Services.AddPrometheusSqlClientMetrics();
builder.Services
.AddRazorPages()
.AddMicrosoftIdentityUI();
var app = builder.Build(); This HTTP connection timeout problem does NOT occur under .NET 6.0 and it does NOT occur under .NET 7.0 on condition that I keep refreshing the web page frequentlySurely this is enough for you guys to pinpoint the problem now? |
Not really but appreciate the narrowing down of the issue thus far.
What is DomainName? This repro looks pretty minimal. So the only external dependencies left are:
Can you tell me about the infrastructure around this application? When does an external request look like? I'd like to understand what it means when the request is stuck. If requests aren't making it to the machine, where could it be getting stuck? Have you tried on verbose logging for Kestrel? This might show you if connections are being established but requests aren't being made over those connections. Since you have Prometheus metrics, what do they look like when this situation happens? From those dumps, I don't see anything that stands out with respect to something hanging or getting stuck in the request pipeline. |
I had already linked to a minimalistic project which reproduces this issue. I have now further simplified this minimalistic project all the way down to the basic project template that you get from creating a new ASP.NET Core Web Application from the Visual Studio "new project" menu. So anyone should be able to reproduce this now by simply creating the barebones project from Visual Studio's
Have a look here at my minimalistic sample application. You will see that it is
I have now removed these dependencies as well, and I am still able to reproduce the problem.
Please have a look at the README file in my minimalistic sample application. It goes into great detail explaining the infrastructure and how the requests are being made etc.
I have enabled verbose logging, but there is nothing showing in the logs.
There is nothing unusual to be seen. Something elseAs an aside which is unrelated to this issue (hopefully), a friend of mine recently alerted me to some alarming code in EF Core 7.0. Please have a look at the code here, and here. You can see that in both of those places, there is a completely unnecessary loop that is repeated 10000 times! And yet the pull request was merged by the maintainers of that project anyway! If the code in this repository is maintained with the same level of care and attention to detail, then frankly I am not surprised that problems like the one I am facing have surfaced. As a result, this friend and I have now determined that |
We'll take a deeper look at your application and see if we can reproduce given the instructions. For what I read on this repository though it's not clear if the reproduction of the issue means hitting f5 in the browser continuously (sorry if I missed that). After building the application, deploying it somewhere, and mapping it via a google domain, do I need to send requests for a day to see this issue happen?
Ouch dotnet/efcore#29642
There are people running on .NET 7 and aren't having this problem. This isn't to say the problem doesn't exist, but there's something about it that we don't quite understand yet. |
The way I have been testing it is to simply deploy it and then "check in" on it every few hours or so by hitting f5 in the browser. In the several hours of my absense (between each time that I "check in" on it) my Grafana Agent keeps hitting the What I have noticed however, is that if I set up a script to essentially hit f5 in my browser every 1 second, then when I come back to it after many hours, or even after a day or so, the manual requests from my browser continue to work (the problem never happens). It is ONLY when I go away for a long time and then come back after a long period of NOT making ANY browser requests, that the problem occurs.
No. You need to do the OPPOSITE of that. Leave the website IDLE with no requests for a day or more, and THEN try to make a request from your browser. This is with the caveat that requests are continuously being made from Grafana Agent in the background (not from the browser). I wonder if I should try to see if this problem still occurs when I disable my Grafana Agent from making those background requests? |
OK so I don't plan to setup a grafana agent. The repro steps you suggest are to deploy, hit it a few times and then leave it for a while (couple hours) then come back and try to hit it again, both from an external URL and an the internal URL until one of those don't respond. Is that right? |
Yes. It might take 2 or 3 days though, not just a few hours. |
You could write a shell script to hit |
@shaynevanasperen It would be great if you could narrow down a repro that's a bit easier to run. As you can imagine, it's challenging for us to reproduce issues that might take 2 or 3 days. |
Alternatively, if you could capture a dump and/or log info when it's in the bad state, that could help us diagnose the issue. |
I already have narrowed down the repro. It's as basic as possible already. Unfortunately the nature of this issue is that it can take that long to manifest.
I have already done this before and there is nothing that stands out (see above). |
I see, thanks @shaynevanasperen. Unfortunately, we don't have enough information to productively debug this at this point. If you end up getting more information about it (or if anyone else hits this issue), please let us know (here or in a new issue) and we can help investigate. |
Closing this issue for now. |
Is there an existing issue for this?
Describe the bug
I have a Web Application which runs some background services and hosts some UI pages using Razor Pages. It was working perfectly fine on .NET 6.0, but after simply updating the target framework to
net7.0
and redeploying to a folder on my machine, I started to encounter this problem.Immediately after deploying the application, it works fine and continues to respond to requests from my browsers (Firefox, Chrome, Edge). But anywhere from 10 minutes to 6 hours later, the site no longer responds to HTTP requests (hitting refresh in my browser results in the connection timing out), and then it stays this way until I stop and restart it from the command line.
I have tried several things to attempt to narrow down the problem, but everything I have tried has failed. It's extremely difficult to do this because I have to wait so long between each try. As soon as I revert the target framework back to .NET 6.0, the problem goes away again and the site remains responsive permanently. So unfortunately I am unable to produce a minimalistic project which reproduces the issue.
I have a Prometheus
/metrics
endpoint exposed through the application and I have Grafana Agent configured on my machine to scrape the metrics every 15 seconds (fromhttp://localhost/metrics
). I have a branch in my request pipeline to exclude that endpoint from Authorization. The rest of the application is secured withMicrosoftIdentity
authentication pointing to my Azure Active Directory tenant using theOpenIdConnect
authentication scheme.The really strange thing is that when the browser requests start to show connection timeouts, my Grafana Agent scraper is still able to scrape the
/metrics
endpoint every 15 seconds, which I can confirm from the logs. Here is a snippet from myProgram.cs
:I've tried disabling all my background sevices, and I even tried disabling authentication/authorization completely, and using self-signed certificates instead of
LetsEncrypt
, but none of these have had any effect.Expected Behavior
The web application should continue to respond to HTTP requests indefinitely.
Steps To Reproduce
Host a .NET 7.0 Razor Pages web application on Windows 11. Wait for the site to stop responding to HTTP requests.
Exceptions (if any)
Nothing. Logs show nothing either.
.NET Version
7.0.100
Anything else?
No response
The text was updated successfully, but these errors were encountered: