Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Net 7 Kestrel windows service hangs after a period of time #82207

Closed
1 task done
sccrgoalie1 opened this issue Feb 2, 2023 · 31 comments · Fixed by #82245
Closed
1 task done

Net 7 Kestrel windows service hangs after a period of time #82207

sccrgoalie1 opened this issue Feb 2, 2023 · 31 comments · Fixed by #82245
Assignees
Milestone

Comments

@sccrgoalie1
Copy link

Is there an existing issue for this?

  • I have searched the existing issues

Describe the bug

We just update our projects to .NET 7 from .NET 5 and are experiencing hangs where Kestrel quits responding. It works for a day or so and then the windows service needs to be restarted. This was not an issue on .NET5. I gathered a memory dump to see if I can figure out what is causing it to hang. Here are some screenshots. This was gathered using ProcDump.

This API is installed at many unique locations and several have reported the same issue that they need to restart the service daily. For some locations, we use Azure relay. So far, the locations that go through the relay have not had any issues.

image

image

Expected Behavior

The API should respond while running

Steps To Reproduce

Let our project run for a day

Exceptions (if any)

No response

.NET Version

7.0.100

Anything else?

No response

@danmoseley
Copy link
Member

Just curious does it repro on 6.0?

@sccrgoalie1
Copy link
Author

Just curious does it repro on 6.0?

Not sure, we skipped over 6.0 and went straight to 7.0. We might be able to backtrack if that will help in diagnosing.

@davidfowl
Copy link
Member

davidfowl commented Feb 3, 2023

There was another issue like this with similar symptoms (hanging after a day or something). Can you look at parallel stacks?

@sccrgoalie1
Copy link
Author

There was another issue like this with similar symptoms (hanging after a day or something).

Yeah, it was this one dotnet/aspnetcore#45215. I reviewed that one and I don't think a solution was found, just not enough info to diagnose.

@davidfowl
Copy link
Member

@sccrgoalie1 Can you run through all of the same steps?

@sccrgoalie1
Copy link
Author

@davidfowl Absolutely, we ran through that issue extensively before we submitted our own. Was there something specific that would be helpful? I have the memory dump. All the requests are http not https. So far, it seems that the issue occurs after the API sits idle for a period of time. Our users work banker's hours, so it works all day long, but when they come back the next morning Kestrel is no longer responding to any requests until we restart the windows service it's running in.

@adityamandaleeka
Copy link
Member

adityamandaleeka commented Feb 6, 2023

@sccrgoalie1 Is it possible for you to share the dump with us (we'd understand if you cannot)?

@adityamandaleeka
Copy link
Member

Can you share more info about what exactly happens on the client and server when this happens?

For instance, does the client get a response back? Do the server logs show anything interesting? Perhaps you may get some interesting output if you enable more detailed logging on the server. See here for more info on configuring logging: https://learn.microsoft.com/en-us/aspnet/core/fundamentals/logging/?view=aspnetcore-7.0#configure-logging

@sccrgoalie1
Copy link
Author

@sccrgoalie1 Is it possible for you to share the dump with us (we'd understand if you cannot)?

@adityamandaleeka Sure, here is the memory dump.

Memory Dump

@sccrgoalie1
Copy link
Author

Can you share more info about what exactly happens on the client and server when this happens?

For instance, does the client get a response back? Do the server logs show anything interesting? Perhaps you may get some interesting output if you enable more detailed logging on the server. See here for more info on configuring logging: https://learn.microsoft.com/en-us/aspnet/core/fundamentals/logging/?view=aspnetcore-7.0#configure-logging

The client does not get a response back, the requests just hang. Unfortunately, we haven't found anything in the logs or event viewer yet. We'll try increasing the detailed logging to see if we can get anything better. The windows service on the server is still running but just not responding.

@adityamandaleeka
Copy link
Member

@sccrgoalie1 Thanks for sharing the dump. I looked through it and didn't spot any immediate issues that jumped out. Just curious, did taking the dump cause the process to "unhang" or was it still in the stuck/unresponsive state?

@adityamandaleeka
Copy link
Member

As a side note (maybe not related to the problem you're hitting), I see a bunch of threads doing things related to performance counters, which are running code from aspnet_perf.dll. Do you know what's causing that?

AFAICT aspnet_perf is an old ASP.NET framework component so I'm wondering what it's doing in the context of a ASP.NET Core app.

@sccrgoalie1
Copy link
Author

@sccrgoalie1 Thanks for sharing the dump. I looked through it and didn't spot any immediate issues that jumped out. Just curious, did taking the dump cause the process to "unhang" or was it still in the stuck/unresponsive state?

No, taking the dump did not unstick it. At one of the sites having to restart daily we converted them to use Microsoft.Azure.Relay.AspNetCore and since then they have been running for days without any issues. I'm not sure at what level that package intertwines with Kestrel, but it seems to prevent the hang we are experiencing.

@sccrgoalie1
Copy link
Author

As a side note (maybe not related to the problem you're hitting), I see a bunch of threads doing things related to performance counters, which are running code from aspnet_perf.dll. Do you know what's causing that?

AFAICT aspnet_perf is an old ASP.NET framework component so I'm wondering what it's doing in the context of a ASP.NET Core app.

The only thing I can think of that might be doing that is Application Insights but we are using the NetCore version (Microsoft.ApplicationInsights.AspNetCore)

@heikkilamarko
Copy link

heikkilamarko commented Feb 12, 2023

We have several .NET 7.0.2 applications running on Windows Server 2019. We run the apps as HashiCorp Nomad Jobs.

We noticed that HTTP requests to an application start to fail if we open the Windows "Resource Monitor" application and select the dotnet process running our app from the list (toggle checkbox on).
After selecting the application, it takes some time before HTTP requests, such as health checks start to fail (hang). To make the app work again, restart is required.

You can also select more than one apps, and all the selected apps start to fail. Apps that are not selected work without problems.

The problem reproduces consistently.

@adityamandaleeka
Copy link
Member

@heikkilamarko Thank you for reporting that. I was able to see the behavior you described even on a Windows 11 machine with an empty ASP.NET Core app.

Nothing obvious popped out under the debugger but cc @noahfalk @davmason in case this rings any bells. Presumably checking the box in Resource Monitor kicks off perfmon or something under the hood right?

@adityamandaleeka
Copy link
Member

Hmm, AFAICT the app is still running and not hung, it's just not getting the incoming requests. @BrennanConroy pointed out that the browser is also not timing out so perhaps something is accepting and intercepting the communication...

@davmason
Copy link
Member

It doesn't ring any bells for me. I am not sure exactly what resource monitor does under the hood

@heikkilamarko
Copy link

Glad that you were able to reproduce the problem. A small clarification. Some of our services are Node.js applications. They don't have this problem. The problem occurs only for dotnet apps.

@noahfalk
Copy link
Member

@adityamandaleeka - No bells for me either I'm afraid. I suspect the right mental model for Resource Monitor and perfmon are that they are two independent apps that both make use of perf counter APIs.

@sccrgoalie1
Copy link
Author

It seems possible that this is also what is causing the issues for our customers. We noticed they all are running a software called Acronis. It must be doing something similar to what Windows Resource monitor does to our ASP.NET Core app

@BrennanConroy
Copy link
Member

BrennanConroy commented Feb 15, 2023

We believe we narrowed down the issue to this change #64834

This was verified by setting the environment variable ComPlus_ThreadPool_UsePortableThreadPoolForIO to 0 before running the app and being unable to repro the issue anymore.
@heikkilamarko and/or @sccrgoalie1 if you could also give that a try and let us know the results that would be great!

@kouvel do you have any ideas how the Windows Threadpool change could cause this issue?

Issue summary:
Running a Kestrel app and attaching Resource Monitor to the process will result in requests to the server hanging. It doesn't happen immediately, sometimes takes 5 minutes, sometimes 1 minute.
Once the issue occurs making a request and using Wireshark to view TCP connections you can see that a connection is created to the port the server is listening on and keep-alives are sent, but Kestrel never sees the socket connection. I had a browser request finally close after 17.5 hours while running it overnight.
When setting ComPlus_ThreadPool_UsePortableThreadPoolForIO=0 I have not seen the issue reproduce anymore.

@heikkilamarko
Copy link

heikkilamarko commented Feb 15, 2023

We gave it a try. Setting ComPlus_ThreadPool_UsePortableThreadPoolForIO=0 seems to solve the problem. After setting the env variable, the problem is not reproducing anymore.

@adityamandaleeka
Copy link
Member

@heikkilamarko Glad you were able to confirm that. It should be a reasonable workaround for now while we investigate and fix the issue. Thank you again for the simple repro steps!

@kouvel
Copy link
Member

kouvel commented Feb 15, 2023

It looks like for some reason, when Resource Monitor is attached, the IO for listening for connections is being associated with the thread it is issued on, even though the handle is attached to an IOCP. AFAIK that's not supposed to happen when the handle is attached to an IOCP. The IO happens to be issued on a thread pool worker thread, and when the thread times out and exits, the IO is cancelled with ERROR_OPERATION_ABORTED, and no more IO completions are received for that handle. It looks like worker threads in the portable thread pool implementation are missing checking for pending IO, that check prevents the thread from exiting and causing the IO to be canceled. It should be added, though I'm still not sure why that happens only when Resource Monitor is attached. I'll dig a bit deeper to see if I can find out what's happening.

@adityamandaleeka
Copy link
Member

Thanks for the quick investigation @kouvel!

@adityamandaleeka
Copy link
Member

Can you open an issue on runtime for this and link it from here?

@dotnet-issue-labeler
Copy link

I couldn't figure out the best area label to add to this issue. If you have write-permissions please help me learn by adding exactly one area label.

@ghost ghost added the untriaged New issue has not been triaged by the area owner label Feb 15, 2023
@kouvel kouvel self-assigned this Feb 15, 2023
@kouvel kouvel transferred this issue from dotnet/aspnetcore Feb 15, 2023
@kouvel kouvel added area-System.Threading and removed untriaged New issue has not been triaged by the area owner labels Feb 15, 2023
@ghost
Copy link

ghost commented Feb 15, 2023

Tagging subscribers to this area: @mangod9
See info in area-owners.md if you want to be subscribed.

Issue Details

Is there an existing issue for this?

  • I have searched the existing issues

Describe the bug

We just update our projects to .NET 7 from .NET 5 and are experiencing hangs where Kestrel quits responding. It works for a day or so and then the windows service needs to be restarted. This was not an issue on .NET5. I gathered a memory dump to see if I can figure out what is causing it to hang. Here are some screenshots. This was gathered using ProcDump.

This API is installed at many unique locations and several have reported the same issue that they need to restart the service daily. For some locations, we use Azure relay. So far, the locations that go through the relay have not had any issues.

image

image

Expected Behavior

The API should respond while running

Steps To Reproduce

Let our project run for a day

Exceptions (if any)

No response

.NET Version

7.0.100

Anything else?

No response

Author: sccrgoalie1
Assignees: kouvel
Labels:

area-System.Threading, untriaged

Milestone: -

@kouvel kouvel added this to the 8.0.0 milestone Feb 15, 2023
@kouvel
Copy link
Member

kouvel commented Feb 15, 2023

I've transferred the issue over to the runtime repo now

@kouvel
Copy link
Member

kouvel commented Feb 15, 2023

Another thing to look into perhaps is why the IO completion with ERROR_OPERATION_ABORTED is not causing some kind of error to occur. Edit: It's being treated similarly to the IO being canceled normally.

kouvel added a commit to kouvel/runtime that referenced this issue Feb 16, 2023
- When Resource Monitor is attached, some async IO operations are bound to the thread that issued it even though the IO handle is bound to an IOCP. If the thread exits, the async IO operation is aborted. This can lead to hangs or unexpected exceptions.
- Added a check that was missing in the portable thread pool implementation to prevent exiting a worker thread when it has pending IO

Fixes dotnet#82207
@ghost ghost added the in-pr There is an active PR which will close this issue when it is merged label Feb 16, 2023
kouvel added a commit that referenced this issue Mar 1, 2023
…2245)

* Check for pending IO in the portable thread pool's worker threads

- When Resource Monitor is attached, some async IO operations are bound to the thread that issued it even though the IO handle is bound to an IOCP. If the thread exits, the async IO operation is aborted. This can lead to hangs or unexpected exceptions.
- Added a check that was missing in the portable thread pool implementation to prevent exiting a worker thread when it has pending IO

Fixes #82207
@ghost ghost removed the in-pr There is an active PR which will close this issue when it is merged label Mar 1, 2023
kouvel added a commit to kouvel/runtime that referenced this issue Mar 8, 2023
- Port of dotnet#82245
- When Resource Monitor is attached, some async IO operations are bound to the thread that issued it even though the IO handle is bound to an IOCP. If the thread exits, the async IO operation is aborted. This can lead to hangs or unexpected exceptions.
- Added a check that was missing in the portable thread pool implementation to prevent exiting a worker thread when it has pending IO

Port of fix for dotnet#82207
kouvel added a commit to kouvel/runtime that referenced this issue Mar 8, 2023
- Port of dotnet#82245
- When Resource Monitor is attached, some async IO operations are bound to the thread that issued it even though the IO handle is bound to an IOCP. If the thread exits, the async IO operation is aborted. This can lead to hangs or unexpected exceptions.
- Added a check that was missing in the portable thread pool implementation to prevent exiting a worker thread when it has pending IO

Port of fix for dotnet#82207
carlossanlop pushed a commit that referenced this issue Mar 9, 2023
…ds (#82248)

* [6.0] Check for pending IO in the portable thread pool's worker threads

- Port of #82245
- When Resource Monitor is attached, some async IO operations are bound to the thread that issued it even though the IO handle is bound to an IOCP. If the thread exits, the async IO operation is aborted. This can lead to hangs or unexpected exceptions.
- Added a check that was missing in the portable thread pool implementation to prevent exiting a worker thread when it has pending IO

Port of fix for #82207

* Refactor Windows-specific code
carlossanlop pushed a commit that referenced this issue Mar 9, 2023
…ds (#82246)

* [7.0] Check for pending IO in the portable thread pool's worker threads

- Port of #82245
- When Resource Monitor is attached, some async IO operations are bound to the thread that issued it even though the IO handle is bound to an IOCP. If the thread exits, the async IO operation is aborted. This can lead to hangs or unexpected exceptions.
- Added a check that was missing in the portable thread pool implementation to prevent exiting a worker thread when it has pending IO

Port of fix for #82207

* Refactor Windows-specific code
@ghost ghost locked as resolved and limited conversation to collaborators Mar 31, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

9 participants