Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[release/7.0-staging] Make WindowsServiceLifetime gracefully stop #85656

Merged

Conversation

ericstj
Copy link
Member

@ericstj ericstj commented May 2, 2023

Backport of #83892
Fixes #62579 on 7.0

Description

Windows services implemented with Microsoft.Extensions.Hosting.WindowsServices would not gracefully stop - due to exit without reporting STOPPED status - resulting in an error in Service Control Manager. If the service had configured error handling this would also cause the service to restart rather than stay stopped. In some cases the service would crash with on stop as well - generating WER dumps.

The clean exit without reporting STOPPED was caused because nothing held up the main routine from exiting before SCM was notified that the service was stopped. This race condition was always present, but has gotten worse over the past two releases as we made the runtime faster - effectively giving the background thread less time to complete and report it's stop.

The AV was caused by a double call to stop the service -- once happening when handing the SCM Stop request, and again from hosting when shutting down the application lifetime. This double call was violating the constraint of SetServiceStatus:

The first call to the function in the SERVICE_STOPPED state closes the RPC context handle and any subsequent calls can cause the process to crash.

We've avoided calling stop twice and also guarded against user's calling ServiceBase.Stop when the service is already stopping.

Customer Impact

Not all shutdown logic is allowed to run - which could result in data loss. Customers receive error reports about their services. Services may restart when intended to stop.

Testing

Fix has been made in main for a month and validated on nightly builds by customers. We added numerous automated test cases to cover this scenario and cover the overall lifetime sequence. We also tested the service stop logic manually and through system shutdown (note that this bug still exists which causes problems on shutdown

Risk

Risk is low, but not zero due to the size of the change and the addition of cross-thread synchronization and locking. We've tried to mitigate this through lots of automated and manual testing. We also kept the locking to a minimum and run no user code under the added lock. Probably the biggest risk here is that we didn't completely fix everything folks care about in this path. For example: #83093

Note the changes to ServiceBase are not strictly required - which is good because we cannot change that type on .NETFramework where this same package is supported.

ericstj and others added 6 commits May 2, 2023 09:03
* Make WindowsServiceLifetime gracefully stop

WindowsServiceLifetime was not waiting for ServiceBase to stop the service.  As a result
we would sometimes end the process before notifying service control manager that the service
had stopped -- resulting in an error in the eventlog and sometimes a service restart.

We also were permitting multiple calls to Stop to occur - through SCM callbacks, and through
public API.  We must not call SetServiceStatus again once the service is marked as stopped.

* Alternate approach to ensuring we only ever set STATE_STOPPED once.

* Avoid calling ServiceBase.Stop on stopped service

I fixed double-calling STATE_STOPPED in ServiceBase, but this fix will
not be present on .NETFramework.  Workaround that by avoiding calling
ServiceBase.Stop when the service has already been stopped by SCM.

* Add tests for WindowsServiceLifetime

These tests leverage RemoteExecutor to avoid creating a separate service
assembly.

* Respond to feedback and add more tests.

This better integrates with the RemoteExecutor component as well,
by hooking up the service process and fetching its handle.

This gives us the correct logging and exitcode handling from
RemoteExecutor.

* Honor Cancellation in StopAsync

* Fix bindingRedirects in RemoteExecutor

* Use Async lambdas for service testing

* Fix issue on Win7 where duplicate service descriptions are disallowed

* Respond to feedback

* Fix comment and add timeout
@ericstj ericstj requested review from carlossanlop, ViktorHofer, buyaa-n and a team May 2, 2023 16:05
@dotnet-issue-labeler dotnet-issue-labeler bot added the needs-area-label An area label is needed to ensure this gets routed to the appropriate area owners label May 2, 2023
@ghost ghost assigned ericstj May 2, 2023
@ericstj ericstj added Servicing-consider Issue for next servicing release review area-Extensions-Hosting and removed needs-area-label An area label is needed to ensure this gets routed to the appropriate area owners labels May 2, 2023
@ghost
Copy link

ghost commented May 2, 2023

Tagging subscribers to this area: @dotnet/area-extensions-hosting
See info in area-owners.md if you want to be subscribed.

Issue Details

Backport of #83892
Fixes #62579 on 7.0

Description

Windows services implemented with Microsoft.Extensions.Hosting.WindowsServices would not gracefully stop - due to exit without reporting STOPPED status - resulting in an error in Service Control Manager. If the service had configured error handling this would also cause the service to restart rather than stay stopped. In some cases the service would crash with on stop as well - generating WER dumps.

The clean exit without reporting STOPPED was caused because nothing held up the main routine from exiting before SCM was notified that the service was stopped. This race condition was always present, but has gotten worse over the past two releases as we made the runtime faster - effectively giving the background thread less time to complete and report it's stop.

The AV was caused by a double call to stop the service -- once happening when handing the SCM Stop request, and again from hosting when shutting down the application lifetime. This double call was violating the constraint of SetServiceStatus:

The first call to the function in the SERVICE_STOPPED state closes the RPC context handle and any subsequent calls can cause the process to crash.

We've avoided calling stop twice and also guarded against user's calling ServiceBase.Stop when the service is already stopping.

Customer Impact

Not all shutdown logic is allowed to run - which could result in data loss. Customers receive error reports about their services. Services may restart when intended to stop.

Testing

Fix has been made in main for a month and validated on nightly builds by customers. We added numerous automated test cases to cover this scenario and cover the overall lifetime sequence. We also tested the service stop logic manually and through system shutdown (note that this bug still exists which causes problems on shutdown

Risk

Risk is low, but not zero due to the size of the change and the addition of cross-thread synchronization and locking. We've tried to mitigate this through lots of automated and manual testing. We also kept the locking to a minimum and run no user code under the added lock. Probably the biggest risk here is that we didn't completely fix everything folks care about in this path. For example: #83093

Note the changes to ServiceBase are not strictly required - which is good because we cannot change that type on .NETFramework where this same package is supported.

Author: ericstj
Assignees: ericstj
Labels:

Servicing-consider, area-Extensions-Hosting

Milestone: -

@rbhanda rbhanda added Servicing-approved Approved for servicing release and removed Servicing-consider Issue for next servicing release review labels May 2, 2023
@rbhanda rbhanda added this to the 7.0.7 milestone May 2, 2023
@ericstj
Copy link
Member Author

ericstj commented May 2, 2023

artifacts\bin\testPackages\packageTest.targets(50,5): error : (NETCORE_ENGINEERING_TELEMETRY=Build) Assembly 'Microsoft.Extensions.Hosting.WindowsServices' has insufficient version for dependency 'Microsoft.Extensions.Logging.Abstractions' : 7.0.0.0 < 7.0.0.1.

This is odd to me. I see that we've versioned Logging.Abstractions but I also see that we're building the package:

<GeneratePackageOnBuild>true</GeneratePackageOnBuild>
<ServicingVersion>1</ServicingVersion>

I'm not sure why we'd hit this error. Building the package locally to see what's happening.

@ericstj
Copy link
Member Author

ericstj commented May 2, 2023

I'm not sure why we'd hit this error. Building the package locally to see what's happening.

I understand why it's happening. It's because we pull in the serviced project transitively through a reference to Microsoft.Extensions.Logging.EventLog but that is not serviced and remains at 7.0.0. Since we don't have a direct reference to Microsoft.Extensions.Logging.Abstractions we can't represent the serviced dependency. I will fix this by adding a direct reference to Microsoft.Extensions.Logging.Abstractions.

This package has been serviced and we compile against the serviced
version of its assemblies.

None of the directly referenced projects have been serviced so our
package doesn't restore the serviced versions.

Lift up the dependency on Logging.Abstractions to ensure we reference
the serviced package.
Copy link
Member

@carlossanlop carlossanlop left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.
OOB package authoring changes look good.
Just left some non-blocking questions.

<AutoGenerateBindingRedirects Condition="'$(AutoGenerateBindingRedirects)' == ''">true</AutoGenerateBindingRedirects>
<GenerateBindingRedirectsOutputType Condition="'$(GenerateBindingRedirectsOutputType)' == ''">true</GenerateBindingRedirectsOutputType>
</PropertyGroup>

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the main PR I found no description for this change except "Fixing binding redirects for remote excutor". For my own education, would you mind explaining this?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We weren't generating any bindingRedirects for tests, but they need them.

When running from XUnit we were working because XUnit has an AssemblyResolve hook that does a similar thing (though it can't do the exact same thing).

When I added out-of-proc services with RemoteExecutor -- really any use of remote executor -- loads failed since they didn't have this resolve handler. Ensuring the app.config is generated makes sure that remote executor gets it. It will also avoid us needing to rely on the AssemblyResolve hook in Xunit (and it's fragility).

Here's the discussion: #83892 (comment)

<GeneratePackageOnBuild>false</GeneratePackageOnBuild>
<ServicingVersion>0</ServicingVersion>
<GeneratePackageOnBuild>true</GeneratePackageOnBuild>
<ServicingVersion>1</ServicingVersion>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for also adding the servicing bump to M.W.Compat.

@@ -27,6 +29,7 @@

<ItemGroup>
<ProjectReference Include="$(LibrariesProjectRoot)Microsoft.Extensions.Hosting\src\Microsoft.Extensions.Hosting.csproj" />
<ProjectReference Include="$(LibrariesProjectRoot)Microsoft.Extensions.Logging.Abstractions\src\Microsoft.Extensions.Logging.Abstractions.csproj" />
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What code in WindowsServiceLifetime.cs caused the need to add this new project reference?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, this is the transitive dependency problem you described here, right? #85656 (comment)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correct. It's a side-effect of incremental servicing and our package testing caught it. Good reason to pay attention to those failures 👍

@@ -31,6 +31,7 @@ public class ServiceBase : Component
private bool _commandPropsFrozen; // set to true once we've use the Can... properties.
private bool _disposed;
private bool _initialized;
private object _stopLock = new object();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The lock was a great addition.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This fix didn't specifically require it but it helps users of ServiceBase from tripping over the same problem. Might be we should lock on all state changes, but for now I focused just on stop since that's documented to be bad to call more than once.

@ericstj ericstj merged commit c8ab0d3 into dotnet:release/7.0-staging May 4, 2023
@ghost ghost locked as resolved and limited conversation to collaborators Jun 4, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
area-Extensions-Hosting Servicing-approved Approved for servicing release
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants