Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WCF Client Hang in Linux Containers #5600

Open
jvnvenu opened this issue Jul 8, 2024 · 14 comments
Open

WCF Client Hang in Linux Containers #5600

jvnvenu opened this issue Jul 8, 2024 · 14 comments
Labels

Comments

@jvnvenu
Copy link

jvnvenu commented Jul 8, 2024

Describe the bug
I have an application which is trying to push data to a WCF service using NetTCP protocol.
The application is running under K8s linux container
We are using mcr.microsoft.com/dotnet/aspnet:8.0.6 base image to run our application
The application is keep pushing data to the WCF service, the application is running fine for couple of days.
After that it get hang on while try to send data and none of the WCF calls working all the calls getting blocked even though it is a different WCF service

By using K8s pod console i have ran a test program where the above application is running. That test application can access the endpoint and get the result. Which means there is no network issue from Pod.

Expected behavior
If it is network issue or any other issue it should get error after the open timeout. but it keeps waiting infinitely.

Additional context
Callstack from dotnet dump
Child SP IP Call Site
00007EA044AB56A0 00007f0fc35d4e96 [HelperMethodFrame_1OBJ: 00007ea044ab56a0] System.Threading.Monitor.ObjWait(Int32, System.Object)
00007EA044AB57D0 00007F0F4BA955CF System.Threading.ManualResetEventSlim.Wait(Int32, System.Threading.CancellationToken)
00007EA044AB5880 00007F0F4C0BF817 System.Threading.Tasks.Task.SpinThenBlockingWait(Int32, System.Threading.CancellationToken)
00007EA044AB58E0 00007F0F4C0BF5A7 System.Threading.Tasks.Task.InternalWaitCore(Int32, System.Threading.CancellationToken)
00007EA044AB5930 00007F0F4C0BF4A8 System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(System.Threading.Tasks.Task, System.Threading.Tasks.ConfigureAwaitOptions)
00007EA044AB5950 00007F0F4C7296C2 System.ServiceModel.Channels.ServiceChannel+CallOpenOnce.System.ServiceModel.Channels.ServiceChannel.ICallOnce.Call(System.ServiceModel.Channels.ServiceChannel, System.TimeSpan)
00007EA044AB5960 00007F0F4C11C6CD System.ServiceModel.Channels.ServiceChannel+CallOnceManager.CallOnce(System.TimeSpan, CallOnceManager)
00007EA044AB59C0 00007F0F4C725899 System.ServiceModel.Channels.ServiceChannel.Call(System.String, Boolean, System.ServiceModel.Dispatcher.ProxyOperationRuntime, System.Object[], System.Object[], System.TimeSpan)
00007EA044AB5B60 00007F0F4C723BA1 System.ServiceModel.Channels.ServiceChannelProxy.Invoke(System.Reflection.MethodInfo, System.Object[])

@HongGit HongGit added the triaged label Jul 8, 2024
@mconnew
Copy link
Member

mconnew commented Jul 8, 2024

Explicitly open your channel before your first call and that should fix this issue. Also you should be using the WCF client asynchronously if possible. You don't need the service and client side to match in their async/sync operation definition, they just need to be equivalent. This means you don't need to modify your service side code to do this. E.g. string DoSomething() is equivalent to Task<string> DoSomethingAsync(). The only time you can't do this simple conversion is if you have ref or out parameters. If that's the case and you want to convert to async, let me know and I can help you convert. In that scenario you basically use message contracts and an out is in the response wrapping message, and a ref is in both the request wrapping message and the response wrapping message.

@jvnvenu
Copy link
Author

jvnvenu commented Jul 9, 2024

@mconnew I don't understand how opening a channel explicitly solve the issue. As my knowledge the blocking is happening on the open channel code. As per the below code reference Open method is trying to opening a channel asynchronously and infinitely waiting for the open channel task result. So opening channel explicitly or implicitly doesn't make any difference.

https://source.dot.net/#System.ServiceModel.Primitives/System/ServiceModel/Channels/CommunicationObject.cs,f4a3b3f9f38225b6,references
https://source.dot.net/#System.Private.CoreLib/src/libraries/System.Private.CoreLib/src/System/Runtime/CompilerServices/TaskAwaiter.cs,46fcfce730d659a3

And the second part is why we need to convert the client as asynchronous. It stuck in the channel open not reached method execution level.

@mconnew
Copy link
Member

mconnew commented Jul 9, 2024

Did you try explicitly opening the channel like I said? You need to cast to IChannel first, then open it. If you wanted an explanation of the reason why, it's several paragraphs of explanation and would have meant my response would have taken longer as I would need to set aside the time to write it all out. I'll simply say that your understanding of what's going on is incorrect.
Converting to async was general advice for performance reasons, it won't fix this issue. I could tell that you are using the WCF client synchronous based on the call stack, so was suggesting a general improvement. But explicitly opening the channel should fix it, as long as you do it before any thread makes a service call with the channel.

@jvnvenu
Copy link
Author

jvnvenu commented Jul 9, 2024

Ok let me try it. Is the below sample correct?

NetTcpBinding netTcpBinding = new NetTcpBinding(SecurityMode.None);
ChannelFactory channelFactory = new ChannelFactory(netTcpBinding);
IService service = channelFactory.CreateChannel(new EndpointAddress());
((IClientChannel)service).Open();
var a = service.Dosomething();
((IClientChannel)service).Close();

@jvnvenu
Copy link
Author

jvnvenu commented Jul 10, 2024

@mconnew I tried to open channel explicitly, issue still there. could you please more details what is root cause it is getting blocked.

@mconnew
Copy link
Member

mconnew commented Jul 10, 2024

What's the call stack now? It won't be the same

@jvnvenu
Copy link
Author

jvnvenu commented Jul 10, 2024

This is the call stack

00007F4970FF7470 00007fc9597e1e96 [HelperMethodFrame_1OBJ: 00007f4970ff7470] System.Threading.Monitor.ObjWait(Int32, System.Object)
00007F4970FF75A0 00007FC8E1CBD72E System.Threading.ManualResetEventSlim.Wait(Int32, System.Threading.CancellationToken)
00007F4970FF7650 00007FC8E221B63E System.Threading.Tasks.Task.SpinThenBlockingWait(Int32, System.Threading.CancellationToken)
00007F4970FF76C0 00007FC8E221B3C7 System.Threading.Tasks.Task.InternalWaitCore(Int32, System.Threading.CancellationToken)
00007F4970FF7710 00007FC8E221B2C8 System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(System.Threading.Tasks.Task, System.Threading.Tasks.ConfigureAwaitOptions)
00007F4970FF7730 00007FC8E28F419A System.ServiceModel.Channels.ServiceChannelProxy.System.ServiceModel.ICommunicationObject.Open()

@jvnvenu
Copy link
Author

jvnvenu commented Jul 10, 2024

@mconnew This is the call stack

00007F4970FF7470 00007fc9597e1e96 [HelperMethodFrame_1OBJ: 00007f4970ff7470] System.Threading.Monitor.ObjWait(Int32, System.Object)
00007F4970FF75A0 00007FC8E1CBD72E System.Threading.ManualResetEventSlim.Wait(Int32, System.Threading.CancellationToken)
00007F4970FF7650 00007FC8E221B63E System.Threading.Tasks.Task.SpinThenBlockingWait(Int32, System.Threading.CancellationToken)
00007F4970FF76C0 00007FC8E221B3C7 System.Threading.Tasks.Task.InternalWaitCore(Int32, System.Threading.CancellationToken)
00007F4970FF7710 00007FC8E221B2C8 System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(System.Threading.Tasks.Task, System.Threading.Tasks.ConfigureAwaitOptions)
00007F4970FF7730 00007FC8E28F419A System.ServiceModel.Channels.ServiceChannelProxy.System.ServiceModel.ICommunicationObject.Open()

@jvnvenu
Copy link
Author

jvnvenu commented Jul 11, 2024

@mconnew if possible could you please share the root cause in a short form basically to understand what is going on. If you share that it will be more helpful.

@mconnew
Copy link
Member

mconnew commented Jul 11, 2024

When you don't explicitly open a channel, the first call needs to open it. This implicit open puts the channel in a special mode which has performance consequences. It places the actual request on a queue and then does the open. Once the channel is open, it pulls a request off the head of the queue, sends the request, waits for the reply which gets returned to the caller, then gets the next request on the queue. Only once the queue is empty does it switch out of this mode. With the right usage pattern you can get stuck in this mode. Basically it causes requests to be sent to the server serially. "Normal" behavior is only one message can be sent at a time, and only one can be received at a time, but the reply from an outstanding request doesn't need to be received before the next request can be sent, and responses can arrive out of order. If your usage is such that you are reusing the channel from multiple threads, and your rate of requests per second * average call completion time is greater than 1, you can be getting further and further behind and get to the point where requests aren't completing.

From your original call stack it showed that you were implicitly opening and the call was waiting for its request to be picked up from the queue and the reply received. I've never seen an open just hang like that, it usually just fails. So now we need to work out why the open is failing. We have end to end scenario tests which run with every change that runs on multiple Linux distros so it's not a basic functional bug where it just doesn't work on Linux.

There are 2 separate diagnostic steps to take. The easiest is to set the OpenTimeout on your binding to something small like 10 seconds, then see what the call stack is for the exception. It's likely to have multiple layers of inner exceptions so make sure you provide details of all the exceptions. The second diagnostic step is to try to work out which component isn't working. You'll need to modify the binding on the server endpoint as well as the client. Try setting SecurityMode to None (parameter to the constructor of NetTcpBinding) and see if the Open is successful. If it is, the issue is with the security you're using. If that's the case, I need to know what client credential type you are using. If it's the default, that means NegotiateStream (which is what I suspect is the case). You have to do some setup in Linux to get that working (have the correct package installed if using NTLM), and use explicit credentials. If using Kerberos, you need to jump through some hoops to join the Linux machine to the Windows domain, then authenticate. I would have expected an exception if NegotiateStream isn't working rather than it just locking up. The easiest NetTcp security to get working is using certificate authentication as you don't have any outside factors to worry about (correct packages, OS config etc).

@jvnvenu
Copy link
Author

jvnvenu commented Jul 12, 2024

@mconnew we are using SecurityMode.None, we don't have any security enabled. I believe the default OpenTimeout is 1 min.
But it never timeout it get stuck endless

@mconnew
Copy link
Member

mconnew commented Jul 12, 2024

Are you using Buffered or Streamed transfer mode? There's been a bug identified with streamed transfer mode with no security, but that opens successfully, the issue is it times out immediately so likely not the same issue, but it's possible it's related.
Are you able to test with enabling security? Just use any certificate and configure the certificate validation to be none. If that works, then it's likely a related issue to the other known bug and I can probably give you a workaround.
Is the linux container image you're using a publicly available one?

@jvnvenu
Copy link
Author

jvnvenu commented Jul 13, 2024

@mconnew we are using Buffered mode. The container image is mcr.microsoft.com/dotnet/aspnet:8.0.6. I can try with security enabled but it takes time. But we cannot change our services to security enabled.

@mconnew
Copy link
Member

mconnew commented Jul 13, 2024

Trying with security is a diagnostic step, not a suggested solution.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants