Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement thread suspension on Linux and Mac platform #3947

Closed
sergiy-k opened this issue Feb 11, 2015 · 24 comments
Closed

Implement thread suspension on Linux and Mac platform #3947

sergiy-k opened this issue Feb 11, 2015 · 24 comments
Assignees

Comments

@sergiy-k
Copy link
Contributor

CoreCLR needs to be able to suspend threads in a managed application in order to run GC. Currently CoreCLR PAL implements threads suspension using signals on Linux. We need to enable and test implementation of thread suspension for Mac. In addition, we need to implement functionality for getting and setting context of a given thread in the current process.

@kangaroo
Copy link
Contributor

@sergiy-k There is already code for suspending and resuming Mac threads in the pal, see:

CThreadSuspensionInfo::THREADHandleSuspendNative(CPalThread *pthrTarget)

Is there some reason this isn't sufficient?

@sergiy-k sergiy-k self-assigned this Feb 11, 2015
@sergiy-k
Copy link
Contributor Author

@kangaroo You are right – the code is there. But, honestly, I don’t know if it still works because this code has some age (I’m sure you have noticed it yourself that there is a lot stuff in PAL that needs to cleaned up) so I want to test it first. Luckily I got a Mac machine yesterday that I could you use for testing. :) That’s why I opened this work item. Besides, there are a few PAL tests for thread suspension that fail even on Linux today.

@jkotas
Copy link
Member

jkotas commented Feb 11, 2015

The existing PAL thread suspension code never worked reliably. It was not used for Silverlight - Silverlight used slower thread suspension technique via explicit GC probes. This issue is about designing and implementing a scheme that works reliably on Linux and Mac - without the overhead of explicit GC probes.

The current PAL code is a pretty complex emulation of SuspendThread/ResumeThread/GetThreadContext/SetThreadContext Win32 APIs. src\vm\threaduspend.cpp wraps these APIs with another complex layer to actually perform the managed thread suspension. One potential avenue to investigate is to design a custom PAL APIs to eliminate some of this complexity - that just does what src\vm\threaduspend.cpp needs instead of general purpose emulation of Win32.

@kangaroo
Copy link
Contributor

@jkotas I'd be curious what unreliability you guys had for the MAC PAL. I've done an (albeit quick) once over of the code, and it looks functionally correct for a non-cooperative suspend/resume.

@jkotas
Copy link
Member

jkotas commented Feb 11, 2015

The current implementation of PAL thread suspension requires a lot of code to be aware of it. It is impossible to ensure and review that all of that code is right. IIRC, the thread suspension was a stress bug factory in Silverlight days and that is why we switched to explicit GC probes as backstop.

I believe that the right design of the thread suspension has to have minimal dependencies, e.g. any bugs in thread suspension have to be in the thread suspension code itself - that should be relatively small and easy to review. The design cannot not depend on code all over the place to cooperate for correctness.

Look for comments like:

#if !DEADLOCK_WHEN_THREAD_IS_SUSPENDED_WHILE_BLOCKED_ON_MUTEX

Wrapper for mkstemp. mkstemp may invoke malloc, in which case a thread
suspended inside it could be holding an internal lock. This function prevents
such a suspension from occuring by ensuring that a thread cannot be suspended
while inside mkstemp.

// Prime the location of CoreCLR so that we don't deadlock under
// dladdr with a pthread_mutex_lock.
IsWithinCoreCLR(NULL);

@kangaroo
Copy link
Contributor

@jkotas That presents an interesting design challenge on mac and linux. Any thread suspended in user land could be holding any lock, so I see a few obvious options:

1> Make all code that relies on suspending someone else lock and allocator free. Basically treat it as if it were operating in a signal handler.
2> Annotate all unsafe regions in some way, and restart threads that are not at a safe point

I think 2 is much trickier, since we could be talking about the malloc lock, printf lock, etc.

Thoughts?

@jkotas
Copy link
Member

jkotas commented Feb 12, 2015

Agree. 1 is the right approach.

@sergiy-k
Copy link
Contributor Author

Agree. As a matter of fact, some of the runtime is already trying to that. For example, in threadsuspend.cpp in VM:

// We can not allocate memory after we suspend a thread.
// Otherwise, we may deadlock the process, because the thread we just suspended
// might hold locks we would need to acquire while allocating.
ThreadStore::AllocateOSContext();

In addition, existing PAL code tries to detect and mitigate this problem by using EnterUnsafeRegion/LeaveUnsafeRegion, which I want to get rid of as well.

My plan is to first ensure that the existing code for suspending/resuming threads and getting/setting thread context works (on both Linux and Mac) and fix currently failing tests or at least understand why they are failing (this was the real intention of this issue). After that I will look at how we can make thread suspension safe and open a separate issue for that.

@sergiy-k
Copy link
Contributor Author

I have been thinking about what we could do to guarantee that we don’t end up in a deadlock situation when using the Suspend/ResumeThread APIs. It seems that the best way to guarantee that this will not happen is to get rid of these dangerous APIs all together. If we can do this then, in addition to making the runtime code safer, it will also allow us to remove A LOT of code and complexity from PAL.

As far as I remember, the main usage of the Suspend/Resume thread APIs in SuspendRuntime is to hijack threads and to drive the threads to a safe point. As @jkotas pointed out, this functionality is currently disabled on non-Windows platforms (by defining DISABLE_THREADSUSPEND) and it relies on the GCPOLL mechanism instead. Even though GCPOLL works OK in many cases, it cannot deal with the code that has tight long running (or never ending) loops that do not have GC synchronization points. So we do want to enable the hijacking functionality on non-Windows platforms too and I believe we can do it without the Suspend/ResumeThread APIs.

Basically, the idea is that instead of suspending a thread and then doing a bunch of work on the suspended context in the SuspendRuntime function, we can tell a thread to do hijacking its own. For example, on Linux SuspendRuntime can send a signal to a thread. When the thread receive this signal, it inspects and modifies its own context and reports the status back to SuspendRuntime/GC once it’s done.
Of course, evil is always in the details. So some prototyping and more investigation needs to be done before we can say for sure that this will work in general and is implementable on all platforms.

Thoughts? Your feedback is more than welcomed. :)

@noahfalk
Copy link
Member

Just wanted to confirm my understanding. The broad proposed approach is that threads could be suspended/hijacked holding any lock, but only during the period of time where the runtime has not yet finished suspending as a whole. At the point that the GC code would run (or the debugger helper), we would have resumed those threads and blocked them somewhere more controlled to ensure they are not holding various low-level locks that might be necessary for OS and C++ runtime APIs to run normally.

I'm not a suspension expert but this seems like a fine direction to me. I think one of the detail areas should be debugger interaction given the various synchronous and asynchronous CONTEXT modification the debugger can do. I'm happy to take a closer look with you at some point.

@jkotas jkotas assigned adityamandaleeka and unassigned sergiy-k Jun 5, 2015
@adityamandaleeka
Copy link
Member

I've done some investigation and prototyping for this work on Linux, and here is what I have in mind so far (at a high level):

In SuspendRuntime, we raise a signal to each thread that needs to leave cooperative mode in order for GC to proceed. In the handler for that signal, we will capture the context and call a function that will determine the type of code that was interrupted and react accordingly.

  • If we interrupted native code, or if preemptive GC is not disabled for the thread (for instance, if the thread switched back to preemptive mode before the signal handler was invoked), we will exit the signal handler and let the thread continue.
  • If we interrupted managed code, we will use the ICodeManager to determine whether the offset we are at is GC safe.
    • If we are at a safe point, we push a RedirectedThreadFrame on the stack using the captured context, pulse GC mode (enable and disable preemptive GC) for the thread, pop the frame, update the context to reflect modifications by the GC, and exit the signal handler.
    • If we are not at a safe point, we will determine whether the method we are in returns a scalar, an object ref, or an interior pointer, and hijack the return address to jump to an appropriate routine which will do the following (which closely resembles what's done in the Windows code in this case):
      • Save all the callee-saved registers and the return value on the stack.
      • Call a worker function which will push a hijack frame (which will let us crawl through it back to where the return should have gone), pulse GC mode, and pop the frame.
      • Restore all the captured registers and return to where the function was supposed to return to before the hijack.

Thoughts? I think this plan should be pretty sound but it could change if there are any unforeseen issues discovered.

@jkotas
Copy link
Member

jkotas commented Jun 16, 2015

You should carefully review all code executed in the signal handler to make sure it only calls methods that are signal handler safe.

@adityamandaleeka
Copy link
Member

@jkotas Agreed. This will be reviewed and tested thoroughly before moving forward.

@janvorli
Copy link
Member

I believe it is actually ok to call even non-safe functions provided we call them only when we know we have interrupted the code in managed code. In such case, it is ok to call any platform functions, since we are sure that we are not in the middle of another platform function that could collide with the one we are executing (taking the same lock, using the same I/O buffer, ...)

@jkotas
Copy link
Member

jkotas commented Jun 17, 2015

Agree. Only the code that determines whether we are in managed code needs to be scrutinized.

@adityamandaleeka
Copy link
Member

dotnet/coreclr#1371 addresses this issue for Linux.

@janvorli janvorli assigned janvorli and unassigned adityamandaleeka Sep 9, 2015
@janvorli
Copy link
Member

janvorli commented Sep 9, 2015

I have assigned this issue to myself, since I am starting to work on the thread suspension for OSX.
After looking into possible options, it seems that handling it in a similar way as on Windows rather than using injection like on Linux will get a cleaner and better performing solution.
Just to make sure there is a clear understanding of the goal of the suspension, I would like to point out that we want to suspend a thread in a way that makes sure it gets suspended at a safe place in managed code only.
On Windows, we suspend the thread and then check whether it is executing a native or managed code.

  • If it is native code, we resume the thread, since we already have g_TrapReturningThreads set and so when the thread will return to managed code, it will get wait on the boundary at a safe point.
  • If it is in managed code, then we can either be at an interruptible point or not.
    • If we are at an interruptible point, we modify the context of the thread to redirect it to a helper function that makes it wait at a safe point and resume the thread.
    • If we are not at an interruptible point, we hijack the current function by modifying its return address on stack to point to a helper function that makes it wait at a safe point and resume the thread.

The interesting point is why we redirect the thread when we are at an interruptible point instead of just keeping the thread suspended. The answer is in one of the comments in the threadsuspend.cpp:

If the thread is already at a safe point, you might think we could simply leave it
suspended and proceed with the GC. In principle, this should be what we do.
However, various historical OS bugs prevent this from working. The problem is that
we are not guaranteed to capture an accurate CONTEXT (register state) for a suspended
thread. So instead, we "redirect" the thread, by overwriting its instruction pointer.
We then resume the thread, and it immediately starts executing our "redirect" routine,
which leaves cooperative mode and waits for the GC to complete.

Now the good thing about OSX thread_suspend is that the context of the suspended thread is accurate and so we don't need the redirection. And not needing the redirection means that we don't need to modify the context of the suspended thread. All we need to do is to suspend and possibly modify the return address on stack to hijack.
That means that we can reuse the Windows code and exclude parts that are responsible for the redirection.

@janvorli
Copy link
Member

janvorli commented Sep 9, 2015

@kangaroo Could you please take a look at the OSX plan above and let me know if you have any concerns?

@kangaroo
Copy link
Contributor

kangaroo commented Sep 9, 2015

@janvorli In general it sounds reasonable to me. One question. What makes a point interruptible or not in regards to managed code? Is it possible for a non-interruptible point to be in a tight loop, or waiting on a conditional which would cause the return to not execute?

@janvorli
Copy link
Member

janvorli commented Sep 9, 2015

@kangaroo, here is the description:
Fully Interruptible Method
A method in which the information for garbage collection is available at all points inside the main body of the method. (excluding the prolog and epilog portions of the method) The Method Header Information has interruptible=1. Because of the size of the Method GC Information required to supply this information, the JITter avoids creating fully interruptible methods where possible. However if a method has a compute bound loop with no method calls the method is required to be fully interruptible.
Non Fully Interruptible Method
A method in which the information for garbage collection is only provided at method call sites. The Method Header Information has interruptible=0. The runtime system will hijack a method exit (i.e. replace the actual return address on the stack with an address within the runtime system) in order to accomplish a garbage collection. The JITter will prefer to make a method non fully interruptible in order to keep the size of the Method GC Information to a minimum.

@kangaroo
Copy link
Contributor

Thanks @janvorli -- it sounds like from the explanation and my looking at the code, the concern case I raised isn't possible.

@janvorli
Copy link
Member

@kangaroo I have implemented the runtime suspension the planned way and it works. But I have realized that using the way we use on Linux (injecting activation) on OSX would enable us to get rid of the safe / unsafe regions tracking in PAL. That would be a huge benefit. So I am changing the plan and I am going to implement the PAL_InjectActivation for OSX instead of mimicking the windows way,

@kangaroo
Copy link
Contributor

@janvorli SGTM

@janvorli
Copy link
Member

OSX support implemented by dotnet/coreclr#1610

@msftgits msftgits transferred this issue from dotnet/coreclr Jan 30, 2020
@ghost ghost locked as resolved and limited conversation to collaborators Jan 7, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants