CRIU: Add support for dynamic debug interpreter transition on restore #17642

tajila · 2023-06-22T14:30:38Z

Background

We currently have 3 interpreters, normal one, criu and debug. Ideally, we would like to get in a position where we only have two interpreters, normal and debug. The CRIU interpreter was added because there were capabilities missing (method enter/exit checks) in the normal interpreter to support serviceability features like java method tracing dynamically upon restore.

Goal

Detect request to run with debug interpreter, then exit normal interpreter and continue in debug interpreter. If we can achieve this then we gain

Remove criu interpreter
Cheaper support for serviceability features on restore
Support -Xint on restore
Support java debugging on restore

Challenges

Places to detect change:

interpreter entry from native (callin) - easy
interpreter return from native (any native, not just JNI) - lots of places to check
interpreter code (bytecode interpreter loop) - piggy back on async checks
jit code - decompile while still in safe point exclusive
jit helpers - ??

dsouzai · 2023-06-22T14:53:24Z

Based on a discussion with @vijaysun-omr, we came up with a few possible ways forward.

1. Disable all compiled code

This is relatively straightforward to do; in fact this is what we currently do for -Xtrace/-Xrs. However, the problem is that this does not guarantee that some JIT'd code will not execute; any JIT'd code on the stack will continue to execute until a new invocation wherein execution will transfer to the interpreter.

2. Fail the checkpoint and start a new JVM in default mode

This is probably less of an option for the JVM and more for Applications; an application can be configured to handle the failure and instead start a new JVM in default mode. This would not maintain Dev/Prod Parity, but it is a fallback option that would at the very least guarantee functionality from a Java Application User pov.

3. Generate code pre-checkpoint as if the JVM is running in FSD mode

Generating code as if the JVM is in FSD mode means running in Involuntary OSR Mode. This means any yield point can be a place where the VM triggers the transition of a thread from JIT'd code to the interpreter. The downside of this approach is that FSD compliant JIT code is around 30% slower. However, this may not matter too much for first response; for steady state throughput, these FSD bodies can be generated with GCR trees to force recompilation post restore.

An important subtlety here is that if debut is not enabled post-restore but redefinition is still possible, the code cache will have some method bodies that support involuntary OSR (i.e. those that were generated pre-checkpoint) and the rest that support voluntary OSR. As such, the VM will need to check a (yet to exist) flag in the body's metadata to determine what type of OSR was used. When redefinition needs to occur, the VM will need to check, at a yield point, if the body was compiled to support involuntary OSR, and if so, decompile it regardless of the type of yield point; otherwise, normal Voluntary OSR mechanics apply.

4. Choose Async Checkpoints that, in Voluntary OSR Mode allow redefinition to occur, and add OSR transitions there

If option 3 is too expensive, another approach is to run in a suboptinal Voluntary OSR Mode. Rather than run the Fear Analysis to minimize the OSR transition points, we force the transition points to be the exact set of yield points that are used to ensure that redefinition occurs; while this set is larger than what would result from an optimal OSR analysis, it is still likely smaller than the set of points in option 3.

However, an important caveat here is that any yield point that is not used to ensure that redefinition occurs must be ignored by the VM for the purpose of checkpointing; the thread should be allowed to continue execution until it hits one of these yield points that is also a transition point (it is guaranteed that the thread will not execute indefinitely before reaching such a point).

Another caveat is that we will need to add Voluntary OSR support for AOT (#4849).

dsouzai · 2023-06-22T14:53:34Z

I am going to start investigating the perf impact of option 3 first. Specifically, I will generate two builds where,

The JIT generates FSD code all the time
The JIT generates FSD code only pre-checkpoint

tajila · 2023-06-22T14:53:59Z

FYI @gacholio

tajila · 2023-06-22T15:14:29Z

any yield point that is not used to ensure that redefinition occurs must be ignored by the VM for the purpose of checkpointing; the thread should be allowed to continue execution until it hits one of these yield points that is also a transition point

I'm not too familiar with this detail, how do we differentiate this in the VM? There are two main mechanism we use, exclusive and safepoint exlcusive. @gacholio thougts.

gacholio · 2023-06-22T17:00:27Z

My impression from discussion with Tobi is that we would just discard all the compiled code if debug was enabled on restore. This avoids any number of difficult issues. The checkpoint code uses safepoint exclusive, so all threads will certainly be at an OSR point.

tajila · 2023-06-22T17:11:03Z

@gacholio that is captured in Irwin's case 3 and 4. From my understanding, what Irwin is saying is that the JIT either needs to be in FSD mode (non default) or Voluntary OSR (default) mode for us to decompile the JIT frames on the stack.

The checkpoint code uses safepoint exclusive, so all threads will certainly be at an OSR point.

To me this sounds like we could then use case 4, which is the cheaper option.

gacholio · 2023-06-22T17:16:18Z

To me this sounds like we could then use case 4, which is the cheaper option.

The OSR I'm talking about is I believe involuntary, in that we force it on all threads (it's not induced by a failed check in the compiled code). Does involuntary require FSD? I didn't think so.

dsouzai · 2023-06-22T18:33:52Z

Does involuntary require FSD?

FSD involves involuntary OSR; normal HCR enabled mode uses voluntary OSR.

gacholio · 2023-06-22T19:14:19Z

So either we need to start in involuntary mode always (or at least if we want to support the possibility of debug) or add guards at every OSR point to check for the switch (maybe this can be done via the assumptions mechanism?).

dsouzai · 2023-06-22T19:25:48Z

add guards at every OSR point to check for the switch (maybe this can be done via the assumptions mechanism?).

Well, once the guards are patched it will always transition to the VM. As such, once we enter into debug mode, the entire code cache might as well be discarded (same with the AVL trees). However, if we don't enter debug mode, the code quality should be better than with involuntary osr mode.

Also, with this approach, at the time when the VM wants to stop threads to prepare for checkpoint, if the thread hits some other yield point that isn't an OSR transition point, it needs to be allowed to return back to running JIT'd code; it's only in involuntary osr mode that all yield points are OSR transition points. That's why if we can get away with involuntary osr mode pre-checkpoint, that would be the simplest approach to take.

tajila · 2023-06-22T19:28:13Z

if the thread hits some other yield point that isn't an OSR transition point, it needs to be allowed to return back to running JIT'd code;

This is the part that is challenging. Im not sure how we detect this.

gacholio · 2023-06-22T19:30:18Z

As such, once we enter into debug mode, the entire code cache might as well be discarded

I believe we will be reinitializing the send targets for all methods when we restore, which has the effect of abandoning all of the compiled code (by which I mean the interpreter will never invoke it again), so normal CCR should be able to discard the old method bodies once every running invocation has OSRed back to the interpreter.

gacholio · 2023-06-22T19:36:16Z

This is the part that is challenging. Im not sure how we detect this.

Let's not do this - it's essentially another layer of exclusive on top of safepoint, which would be completely unmanageable (I'd already like to see some proof that safepoint is valuable given how many problems it has had).

dsouzai · 2023-06-22T19:37:30Z

This is the part that is challenging. Im not sure how we detect this.

@vijaysun-omr could elaborate more on this perhaps, but he did mention that there's only very specific bytecodes that matter for the purpose of (in a normal run) ensuring that we yield to allow a redefinition event (for example, if we're in a loop with no monents/invocations, we need to ensure that we don't loop indefinitely).

If there's some way to identify at the yield point / transition point what the bytecode is supposed to, we would be able to distinguish between normal yield points and OSR transition points. Of course the critical point here is that the set of osr transition points must be the set of yield points that are necessary to ensure a redefinition event. It may also be that when we transition via OSR, we end up in a different place than when we yield via a yield point, so that too could be a distinguishing factor.

That said, I don't know if what I just described is absolutely accurate, so I'll let Vijay clarify.

vijaysun-omr · 2023-06-22T20:56:36Z

I am under the impression that under our present default HCR implementation, the VM only allows actual class redefinition to occur at certain yield points, and my understanding is that those yield points are 1) async checks 2) method calls (probably via stack overflow check) and 3) monitor enter.

If this is not how the VM is doing class redefinition, then please clarify. If this is how the VM is doing class redefinition, then I don't understand what more is needed in order to support option 4 in Irwin's post.

gacholio · 2023-06-23T18:33:59Z

Redefinition can occur at any place that releases VM access. These would include:

stack overflow
method entry event (FSD only)
async check
method call
monitor enter
field events (FSD only)
many JIT helpers (resolve helpers in particular)
OOL object allocation (not if safepoint is enabled)
method exit event (FSD only)

With some exceptions, if you call out from compiled code, that's a redef point (some JIT helpers will never release VM access, so we'll need to be very careful in future if we change a helper and the JIT has assumed it will not release VM access).

The only practical solution for compiled code is to discard it entirely on restore (i.e. post decompiles for every compiled frame in every thread). This will naturally result in the debug interpreter being invoked after the decompiles.

Safepoint HCR means that object allocation is not an OSR/decompile point (the checkpoint code gets that kind of access if necessary).

The requirement is that we have an OSR block at all of the possible locations that a method could be paused (by safepoint exclusive). I'd rather not rely on guards to accomplish this since it would be very hard to distinguish which points will rely on the guard fail and which need to be forced into OSR.

When we restore, we will mark all frames in all stacks for decompile, and reset all method send targets back to their default (count and compile in the JIT case). Eventually, the obsolete compiled code will be unrerefenced and able to be reclaimed.

vijaysun-omr · 2023-06-23T20:04:35Z

That list of program points in compiled code from @gacholio where class redefinition may occur (ignoring FSD for the moment) is what we used to have, until some more OSR changes were done to the design a few years ago was my understanding. The basis of this understanding is this code :

The code under the if-condition I pasted only checks for calls, async checks and monitor enters as spots where it needs to arrange for OSR transitions ("post execution OSR" there means it will set up the OSR transition after those operations are done and we return back to the JITed code) https://github.com/eclipse/omr/blob/2d5ac63fbe881f0af035ef2732b22f85eb3893dd/compiler/compile/OMRCompilation.cpp#L637

There is also this comment that alludes to what that code does:

openj9/runtime/compiler/optimizer/OSRGuardInsertion.cpp

Line 649 in 163a514

    
           // HCR in the new world is only allowed to happen at three kinds of async points:

There must have been some VM code added to ensure we only redefine at those 3 points since the JIT is not in charge of where class redefinition occurs. The point of debate being this category which the above JIT code does not seem to consider anymore as a place where redefinition is possible:

many JIT helpers (resolve helpers in particular)
OOL object allocation (not if safepoint is enabled)

gacholio · 2023-06-26T14:45:27Z

changes were done to the design a few years ago

You are likely referring to safepoint OSR, which only eliminates object allocation from the list of HCR points:

openj9/runtime/compiler/control/J9Options.cpp

Lines 3088 to 3093 in b034032

    
           // Check NextGenHCR is supported by the VM 
        
           if (!(javaVM->extendedRuntimeFlags & J9_EXTENDED_RUNTIME_OSR_SAFE_POINT) || 
        
               (*vmHooks)->J9HookDisable(vmHooks, J9HOOK_VM_OBJECT_ALLOCATE_INSTRUMENTABLE) || disableHCR) 
        
              { 
        
              self()->setOption(TR_DisableNextGenHCR); 
        
              }

Looking at the code, in HCR (not FSD) mode, the VM does not force decompile anywhere - it calls jitClassesRedefined with a list of modified classes/methods so the JIT can patch what it needs to.

So, I suppose it's up the JIT to determine where HCR checks need to be inserted to ensure correctness.

One thing I think we've all forgotten (and I've just remembered) is that HCR does not affect existing frames on the stack. The requirement is that all new method invocations target the most current version of the method.

This may mean that existing HCR/OSR is not sufficient to accomplish what's needed here as we will be unable to simply discard the code cache like we do for FSD (extended) HCR.

dsouzai · 2023-06-26T20:08:25Z

There are two different concepts at play here:

Points at which redefinition can occur
Points at which decompilation can occur

In the case of FSD, i.e. when we use Involuntary OSR, the sets of these points end up being the same from the point of view of the JIT because all those yield points mentioned by Gac are decompilation points.

In the case of default HCR, i.e. when we use Voluntary OSR, from the point of view of the JIT, redefinition and decompilation points are not necessarily the same. In general, a thread yields to the VM to allow a STW redefinition event to occur, and then the thread continues executing until it reaches a decompilation point. The only yield points that could be redefinition points are, as Vijay mentioned, asynccheck, calls, and monents. This can be seen here:

https://github.com/eclipse/omr/blob/0c448df41bbd5978cb22dac0fe117febf78010d7/compiler/compile/OMRCompilation.cpp#L637-L665

the selected if branch above is what runs by default.

What Option 4 in #17642 (comment) proposes is to essentially make the set of redefinition points (from the JIT's pov in Voluntary OSR mode) also the set of decompilation points. This can be implemented in two ways:

The Involuntary OSR Mechanism: At these yield points, the VM triggers decompilation
The Voluntary OSR Mechanism: As with HCR, the VM allows the thread to return to executing the compiled body; unlike HCR though, the code generated to trigger a decompilation is emitted right after the yield point.

1 is obviously the cleaner approach, but 2 may be more practical in terms of being able to reuse non-FSD infrastructure.

At any rate, the question of what are redefinition points and what are decompilation points is an orthogonal concern to Option 4 above, which banks on the fact that we must already able to distinguish between the two for HCR.

All that said, if the assumption that redefinition cannot occur outside of asynccheck, calls, and monents is wrong, then HCR has a longstanding bug independent of the CRIU feature. As far as the JIT is concerned right now, it generates code assuming that redefinition can only occur at these three types of yield points. The code comment linked in #17642 (comment) was first added around May 2016. @gacholio do you know what VM changes were added around that time frame that might explain why that comment exists?

gacholio · 2023-06-27T17:52:57Z

All that said, if the assumption that redefinition cannot occur outside of asynccheck, calls, and monents is wrong, then HCR has a longstanding bug independent of the CRIU feature.

Clasically, HCR could occur any time VM access can be released. That includes all of the places (and possibly more) that I detailed above.

The only HCR change I can think of is the safepoint OSR (which I think you refer to as nextGenHCR). This disallows HCR at object allocation points.

When the HCR occurs, the VM does not add any decompilations - it reports the modified classes/methods so the JIT can do the appropriate patching (presumably invalidating calls to any potentially-replaced methods). As stated above, there's no need to decompile when the thread resume - it's fine to wait until a new method invocation is going to take place (even then, if you know that the invoked method has not been replaced, you can just go ahead and invoke it).

gacholio · 2023-06-27T18:02:49Z

It's tempting to use voluntary OSR to let the decompiles trickle in as the compiled code detects the restore, but this won't work properly in the debugger (an obvious example is that the debugger would not be able to query locals in frames that remain compiled without FSD).

I think the only way this will work is to make every escape point (except allocation points in next gen) from the compiled code into an OSR point, and do the force decompile (involuntary) on restore.

DanHeidinga · 2023-06-27T18:53:57Z

It's tempting to use voluntary OSR to let the decompiles trickle in as the compiled code detects the restore, but this won't work properly in the debugger (an obvious example is that the debugger would not be able to query locals in frames that remain compiled without FSD).

@gacholio that sounds a lot like Graeme's @SelectiveDebug technology from many many years ago. Is that a reasonable approach to build off where existing frames are marked in some way to indicate they can't be debugged (and use the correct stackmapper) and new invocations are debuggable?

How valid this is depends on the user requirements but it seems like a reasonable position to me.

gacholio · 2023-06-27T18:57:44Z

@SelectiveDebug technology

I don't see the correlation , and I would have to say no to building on top of 20-year old abandoned tech (I doubt there's even a mention of it left in the codebase). It also does not address my above concern about locals.

DanHeidinga · 2023-06-27T19:27:06Z

We don't want to reuse the @SelectiveDebug tech but the idea of allowing a mix of debuggable and not debuggable frames is worth considering. The locals in non-debuggable frames would simply be unavailable - I believe there's an existing JVMTI error (JVMTI_ERROR_OPAQUE_FRAME) to return from the locals related queries that covers this behaviour

dsouzai · 2023-06-28T00:17:31Z

The only HCR change I can think of is the safepoint OSR (which I think you refer to as nextGenHCR). This disallows HCR at object allocation points.

After talking to @jdmpapin and Vijay, I believe that the three types of yield points I mentioned above do cover most of what is handled by safepoints. However, it may be that the resolve helpers are not handled; we'll have to take a look and see if we do handle it in some other way; either way we would have to make them an explicit OSR point.

I think the only way this will work is to make every escape point (except allocation points in next gen) from the compiled code into an OSR point, and do the force decompile (involuntary) on restore.

Yeah that sounds right. Actually, additionally we need to make these points also the only points that a thread can yield to allow a checkpoint. Essentially, in Option 4, we need to have the set of Redefinition Points (Escape Points/HCR Points), the set of OSR Points (Involuntary OSR Transition Points), and the set of Checkpoint Points be the same set of points.

Overall though, I do agree that if FSD compliant code pre-checkpoint is sufficient then we should just stick to that.

dsouzai · 2023-07-05T19:52:26Z

I launched some perf runs to measure the impact of generating FSD compliant code. I ran the pingperf and restcrud apps; as the names suggest, pingperf is a simple OpenLiberty app that responds to a request with a response, whereas restcrud queries a postgres db and returns the results.

I had 3 builds:

Baseline
FSD Always - the JIT always generates FSD compliant code, even post-restore
FSD Pre-checkpoint - the JIT only generates FSD compliant code pre-checkpoint

`pingperf`

Build	Startup Slowdown	First Response Slowdow
FSD Always	5%	4%
FSD Pre-checkpoint	4%	3%

`restcrud`

Build	Startup Slowdown	First Response Slowdow
FSD Always	2.5%	15%
FSD Pre-checkpoint	2.5%	2%

From the looks of things, the FSD approach (i.e. Option 3) looks to be sufficient to enable debug post-restore.

That said, there are some things that we need to address.

If debug is not enabled post-restore, then in order to support "normal" HCR redefinition, at any yield point (not just the safepoints mentioned above), the VM will need to check a (yet to be defined) flag in the method's metadata to see if it is a FSD body; if it is, the VM will have to trigger an OSR transition. This is because FSD bodies do not have any OSR guards. Essentially, we have never been in a situation where we have Involuntary and Voluntary OSR method bodies at the same time.
I have to ensure that throughput is not affected. I did some initial throughput runs, and it turns out that the FSD Pre-checkpoint build is not much better than the FSD Always build, which is ~40% worse than baseline. Part of this comes from the fact that the FSD compliant bodies do not get recompiled (because up until this point, we never had the need for FSD bodies to exist except if we knew for certain debug was enabled). However, this doesn't explain the entire gap, and so I need to investigate further; it may be that there are other optimizations that get disabled under FSD that I missed in the post-restore options processing where I reset the FSD flag.

Add a new private flag which instructs the interpreter to exit and re-invoke itself. This will be used by CRIU when a restored image requests debug capabilities (by changing the interpreter entry point to the debug interpreter). Related: eclipse-openj9#17642 Signed-off-by: Graham Chapman <graham_chapman@ca.ibm.com>

Support transition to debug interpreter on restore when the transition is requested with an env var file or an options file. Issue: eclipse-openj9#17642 Co-authored-by: Tobi Ajila <atobia@ca.ibm.com> Signed-off-by: Amarpreet Singh <Amarpreet.A.Singh@ibm.com>

singh264 · 2024-06-05T18:09:15Z

How can I know the status of the issue?

dsouzai · 2024-06-06T13:30:30Z

The compiler work is being tracked here #18866 (it does include some VM pre-requisites). For the most part, the compiler functional work is done, but we still need to reduce the footprint gap caused by generating FSD pre-checkpoint.

singh264 · 2024-06-06T13:40:51Z

How can I know if now is a good time to address the VM pre-requisites as it seems like we still need to reduce the footprint gap caused by generating FSD pre-checkpoint?

dsouzai · 2024-06-06T13:57:18Z

The footprint gap and the VM pre-prerequisites are mutually exclusive; the work to reduce the footprint gap is not going to be impacted by the necessary VM changes.

That said you should probably coordinate with @JasonFengJ9 since I believe he's working on the VM side debug on restore work.

singh264 · 2024-06-06T14:11:42Z

@JasonFengJ9 how can I potentially assist with the debug on restore work?

JasonFengJ9 · 2024-06-06T15:32:07Z

The first openj9 portion of the debug on restore work was

CRIU supports Java debugger during checkpoint and restore #19405
which has been merged.

The corresponding extension repo PR (initially opened by Mike Z., now with my changes) is awaiting review

Add support for a suspendOnRestore option in JDWP ibmruntimes/openj9-openjdk-jdk#591

I have a draft PR for the second openj9 PR which is being tuned according to Irwin's perf results, the ETA is next week or so.

There are quite a few other CRIU open issues, please talk to @tajila for a suitable task.

singh264 · 2024-06-06T16:21:19Z

It seems like a suitable task was discussed after I talked to @tajila for a suitable task.

singh264 · 2024-07-09T19:21:03Z

How can I contribute to the task?

tajila · 2024-07-09T20:04:54Z

@singh264 I've assigned #19835 to you.

tajila added comp:vm comp:jit criu Used to track CRIU snapshot related work labels Jun 22, 2023

tajila added this to the Release 0.41 (Java 8, 11, 17) Oct refresh milestone Jun 22, 2023

gacholio mentioned this issue Feb 6, 2024

Support interpreter re-entry for CRIU debug support #18903

Merged

singh264 modified the milestones: vm, Release 0.48 (Java 8, 11, 17, 21, 23) October refresh Jul 18, 2024

tajila mentioned this issue Sep 17, 2024

CRIU: Remove CRIU debug interpreter #19835

Closed

CRIU: Add support for dynamic debug interpreter transition on restore #17642

CRIU: Add support for dynamic debug interpreter transition on restore #17642

Comments

tajila commented Jun 22, 2023 • edited Loading

Background

Goal

Challenges

dsouzai commented Jun 22, 2023 • edited Loading

1. Disable all compiled code

2. Fail the checkpoint and start a new JVM in default mode

3. Generate code pre-checkpoint as if the JVM is running in FSD mode

4. Choose Async Checkpoints that, in Voluntary OSR Mode allow redefinition to occur, and add OSR transitions there

dsouzai commented Jun 22, 2023

tajila commented Jun 22, 2023

tajila commented Jun 22, 2023

gacholio commented Jun 22, 2023

tajila commented Jun 22, 2023

gacholio commented Jun 22, 2023

dsouzai commented Jun 22, 2023

gacholio commented Jun 22, 2023

dsouzai commented Jun 22, 2023

tajila commented Jun 22, 2023

gacholio commented Jun 22, 2023 • edited Loading

gacholio commented Jun 22, 2023

dsouzai commented Jun 22, 2023 • edited Loading

vijaysun-omr commented Jun 22, 2023

gacholio commented Jun 23, 2023

vijaysun-omr commented Jun 23, 2023 • edited Loading

gacholio commented Jun 26, 2023

dsouzai commented Jun 26, 2023

gacholio commented Jun 27, 2023

gacholio commented Jun 27, 2023

DanHeidinga commented Jun 27, 2023

gacholio commented Jun 27, 2023

DanHeidinga commented Jun 27, 2023

dsouzai commented Jun 28, 2023

dsouzai commented Jul 5, 2023

pingperf

restcrud

singh264 commented Jun 5, 2024

dsouzai commented Jun 6, 2024

singh264 commented Jun 6, 2024

dsouzai commented Jun 6, 2024

singh264 commented Jun 6, 2024

JasonFengJ9 commented Jun 6, 2024

singh264 commented Jun 6, 2024

singh264 commented Jul 9, 2024

tajila commented Jul 9, 2024

tajila commented Jun 22, 2023 •

edited

Loading

dsouzai commented Jun 22, 2023 •

edited

Loading

gacholio commented Jun 22, 2023 •

edited

Loading

dsouzai commented Jun 22, 2023 •

edited

Loading

vijaysun-omr commented Jun 23, 2023 •

edited

Loading

`pingperf`

`restcrud`