Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SPMI asmdiff errors on Arm64 Linux #91257

Closed
a74nh opened this issue Aug 29, 2023 · 10 comments · Fixed by #91783
Closed

SPMI asmdiff errors on Arm64 Linux #91257

a74nh opened this issue Aug 29, 2023 · 10 comments · Fixed by #91783
Assignees
Labels
area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI
Milestone

Comments

@a74nh
Copy link
Contributor

a74nh commented Aug 29, 2023

Description

On Arm64 Linux, running SPMI asmdiffs causes errors:

[15:39:34] ERROR: Couldn't load base metrics summary created by child process

And the script fails with no diffs.

Reproduction Steps

./build.sh -rc checked -lc release -s clr+libs
./src/tests/build.sh generatelayoutonly Checked
python3 ./src/coreclr/scripts/superpmi.py asmdiffs -build_type Checked -log_level debug

Expected behavior

SPMI will run to completion without any errors

Actual behavior

Full log file: superpmi.4.log

Regression?

Used to work.

Known Workarounds

None.

Configuration

runtime HEAD:

3bda6e0013d 2023-08-21.. Vladimir Sadov  [NativeAOT] Missing memory fence before bulk move of objects (#90890) HEAD -> main, origin/main, origin/HEAD

Linux Ubuntu 22.04, Arm64 Altra.

Other information

Works on: Linux Ubuntu 22.04, X64
Works on: Windows, Arm64

SPMI replay works on Linux, Arm64

@ghost ghost added the untriaged New issue has not been triaged by the area owner label Aug 29, 2023
@ghost
Copy link

ghost commented Aug 29, 2023

Tagging subscribers to this area: @hoyosjs
See info in area-owners.md if you want to be subscribed.

Issue Details

Description

On Arm64 Linux, running SPMI asmdiffs causes errors:

[15:39:34] ERROR: Couldn't load base metrics summary created by child process

And the script fails with no diffs.

Reproduction Steps

./build.sh -rc checked -lc release -s clr+libs
./src/tests/build.sh generatelayoutonly Checked
python3 ./src/coreclr/scripts/superpmi.py asmdiffs -build_type Checked -log_level debug

Expected behavior

SPMI will run to completion without any errors

Actual behavior

Full log file: superpmi.4.log

Regression?

Used to work.

Known Workarounds

None.

Configuration

runtime HEAD:

3bda6e0013d 2023-08-21.. Vladimir Sadov  [NativeAOT] Missing memory fence before bulk move of objects (#90890) HEAD -> main, origin/main, origin/HEAD

Linux Ubuntu 22.04, Arm64 Altra.

Other information

Works on: Linux Ubuntu 22.04, X64
Works on: Windows, Arm64

Author: a74nh
Assignees: -
Labels:

area-Infrastructure-coreclr

Milestone: -

@jakobbotsch
Copy link
Member

What happens when you run the command it is failing on manually without parallelism? I.e.

/home/alahay01/dotnet/runtime_base/artifacts/tests/coreclr/linux.arm64.Checked/Tests/Core_Root/superpmi -a -jitoption force JitAlignLoops=0 -jitoption force JitEnableNoWayAssert=1 -jitoption force JitNoForceFallback=1 -jit2option force JitAlignLoops=0 -jit2option force JitEnableNoWayAssert=1 -jit2option force JitNoForceFallback=1 /home/alahay01/dotnet/runtime_base/artifacts/spmi/basejit/31234a863efe1a4dc1c6f4f1520f8515d5a90640.linux.arm64.Checked/libclrjit.so /home/alahay01/dotnet/runtime_base/artifacts/tests/coreclr/linux.arm64.Checked/Tests/Core_Root/libclrjit.so /home/alahay01/dotnet/runtime_base/artifacts/spmi/mch/4bceb905-d550-4a5d-b1eb-276fff68d183.linux.arm64/libraries_tests.pmi.linux.arm64.checked.mch

@jakobbotsch jakobbotsch added area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI and removed area-Infrastructure-coreclr untriaged New issue has not been triaged by the area owner labels Aug 29, 2023
@ghost
Copy link

ghost commented Aug 29, 2023

Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch
See info in area-owners.md if you want to be subscribed.

Issue Details

Description

On Arm64 Linux, running SPMI asmdiffs causes errors:

[15:39:34] ERROR: Couldn't load base metrics summary created by child process

And the script fails with no diffs.

Reproduction Steps

./build.sh -rc checked -lc release -s clr+libs
./src/tests/build.sh generatelayoutonly Checked
python3 ./src/coreclr/scripts/superpmi.py asmdiffs -build_type Checked -log_level debug

Expected behavior

SPMI will run to completion without any errors

Actual behavior

Full log file: superpmi.4.log

Regression?

Used to work.

Known Workarounds

None.

Configuration

runtime HEAD:

3bda6e0013d 2023-08-21.. Vladimir Sadov  [NativeAOT] Missing memory fence before bulk move of objects (#90890) HEAD -> main, origin/main, origin/HEAD

Linux Ubuntu 22.04, Arm64 Altra.

Other information

Works on: Linux Ubuntu 22.04, X64
Works on: Windows, Arm64

SPMI replay works on Linux, Arm64

Author: a74nh
Assignees: -
Labels:

area-CodeGen-coreclr

Milestone: -

@a74nh
Copy link
Contributor Author

a74nh commented Aug 29, 2023

What happens when you run the command it is failing on manually without parallelism? I.e.

/home/alahay01/dotnet/runtime_base/artifacts/tests/coreclr/linux.arm64.Checked/Tests/Core_Root/superpmi -a -jitoption force JitAlignLoops=0 -jitoption force JitEnableNoWayAssert=1 -jitoption force JitNoForceFallback=1 -jit2option force JitAlignLoops=0 -jit2option force JitEnableNoWayAssert=1 -jit2option force JitNoForceFallback=1 /home/alahay01/dotnet/runtime_base/artifacts/spmi/basejit/31234a863efe1a4dc1c6f4f1520f8515d5a90640.linux.arm64.Checked/libclrjit.so /home/alahay01/dotnet/runtime_base/artifacts/tests/coreclr/linux.arm64.Checked/Tests/Core_Root/libclrjit.so /home/alahay01/dotnet/runtime_base/artifacts/spmi/mch/4bceb905-d550-4a5d-b1eb-276fff68d183.linux.arm64/libraries_tests.pmi.linux.arm64.checked.mch

I eventually get a segfault....

 40.9% - Loaded 120500  Jitted 120500  Diffs 0  FailedCompile 0 at 167 per second
 41.1% - Loaded 121000  Jitted 121000  Diffs 0  FailedCompile 0 at 101 per second
 41.4% - Loaded 121500  Jitted 121500  Diffs 0  FailedCompile 0 at 111 per second
 41.6% - Loaded 122000  Jitted 122000  Diffs 0  FailedCompile 0 at 119 per second
 41.8% - Loaded 122500  Jitted 122500  Diffs 0  FailedCompile 0 at 122 per second
[1]    438200 segmentation fault (core dumped)   -a -jitoption force JitAlignLoops=0 -jitoption force JitEnableNoWayAssert=1

@jakobbotsch
Copy link
Member

Do you have a backtrace? Generally SPMI should handle exceptions occurring within the JIT, so this might be a segfault in SPMI itself.

@a74nh
Copy link
Contributor Author

a74nh commented Aug 30, 2023

Not sure why it wasn't saving a coredump (I had ulimit -c unlimited set)

Running in gdb:

Thread 1 "superpmi" received signal SIGSEGV, Segmentation fault.
LightWeightMap<unsigned int, Agnostic_CompileMethodResults>::Get (this=0x0, key=0) at /home/alahay01/dotnet/runtime_base/src/coreclr/tools/superpmi/superpmi/../superpmi-shared/lightweightmap.h:442
442	        int index = GetIndex(key);
(gdb) bt
#0  LightWeightMap<unsigned int, Agnostic_CompileMethodResults>::Get (this=0x0, key=0)
    at /home/alahay01/dotnet/runtime_base/src/coreclr/tools/superpmi/superpmi/../superpmi-shared/lightweightmap.h:442
#1  0x0000aaaaaab43644 in CompileResult::repCompileMethod (this=<optimized out>, nativeEntry=0xffffffffe480,
    nativeSizeOfCode=0xffffffffe3ec, result=0xffffffffe3e8)
    at /home/alahay01/dotnet/runtime_base/src/coreclr/tools/superpmi/superpmi-shared/compileresult.cpp:412
#2  0x0000aaaaaab389c0 in NearDiffer::compare (this=0xffffffffe790, mc=0xaaaac3285db0, cr1=0xaaaac32ed210, cr2=0xaaaac33b7e50)
    at /home/alahay01/dotnet/runtime_base/src/coreclr/tools/superpmi/superpmi/neardiffer.cpp:1258
#3  0x0000aaaaaab3c938 in InvokeNearDiffer(NearDiffer*, MethodContext**, CompileResult**, MethodContextReader**)::$_1::operator()(InvokeNearDiffer(NearDiffer*, MethodContext**, CompileResult**, MethodContextReader**)::Param*) const (pParam=0xffffffffe5c0,
    this=<optimized out>) at /home/alahay01/dotnet/runtime_base/src/coreclr/tools/superpmi/superpmi/superpmi.cpp:100
#4  InvokeNearDiffer (nearDiffer=nearDiffer@entry=0xffffffffe790, mc=mc@entry=0xffffffffe878, crl=crl@entry=0xffffffffe6f8,
    reader=reader@entry=0xffffffffe7a8) at /home/alahay01/dotnet/runtime_base/src/coreclr/tools/superpmi/superpmi/superpmi.cpp:111
#5  0x0000aaaaaab3bdb4 in main (argc=<optimized out>, argv=<optimized out>)
    at /home/alahay01/dotnet/runtime_base/src/coreclr/tools/superpmi/superpmi/superpmi.cpp:605

CompileMethod is 0 when inside CompileResult::repCompileMethod().

Curiously the caller, neardiffer, is inside an Arm64 only block at the time:

// On Arm64 the constant pool is appended at the end of the method code section, hence hotCodeSize_{1,2}
// is a sum of their sizes. The following is to adjust their sizes and the roDataBlock_{1,2} pointers.

@a74nh
Copy link
Contributor Author

a74nh commented Aug 30, 2023

I'm not sure exactly what the code is doing (repCompileMethod is repeat a compile?), but I tried copying from recCompileMethod():

diff --git a/src/coreclr/tools/superpmi/superpmi-shared/compileresult.cpp b/src/coreclr/tools/superpmi/superpmi-shared/compileresult.cpp
index f9ecfa487c4..5350bcf28a4 100644
--- a/src/coreclr/tools/superpmi/superpmi-shared/compileresult.cpp
+++ b/src/coreclr/tools/superpmi/superpmi-shared/compileresult.cpp
@@ -408,6 +408,9 @@ void CompileResult::dmpCompileMethod(DWORD key, const Agnostic_CompileMethodResu
 }
 void CompileResult::repCompileMethod(BYTE** nativeEntry, ULONG* nativeSizeOfCode, CorJitResult* result)
 {
+        if (CompileMethod == nullptr)
+        CompileMethod = new LightWeightMap<DWORD, Agnostic_CompileMethodResults>();
+
     Agnostic_CompileMethodResults value;
     value             = CompileMethod->Get(0);
     *nativeEntry      = (BYTE*)value.nativeEntry;

But that (as probably expected) gives the following error from the get:

ERROR: main method 122801 of size 162 failed to load and compile correctly.
ERROR: Exception thrown: SuperPMI assertion 'index != -1' failed (Didn't find Item (in Get))

Then it errors a bit further along:

 42.9% - Loaded 129000  Jitted 129000  Diffs 0  FailedCompile 0 at 465 per second
 43.0% - Loaded 129500  Jitted 129500  Diffs 0  FailedCompile 0 at 219 per second

Thread 1 "superpmi" received signal SIGTRAP, Trace/breakpoint trap.
0x0000fffff5b6a6d4 in ?? () from /home/alahay01/dotnet/runtime_base/artifacts/spmi/basejit/31234a863efe1a4dc1c6f4f1520f8515d5a90640.linux.arm64.Checked/libclrjit.so
(gdb) bt
#0  0x0000fffff5b6a6d4 in ?? ()
   from /home/alahay01/dotnet/runtime_base/artifacts/spmi/basejit/31234a863efe1a4dc1c6f4f1520f8515d5a90640.linux.arm64.Checked/libclrjit.so
#1  0x0000fffff5b0905c in ?? ()
   from /home/alahay01/dotnet/runtime_base/artifacts/spmi/basejit/31234a863efe1a4dc1c6f4f1520f8515d5a90640.linux.arm64.Checked/libclrjit.so
#2  0x0000fffff5921328 in ?? ()
   from /home/alahay01/dotnet/runtime_base/artifacts/spmi/basejit/31234a863efe1a4dc1c6f4f1520f8515d5a90640.linux.arm64.Checked/libclrjit.so
#3  0x0000fffff591c4c8 in ?? ()
   from /home/alahay01/dotnet/runtime_base/artifacts/spmi/basejit/31234a863efe1a4dc1c6f4f1520f8515d5a90640.linux.arm64.Checked/libclrjit.so
#4  0x0000fffff5946068 in ?? ()
   from /home/alahay01/dotnet/runtime_base/artifacts/spmi/basejit/31234a863efe1a4dc1c6f4f1520f8515d5a90640.linux.arm64.Checked/libclrjit.so
#5  0x0000fffff5949adc in ?? ()
   from /home/alahay01/dotnet/runtime_base/artifacts/spmi/basejit/31234a863efe1a4dc1c6f4f1520f8515d5a90640.linux.arm64.Checked/libclrjit.so
#6  0x0000fffff594c558 in ?? ()
   from /home/alahay01/dotnet/runtime_base/artifacts/spmi/basejit/31234a863efe1a4dc1c6f4f1520f8515d5a90640.linux.arm64.Checked/libclrjit.so
#7  0x0000fffff58c01b8 in ?? ()
   from /home/alahay01/dotnet/runtime_base/artifacts/spmi/basejit/31234a863efe1a4dc1c6f4f1520f8515d5a90640.linux.arm64.Checked/libclrjit.so
#8  0x0000fffff5a21fc0 in ?? ()
   from /home/alahay01/dotnet/runtime_base/artifacts/spmi/basejit/31234a863efe1a4dc1c6f4f1520f8515d5a90640.linux.arm64.Checked/libclrjit.so
#9  0x0000fffff585ca9c in ?? ()
   from /home/alahay01/dotnet/runtime_base/artifacts/spmi/basejit/31234a863efe1a4dc1c6f4f1520f8515d5a90640.linux.arm64.Checked/libclrjit.so
#10 0x0000fffff5861d94 in ?? ()
   from /home/alahay01/dotnet/runtime_base/artifacts/spmi/basejit/31234a863efe1a4dc1c6f4f1520f8515d5a90640.linux.arm64.Checked/libclrjit.so

@jakobbotsch
Copy link
Member

Does the following patch fix the problem:

diff --git a/src/coreclr/tools/superpmi/superpmi/neardiffer.cpp b/src/coreclr/tools/superpmi/superpmi/neardiffer.cpp
index 4fa300725cb..c80970f538c 100644
--- a/src/coreclr/tools/superpmi/superpmi/neardiffer.cpp
+++ b/src/coreclr/tools/superpmi/superpmi/neardiffer.cpp
@@ -1247,28 +1247,29 @@ bool NearDiffer::compare(MethodContext* mc, CompileResult* cr1, CompileResult* c
     // is a sum of their sizes. The following is to adjust their sizes and the roDataBlock_{1,2} pointers.
     if (GetSpmiTargetArchitecture() == SPMI_TARGET_ARCHITECTURE_ARM64)
     {
-        BYTE*        nativeEntry_1;
-        ULONG        nativeSizeOfCode_1;
-        CorJitResult jitResult_1;
+        if (hotCodeSize_1 > 0)
+        {
+            BYTE* nativeEntry_1;
+            ULONG        nativeSizeOfCode_1;
+            CorJitResult jitResult_1;
+            cr1->repCompileMethod(&nativeEntry_1, &nativeSizeOfCode_1, &jitResult_1);
+            roDataSize_1 = hotCodeSize_1 - nativeSizeOfCode_1;
+            roDataBlock_1 = hotCodeBlock_1 + nativeSizeOfCode_1;
+            orig_roDataBlock_1 = (void*)((size_t)orig_hotCodeBlock_1 + nativeSizeOfCode_1);
+            hotCodeSize_1 = nativeSizeOfCode_1;
+        }
 
-        BYTE*        nativeEntry_2;
-        ULONG        nativeSizeOfCode_2;
-        CorJitResult jitResult_2;
-
-        cr1->repCompileMethod(&nativeEntry_1, &nativeSizeOfCode_1, &jitResult_1);
-        cr2->repCompileMethod(&nativeEntry_2, &nativeSizeOfCode_2, &jitResult_2);
-
-        roDataSize_1 = hotCodeSize_1 - nativeSizeOfCode_1;
-        roDataSize_2 = hotCodeSize_2 - nativeSizeOfCode_2;
-
-        roDataBlock_1 = hotCodeBlock_1 + nativeSizeOfCode_1;
-        roDataBlock_2 = hotCodeBlock_2 + nativeSizeOfCode_2;
-
-        orig_roDataBlock_1 = (void*)((size_t)orig_hotCodeBlock_1 + nativeSizeOfCode_1);
-        orig_roDataBlock_2 = (void*)((size_t)orig_hotCodeBlock_2 + nativeSizeOfCode_2);
-
-        hotCodeSize_1 = nativeSizeOfCode_1;
-        hotCodeSize_2 = nativeSizeOfCode_2;
+        if (hotCodeSize_2 > 0)
+        {
+            BYTE* nativeEntry_2;
+            ULONG        nativeSizeOfCode_2;
+            CorJitResult jitResult_2;
+            cr2->repCompileMethod(&nativeEntry_2, &nativeSizeOfCode_2, &jitResult_2);
+            roDataSize_2 = hotCodeSize_2 - nativeSizeOfCode_2;
+            roDataBlock_2 = hotCodeBlock_2 + nativeSizeOfCode_2;
+            orig_roDataBlock_2 = (void*)((size_t)orig_hotCodeBlock_2 + nativeSizeOfCode_2);
+            hotCodeSize_2 = nativeSizeOfCode_2;
+        }
     }
 
     LogDebug("HCS1 %d CCS1 %d RDS1 %d xcpnt1 %d flag1 %08X, HCB %p CCB %p RDB %p ohcb %p occb %p odb %p", hotCodeSize_1,

It is likely a regression introduced by #89654, though it is strange to me that it works on win-arm64 but not linux-arm64 if so. I will try to take a closer look once I have some free cycles.

@a74nh
Copy link
Contributor Author

a74nh commented Aug 30, 2023

Does the following patch fix the problem:

That seems to fix it!

Thread 1 "superpmi" received signal SIGTRAP, Trace/breakpoint trap.

Running inside gdb with the patch, I still get the sigtraps, but I'm assuming that is expected behaviour.

it is strange to me that it works on win-arm64 but not linux-arm64

On Windows we were running the windows .mch files. Quite possible it falls over on Windows if you give it the Linux .mch files

I will try to take a closer look once I have some free cycles.

I can use the above fix for now, and will await the real patch. Thanks.

@JulieLeeMSFT JulieLeeMSFT added this to the 9.0.0 milestone Aug 31, 2023
@jakobbotsch jakobbotsch self-assigned this Sep 8, 2023
jakobbotsch added a commit to jakobbotsch/runtime that referenced this issue Sep 8, 2023
After dotnet#89654 SPMI replay will succeed instead of result in replay errors
in expected error cases (such as BADCODE or EE exception). To support
diffing such contexts, we record zero-sized assembly that the near
differ uses. However, on arm64 there is some additional code that calls
repCompileMethod to make some additional adjustments to the code blob,
and in the "EE exception" cases we cannot replay this function,
resulting in crash during asmdiff. This fixes the problem by only making
the adjustments when we know there is any code.

An alternative solution could be to avoid invoking the neardiffer at all
in the succeeding error cases, but this seemed like an ok pragmatic
solution.

Fix dotnet#91257
@ghost ghost added the in-pr There is an active PR which will close this issue when it is merged label Sep 8, 2023
jakobbotsch added a commit that referenced this issue Sep 12, 2023
After #89654 SPMI replay will succeed instead of result in replay errors
in expected error cases (such as BADCODE or EE exception). To support
diffing such contexts, we record zero-sized assembly that the near
differ uses. However, on arm64 there is some additional code that calls
repCompileMethod to make some additional adjustments to the code blob,
and in the "EE exception" cases we cannot replay this function,
resulting in crash during asmdiff. This fixes the problem by only making
the adjustments when we know there is any code.

An alternative solution could be to avoid invoking the neardiffer at all
in the succeeding error cases, but this seemed like an ok pragmatic
solution.

Fix #91257
@ghost ghost removed the in-pr There is an active PR which will close this issue when it is merged label Sep 12, 2023
@a74nh
Copy link
Contributor Author

a74nh commented Sep 15, 2023

Confirmed this works for me now. Thanks!

@ghost ghost locked as resolved and limited conversation to collaborators Oct 15, 2023
a74nh pushed a commit to a74nh/runtime that referenced this issue Dec 20, 2023
…#91783)

After dotnet#89654 SPMI replay will succeed instead of result in replay errors
in expected error cases (such as BADCODE or EE exception). To support
diffing such contexts, we record zero-sized assembly that the near
differ uses. However, on arm64 there is some additional code that calls
repCompileMethod to make some additional adjustments to the code blob,
and in the "EE exception" cases we cannot replay this function,
resulting in crash during asmdiff. This fixes the problem by only making
the adjustments when we know there is any code.

An alternative solution could be to avoid invoking the neardiffer at all
in the succeeding error cases, but this seemed like an ok pragmatic
solution.

Fix dotnet#91257
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants