Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Test failure: JIT/opt/Vectorization/UnrollEqualsStartsWIth/UnrollEqualsStartsWIth.dll #86265

Closed
BruceForstall opened this issue May 15, 2023 · 25 comments · Fixed by #90937
Closed
Assignees
Labels
area-VM-coreclr blocking-clean-ci-optional Blocking optional rolling runs
Milestone

Comments

@BruceForstall
Copy link
Member

https://dev.azure.com/dnceng-public/public/_build/results?buildId=272907&view=ms.vss-test-web.build-test-results-tab

coreclr windows x86 Checked jitstress2_jitstressregs0x1000 @ Windows.10.Amd64.Open

set DOTNET_TieredCompilation=0
set DOTNET_JitStress=2
set DOTNET_JitStressRegs=0x1000

Looks like failure of this test brought down the whole run.

18:47:18.821 Running test: JIT/opt/Vectorization/UnrollEqualsStartsWIth/UnrollEqualsStartsWIth.dll
App Exit Code: 3
Expected: 100
Actual: 3
END EXECUTION - FAILED
FAILED
[XUnitLogChecker]: Item 'JIT.opt' did not finish running. Checking and fixing the log...
[XUnitLogChecker]: XUnit log file has been fixed!

211/292 tests run.
* 210 tests passed.
* 0 tests failed.
* 1 tests skipped.

[XUnitLogChecker]: Checking for dumps...
[XUnitLogChecker]: No crash dumps found. Continuing...
[XUnitLogChecker]: Finished!

A different run

coreclr windows x86 Checked jitstress2_jitstressregs1 @ Windows.10.Amd64.Open

with:

set DOTNET_TieredCompilation=0
set DOTNET_JitStress=2
set DOTNET_JitStressRegs=1

crashed:

18:57:33.193 Running test: JIT/opt/Vectorization/UnrollEqualsStartsWIth/UnrollEqualsStartsWIth.dll

Assert failure(PID 3456 [0x00000d80], Thread: 4432 [0x1150]): !PreemptiveGCDisabled()

CORECLR! Thread::DetachThread + 0xCF (0x727a9aa5)
CORECLR! TlsDestructionMonitor::~TlsDestructionMonitor + 0x8F (0x72bf72f5)
CORECLR! _dyn_tls_dtor + 0x86 (0x72c099b6)
NTDLL! RtlDecompressBuffer + 0xDE (0x77aaea4e)
NTDLL! LdrShutdownThread + 0x386 (0x77a7eeb6)
NTDLL! LdrSetAppCompatDllRedirectionCallback + 0x1052A (0x77ad3c6a)
NTDLL! LdrShutdownProcess + 0x15F (0x77a92d7f)
NTDLL! RtlExitUserProcess + 0x96 (0x77a93886)
KERNEL32! ExitProcess + 0x13 (0x7795b3d3)
CORECLR! exit_or_terminate_process + 0x38 (0x72c43f08)
    File: D:\a\_work\1\s\src\coreclr\vm\threads.cpp Line: 979
    Image: C:\h\w\AA420919\p\corerun.exe

App Exit Code: -1073740286
Expected: 100
Actual: -1073740286
END EXECUTION - FAILED
FAILED
[XUnitLogChecker]: Item 'JIT.opt' did not finish running. Checking and fixing the log...
[XUnitLogChecker]: XUnit log file has been fixed!

211/292 tests run.
* 210 tests passed.
* 0 tests failed.
* 1 tests skipped.

[XUnitLogChecker]: Checking for dumps...
[XUnitLogChecker]: No crash dumps found. Continuing...
[XUnitLogChecker]: Finished!

Another failure:

https://dev.azure.com/dnceng-public/public/_build/results?buildId=273272&view=ms.vss-test-web.build-test-results-tab

coreclr windows x86 Checked jitstress2_jitstressregs8 @ Windows.10.Amd64.Open

set DOTNET_TieredCompilation=0
set DOTNET_JitStress=2
set DOTNET_JitStressRegs=8
18:45:54.299 Running test: JIT/opt/Vectorization/UnrollEqualsStartsWIth/UnrollEqualsStartsWIth.dll
App Exit Code: 3
Expected: 100
Actual: 3
END EXECUTION - FAILED
FAILED
[XUnitLogChecker]: Item 'JIT.opt' did not finish running. Checking and fixing the log...
[XUnitLogChecker]: XUnit log file has been fixed!

211/292 tests run.
* 210 tests passed.
* 0 tests failed.
* 1 tests skipped.

[XUnitLogChecker]: Checking for dumps...
[XUnitLogChecker]: No crash dumps found. Continuing...
[XUnitLogChecker]: Finished!

@markples Another issue with merging?

cc @EgorBo

@BruceForstall BruceForstall added JitStress CLR JIT issues involving JIT internal stress modes area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI blocking-clean-ci-optional Blocking optional rolling runs labels May 15, 2023
@BruceForstall BruceForstall added this to the 8.0.0 milestone May 15, 2023
@ghost
Copy link

ghost commented May 15, 2023

Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch
See info in area-owners.md if you want to be subscribed.

Issue Details

https://dev.azure.com/dnceng-public/public/_build/results?buildId=272907&view=ms.vss-test-web.build-test-results-tab

coreclr windows x86 Checked jitstress2_jitstressregs0x1000 @ Windows.10.Amd64.Open

set DOTNET_TieredCompilation=0
set DOTNET_JitStress=2
set DOTNET_JitStressRegs=0x1000

Looks like failure of this test brought down the whole run.

18:47:18.821 Running test: JIT/opt/Vectorization/UnrollEqualsStartsWIth/UnrollEqualsStartsWIth.dll
App Exit Code: 3
Expected: 100
Actual: 3
END EXECUTION - FAILED
FAILED
[XUnitLogChecker]: Item 'JIT.opt' did not finish running. Checking and fixing the log...
[XUnitLogChecker]: XUnit log file has been fixed!

211/292 tests run.
* 210 tests passed.
* 0 tests failed.
* 1 tests skipped.

[XUnitLogChecker]: Checking for dumps...
[XUnitLogChecker]: No crash dumps found. Continuing...
[XUnitLogChecker]: Finished!

A different run

coreclr windows x86 Checked jitstress2_jitstressregs1 @ Windows.10.Amd64.Open

with:

set DOTNET_TieredCompilation=0
set DOTNET_JitStress=2
set DOTNET_JitStressRegs=1

crashed:

18:57:33.193 Running test: JIT/opt/Vectorization/UnrollEqualsStartsWIth/UnrollEqualsStartsWIth.dll

Assert failure(PID 3456 [0x00000d80], Thread: 4432 [0x1150]): !PreemptiveGCDisabled()

CORECLR! Thread::DetachThread + 0xCF (0x727a9aa5)
CORECLR! TlsDestructionMonitor::~TlsDestructionMonitor + 0x8F (0x72bf72f5)
CORECLR! _dyn_tls_dtor + 0x86 (0x72c099b6)
NTDLL! RtlDecompressBuffer + 0xDE (0x77aaea4e)
NTDLL! LdrShutdownThread + 0x386 (0x77a7eeb6)
NTDLL! LdrSetAppCompatDllRedirectionCallback + 0x1052A (0x77ad3c6a)
NTDLL! LdrShutdownProcess + 0x15F (0x77a92d7f)
NTDLL! RtlExitUserProcess + 0x96 (0x77a93886)
KERNEL32! ExitProcess + 0x13 (0x7795b3d3)
CORECLR! exit_or_terminate_process + 0x38 (0x72c43f08)
    File: D:\a\_work\1\s\src\coreclr\vm\threads.cpp Line: 979
    Image: C:\h\w\AA420919\p\corerun.exe

App Exit Code: -1073740286
Expected: 100
Actual: -1073740286
END EXECUTION - FAILED
FAILED
[XUnitLogChecker]: Item 'JIT.opt' did not finish running. Checking and fixing the log...
[XUnitLogChecker]: XUnit log file has been fixed!

211/292 tests run.
* 210 tests passed.
* 0 tests failed.
* 1 tests skipped.

[XUnitLogChecker]: Checking for dumps...
[XUnitLogChecker]: No crash dumps found. Continuing...
[XUnitLogChecker]: Finished!

Another failure:

https://dev.azure.com/dnceng-public/public/_build/results?buildId=273272&view=ms.vss-test-web.build-test-results-tab

coreclr windows x86 Checked jitstress2_jitstressregs8 @ Windows.10.Amd64.Open

set DOTNET_TieredCompilation=0
set DOTNET_JitStress=2
set DOTNET_JitStressRegs=8
18:45:54.299 Running test: JIT/opt/Vectorization/UnrollEqualsStartsWIth/UnrollEqualsStartsWIth.dll
App Exit Code: 3
Expected: 100
Actual: 3
END EXECUTION - FAILED
FAILED
[XUnitLogChecker]: Item 'JIT.opt' did not finish running. Checking and fixing the log...
[XUnitLogChecker]: XUnit log file has been fixed!

211/292 tests run.
* 210 tests passed.
* 0 tests failed.
* 1 tests skipped.

[XUnitLogChecker]: Checking for dumps...
[XUnitLogChecker]: No crash dumps found. Continuing...
[XUnitLogChecker]: Finished!

@markples Another issue with merging?

cc @EgorBo

Author: BruceForstall
Assignees: -
Labels:

JitStress, area-CodeGen-coreclr, blocking-clean-ci-optional

Milestone: 8.0.0

@EgorBo
Copy link
Member

EgorBo commented May 15, 2023

Note: this test may be quite slow in stress modes, although, it used to work fine

@markples
Copy link
Member

Do you think there is a JitStress/merged test group interaction here?

This test is interesting. It internally has a merged test group implementation and apparently has 127512 tests. They are executed via reflection, and a normal failure would cause one of the internal (sub-)batches to be terminated. A crash will take out the entire group. This infrastructure seems much heavier than the new wrapper code.

I think converting this to xunit-style tests would reduce the code, though a 100k+ test group would be a new stress to the system.

To be clear, I'm not suggesting any of that as a fix for the test failure - just something that I noticed while looking at the test.

@markples
Copy link
Member

markples commented May 16, 2023

This has been inconsistent to repro across configurations. I have gotten it with -c Checked -lc Release and only in the merged group (and not in a "merged" group with just the one test. Visual Studio shuts down when this test is executed (and the report bug mechanism isn't properly searching for matching bugs for me). It isn't reproing under windbg.

There are a number of exception that are thrown and caught throughout the merged group - divide by zeros and access violations (for null). [edit - was incorrect - This test itself is as well (though I need to double check because I don't see null in the test list] Is there possibly an interaction between accessing null literals and jitstress? (or somehow a history of exceptions given the merged group interaction, but perhaps that is misleading somehow).

It seems like there is an actual problem even if merged groups are exposing it, though since it is jitstress it might be a jitstress issue and not a product one.

@markples
Copy link
Member

The actual failure that the logs don't seem to capture (or maybe I've ended up in a different configuration?) is an assertion failure:

src/coreclr/vm/siginfo.cpp line 4928

void PromoteCarefully(promote_func   fn,
                      PTR_PTR_Object ppObj,
                      ScanContext*   sc,
                      uint32_t       flags /* = GC_CALL_INTERIOR*/ )
{
    LIMITED_METHOD_CONTRACT;

    //
    // Sanity check that the flags contain only these three values
    //
    assert((flags & ~(GC_CALL_INTERIOR|GC_CALL_PINNED)) == 0);

    //
    // Sanity check that GC_CALL_INTERIOR FLAG is set
    //
    assert(flags & GC_CALL_INTERIOR);

#if !defined(DACCESS_COMPILE)

    //
    // Sanity check the stack scan limit
    //
>>  assert(sc->stack_limit != 0);  <<

This suggests that the difficulty reproing it might have to do with GC timing. Does jitstress do anything odd with GC (triggering, reporting, etc.)? I.e., is it likely a jitstress or product issue?

I had previous seen a bit of a pattern where I can get the failure to happen the first time after building in src/tests but not again after that. It wasn't 100%, and it seems like confusing information since the test runs for a few seconds before failing (which makes some sort of cold start seem less likely).

It's certainly an interesting assertion to hit, though.

@BruceForstall
Copy link
Member Author

Does jitstress do anything odd with GC (triggering, reporting, etc.)? I.e., is it likely a jitstress or product issue?

It appears to hit with multiple different JitStressRegs values. Maybe it does or would hit without JitStressRegs set if run enough?

It also appears to hit only with JitStress=2?

JitStress doesn't do anything specific related to GC: it doesn't trigger GC more, though might change between fully and partially interruptible. GC reporting changes because the generated code changes.

@markples
Copy link
Member

markples commented May 18, 2023

Whether the "first time after building" observation is accurate or not, I was able to use it to get a case in windbg. Below is the stack trace. Note that, at first glance anyway, this doesn't look like a typical GC bug (misreporting, heap corruption, etc.) as the assertion is about the stack limit.

...
[0xf]    coreclr!PromoteCarefully+0x58   0xceff140   0x6124d51b   
[0x10]   coreclr!GcEnumObject+0x4b   0xceff15c   0x61067cae   
[0x11]   coreclr!EECodeManager::EnumGcRefs+0xe3e   0xceff178   0x6124d782   
[0x12]   coreclr!GcStackCrawlCallBack+0x152   0xceff5dc   0x611752fa   
[0x13]   coreclr!Thread::MakeStackwalkerCallback+0x48   0xceff634   0x611769d4   
[0x14]   coreclr!Thread::StackWalkFramesEx+0x186   0xceff650   0x611767ce   
[0x15]   coreclr!Thread::StackWalkFrames+0x159   0xceff928   0x6124cc04   
[0x16]   coreclr!ScanStackRoots+0x196   0xceffc60   0x6124addf   
[0x17]   coreclr!GCToEEInterface::GcScanRoots+0xf1   0xceffd74   0x6148c192   
[0x18]   coreclr!WKS::gc_heap::background_mark_phase+0x406   0xceffd98   0x61495ffe   
[0x19]   coreclr!WKS::gc_heap::gc1+0x13b   0xceffe00   0x6148e51d   
[0x1a]   coreclr!WKS::gc_heap::bgc_thread_function+0xcc   0xceffe24   0x6148e6c7   
[0x1b]   coreclr!WKS::gc_heap::bgc_thread_stub+0x27   0xceffe44   0x6124976e   
[0x1c]   coreclr!<lambda_6e1c80cead0a95b6f179a4ba7ad5e186>::operator()+0x84   0xceffe48   0x75a47d59   
[0x1d]   KERNEL32!BaseThreadInitThunk+0x19   0xceffe68   0x77d0b74b   
[0x1e]   ntdll!__RtlUserThreadStart+0x2b   0xceffe78   0x77d0b6cf   
[0x1f]   ntdll!_RtlUserThreadStart+0x1b   0xceffed0   0x0   

@BruceForstall
Copy link
Member Author

Maybe @dotnet/gc or @janvorli have insight?

@mangod9
Copy link
Member

mangod9 commented May 18, 2023

@markples do you have the dumps available? @cshung is this related to any of your recent changes?

@janvorli
Copy link
Member

@markples if you can share the dump and the contents of the artifacts\tests\coreclr\Windows.x64.XXXXX\Tests\Core_Root, I can take a look.

@markples
Copy link
Member

markples commented May 18, 2023

Thanks @janvorli. I've copied a .dump /m, .dump /mf, and the contents of Windows.x86.checked\Tests\Core_Root to https://microsoft-my.sharepoint.com/:f:/p/markples/EmAjXtmylUxMoQPjBpFC9ucBd5mGrD0IKPPopBCkJr0H4g?e=7z4efq. This should have MSFT-organization access.

A few key points from the discussion above:

  • Repro is so far x86 only
  • Repro is running under a jit stress mode
  • Repro is very unreliable, though I have seen a strong correlation between hitting one and the first execution after building in src\tests

@markples
Copy link
Member

b539509 has the same symptoms but seems to be equally difficult to repro.

@janvorli
Copy link
Member

I can see that the issue is caused by the topmost explicit frame of the thread that GC is scanning being InlinedCallFrame and having m_pCallSiteSP set to 0. That value is then stored in the stack_limit and that later on triggers the assert sc->stack_limit != 0 that I can see in the dump.

0:009> dt (InlinedCallFrame)0x02f7e920 
coreclr!InlinedCallFrame
   +0x000 __VFN_table : 0x61715b3c 
   +0x004 m_Next           : 0xffffffff Frame
   +0x008 m_Datum          : 0x80000000 NDirectMethodDesc
   +0x00c m_pCallSiteSP    : (null) 
   +0x010 m_pCallerReturnAddress : 0
   +0x014 m_pCalleeSavedFP : 0x2f7ea9c
   +0x018 m_pThread        : (null) 

Notice that the m_pCallerReturnAddress is set to 0 too. That indicates that the frame doesn't have active call. InlinedCallFrames are used for pinvokes and they stay on the per-thread list of explicit frames even when there is no pinvoke in progress. Only the m_pCallerReturnAddress is set to 0 in that case and the m_pCallSiteSP is not set.

So it seems the code getting the stack_limit is wrong and it should also check InlinedCallFrame::FrameHasActiveCall(pTopFrame).
This is the current code:

Frame* pTopFrame = pThread->GetFrame();
Object ** topStack = (Object **)pTopFrame;
if ((pTopFrame != ((Frame*)-1))
&& (pTopFrame->GetVTablePtr() == InlinedCallFrame::GetMethodFrameVPtr())) {
// It is an InlinedCallFrame. Get SP from it.
InlinedCallFrame* pInlinedFrame = (InlinedCallFrame*)pTopFrame;
topStack = (Object **)pInlinedFrame->GetCallSiteSP();
}
sc->stack_limit = (uintptr_t)topStack;

It seems it should look like this:

    Frame* pTopFrame = pThread->GetFrame();
    Object ** topStack = (Object **)pTopFrame;
    if (InlinedCallFrame::FrameHasActiveCall(pTopFrame))
    {
        // It is an InlinedCallFrame with active call. Get SP from it.
        InlinedCallFrame* pInlinedFrame = (InlinedCallFrame*)pTopFrame;
        topStack = (Object **)pInlinedFrame->GetCallSiteSP();
    }

    sc->stack_limit = (uintptr_t)topStack;

@mangod9
Copy link
Member

mangod9 commented May 19, 2023

why is this failing now? Did codegen for inlining change recently which is now triggering this issue?

@janvorli
Copy link
Member

Maybe that something (maybe related to the jitstress) has made GC happening at places around pinvokes that were not reported as GC safe before and now they are.

@markples
Copy link
Member

I don't immediately see anything scanning the log for pinvoke and inliner. #85975 @jkotas touches the inliner but doesn't look like it was supposed to have any changes in behavior. @dotnet/jit-contrib is anyone aware of changes around pinvokes?

Chronologically, it looks like the switch to merged test groups is a likely factor. Also, I was unable to repro the problem with merged test groups turned off, but the repro is flaky so that's not conclusive. Merged test groups have two main effects: (1) there are a few more methods on the stack while tests are executing, (2) other tests will have executed in the same process prior to a particular test being run. So one theory would be that the issue was already there but a GC was never triggered at that point due to the individual test not having done enough allocation yet. I constructed a merged test group with just the single test, and it didn't repro, but again the flaky nature only makes this a partial case for (2) over (1).

The failure showed up in both this test and b539509 (two copies of it in different merged groups). Both have some arrays and exceptions, though I don't think the active part of b539509 is throwing immediately near the failure point (not sure about UnrollEqualsStartsWith).

@janvorli
Copy link
Member

@markples since you are able to repro the issue, would you be willing to test the change I've shared above to verify that
it really fixes the problem?

@markples
Copy link
Member

The issue hasn't reproed after several attempts. Thank you for the investigation and fix!

For reference, here are my steps:

setup: build -rc Checked -lc Release -a x86 -subset clr+libs

each iteration:

  • in artifacts\tests\coreclr: rmdir /s /q obj windows.x86.Checked
  • in src\tests: build.cmd Checked x86 generatelayoutonly /p:LibrariesConfiguration=Release & build -priority 1 x86 checked test JIT\opt\JIT.opt.csproj /p:LibrariesConfiguration=Release
  • in artifacts\tests\coreclr\windows.x86.Checked\JIT\opt\JIT.opt: JIT.opt.cmd -coreroot <root>\artifacts\tests\coreclr\windows.x86.Checked\Tests\Core_Root (I tried a few times per build, but the repro was only occurring on the first)

@jkotas
Copy link
Member

jkotas commented May 20, 2023

    Object ** topStack = (Object **)pTopFrame;
    if (InlinedCallFrame::FrameHasActiveCall(pTopFrame))
    {
        // It is an InlinedCallFrame with active call. Get SP from it.
        InlinedCallFrame* pInlinedFrame = (InlinedCallFrame*)pTopFrame;
        topStack = (Object **)pInlinedFrame->GetCallSiteSP();
    }

    sc->stack_limit = (uintptr_t)topStack;

@janvorli I do not think that this fix is quite right. If the top InlinedCallFrame is inactive, it can be somewhere in the middle of the stack and there can be a lot of managed code frames between the actual top of the stack and the stack_limit. Similarly, if the thread has no Frames, we will set the stack_limit to (Frame*)-1.

We should either compute stack_limit more precisely, or we should rename it to stack_limit_estimate and document that it may not cover the entire active stack. In the latter case, the whole if (InlinedCallFrame::FrameHasActiveCall(pTopFrame)) special case can be deleted.

@jkotas
Copy link
Member

jkotas commented May 20, 2023

Here is a repro that hits the assert reliably within a few seconds (compile with /o+, run with TieredCompilation=0 on x86 debug or checked build):

using System;
using System.Threading;
using System.Runtime.InteropServices;
using System.Runtime.CompilerServices;

object myObject = new Object();

WaitAndPInvoke(default, 3, ref myObject);
ClearStack();
WaitAndPInvoke(default, Int32.MaxValue, ref myObject);

[MethodImpl(MethodImplOptions.NoInlining)]
static void ClearStack()
{
    Span<int> s = stackalloc int[1000];
    s.Clear();
}

[MethodImpl(MethodImplOptions.NoInlining)]
static void WaitAndPInvoke(BigStruct s, int c, ref object o)
{
    if (c == Int32.MaxValue) new Thread(TriggerGC).Start();
    for (int i = 0; i < c; i++) ThrowAndCatch(null);
    GetTickCount();
    o.ToString();
}

[MethodImpl(MethodImplOptions.NoInlining)]
static void ThrowAndCatch(object o)
{
   try
   {
      o.ToString();
   }
   catch (NullReferenceException)
   {
   }
}


[DllImport("kernel32")]
extern static int GetTickCount();

static void TriggerGC()
{
   for (;;)
   {
      Thread.Sleep(0);
      GC.Collect();
   }
}

[StructLayout(LayoutKind.Explicit, Size=200)]
struct BigStruct {
}

@jkotas
Copy link
Member

jkotas commented May 20, 2023

why is this failing now? Did codegen for inlining change recently which is now triggering this issue?

We are reading uninitialized value on the stack. The assert is triggered when the uninitialized value is zero.

@markples markples added runtime-coreclr specific to the CoreCLR runtime and removed area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI labels May 24, 2023
@janvorli
Copy link
Member

@jkotas you are right that the case when InlinedCallFrame is not active is problematic. However, that's the case on Windows only. On Unix, we always have an explicit frame that's at the lowest address we want to scan (when the thread is interrupted in managed code, we have the RedirectedThreadFrame there). We should never see inactive InlinedCallFrame as the first frame when doing GC scan on Unix. Since this stack_limit was introduced to fix a Unix specific issue, we can fix it just by making the stack_limit stuff Unix only.

@jkotas jkotas added area-VM-coreclr and removed runtime-coreclr specific to the CoreCLR runtime JitStress CLR JIT issues involving JIT internal stress modes labels May 25, 2023
@janvorli
Copy link
Member

@jkotas I am unable to reproduce the problem using the repro that you've shared above. The only thing I am not sure about is what you've meant by "compile with /o+". I am building it using dotnet build -c Checked -a x86 and have <Optimize>True</Optimize> in the csproj.

@janvorli
Copy link
Member

Btw, after looking at the original dump with the issue again, the real culprit is actually elsewhere. During the GC scan, we should have some other frame on top of the stack than an inactive inlined call frame. I could see that besides the crash, the GC stack walk started in the middle of the managed stack.

@jkotas
Copy link
Member

jkotas commented Aug 21, 2023

@jkotas I am unable to reproduce the problem using the repro that you've shared above

It still repros for me. Here are the exact repro steps that I have just tried:

build -s clr -a x86 -c checked
cd artifacts\bin\coreclr\windows.x86.Checked
notepad test.cs (copy, paste and save my repro: https://github.com/dotnet/runtime/issues/86265#issuecomment-1555398337)
csc /r:System.Private.CoreLib.dll /o+ test.cs (assumes csc alias that invokes recent C# compiler) 
set DOTNET_TieredCompilation=0

Result:

---------------------------
Microsoft Visual C++ Runtime Library
---------------------------
Assertion failed!

Program: ...ifacts\bin\coreclr\windows.x86.Checked\coreclr.dll
File: C:\runtime\src\coreclr\vm\siginfo.cpp
Line: 4968

Expression: sc->stack_limit != 0

For information on how your program can cause an assertion
failure, see the Visual C++ documentation on asserts

(Press Retry to debug the application - JIT must be enabled)
---------------------------
Abort   Retry   Ignore   
---------------------------

janvorli added a commit to janvorli/runtime that referenced this issue Aug 22, 2023
There is a problem with computing stack_limit in the presence of
inactive InlinedCallFrame. We were trying to call GetCallSiteSP on that
frame to get the call site SP. But when the inlined call frame is not
active (there is no call to native code in progress), that method
returns random junk.

But even if we checked for an active call before reading the value, it
would not get correct stack limit on x86 windows when there was no
active call with the inlined call frame being the topmost one. That
can happen with hardware exception handling on Windows x86. On other
targets, we always have other explicit frame on top of the explicit
frames stack, but on windows x86, we don't use the FaultingExceptionFrame for
hardware exceptions, so the inactive inlined call frame could be the
topmost one when GC starts to scan stack roots.

Since the stack_limit was introduced to fix a Unix specific problem, I
have fixed that by disabling the stack_limit usage for x86 windows.

Close dotnet#86265
@ghost ghost added the in-pr There is an active PR which will close this issue when it is merged label Aug 22, 2023
janvorli added a commit that referenced this issue Aug 24, 2023
* Fix stack_limit handling

There is a problem with computing stack_limit in the presence of
inactive InlinedCallFrame. We were trying to call GetCallSiteSP on that
frame to get the call site SP. But when the inlined call frame is not
active (there is no call to native code in progress), that method
returns random junk.

But even if we checked for an active call before reading the value, it
would not get correct stack limit on x86 windows when there was no
active call with the inlined call frame being the topmost one. That
can happen with hardware exception handling on Windows x86. On other
targets, we always have other explicit frame on top of the explicit
frames stack, but on windows x86, we don't use the FaultingExceptionFrame for
hardware exceptions, so the inactive inlined call frame could be the
topmost one when GC starts to scan stack roots.

Since the stack_limit was introduced to fix a Unix specific problem, I
have fixed that by disabling the stack_limit usage for x86 windows.

Close #86265

* Add back the stack_limit field to keep contract unchanged

* Reflect PR feedback - Minimalize the number of ifdefs

* Reflect PR feedback
@ghost ghost removed the in-pr There is an active PR which will close this issue when it is merged label Aug 24, 2023
github-actions bot pushed a commit that referenced this issue Aug 24, 2023
There is a problem with computing stack_limit in the presence of
inactive InlinedCallFrame. We were trying to call GetCallSiteSP on that
frame to get the call site SP. But when the inlined call frame is not
active (there is no call to native code in progress), that method
returns random junk.

But even if we checked for an active call before reading the value, it
would not get correct stack limit on x86 windows when there was no
active call with the inlined call frame being the topmost one. That
can happen with hardware exception handling on Windows x86. On other
targets, we always have other explicit frame on top of the explicit
frames stack, but on windows x86, we don't use the FaultingExceptionFrame for
hardware exceptions, so the inactive inlined call frame could be the
topmost one when GC starts to scan stack roots.

Since the stack_limit was introduced to fix a Unix specific problem, I
have fixed that by disabling the stack_limit usage for x86 windows.

Close #86265
carlossanlop pushed a commit that referenced this issue Aug 26, 2023
* Fix stack_limit handling

There is a problem with computing stack_limit in the presence of
inactive InlinedCallFrame. We were trying to call GetCallSiteSP on that
frame to get the call site SP. But when the inlined call frame is not
active (there is no call to native code in progress), that method
returns random junk.

But even if we checked for an active call before reading the value, it
would not get correct stack limit on x86 windows when there was no
active call with the inlined call frame being the topmost one. That
can happen with hardware exception handling on Windows x86. On other
targets, we always have other explicit frame on top of the explicit
frames stack, but on windows x86, we don't use the FaultingExceptionFrame for
hardware exceptions, so the inactive inlined call frame could be the
topmost one when GC starts to scan stack roots.

Since the stack_limit was introduced to fix a Unix specific problem, I
have fixed that by disabling the stack_limit usage for x86 windows.

Close #86265

* Add back the stack_limit field to keep contract unchanged

* Reflect PR feedback - Minimalize the number of ifdefs

* Reflect PR feedback

---------

Co-authored-by: Jan Vorlicek <janvorli@microsoft.com>
@ghost ghost locked as resolved and limited conversation to collaborators Sep 24, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
area-VM-coreclr blocking-clean-ci-optional Blocking optional rolling runs
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants