-
Notifications
You must be signed in to change notification settings - Fork 4.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Test failure: JIT/opt/Vectorization/UnrollEqualsStartsWIth/UnrollEqualsStartsWIth.dll #86265
Comments
Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch Issue Detailscoreclr windows x86 Checked jitstress2_jitstressregs0x1000 @ Windows.10.Amd64.Open
Looks like failure of this test brought down the whole run.
A different run coreclr windows x86 Checked jitstress2_jitstressregs1 @ Windows.10.Amd64.Open with:
crashed:
Another failure: coreclr windows x86 Checked jitstress2_jitstressregs8 @ Windows.10.Amd64.Open
@markples Another issue with merging? cc @EgorBo
|
Note: this test may be quite slow in stress modes, although, it used to work fine |
Do you think there is a JitStress/merged test group interaction here? This test is interesting. It internally has a merged test group implementation and apparently has 127512 tests. They are executed via reflection, and a normal failure would cause one of the internal (sub-)batches to be terminated. A crash will take out the entire group. This infrastructure seems much heavier than the new wrapper code. I think converting this to xunit-style tests would reduce the code, though a 100k+ test group would be a new stress to the system. To be clear, I'm not suggesting any of that as a fix for the test failure - just something that I noticed while looking at the test. |
This has been inconsistent to repro across configurations. I have gotten it with -c Checked -lc Release and only in the merged group (and not in a "merged" group with just the one test. Visual Studio shuts down when this test is executed (and the report bug mechanism isn't properly searching for matching bugs for me). It isn't reproing under windbg. There are a number of exception that are thrown and caught throughout the merged group - divide by zeros and access violations (for null). [edit - was incorrect - It seems like there is an actual problem even if merged groups are exposing it, though since it is jitstress it might be a jitstress issue and not a product one. |
The actual failure that the logs don't seem to capture (or maybe I've ended up in a different configuration?) is an assertion failure: src/coreclr/vm/siginfo.cpp line 4928 void PromoteCarefully(promote_func fn,
PTR_PTR_Object ppObj,
ScanContext* sc,
uint32_t flags /* = GC_CALL_INTERIOR*/ )
{
LIMITED_METHOD_CONTRACT;
//
// Sanity check that the flags contain only these three values
//
assert((flags & ~(GC_CALL_INTERIOR|GC_CALL_PINNED)) == 0);
//
// Sanity check that GC_CALL_INTERIOR FLAG is set
//
assert(flags & GC_CALL_INTERIOR);
#if !defined(DACCESS_COMPILE)
//
// Sanity check the stack scan limit
//
>> assert(sc->stack_limit != 0); << This suggests that the difficulty reproing it might have to do with GC timing. Does jitstress do anything odd with GC (triggering, reporting, etc.)? I.e., is it likely a jitstress or product issue? I had previous seen a bit of a pattern where I can get the failure to happen the first time after building in src/tests but not again after that. It wasn't 100%, and it seems like confusing information since the test runs for a few seconds before failing (which makes some sort of cold start seem less likely). It's certainly an interesting assertion to hit, though. |
It appears to hit with multiple different JitStressRegs values. Maybe it does or would hit without JitStressRegs set if run enough? It also appears to hit only with JitStress=2? JitStress doesn't do anything specific related to GC: it doesn't trigger GC more, though might change between fully and partially interruptible. GC reporting changes because the generated code changes. |
Whether the "first time after building" observation is accurate or not, I was able to use it to get a case in windbg. Below is the stack trace. Note that, at first glance anyway, this doesn't look like a typical GC bug (misreporting, heap corruption, etc.) as the assertion is about the stack limit.
|
Maybe @dotnet/gc or @janvorli have insight? |
@markples if you can share the dump and the contents of the artifacts\tests\coreclr\Windows.x64.XXXXX\Tests\Core_Root, I can take a look. |
Thanks @janvorli. I've copied a A few key points from the discussion above:
|
b539509 has the same symptoms but seems to be equally difficult to repro. |
I can see that the issue is caused by the topmost explicit frame of the thread that GC is scanning being
Notice that the So it seems the code getting the runtime/src/coreclr/vm/gcenv.ee.cpp Lines 117 to 126 in 29d485c
It seems it should look like this: Frame* pTopFrame = pThread->GetFrame();
Object ** topStack = (Object **)pTopFrame;
if (InlinedCallFrame::FrameHasActiveCall(pTopFrame))
{
// It is an InlinedCallFrame with active call. Get SP from it.
InlinedCallFrame* pInlinedFrame = (InlinedCallFrame*)pTopFrame;
topStack = (Object **)pInlinedFrame->GetCallSiteSP();
}
sc->stack_limit = (uintptr_t)topStack; |
why is this failing now? Did codegen for inlining change recently which is now triggering this issue? |
Maybe that something (maybe related to the jitstress) has made GC happening at places around pinvokes that were not reported as GC safe before and now they are. |
I don't immediately see anything scanning the log for pinvoke and inliner. #85975 @jkotas touches the inliner but doesn't look like it was supposed to have any changes in behavior. @dotnet/jit-contrib is anyone aware of changes around pinvokes? Chronologically, it looks like the switch to merged test groups is a likely factor. Also, I was unable to repro the problem with merged test groups turned off, but the repro is flaky so that's not conclusive. Merged test groups have two main effects: (1) there are a few more methods on the stack while tests are executing, (2) other tests will have executed in the same process prior to a particular test being run. So one theory would be that the issue was already there but a GC was never triggered at that point due to the individual test not having done enough allocation yet. I constructed a merged test group with just the single test, and it didn't repro, but again the flaky nature only makes this a partial case for (2) over (1). The failure showed up in both this test and b539509 (two copies of it in different merged groups). Both have some arrays and exceptions, though I don't think the active part of b539509 is throwing immediately near the failure point (not sure about UnrollEqualsStartsWith). |
@markples since you are able to repro the issue, would you be willing to test the change I've shared above to verify that |
The issue hasn't reproed after several attempts. Thank you for the investigation and fix! For reference, here are my steps: setup: each iteration:
|
@janvorli I do not think that this fix is quite right. If the top InlinedCallFrame is inactive, it can be somewhere in the middle of the stack and there can be a lot of managed code frames between the actual top of the stack and the stack_limit. Similarly, if the thread has no Frames, we will set the stack_limit to (Frame*)-1. We should either compute |
Here is a repro that hits the assert reliably within a few seconds (compile with /o+, run with TieredCompilation=0 on x86 debug or checked build):
|
We are reading uninitialized value on the stack. The assert is triggered when the uninitialized value is zero. |
@jkotas you are right that the case when |
@jkotas I am unable to reproduce the problem using the repro that you've shared above. The only thing I am not sure about is what you've meant by "compile with /o+". I am building it using |
Btw, after looking at the original dump with the issue again, the real culprit is actually elsewhere. During the GC scan, we should have some other frame on top of the stack than an inactive inlined call frame. I could see that besides the crash, the GC stack walk started in the middle of the managed stack. |
It still repros for me. Here are the exact repro steps that I have just tried:
Result:
|
There is a problem with computing stack_limit in the presence of inactive InlinedCallFrame. We were trying to call GetCallSiteSP on that frame to get the call site SP. But when the inlined call frame is not active (there is no call to native code in progress), that method returns random junk. But even if we checked for an active call before reading the value, it would not get correct stack limit on x86 windows when there was no active call with the inlined call frame being the topmost one. That can happen with hardware exception handling on Windows x86. On other targets, we always have other explicit frame on top of the explicit frames stack, but on windows x86, we don't use the FaultingExceptionFrame for hardware exceptions, so the inactive inlined call frame could be the topmost one when GC starts to scan stack roots. Since the stack_limit was introduced to fix a Unix specific problem, I have fixed that by disabling the stack_limit usage for x86 windows. Close dotnet#86265
* Fix stack_limit handling There is a problem with computing stack_limit in the presence of inactive InlinedCallFrame. We were trying to call GetCallSiteSP on that frame to get the call site SP. But when the inlined call frame is not active (there is no call to native code in progress), that method returns random junk. But even if we checked for an active call before reading the value, it would not get correct stack limit on x86 windows when there was no active call with the inlined call frame being the topmost one. That can happen with hardware exception handling on Windows x86. On other targets, we always have other explicit frame on top of the explicit frames stack, but on windows x86, we don't use the FaultingExceptionFrame for hardware exceptions, so the inactive inlined call frame could be the topmost one when GC starts to scan stack roots. Since the stack_limit was introduced to fix a Unix specific problem, I have fixed that by disabling the stack_limit usage for x86 windows. Close #86265 * Add back the stack_limit field to keep contract unchanged * Reflect PR feedback - Minimalize the number of ifdefs * Reflect PR feedback
There is a problem with computing stack_limit in the presence of inactive InlinedCallFrame. We were trying to call GetCallSiteSP on that frame to get the call site SP. But when the inlined call frame is not active (there is no call to native code in progress), that method returns random junk. But even if we checked for an active call before reading the value, it would not get correct stack limit on x86 windows when there was no active call with the inlined call frame being the topmost one. That can happen with hardware exception handling on Windows x86. On other targets, we always have other explicit frame on top of the explicit frames stack, but on windows x86, we don't use the FaultingExceptionFrame for hardware exceptions, so the inactive inlined call frame could be the topmost one when GC starts to scan stack roots. Since the stack_limit was introduced to fix a Unix specific problem, I have fixed that by disabling the stack_limit usage for x86 windows. Close #86265
* Fix stack_limit handling There is a problem with computing stack_limit in the presence of inactive InlinedCallFrame. We were trying to call GetCallSiteSP on that frame to get the call site SP. But when the inlined call frame is not active (there is no call to native code in progress), that method returns random junk. But even if we checked for an active call before reading the value, it would not get correct stack limit on x86 windows when there was no active call with the inlined call frame being the topmost one. That can happen with hardware exception handling on Windows x86. On other targets, we always have other explicit frame on top of the explicit frames stack, but on windows x86, we don't use the FaultingExceptionFrame for hardware exceptions, so the inactive inlined call frame could be the topmost one when GC starts to scan stack roots. Since the stack_limit was introduced to fix a Unix specific problem, I have fixed that by disabling the stack_limit usage for x86 windows. Close #86265 * Add back the stack_limit field to keep contract unchanged * Reflect PR feedback - Minimalize the number of ifdefs * Reflect PR feedback --------- Co-authored-by: Jan Vorlicek <janvorli@microsoft.com>
https://dev.azure.com/dnceng-public/public/_build/results?buildId=272907&view=ms.vss-test-web.build-test-results-tab
coreclr windows x86 Checked jitstress2_jitstressregs0x1000 @ Windows.10.Amd64.Open
Looks like failure of this test brought down the whole run.
A different run
coreclr windows x86 Checked jitstress2_jitstressregs1 @ Windows.10.Amd64.Open
with:
crashed:
Another failure:
https://dev.azure.com/dnceng-public/public/_build/results?buildId=273272&view=ms.vss-test-web.build-test-results-tab
coreclr windows x86 Checked jitstress2_jitstressregs8 @ Windows.10.Amd64.Open
@markples Another issue with merging?
cc @EgorBo
The text was updated successfully, but these errors were encountered: