Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Test failure tracing/eventpipe/rundownvalidation/rundownvalidation/rundownvalidation.sh #54801

Closed
VincentBu opened this issue Jun 28, 2021 · 25 comments · Fixed by #72517
Closed
Assignees
Labels
arch-x64 area-Tracing-coreclr blocking-clean-ci Blocking PR or rolling runs of 'runtime' or 'runtime-extra-platforms' GCStress os-linux Linux OS (any supported distro) tracking-external-issue The issue is caused by external problem (e.g. OS) - nothing we can do to fix it directly
Milestone

Comments

@VincentBu
Copy link
Contributor

VincentBu commented Jun 28, 2021

Run: runtime-coreclr gcstress0x3-gcstress0xc 20210627.1

Failed test:

CoreCLR Linux x64 Checked gcstress0x3 @ Ubuntu.1804.Amd64.Open

- tracing/eventpipe/rundownvalidation/rundownvalidation/rundownvalidation.sh

Error message:

Fatal error. System.AccessViolationException: Attempted to read or write protected memory. This is often an indication that other memory is corrupt.
at Microsoft.Diagnostics.Tracing.TraceEvent.CopyBlob(IntPtr, IntPtr, Int32)
at Microsoft.Diagnostics.Tracing.TraceEvent.Clone()
at Microsoft.Diagnostics.Tracing.Parsers.ExternalTraceEventParser.<.ctor>b__0_0(Microsoft.Diagnostics.Tracing.TraceEvent)
at Microsoft.Diagnostics.Tracing.TraceEventDispatcher.Lookup(EVENT_RECORD*)
at Microsoft.Diagnostics.Tracing.EventPipeEventSource.DispatchEventRecord(EVENT_RECORD*)
at Microsoft.Diagnostics.Tracing.EventPipeEventSource.EventCache_OnEvent(Microsoft.Diagnostics.Tracing.EventPipeEventHeader ByRef)
at Microsoft.Diagnostics.Tracing.EventPipe.EventCache.SortAndDispatch(Int64)
at Microsoft.Diagnostics.Tracing.EventPipe.EventCache.ProcessSequencePointBlock(Byte[])
at Microsoft.Diagnostics.Tracing.EventPipeSequencePointBlock.ReadBlockContents(FastSerialization.PinnedStreamReader)
at Microsoft.Diagnostics.Tracing.EventPipeBlock.FromStream(FastSerialization.Deserializer)
at FastSerialization.Deserializer.ReadObjectDefinition(FastSerialization.Tags, FastSerialization.StreamLabel)
at FastSerialization.Deserializer.ReadObject()
at Microsoft.Diagnostics.Tracing.EventPipeEventSource.Process()
at Tracing.Tests.Common.IpcTraceTest+<>c__DisplayClass8_0.<Validate>b__1()
at System.Threading.Tasks.Task.InnerInvoke()
at System.Threading.Tasks.Task+<>c.<.cctor>b__284_0(System.Object)
at System.Threading.ExecutionContext.RunFromThreadPoolDispatchLoop(System.Threading.Thread, System.Threading.ExecutionContext, System.Threading.ContextCallback, System.Object)
at System.Threading.Tasks.Task.ExecuteWithThreadLocal(System.Threading.Tasks.Task ByRef, System.Threading.Thread)
at System.Threading.Tasks.Task.ExecuteEntryUnsafe(System.Threading.Thread)
at System.Threading.ThreadPoolWorkQueue.Dispatch()
at System.Threading.PortableThreadPool+WorkerThread.WorkerThreadStart()
at System.Threading.Thread.StartCallback()
/datadisks/disk1/work/AA440922/w/B06F0997/e/tracing/eventpipe/rundownvalidation/rundownvalidation/rundownvalidation.sh: line 394: 26241 Aborted                 (core dumped) $LAUNCHER $ExePath "${CLRTestExecutionArguments[@]}"

Return code:      1
Raw output file:      /datadisks/disk1/work/AA440922/w/B06F0997/uploads/Reports/tracing.eventpipe/rundownvalidation/rundownvalidation/rundownvalidation.output.txt
Raw output:
BEGIN EXECUTION
/datadisks/disk1/work/AA440922/p/corerun rundownvalidation.dll ''
0.0s: ==TEST STARTING==
33.6s: Started sending sentinel events...
33.8s: Connecting to EventPipe...
65.9s: Connected to EventPipe with sessionID '0x7fd1800093a0'
65.9s: Creating EventPipeEventSource...
76.6s: EventPipeEventSource created
80.0s: Dynamic.All callback registered
80.0s: Running optional trace validator
80.9s: Finished running optional trace validator
80.9s: Starting stream processing...
98.9s: Saw new provider 'Microsoft-DotNETCore-SampleProfiler'
131.0s: Saw sentinel event
131.0s: Stopped sending sentinel events
131.3s: Starting event generating action...
131.3s: Stopping event generating action
131.4s: Sending StopTracing command...
169.4s: Saw new provider 'Microsoft-DotNETCore-EventPipe'
169.6s: Saw new provider 'Microsoft-Windows-DotNETRuntimeRundown'
Gathering state for process 26241 corerun
Writing minidump with heap to file /home/helixbot/dotnetbuild/dumps/coredump.26241.dmp
Written 132464640 bytes (32340 pages) to core file
Dump successfully written
Expected: 100
Actual: 134
END EXECUTION - FAILED
Test Harness Exitcode is : 1
To run the test:

set CORE_ROOT=/datadisks/disk1/work/AA440922/p
/datadisks/disk1/work/AA440922/w/B06F0997/e/tracing/eventpipe/rundownvalidation/rundownvalidation/rundownvalidation.sh
Expected: True
Actual:   False


Stack trace
   at tracing_eventpipe._rundownvalidation_rundownvalidation_rundownvalidation_._rundownvalidation_rundownvalidation_rundownvalidation_sh() in tracing.eventpipe.XUnitWrapper.dll:token 0x6000024+0x295

Runfo Tracking Issue: tracing/eventpipe/rundownvalidation/rundownvalidation/rundownvalidation.sh

Build Definition Kind Run Name Console Core Dump Test Results Run Client
1626538 runtime PR 65398 Mono Android x64 Release @ Ubuntu.1804.Amd64.Android.29.Open console.log runclient.py
1626376 runtime PR 65489 CoreCLR Linux arm Checked no_tiered_compilation @ (Ubuntu.1804.Arm32.Open)Ubuntu.1804.Armarch.Open@mcr.microsoft.com/dotnet-buildtools/prereqs:ubuntu-18.04-helix-arm32v7-bfcd90a-20200121150440 console.log core dump runclient.py
1626136 runtime PR 64358 CoreCLR Linux arm Checked no_tiered_compilation @ (Ubuntu.1804.Arm32.Open)Ubuntu.1804.Armarch.Open@mcr.microsoft.com/dotnet-buildtools/prereqs:ubuntu-18.04-helix-arm32v7-bfcd90a-20200121150440 console.log core dump runclient.py
1625363 runtime PR 65489 CoreCLR Linux arm Checked no_tiered_compilation @ (Ubuntu.1804.Arm32.Open)Ubuntu.1804.Armarch.Open@mcr.microsoft.com/dotnet-buildtools/prereqs:ubuntu-18.04-helix-arm32v7-bfcd90a-20200121150440 console.log core dump runclient.py
1625267 runtime PR 64358 CoreCLR Linux arm Checked no_tiered_compilation @ (Ubuntu.1804.Arm32.Open)Ubuntu.1804.Armarch.Open@mcr.microsoft.com/dotnet-buildtools/prereqs:ubuntu-18.04-helix-arm32v7-bfcd90a-20200121150440 console.log core dump runclient.py
1623185 runtime PR 64358 CoreCLR Linux arm Checked no_tiered_compilation @ (Ubuntu.1804.Arm32.Open)Ubuntu.1804.Armarch.Open@mcr.microsoft.com/dotnet-buildtools/prereqs:ubuntu-18.04-helix-arm32v7-bfcd90a-20200121150440 console.log core dump runclient.py
1622882 runtime PR 64358 CoreCLR Linux arm Checked no_tiered_compilation @ (Ubuntu.1804.Arm32.Open)Ubuntu.1804.Armarch.Open@mcr.microsoft.com/dotnet-buildtools/prereqs:ubuntu-18.04-helix-arm32v7-bfcd90a-20200121150440 console.log core dump runclient.py
1614899 runtime PR 64358 CoreCLR Linux arm Checked no_tiered_compilation @ (Ubuntu.1804.Arm32.Open)Ubuntu.1804.Armarch.Open@mcr.microsoft.com/dotnet-buildtools/prereqs:ubuntu-18.04-helix-arm32v7-bfcd90a-20200121150440 console.log core dump runclient.py
1612319 runtime PR 64358 CoreCLR Linux arm Checked no_tiered_compilation @ (Ubuntu.1804.Arm32.Open)Ubuntu.1804.Armarch.Open@mcr.microsoft.com/dotnet-buildtools/prereqs:ubuntu-18.04-helix-arm32v7-bfcd90a-20200121150440 console.log core dump runclient.py

Build Result Summary

Day Hit Count Week Hit Count Month Hit Count
0 0 9
@VincentBu VincentBu added os-linux Linux OS (any supported distro) GCStress arch-x64 labels Jun 28, 2021
@dotnet-issue-labeler dotnet-issue-labeler bot added area-Tracing-coreclr untriaged New issue has not been triaged by the area owner labels Jun 28, 2021
@tommcdon
Copy link
Member

@davidwrighton @josalem related to #51477?

@tommcdon tommcdon removed the untriaged New issue has not been triaged by the area owner label Jun 28, 2021
@tommcdon tommcdon added this to the 6.0.0 milestone Jun 28, 2021
@VincentBu
Copy link
Contributor Author

VincentBu commented Jul 12, 2021

Run runtime-coreclr gcstress0x3-gcstress0xc 20210711.1
Failed test:

CoreCLR Linux x64 Checked gcstress0x3 @ Ubuntu.1804.Amd64.Open
- tracing/eventpipe/rundownvalidation/rundownvalidation/rundownvalidation.sh

CoreCLR windows x86 Checked gcstress0xc @ Windows.10.Amd64.Open
- Loader\\ContextualReflection\\ContextualReflection\\ContextualReflection.cmd


Error message

Fatal error. System.AccessViolationException: Attempted to read or write protected memory. This is often an indication that other memory is corrupt.
at Microsoft.Diagnostics.Tracing.TraceEvent.CopyBlob(IntPtr, IntPtr, Int32)
at Microsoft.Diagnostics.Tracing.TraceEvent.Clone()
at Microsoft.Diagnostics.Tracing.Parsers.ExternalTraceEventParser.<.ctor>b__0_0(Microsoft.Diagnostics.Tracing.TraceEvent)
at Microsoft.Diagnostics.Tracing.TraceEventDispatcher.Lookup(EVENT_RECORD*)
at Microsoft.Diagnostics.Tracing.EventPipeEventSource.EventCache_OnEvent(Microsoft.Diagnostics.Tracing.EventPipeEventHeader ByRef)
at Microsoft.Diagnostics.Tracing.EventPipe.EventCache.SortAndDispatch(Int64)
at Microsoft.Diagnostics.Tracing.EventPipe.EventCache.ProcessSequencePointBlock(Byte[])
at Microsoft.Diagnostics.Tracing.EventPipeSequencePointBlock.ReadBlockContents(FastSerialization.PinnedStreamReader)
at Microsoft.Diagnostics.Tracing.EventPipeBlock.FromStream(FastSerialization.Deserializer)
at FastSerialization.Deserializer.ReadObjectDefinition(FastSerialization.Tags, FastSerialization.StreamLabel)
at FastSerialization.Deserializer.ReadObject()
at Microsoft.Diagnostics.Tracing.EventPipeEventSource.Process()
at Tracing.Tests.Common.IpcTraceTest+<>c__DisplayClass8_0.<Validate>b__1()
at System.Threading.Tasks.Task.InnerInvoke()
at System.Threading.Tasks.Task+<>c.<.cctor>b__271_0(System.Object)
at System.Threading.ExecutionContext.RunFromThreadPoolDispatchLoop(System.Threading.Thread, System.Threading.ExecutionContext, System.Threading.ContextCallback, System.Object)
at System.Threading.Tasks.Task.ExecuteWithThreadLocal(System.Threading.Tasks.Task ByRef, System.Threading.Thread)
at System.Threading.Tasks.Task.ExecuteEntryUnsafe(System.Threading.Thread)
at System.Threading.ThreadPoolWorkQueue.Dispatch()
at System.Threading.PortableThreadPool+WorkerThread.WorkerThreadStart()
at System.Threading.Thread.StartCallback()
/datadisks/disk1/work/A4DE0968/w/B9D109D4/e/tracing/eventpipe/rundownvalidation/rundownvalidation/rundownvalidation.sh: line 394: 23695 Aborted                 (core dumped) $LAUNCHER $ExePath "${CLRTestExecutionArguments[@]}"

Return code:      1
Raw output file:      /datadisks/disk1/work/A4DE0968/w/B9D109D4/uploads/Reports/tracing.eventpipe/rundownvalidation/rundownvalidation/rundownvalidation.output.txt
Raw output:
BEGIN EXECUTION
/datadisks/disk1/work/A4DE0968/p/corerun rundownvalidation.dll ''
0.0s: ==TEST STARTING==
22.5s: Started sending sentinel events...
23.1s: Connecting to EventPipe...
47.8s: Connected to EventPipe with sessionID '0x7f91ac0093a0'
47.8s: Creating EventPipeEventSource...
54.3s: EventPipeEventSource created
58.2s: Dynamic.All callback registered
58.2s: Running optional trace validator
58.8s: Finished running optional trace validator
58.8s: Starting stream processing...
87.4s: Saw new provider 'Microsoft-DotNETCore-SampleProfiler'
123.5s: Saw sentinel event
123.6s: Stopped sending sentinel events
124.4s: Starting event generating action...
124.5s: Stopping event generating action
124.9s: Sending StopTracing command...
305.4s: Saw new provider 'Microsoft-DotNETCore-EventPipe'
305.6s: Saw new provider 'Microsoft-Windows-DotNETRuntimeRundown'
Gathering state for process 23695 corerun
Crashing thread 00005cb2 signal 00000006
Writing minidump with heap to file /home/helixbot/dotnetbuild/dumps/coredump.23695.dmp
Written 133865472 bytes (32682 pages) to core file
Dump successfully written
Expected: 100
Actual: 134
END EXECUTION - FAILED
Test Harness Exitcode is : 1
To run the test:

set CORE_ROOT=/datadisks/disk1/work/A4DE0968/p
/datadisks/disk1/work/A4DE0968/w/B9D109D4/e/tracing/eventpipe/rundownvalidation/rundownvalidation/rundownvalidation.sh
Expected: True
Actual:   False


Stack trace
   at tracing_eventpipe._rundownvalidation_rundownvalidation_rundownvalidation_._rundownvalidation_rundownvalidation_rundownvalidation_sh() in tracing.eventpipe.XUnitWrapper.dll:token 0x6000024+0x295

@josalem
Copy link
Contributor

josalem commented Jul 12, 2021

related to #51477?

It's possible. This isn't runtime code; this is Microsoft/PerfView code on the stack. I'd be surprised if there was an AV in TraceEvent that we haven't hit before now given the number of tests that use it. Especially since we haven't changed the version in recent memory.

This failure has a dump associated with it, so I should be able to dig into it a little more. I've got a few things on my plate, so I'll put this on the stack to investigate.

(CC @brianrob)

To elaborate on the issue text, the two mentioned tests have different failures. The error message in the issue text is for the rundown validation test.

This is the error message for the ContextualReflection test:

Assert failure(PID 7972 [0x00001f24], Thread: 6620 [0x19dc]): ptrManagedAssemblyLoadContext != NULL

CORECLR! AssemblyNative::GetLoadContextForAssembly + 0xF3 (0x70de4bd3)
SYSTEM.PRIVATE.CORELIB! <no symbol> + 0x0 (0x6fd61d92)
SYSTEM.PRIVATE.CORELIB! <no symbol> + 0x0 (0x6fd639cb)
<no module>! <no symbol> + 0x0 (0x08929ab9)
<no module>! <no symbol> + 0x0 (0x089289b3)
<no module>! <no symbol> + 0x0 (0x0520eef8)
<no module>! <no symbol> + 0x0 (0x0520ba88)
CORECLR! CallDescrWorkerInternal + 0x34 (0x710ef801)
CORECLR! CallDescrWorker + 0xAA (0x70df5789)
CORECLR! CallDescrWorkerWithHandler + 0xDE (0x70df58c7)
    File: D:\workspace\_work\1\s\src\coreclr\vm\assemblynative.cpp Line: 1317
    Image: C:\h\w\A53508CA\p\corerun.exe

I'm not sure if these are related.

@josalem
Copy link
Contributor

josalem commented Jul 12, 2021

Took a look at the history for the rundown validation test. It's been intermittently failing since last Friday with varying errors. There's the above AV, but there is also the following error:

Process terminated. Assertion failed.
at System.Collections.Hashtable.InitHash(Object key, Int32 hashsize, UInt32& seed, UInt32& incr) in System.Private.CoreLib.dll:token 0x60068c4+0x11
at System.Collections.Hashtable.get_Item(Object key) in System.Private.CoreLib.dll:token 0x60068d0+0xb
at System.Collections.Hashtable.SyncHashtable.get_Item(Object key) in System.Private.CoreLib.dll:token 0x60068fa+0x0
at System.Text.EncodingTable.GetCodePageFromName(String name) in System.Private.CoreLib.dll:token 0x60035ec+0xe
at System.Text.Encoding.GetEncoding(String name) in System.Private.CoreLib.dll:token 0x6003567+0x0
at System.Text.EncodingHelper.GetEncodingFromCharset() in //src/libraries/Common/src/System/Text/EncodingHelper.Unix.cs:line 15
at System.ConsolePal.GetConsoleEncoding() in //src/libraries/System.Console/src/System/ConsolePal.Unix.cs:line 728
at System.ConsolePal.get_OutputEncoding() in //src/libraries/System.Console/src/System/ConsolePal.Unix.cs:line 68
at System.Console.get_OutputEncoding() in //src/libraries/System.Console/src/System/Console.cs:line 119
at System.Console.CreateOutputWriter(Stream outputStream) in //src/libraries/System.Console/src/System/Console.cs:line 242
at System.Console.<get_Out>g__EnsureInitialized|26_0() in //src/libraries/System.Console/src/System/Console.cs:line 212
at System.Console.get_Out() in /_/src/libraries/System.Console/src/System/Console.cs:line 204
at Tracing.Tests.Common.Logger..ctor(TextWriter log) in common.dll:token 0x6000007+0x6
at Tracing.Tests.Common.Logger..cctor() in common.dll:token 0x6000009+0x0
at Tracing.Tests.Common.IpcTraceTest.RunAndValidateEventCounts(Dictionary2 expectedEventCounts, Action eventGeneratingAction, Nullable1 sessionConfiguration, Func`2 optionalTraceValidator) in common.dll:token 0x600001b+0x0
at Tracing.Tests.RundownValidation.RundownValidation.Main(String[] args) in rundownvalidation.dll:token 0x6000001+0x38

It's not immediately obvious to me what assert failed, though:

private uint InitHash(object key, int hashsize, out uint seed, out uint incr)
{
// Hashcode must be positive. Also, we must not use the sign bit, since
// that is used for the collision bit.
uint hashcode = (uint)GetHash(key) & 0x7FFFFFFF;
seed = (uint)hashcode;
// Restriction: incr MUST be between 1 and hashsize - 1, inclusive for
// the modular arithmetic to work correctly. This guarantees you'll
// visit every bucket in the table exactly once within hashsize
// iterations. Violate this and it'll cause obscure bugs forever.
// If you change this calculation for h2(key), update putEntry too!
incr = (uint)(1 + ((seed * HashHelpers.HashPrime) % ((uint)hashsize - 1)));
return hashcode;
}

I'm guessing something happened during the QCall for GetHashCode on the string being used as a key:

FCIMPL1(INT32, ValueTypeHelper::GetHashCode, Object* objUNSAFE)

@josalem
Copy link
Contributor

josalem commented Jul 14, 2021

I investigated the CopyBlob() AV stack trace (different from my previous comment), and it appears to be something wonky surrounding following pointers to pinned GC handles.

EventPipeEventSource reads data into byte[] buffers, then pins them using GC.Alloc(data, GCHandleType.Pinned). It then walks the buffer and creates the EVENT_RECORD structs that the TraceEvent library uses. When I look at the stack, I see a SIGSEGV on an address that is supposed to point to the payload of one of these EVENT_RECORD objects. EventPipeEventSource and TraceEvent in general tend to pin objects and then directly access things unsafely with pointers.

I'm not convinced the pointer arithmetic/bookkeeping is wrong. It checks bounds and whatnot. When the logic goes to dereference the pointer into the pinned buffer, however, it seg faults. The address is not included in the dump created by createdump, but is in the system created dump. That being said, the system created dump seems to have the values at the address all zeroed out, which may indicate that it is junk data and that is why it isn't in the createdump dump.

The long and short of it is: this call stack is segfaulting on what should be an address into a pinned GC handle for a byte[]. I'll continue to investigate. I don't have any evidence pointing to TraceEvent being wrong here.

@josalem
Copy link
Contributor

josalem commented Jul 14, 2021

I investigated a little further and this looks like this might be caused by the GC collecting data while it is still in use. The code is in the middle of SortAndDispatch which walks the dictionary of events for all the threads and (as the name implies) sorts then dispatches the events. The seg fault is happening when the logic goes to clone some memory that is supposed to be in a pinned GC handle. When I look at the array of these pinned buffers, however, all the entries in the cache are marked as null.

(lldb) dumparray -length 15 -details  00007f918406bba0
Name:        Microsoft.Diagnostics.Tracing.EventPipe.EventMarker[]
MethodTable: 00007f91bd12b3b8
EEClass:     00007f91bc673588
Size:        16408(0x4018) bytes
Array:       Rank 1, Number of elements 2048, Type CLASS
Element Methodtable: 00007f91bd12a300
[0] null
[1] null
[2] null
[3] null
[4] null
[5] null
[6] null
[7] null
[8] null
[9] null
[10] null
[11] null
[12] null
[13] null
[14] null

One of those entries is supposed to be where the pinned buffer pointed to by the faulting address lived.

None of the above logic should be pointing to arbitrary data not allocated by itself, nor is it code that has changed recently. I don't believe that this is being caused by TraceEvent itself.

@Maoni0 or @cshung is this an instance where something marked as pinned got collected, or is it possible that the pinning isn't working because the pinning APIs are misused? I've got this dump open if there is any info you would need to determine yes or no.

@Maoni0
Copy link
Member

Maoni0 commented Jul 14, 2021

if you have a pinned handle, it's a strong handle so GC will treat it as live till you call Free on it.

it would be helpful to show the code where you are pinning the object of interest and how it's related to this EventMarker array.

@josalem
Copy link
Contributor

josalem commented Jul 14, 2021

Here is the relevant code I found digging through TraceEvent:
https://github.com/microsoft/perfview/blob/6b4771eecba5ed00555a34d70e19a8742af4ee0a/src/TraceEvent/EventPipe/EventCache.cs#L248-L280

    internal class EventCacheThread
    {
        public Queue<EventMarker> Events = new Queue<EventMarker>();
        public int SequenceNumber;
        public long LastCachedEventTimestamp;
    }

    internal class EventMarker
    {
        public EventMarker(PinnedBuffer buffer)
        {
            Buffer = buffer;
        }
        public EventPipeEventHeader Header;
        public PinnedBuffer Buffer;
    }

    internal class PinnedBuffer
    {
        public PinnedBuffer(byte[] data)
        {
            Data = data;
            PinningHandle = GCHandle.Alloc(data, GCHandleType.Pinned);
        }

        ~PinnedBuffer()
        {
            PinningHandle.Free();
        }

        public byte[] Data { get; private set; }
        public GCHandle PinningHandle { get; private set; }
    }

Those are stored in this dictionary on EventChache:

Dictionary<long, EventCacheThread> _threads = new Dictionary<long, EventCacheThread>();

@davidwrighton
Copy link
Member

@josalem From what you're reporting here, this doesn't look like a bug in the GC proper. This looks like a bug in the PerfView repo.

In particular, the EventMarker object isn't kept alive through the call to the OnEvent call in SortAndDispatch, and the implementation of the OnEvent method (EventPipeEventSource.EventCache_OnEvent) which takes a ref parameter which would generally keep the object alive, uses the ref, to produce a raw pointer to pass to EventPipeEventSource.DispatchEventRecord but does not attempt to keep the ref alive any further. The end result of this, is that the EventMarker object can be freed by the GC, and in the presence of GC stress, this sort of fault is vastly more likely to occur than in normal execution.

I expect that the fix is to add a call to GC.KeepAlive(eventMarker) after the call to OnEvent?.Invoke(ref eventMarker.Header) in EventCache.SortAndDispatch, but this code is rather convoluted, so I'm not convinced that will fix all possible problems.

The big problem with using finalizers is that the runtime is free to execute them much, MUCH earlier than a developer might expect. I've event seen cases where a finalizer can execute before a method has finished its constructor.

@josalem
Copy link
Contributor

josalem commented Jul 14, 2021

Thanks for explanation @davidwrighton! So it sounds like this is all because ConvertEventHeaderToRecord converts EventMarker to EVENT_RECORD* which effectively undoes the protections of the pinhandle because it creates a window of time to call the finalizer which frees the pinning handle. I'll need to audit the TraceEvent code to make sure this isn't happening all over.

CC @noahfalk

@davidwrighton
Copy link
Member

@josalem, that's what it looks like to me. In general, its very tricky to use types with finalizers correctly, if the type exposes details which are supposed to be protected by the finalizer (such as the PinningHandle property). For instance, this is why we moved to using SafeHandles internally for FileStreams, and why use of the Handle property on a FileStream is so strongly discouraged. Audits for correctness tend to fail to find issues, failures tend to be extremely hard to isolate, and the behavior of code will change unpredictably from version of .NET to version of .NET.

@josalem
Copy link
Contributor

josalem commented Jul 15, 2021

Thanks @davidwrighton! I feel confident now, that the CopyBlob() segfault test failure isn't a product failure in dotnet/runtime, but rather an unfortunate pitfall in how EventPipeEventSource in Microsoft/PerfView was written when run under gcstress.

To that end, I'll open an issue to track changing this logic in Microsoft/PerfView, and disable this test for gcstress runs. If we notice instability in other tracing tests under gcstress due to the same issue, we can disable them as well until an updated EventPipeEventSource is used.

@noahfalk
Copy link
Member

Thanks @davidwrighton ! Btw if you've got questions about that code feel free to ask, I am the evil dev who authored it ; ) I dislike the degree to which TraceEvent resorts to raw pointer manipulations but for better or worse they are baked into the interfaces between components somewhat pervasively :/ If we continue to find bugs similar to this one it will increase the justification to try addressing the problem more broadly.

@davidwrighton
Copy link
Member

I don't really have any significant questions, it was more of an educational note I was writing for John about how when he sees a finalizer in code behaving with this kind of problem he should focus in on that as the most likely cause of the problem. It didn't even seem like the code was really terrible, its just that that sort of code is extremely easy to mess up, and a better approach would involve use of a fixed statement instead of a pinned gchandle if a pointer is really the right solution. (Mostly because with a fixed statement lifetime is clearly handled.) Obviously actually doing so would require significant refactoring in the trace event logic, and I really don't blame you for not wanting to take that task on.

@VincentBu
Copy link
Contributor Author

Failed again in runtime-coreclr gcstress0x3-gcstress0xc 20210727.1

Failed test:

CoreCLR windows x86 Checked gcstress0xc @ Windows.10.Amd64.Open

- Loader\\ContextualReflection\\ContextualReflection\\ContextualReflection.cmd

Error message:

Assert failure(PID 5432 [0x00001538], Thread: 3704 [0x0e78]): ptrManagedAssemblyLoadContext != NULL

CORECLR! AssemblyNative::GetLoadContextForAssembly + 0xF3 (0x70f85e73)
SYSTEM.PRIVATE.CORELIB! <no symbol> + 0x0 (0x6fee0332)
SYSTEM.PRIVATE.CORELIB! <no symbol> + 0x0 (0x6fee214b)
<no module>! <no symbol> + 0x0 (0x08ba96b9)
<no module>! <no symbol> + 0x0 (0x08ba85a3)
<no module>! <no symbol> + 0x0 (0x087beeb8)
<no module>! <no symbol> + 0x0 (0x087bba48)
CORECLR! CallDescrWorkerInternal + 0x34 (0x71294ba1)
CORECLR! CallDescrWorker + 0xAA (0x70f969f9)
CORECLR! CallDescrWorkerWithHandler + 0xDE (0x70f96b37)
File: D:\workspace_work\1\s\src\coreclr\vm\assemblynative.cpp Line: 1317
Image: C:\h\w\A59D098B\p\corerun.exe

Return code:      1
Raw output file:      C:\h\w\A59D098B\w\A7D90961\uploads\Reports\Loader.ContextualReflection\ContextualReflection\ContextualReflection.output.txt
Raw output:
BEGIN EXECUTION
"C:\h\w\A59D098B\p\corerun.exe"  ContextualReflection.dll
Expected: 100
Actual: -1073740286
END EXECUTION - FAILED
FAILED
Test Harness Exitcode is : 1
To run the test:

set CORE_ROOT=C:\h\w\A59D098B\p
C:\h\w\A59D098B\w\A7D90961\e\Loader\ContextualReflection\ContextualReflection\ContextualReflection.cmd
Expected: True
Actual:   False


Stack trace
   at Loader_ContextualReflection._ContextualReflection_ContextualReflection_._ContextualReflection_ContextualReflection_cmd() in Loader.ContextualReflection.XUnitWrapper.dll:token 0x6000004+0x284

@josalem
Copy link
Contributor

josalem commented Jul 29, 2021

@VincentBu I don't think the ContextualReflection test and the TraceValidation test failures are related. I think it would be worth it to open a separate issue for the reflection failure.

@josalem
Copy link
Contributor

josalem commented Aug 9, 2021

I'll be moving this issue to 7 in order to track completion of microsoft/perfview/issues/1474 and updating dotnet/runtime to the subsequent TraceEvent version.

@josalem josalem added the tracking-external-issue The issue is caused by external problem (e.g. OS) - nothing we can do to fix it directly label Aug 9, 2021
@VincentBu
Copy link
Contributor Author

Failed again in: runtime 20210928.21

Failed test:

Mono OSX x64 Release no_tiered_compilation @ OSX.1014.Amd64.Open

- tracing/eventpipe/rundownvalidation/rundownvalidation/rundownvalidation.sh

Error message:

cmdLine:/private/tmp/helix/working/A57108DA/w/AA700948/e/tracing/eventpipe/rundownvalidation/rundownvalidation/rundownvalidation.sh Timed Out (timeout in milliseconds: 600000 from variable __TestTimeout, start: 9/28/2021 9:15:19 AM, end: 9/28/2021 9:25:20 AM)

Return code:      -100
Raw output file:      /tmp/helix/working/A57108DA/w/AA700948/uploads/Reports/tracing.eventpipe/rundownvalidation/rundownvalidation/rundownvalidation.output.txt
Raw output:
BEGIN EXECUTION
/tmp/helix/working/A57108DA/p/corerun -p System.Reflection.Metadata.MetadataUpdater.IsSupported=false rundownvalidation.dll ''
0.0s: ==TEST STARTING==
0.3s: Connecting to EventPipe...
0.3s: Started sending sentinel events...
0.4s: Connected to EventPipe with sessionID '0x7fe6ce703e90'
0.4s: Creating EventPipeEventSource...
0.5s: EventPipeEventSource created
0.5s: Dynamic.All callback registered
0.5s: Running optional trace validator
0.5s: Finished running optional trace validator
0.5s: Starting stream processing...
0.7s: Saw new provider 'Microsoft-DotNETCore-SampleProfiler'
0.8s: Saw sentinel event
0.8s: Stopped sending sentinel events
0.8s: Starting event generating action...
0.8s: Stopping event generating action
0.8s: Sending StopTracing command...
0.9s: Saw new provider 'Microsoft-DotNETCore-EventPipe'
1.0s: Saw new provider 'Microsoft-Windows-DotNETRuntimeRundown'

cmdLine:/private/tmp/helix/working/A57108DA/w/AA700948/e/tracing/eventpipe/rundownvalidation/rundownvalidation/rundownvalidation.sh Timed Out (timeout in milliseconds: 600000 from variable __TestTimeout, start: 9/28/2021 9:15:19 AM, end: 9/28/2021 9:25:20 AM)
Test Harness Exitcode is : -100
To run the test:

set CORE_ROOT=/tmp/helix/working/A57108DA/p
/private/tmp/helix/working/A57108DA/w/AA700948/e/tracing/eventpipe/rundownvalidation/rundownvalidation/rundownvalidation.sh
Expected: True
Actual:   False


Stack trace
   at tracing_eventpipe._rundownvalidation_rundownvalidation_rundownvalidation_._rundownvalidation_rundownvalidation_rundownvalidation_sh()

@ViktorHofer
Copy link
Member

According to runfo this is failing multiple times per week (not the same stack as in the initial post but the same as the post above): https://runfo.azurewebsites.net/search/tests/?q=started%3A%7E7+definition%3Aruntime+name%3Arundownvalidation

@ViktorHofer ViktorHofer added the blocking-clean-ci Blocking PR or rolling runs of 'runtime' or 'runtime-extra-platforms' label Sep 30, 2021
@josalem
Copy link
Contributor

josalem commented Sep 30, 2021

Looking at the linked failures in the runfo page, I think there are two different failures and neither is the one tracked in this issue. If I recall, there's a different issue tracking the mono timeout failure that makes up the majority of them, but the two coreclr Windows x86 failures looks new. When I get back to my desk I'll see about rearranging things so the right issue is tracked.

@josalem
Copy link
Contributor

josalem commented Sep 30, 2021

@ViktorHofer I think the problem that runfo is tracking in this Issue is actually #59296.

CC @lateralusX since this appears to be affecting Mono

@krwq
Copy link
Member

krwq commented Oct 6, 2021

Can this be disabled while investigation is pending? similarly to #59994

@lateralusX
Copy link
Member

Looks like this issue has diverged from its original report (that was a CoreCLR crash) and is now most likely the same underlying issue as #59296.

@lateralusX
Copy link
Member

I add it to the exclude list as well.

@hoyosjs
Copy link
Member

hoyosjs commented Mar 9, 2022

The underlying issue has been fixed. The hits this month were for a PR.

@hoyosjs hoyosjs closed this as completed Mar 9, 2022
@ghost ghost locked as resolved and limited conversation to collaborators Apr 8, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
arch-x64 area-Tracing-coreclr blocking-clean-ci Blocking PR or rolling runs of 'runtime' or 'runtime-extra-platforms' GCStress os-linux Linux OS (any supported distro) tracking-external-issue The issue is caused by external problem (e.g. OS) - nothing we can do to fix it directly
Projects
None yet
Development

Successfully merging a pull request may close this issue.

10 participants