Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[release/7.0] Fix GUID interop in distributed transactions and block ARM #74570

Merged
merged 1 commit into from
Sep 13, 2022

Conversation

github-actions[bot]
Copy link
Contributor

@github-actions github-actions bot commented Aug 25, 2022

Backport of #74226 to release/7.0

Fixes #74170

/cc @roji

Customer Impact

Hard crash in GUID interop when using new distributed transactions feature in .NET 7.
Also blocks ARM64 since it's not working.

Testing

Significant testing done on main PR.

Risk

Risk is small; fix clear bug in new feature.

IMPORTANT: Is this backport for a servicing release? If so and this change touches code that ships in a NuGet package, please make certain that you have added any necessary package authoring and gotten it explicitly reviewed.

@roji
Copy link
Member

roji commented Aug 25, 2022

@carlossanlop
Copy link
Member

@jeffschwMSFT do you approve of this backport?
@AaronRobinsonMSFT can we get a sign-off?

@roji
Copy link
Member

roji commented Aug 25, 2022

@AaronRobinsonMSFT unless you object, we can hold off on this PR, as the arm64 issue still occurs (#74226 (comment)).

@AaronRobinsonMSFT
Copy link
Member

Let me get a sign off going.

@AaronRobinsonMSFT
Copy link
Member

@AaronRobinsonMSFT unless you object, we can hold off on this PR, as the arm64 issue still occurs (#74226 (comment)).

Right. That is going to get some pushback. We should really root cause this and create a single PR for the full fix since there seems to be something else also going on here.

@roji
Copy link
Member

roji commented Aug 25, 2022

Makes sense,

@AaronRobinsonMSFT AaronRobinsonMSFT added the NO-MERGE The PR is not ready for merge yet (see discussion for detailed reasons) label Aug 25, 2022
@AaronRobinsonMSFT AaronRobinsonMSFT marked this pull request as draft August 25, 2022 17:10
@karelz karelz added this to the 7.0.0 milestone Aug 25, 2022
@danmoseley
Copy link
Member

Just checking in, where are we likely to land with this one and who is holding the baton? As you know, the bar goes up to servicing level next week.

@roji
Copy link
Member

roji commented Sep 7, 2022

@danmoseley this PR, as-is, apparently doesn't fix the original issue (intermittent AccessViolationException on arm64 only), see #74170 (comment) for some context. We're basically blocking on access to an arm64 machine which would allow us to further investigate and hopefully fix this; note that this is potentially a wider arm64 interop (or possibly MSDTC on arm64) issue.

Unless @kunalspathak or someone else can help us out with this, I plan to submit a separate PR to disable distributed transactions entirely on arm64 (throw PlatformNotSupportedException). I'll probably be doing that early next week.

@jkoritzinsky
Copy link
Member

@roji have you been able to reproduce the failure with this PR (and with the tests re-enabled)? I'm running the OleTxTests tests in a loop until I get a failure, but if you can get a dump, that would be preferable.

@roji
Copy link
Member

roji commented Sep 8, 2022

@jkoritzinsky as far as I know, the AccessViolationException this tracks has only ever occured on amd64, which I don't have access to; are you saying you're seeing it (on non-arm64? on arm64?). If you're seeing this on arm64 and can help out, then see comment #74170 (comment); @AaronRobinsonMSFT proposed we experiment with manual COM marshalling, which I've pushed in #74226, and we indeed need a dump of the interop stub (but are blocked because of arm64 access).

@karelz
Copy link
Member

karelz commented Sep 8, 2022

@roji there should be some arm64 machines available for investigations -- @MattGal @jeffschwMSFT @danmoseley could perhaps know who has access ... it is taken care of by @kunalspathak (see next reply).

@kunalspathak
Copy link
Member

Unless @kunalspathak or someone else can help us out with this,

I have sent email internally to provide you with an access to the Arm64 machine.

@jkoritzinsky
Copy link
Member

I was running it in an ARM64 machine. I let it run overnight, so I’ll see if I got a failure when I get to my machine today.

@jkoritzinsky
Copy link
Member

My terminal got killed, so I'll try it again.

@jkoritzinsky
Copy link
Member

jkoritzinsky commented Sep 8, 2022

@roji I've been running the OleTxTests test suite through the test runner process in a loop on a Windows 10 ARM64 box. It's currently executed more than 440 times without failure with the changes in this PR with 0969048 reverted (so the distributed transaction tests actually run).

Update: I'm at over 1000 executions without failure. Are we sure this still fails in a non-stress mode?

@jkoritzinsky
Copy link
Member

At execution 1036, I recieved the following error (not an AV):

System.AggregateException : One or more errors occurred. (The Transaction Manager returned a log full error. (0x8004D01A)) (The following constructor parameters did not have matching fixture data: OleTxFixture fixture)
      ---- System.Transactions.TransactionException : The Transaction Manager returned a log full error. (0x8004D01A)
      -------- System.Runtime.InteropServices.COMException : The Transaction Manager returned a log full error. (0x8004D01A)
      ---- The following constructor parameters did not have matching fixture data: OleTxFixture fixture
      Stack Trace:

        ----- Inner Stack Trace #1 (System.Transactions.TransactionException) -----
        E:\source\runtime2\src\libraries\System.Transactions.Local\src\System\Transactions\Oletx\OletxTransactionManager.cs(628,0): at System.Transactions.Oletx.OletxTransactionManager.ProxyException(COMException comException)
        E:\source\runtime2\src\libraries\System.Transactions.Local\src\System\Transactions\Oletx\OletxResourceManager.cs(325,0): at System.Transactions.Oletx.OletxResourceManager.EnlistDurable(OletxTransaction oletxTransaction, Boolean canDoSinglePhase, IEnlistmentNotificationInternal enlistmentNotification, EnlistmentOptions enlistmentOptions)
        E:\source\runtime2\src\libraries\System.Transactions.Local\src\System\Transactions\Oletx\OletxTransaction.cs(339,0): at System.Transactions.Oletx.OletxTransaction.EnlistDurable(Guid resourceManagerIdentifier, ISinglePhaseNotificationInternal singlePhaseNotification, Boolean canDoSinglePhase, EnlistmentOptions enlistmentOptions)
        E:\source\runtime2\src\libraries\System.Transactions.Local\src\System\Transactions\TransactionState.cs(1744,0): at System.Transactions.TransactionStatePromotedBase.EnlistDurable(InternalTransaction tx, Guid resourceManagerIdentifier, IEnlistmentNotification enlistmentNotification, EnlistmentOptions enlistmentOptions, Transaction atomicTransaction)
        E:\source\runtime2\src\libraries\System.Transactions.Local\src\System\Transactions\TransactionState.cs(552,0): at System.Transactions.EnlistableStates.EnlistDurable(InternalTransaction tx, Guid resourceManagerIdentifier, IEnlistmentNotification enlistmentNotification, EnlistmentOptions enlistmentOptions, Transaction atomicTransaction)
        E:\source\runtime2\src\libraries\System.Transactions.Local\src\System\Transactions\Transaction.cs(503,0): at System.Transactions.Transaction.EnlistDurable(Guid resourceManagerIdentifier, IEnlistmentNotification enlistmentNotification, EnlistmentOptions enlistmentOptions)
        E:\source\runtime2\src\libraries\System.Transactions.Local\tests\OleTxTests.cs(505,0): at System.Transactions.Tests.OleTxTests.OleTxFixture..ctor()
           at System.RuntimeMethodHandle.InvokeMethod(Object target, Void** arguments, Signature sig, Boolean isConstructor)
        E:\source\runtime2\src\libraries\System.Private.CoreLib\src\System\Reflection\ConstructorInvoker.cs(86,0): at System.Reflection.ConstructorInvoker.Invoke(Object obj, IntPtr* args, BindingFlags invokeAttr)
        ----- Inner Stack Trace -----
           at System.Transactions.DtcProxyShim.DtcInterfaces.IResourceManager.Enlist(ITransaction pTransaction, ITransactionResourceAsync pRes, Guid& pUOW, OletxTransactionIsolationLevel& pisoLevel, ITransactionEnlistmentAsync& ppEnlist)
        E:\source\runtime2\src\libraries\System.Transactions.Local\src\System\Transactions\DtcProxyShim\ResourceManagerShim.cs(29,0): at System.Transactions.DtcProxyShim.ResourceManagerShim.Enlist(TransactionShim transactionShim, OletxEnlistment managedIdentifier, EnlistmentShim& enlistmentShim)
        E:\source\runtime2\src\libraries\System.Transactions.Local\src\System\Transactions\Oletx\OletxResourceManager.cs(310,0): at System.Transactions.Oletx.OletxResourceManager.EnlistDurable(OletxTransaction oletxTransaction, Boolean canDoSinglePhase, IEnlistmentNotificationInternal enlistmentNotification, EnlistmentOptions enlistmentOptions)
        ----- Inner Stack Trace #2 (Xunit.Sdk.TestClassException) -----

@carlossanlop
Copy link
Member

FYI there are merge conflicts with the base branch now.

@danmoseley
Copy link
Member

It doesn't help really, but I did confirm from looking at the xunit code, that "The following constructor parameters did not have matching fixture data: OleTxFixture fixture" is just an outcome of failing to construct OleTxFixture due to the exception above.

@roji either some fragility in the product, in the DTC or whatever below that, or in OleTxFixture,.. ?

And fix GUID interop in distributed transactions

See #74170

(cherry picked from commit fdfef13)
@roji
Copy link
Member

roji commented Sep 9, 2022

First, thanks everybody for helping out with this...

Since we're about to branch for rc2 and we still have an unknown AccessViolationException on arm64, and since @AaronRobinson is OOF, I'm proposing to add a check to throw PlatformNotSupportedException if distributed transactions are attempted on arm64 (and also merge the already-approved GUID interop changes). We can continue investigating this post-rc2 (to at least exclude a deeper bug) or for 8.0, let's continue conversation on this in #74170.

  • About this latest failure... The OleTxTests fixture is hardened against some MSDTC flakiness (retry mechanism upon catching TransactionManagerCommunicationException). However, @jkoritzinsky's error is TransactionException : The Transaction Manager returned a log full error (XACT_E_LOGFULL), which indicates that the MSDTC's log filled up (e.g. see this post). This is the first time I've seen this error (it hasn't ever occurred on x64 CI runs), and I'm guessing it's caused by @jkoritzinsky's continuous runs somehow filling up the machine's log or similar (the post above describes how to increase the log size). I'm not sure there's anything we can/should do about this, at least until we see it happening in a less artificial scenario..
  • @jkoritzinsky, @kunalspathak indicated that they're able to reproduce the AccessViolationException without jitstress flags. @kunalspathak, maybe you can share how you manage to reliably repro the AccessViolationException here?
  • As above, I'm updating PR #74226 (and its 7.0 backport, #74570) to fix the interop GUID definition mismatch (includes #75219, /cc @jkoritzinsky) and throw PlatformNotSupportedException for arm64.

@roji roji changed the title [release/7.0] Fix GUID interop in distributed transactions [release/7.0] Fix GUID interop in distributed transactions and block ARM Sep 9, 2022
@roji roji marked this pull request as ready for review September 9, 2022 14:30
@danmoseley
Copy link
Member

Re log full - it might possibly be worth temporarily adding a throw in the finalizer of internaltransaction (can't see the type name on my phone right now) to verify it's not running. It's a long shot but if it was doing cleanup instead of dispose, and the tests exited before it ran, perhaps that could leave state in this "log".

@ajcvickers
Copy link
Member

@danmoseley

Re log full - it might possibly be worth temporarily adding a throw in the finalizer of internaltransaction (can't see the type name on my phone right now) to verify it's not running. It's a long shot but if it was doing cleanup instead of dispose, and the tests exited before it ran, perhaps that could leave state in this "log".

Is this something @roji should do before getting approval/merging for 7.0? Or something to follow up on in main?

@danmoseley
Copy link
Member

danmoseley commented Sep 9, 2022

Is this something @roji should do before getting approval/merging for 7.0? Or something to follow up on in main?

No, I don't think so. It's only showing on stress on Jeremy's machine so far, it's not clear whether it's product or not, and there's also no suggestion (I think) that this introduces it.

Leaving that aside -- if I understand correctly, we believe that this feature is shippable for x64, with none of these issues seen there? Are you comfortable shipping the feature disabled for 32 bit and for Arm? Is x64 (Windows of course) where customers are asking for it? Is there a possibility of surprise if eg., they develop on x64 and attempt to deploy on Arm64 (at least it will fail immediately)? I don't have an opinion, what do you think?

If we fix the AV and verify on Arm64, it does seem like we can enable it there in servicing.

@ajcvickers
Copy link
Member

ajcvickers commented Sep 9, 2022

@danmoseley I think it reasonable to bring it to x64 only in .NET 7. There will only be a small subsection of customers using this anyway, any the primary target audience is existing server apps running on Windows .NET Framework, which will generally be x64. I imagine we will want to get ARM64 working for the future, but I don't think we can realistically do that in .NET 7, and, given the target audience, I don't think backing out the entire feature is a better approach. For x86, we can probably wait for feedback.

I doubt we can do ARM64 in servicing, but we could consider that with a very strong case if the investigation shows conclusively that the broken part is simple and is safe to change.

@danmoseley
Copy link
Member

x64 only sounds good to me then. I do agree though we should continue investigation on Arm (eg it could be an issue common to both that only shows up there)

@roji
Copy link
Member

roji commented Sep 11, 2022

@danmoseley thanks for helping out here - I think @ajcvickers summed it up above; we expect the vast majority of the users of this feature to only need x64. I do hope we'll get to the bottom of the ARM64 issue soon; we can then evaluate whether it makes sense to to unblock that in servicing.

I think this PR still requires your approval (or tactics)?

@jkotas
Copy link
Member

jkotas commented Sep 11, 2022

The goal for Windows ARM64 is to have a full parity with Windows x64 so that there are no blockers for customers who want to migrate to Windows ARM64. This is going against that goal.

case Architecture.X86:
throw new PlatformNotSupportedException(SR.DistributedNotSupportedOn32Bits);

case Architecture.Armv6: // #74170
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

case Architecture.Armv6 is unnecessary. Armv6 is Unix-only platform.

@jkotas
Copy link
Member

jkotas commented Sep 11, 2022

Intermittent AccessViolationException

Everybody (.NET, Windows, ...) is still working on long-tail of Windows Arm64 bugs, like this one. The fact that these bugs exist is not a good reason to completely disabling features.

@jkoritzinsky
Copy link
Member

Has anyone here been able to reproduce the AV with the changes in the PR and with the tests re-enabled?

@roji
Copy link
Member

roji commented Sep 12, 2022

Everybody (.NET, Windows, ...) is still working on long-tail of Windows Arm64 bugs, like this one. The fact that these bugs exist is not a good reason to completely disabling features

To be sure we're on the same page, @jkotas are you saying we should not block distributed transactions on arm64 although we know it produces intermittent AccessViolationException? I thought this wouldn't be ideal from a reliability standpoint (and this is something we do intend to get to the bottom of).

If this is what we want to do, I'll remove the exception from this PR and leave only the GUID interop fixes.

@jkotas
Copy link
Member

jkotas commented Sep 12, 2022

Right, I do not think we should be blocking the feature on arm64 due to intermittent AccessViolationException.

We should keep working on root-causing the intermittent AccessViolationException. (It is ok if the fix does not make it for .NET 7 GA and needs to wait for servicing. I guess there is also a possibility that the crash needs to be fixed in Windows.)

@danmoseley
Copy link
Member

Just curious, what's the blocker on x86?

@ajcvickers
Copy link
Member

@danmoseley Can we get a decision here, or does this need to go to Tactics? Specifically, we would like to get this merged for rc2 because it contains other fixes, and we can do that either with or without blocking arm64.

@roji
Copy link
Member

roji commented Sep 13, 2022

@danmoseley on x86 I get an ArgumentException when invoking DtcGetTransactionManagerExW (docs) (full exception details below). This is 100% consistent (no flakiness) and easy to reproduce.

Full exception details
  Error Message:
   System.AggregateException : One or more errors occurred. (Value does not fall within the expected range.) (The following constructor parameters did not have matching fixture data: OleTxFixture fixture)
---- System.ArgumentException : Value does not fall within the expected range.
---- The following constructor parameters did not have matching fixture data: OleTxFixture fixture
  Stack Trace:

----- Inner Stack Trace #1 (System.ArgumentException) -----
   at System.Transactions.DtcProxyShim.DtcProxyShimFactory.DtcGetTransactionManagerExW(String pszHost, String pszTmName, Guid& riid, Int32 grfOptions, Object pvConfigPararms, ITransactionDispenser& ppvObject)
   at System.Transactions.DtcProxyShim.DtcProxyShimFactory.DtcGetTransactionManager(String nodeName, ITransactionDispenser& localDispenser) in C:\projects\runtime\src\libraries\System.Transactions.Local\src\System\Transactions\DtcProxyShim\DtcProxyShimFactory.cs:line 52
   at System.Transactions.DtcProxyShim.DtcProxyShimFactory.ConnectToProxyCore(String nodeName, Guid resourceManagerIdentifier, Object managedIdentifier, Boolean& nodeNameMatches, Byte[]& whereabouts, ResourceManagerShim& resourceManagerShim) in C:\projects\runtime\src\libraries\System.Transactions.Local\src\System\Transactions\DtcProxyShim\DtcProxyShimFactory.cs:line 76
   at System.Transactions.DtcProxyShim.DtcProxyShimFactory.ConnectToProxy(String nodeName, Guid resourceManagerIdentifier, Object managedIdentifier, Boolean& nodeNameMatches, Byte[]& whereabouts, ResourceManagerShim& resourceManagerShim) in C:\projects\runtime\src\libraries\System.Transactions.Local\src\System\Transactions\DtcProxyShim\DtcProxyShimFactory.cs:line 62
   at System.Transactions.Oletx.DtcTransactionManager.Initialize() in C:\projects\runtime\src\libraries\System.Transactions.Local\src\System\Transactions\Oletx\DtcTransactionManager.cs:line 38
   at System.Transactions.Oletx.DtcTransactionManager.get_ProxyShimFactory() in C:\projects\runtime\src\libraries\System.Transactions.Local\src\System\Transactions\Oletx\DtcTransactionManager.cs:line 79
   at System.Transactions.Oletx.OletxTransactionManager.CreateTransaction(TransactionOptions properties) in C:\projects\runtime\src\libraries\System.Transactions.Local\src\System\Transactions\Oletx\OletxTransactionManager.cs:line 448
   at System.Transactions.TransactionStatePromoted.EnterState(InternalTransaction tx) in C:\projects\runtime\src\libraries\System.Transactions.Local\src\System\Transactions\TransactionState.cs:line 2172
   at System.Transactions.EnlistableStates.EnlistDurable(InternalTransaction tx, Guid resourceManagerIdentifier, IEnlistmentNotification enlistmentNotification, EnlistmentOptions enlistmentOptions, Transaction atomicTransaction) in C:\projects\runtime\src\libraries\System.Transactions.Local\src\System\Transactions\TransactionState.cs:line 549
   at System.Transactions.Transaction.EnlistDurable(Guid resourceManagerIdentifier, IEnlistmentNotification enlistmentNotification, EnlistmentOptions enlistmentOptions) in C:\projects\runtime\src\libraries\System.Transactions.Local\src\System\Transactions\Transaction.cs:line 503
   at System.Transactions.Tests.OleTxTests.OleTxFixture.<>c.<.ctor>b__0_0() in C:\projects\runtime\src\libraries\System.Transactions.Local\tests\OleTxTests.cs:line 534
   at System.Transactions.Tests.OleTxTests.Test(Action action) in C:\projects\runtime\src\libraries\System.Transactions.Local\tests\OleTxTests.cs:line 481
   at System.Transactions.Tests.OleTxTests.OleTxFixture..ctor() in C:\projects\runtime\src\libraries\System.Transactions.Local\tests\OleTxTests.cs:line 527
   at System.RuntimeMethodHandle.InvokeMethod(Object target, Void** arguments, Signature sig, Boolean isConstructor)
   at System.Reflection.ConstructorInvoker.Invoke(Object obj, IntPtr* args, BindingFlags invokeAttr) in C:\projects\runtime\src\libraries\System.Private.CoreLib\src\System\Reflection\ConstructorInvoker.cs:line 86
----- Inner Stack Trace #2 (Xunit.Sdk.TestClassException) -----

@ajcvickers ajcvickers added the Servicing-consider Issue for next servicing release review label Sep 13, 2022
@carlossanlop carlossanlop added Servicing-consider Issue for next servicing release review and removed Servicing-consider Issue for next servicing release review labels Sep 13, 2022
@carlossanlop
Copy link
Member

FYI 7.0 backport PRs do not need to go through Tactics yet until after the snap. But if it's critical to discuss in Tactics, I guess it's ok.

@rbhanda rbhanda added Servicing-approved Approved for servicing release and removed Servicing-consider Issue for next servicing release review labels Sep 13, 2022
@roji
Copy link
Member

roji commented Sep 13, 2022

@carlossanlop @jkotas my understanding from Tactics is that we want to merge this as-is for now (including the arm64 block), and possibly review later for removing the arm64 block before GA. So @carlossanlop you can go ahead and merge.

@carlossanlop carlossanlop removed the NO-MERGE The PR is not ready for merge yet (see discussion for detailed reasons) label Sep 13, 2022
@carlossanlop
Copy link
Member

Sure @roji. Removing the "no merge" label.
Tactics approved. Signed off. CI failures are unrelated (Runtime.Numerics and Net.Http.Functional).
Ready to merge. :shipit:

@carlossanlop carlossanlop merged commit a2cf637 into release/7.0 Sep 13, 2022
@carlossanlop carlossanlop deleted the backport/pr-74226-to-release/7.0 branch September 13, 2022 18:17
@ghost ghost locked as resolved and limited conversation to collaborators Oct 13, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

10 participants