Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

JIT: Handle half accesses for SIMDs in local morph #89520

Merged
merged 3 commits into from
Jul 27, 2023

Conversation

jakobbotsch
Copy link
Member

@jakobbotsch jakobbotsch commented Jul 26, 2023

While it's not generally expected that halves can be accessed directly (ending up with LCL_FLD), it sometimes happens in some of the SW implementations of Vector256/Vector512 methods. In rare cases the JIT still falls back to these even with there is HW acceleration.
In those cases we want to avoid DNER'ing the involved locals, so expand the existing recognition with these patterns.

Also add a check to the existing SIMD16 -> SIMD12 to verify the source is a SIMD16.

Fix #85359
Fix #89456

Some size wise regressions are expected, which come down to

  • A large number of similar looking tests end up now enregistering some locals that cause new upper half saves/restores to be required. This accounts for most of the size-wise regressions.
  • The expansions often do not result in smaller code because loading/storing the halves directly from/to stack is smaller code than the vector equivalent with extraction/insertion.

Many of the regressions are in SW implementations of Vector256/Vector512 methods that we usually do not expect to see called with HW acceleration supported.

Also add a check to the existing SIMD16 -> SIMD12 to verify the source
is a SIMD16.

Fix dotnet#85359
Fix dotnet#89456
@dotnet-issue-labeler dotnet-issue-labeler bot added the area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI label Jul 26, 2023
@ghost ghost assigned jakobbotsch Jul 26, 2023
@ghost
Copy link

ghost commented Jul 26, 2023

Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch
See info in area-owners.md if you want to be subscribed.

Issue Details

While it's not generally expected that quarters/halves can be accessed directly (ending up with LCL_FLD), it sometimes happens in some of the SW implementations of Vector256/Vector512 methods. In rare cases the JIT still falls back to these even with there is HW acceleration.
In those cases we want to avoid DNER'ing the involved locals, so expand the existing recognition with these patterns.

Also add a check to the existing SIMD16 -> SIMD12 to verify the source is a SIMD16.

Fix #85359
Fix #89456

Some size wise regressions are expected, which come down to

  • A large number of similar looking tests end up now enregistering some locals that cause new upper half saves/restores to be required. This accounts for most of the size-wise regressions.
  • The expansions often do not result in smaller code because loading/storing the halves directly from/to stack is smaller code than the vector equivalent with extraction/insertion.

Many of the regressions are in SW implementations of Vector256/Vector512 methods that we usually do not expect to see called with HW acceleration supported.

Author: jakobbotsch
Assignees: -
Labels:

area-CodeGen-coreclr

Milestone: -

@@ -1125,11 +1192,17 @@ class LocalAddressVisitor final : public GenTreeVisitor<LocalAddressVisitor>
}
else if (indir->TypeIs(TYP_SIMD12))
{
if ((offset == 0) && m_compiler->IsBaselineSimdIsaSupported())
if ((offset == 0) && (varDsc->TypeGet() == TYP_SIMD16) && m_compiler->IsBaselineSimdIsaSupported())
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could adjust this to handle TYP_SIMD32/TYP_SIMD64 if desired, correct? Just not worth it as part of this PR?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, let's leave that for a separate PR if desired. This was just a drive-by fix for #89456.

@jakobbotsch jakobbotsch changed the title JIT: Handle half/quarter accesses for SIMDs in local morph JIT: Handle halfaccesses for SIMDs in local morph Jul 27, 2023
@jakobbotsch jakobbotsch changed the title JIT: Handle halfaccesses for SIMDs in local morph JIT: Handle half accesses for SIMDs in local morph Jul 27, 2023
@jakobbotsch jakobbotsch marked this pull request as ready for review July 27, 2023 13:06
@jakobbotsch
Copy link
Member Author

cc @dotnet/jit-contrib PTAL @tannergooding

Diffs. Size wise regressions as discussed above.

break;
}
default:
unreached();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are we confident that unreached() is "correct" here or can/should we simply bail out of the transformation?

The main consideration would probably be that Vector<T> and Vector64<T> are backed by ulong fields. That shouldn't normally be an issue, but with all the various Unsafe.As/BitCast and other tricks, I'm not positive its not possible to see something here still.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's fine -- SelectLocalIndirTransform only returns WithElement/GetElement for the cases that we can handle here.

Copy link
Member

@tannergooding tannergooding left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

Left a couple questions around certainty the unreached/assertions are "safe" given the various bitcasts and such we could encounter.

Size regression is primarily in tests and won't really impact the real world code most users see. The overall codegen is better (removing two stores) and while there is some more we could do, its a step in the right direction.

-- We do probably want to check the perf scores at some point since its showing the 2x insert (3 cycles, 1tp, 1uop, 1port each) is 1 cycle slower than the 2x stores (<11 cycles, 0.5-1tp, 1-2 uops, 2 ports each)

@jakobbotsch
Copy link
Member Author

Failures are known according to build analysis.

@jakobbotsch jakobbotsch merged commit 87526fb into dotnet:main Jul 27, 2023
132 of 135 checks passed
@jakobbotsch jakobbotsch deleted the fix-85359 branch July 27, 2023 17:07
@ghost ghost locked as resolved and limited conversation to collaborators Aug 26, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI
Projects
None yet
2 participants