[mono][jit] Adding more arm64 SIMD operations, SIMD codegen with instruction table. #83094

jandupej · 2023-03-07T15:01:26Z

Transformation of Mono IR to machine code is now automated with a table in simd-arm64.h. The table format should be general enough also for x86(-64). Opcodes not listed in the table go through the preexisting process. Only arm64 instructions are considered right now.

Adding comparison operations, negation, ones complement.

Contributes to #80566 and #43051.

Note: before merging the final PR, disable SIMD until all intrinsics are implemented (as in 31a9c6c)

…ew others, tabular approach to instruciton specs.

…chine code transformation is now table-generated for select operations. Fixing issues.

src/mono/mono/mini/mini-arm64.c

fanyang-mono · 2023-03-07T19:03:35Z

CI failures on runtime (Build Libraries Test Run release mono linux arm64 Debug) and runtime (Build browser-wasm linux Release AllSubsets_Mono_RuntimeTests monointerpreter) are not related to this PR.

src/mono/mono/mini/mini-arm64.c

src/mono/mono/mini/simd-arm64.h

src/mono/mono/mini/mini-arm64.c

vargaz · 2023-03-08T12:02:33Z

This makes the code a little bit hard to read. It might be better to only do it for only a subset like the OP_XBINOP ones.

jandupej · 2023-03-08T12:46:25Z

This makes the code a little bit hard to read. It might be better to only do it for only a subset like the OP_XBINOP ones.

@vargaz Which part are you referring to? (the comment has no line number attached)

vargaz · 2023-03-08T13:09:15Z

The simd opcodes are not regular enough for them to be generated from a table imho. The OP_XBINOP opcodes are easier to handle since they each map to an llvm intrinsics, and the llvm intrinsics map to SIMD instructions.

tannergooding · 2023-03-08T13:16:54Z

The simd opcodes are not regular enough for them to be generated from a table imho

This might just be due to the way Mono is currently setup. Overall the data is incredibly regular and we need very little special handling in RyuJIT. This is true for both x86/x64 and Arm64. Arm64 is notably more regular than x64 is, simply due to all the intrinsics being introduced in the same ISA, and so the represented tables are more dense and less sparse.

jandupej · 2023-03-08T14:51:20Z

The table effectively unpacks into two tables (one where the opcode fully defines the operations, and one where inst_c0 is needed in addition to opcode). I agree this is confusing. The idea is to make adding a new mapping of mono IR to CPU instruction simpler; I think this solution goes in that general direction. So what can be done to make this more intuitive? I can think of these:

Implement the table only for XBINOP, but the comparisons and unary operations and others would have to be handled manually. This, I think, would defeat the purpose of the tabular approach.
Consolidate the IR operations in a way that would enable a nice regular table. This would be a big job, I imagine, but might be worth it.
Document or streamline the proposed implementation better.
Rework the table usage so that it does not unpack into a switch, but fills an array at startup that would allow as to map opcodexinst_c0 into a basic arm64 opcode + a function for adding register numbers, immediate values into it. The resulting code might be more readable.

Do you have any other ideas? @vargaz @tannergooding @fanyang-mono

tannergooding · 2023-03-08T15:21:53Z

For RyuJIT, we effectively have one node type to represent all hardware intrinsics: GT_HWINTRINSIC and we don't have unique node types for unary vs binary vs etc. This is unlike regular operations where we may have more explicit types such as GenTreeUnOp, GenTreeOp, etc..

This was done specifically to help with the general table driven approach and in general specialized handling that SIMD nodes may require as compared to other nodes.

The GenTreeHWIntrinsic node tracks some core information such as the:

Intrinsic ID
Operands (specially tracking 2, for the common case of 0/1/2 operands)
SIMD Size Info (used to track input size, so you can differentiate between overloads)
SIMD Base Type (used to track input type, again so you can differentiate between overloads)

This is in addition to "standard" things that all nodes track, like return type. There are also a couple special flags/fields that have multiple uses based on the intrinsic ID for the very few cases that do need specialized handling.

We then have one nice big table for each architecture:

This table tracks the ISA and function name, which combined form the Intrinsic ID (such as NI_AdvSimd_Add), it tracks the simd size (where -1 means the size is variable and 0 means the operation is "scalar"), the number of arguments the API takes (again where -1 means variable), and then a "per-base type" mapping of what instruction is to be emitted (e.g. NI_AdvSimd_AbsoluteDifference is INS_sabd for sbyte and INS_uabd for byte).

We then track a category, which is used to help decide which of the table driven paths an API goes down. Most of these are HW_Category_SIMD. A category HW_Category_Helper generally indicates it needs some form of specialized handling or shouldn't be regularly encountered.

We then also track some flags that also help drive the compilation handling in general. HW_Flag_SpecialImport and HW_Flag_SpecialCodegen respectively mean that there is some aspect that can't be table driven today. Most often this is just because of a unique behavior or handling that isn't worth adding a flag to handle.

For Arm64, we have roughly 461 intrinsics, of those:

55 have some form of specialized codegen requirement today
10 have some form of specialized import requirement today

Since each of these intrinsics has 10 types it needs to consider support for, there are a total of 4610 instruction entries tracked. The majority of this is "sparse" at 2997 entries. The reason for this level of sparseness is primarily because there are many APIs which are only valid for float/double and so they account for 8 INS_invalid entries each. We could refactor this a bit to save on the space, but haven't had much of a need to do so yet, particularly given the overall savings we're seeing from just having the table driven approach existing in the first place.

Notably there are also ~195 xplat APIs for NI_Vector64_* and NI_Vector128_* which are always specially imported today. We could make this table driven, but it'd be a slightly different table/setup since they provide fallback paths and more general customized support.

From all of this, we are able to generally table drive each phase of the JIT. We are also able to broadly share a large amount of logic between the x86, x64, and Arm64 intrinsic support.

This is applicable to:

import
morph
value numbering
lowering
codegen

Outside of defining the tables and the HWIntrinsicInfo struct they map to, we aren't using macros or other things to simplify code.

We just map the ID to the HWIntrinsicInfo, call the relevant member to get the flags or other info required to determine which path to take, and then execute some small amount of shared handling.

Asserts exist to help ensure that anything that needs to be changed when new functionality is added gets handled.

jandupej · 2023-03-08T15:47:46Z

Thanks, @tannergooding , that will be useful.

…ted to exclude certain operations that are easily implemented manually.

jandupej · 2023-03-10T10:21:36Z

After discussion with @vargaz, the table was modified to only encode the operations which are hidden under umbrella OP_s. Operations like OP_NEGATION are excluded for now. The longer-term plan is to consolidate these "stray" operations into a more predictable structure and include them then. That part is tracked by #83252.

src/mono/mono/mini/mini-arm64.c

src/mono/mono/mini/simd-intrinsics.c

src/mono/mono/mini/mini-arm64.c

jandupej added 3 commits February 22, 2023 16:36

[mono][jit] Handling vector operations on arm64, comparisions and a f…

6b75901

…ew others, tabular approach to instruciton specs.

Merge branch 'main' into arm64-intrin-cmp

b66d330

[mono][jit] Adding compare vector operations, not, neg. Mono IR to ma…

4dc6a40

…chine code transformation is now table-generated for select operations. Fixing issues.

dotnet-issue-labeler bot added the area-Codegen-JIT-mono label Mar 7, 2023

ghost assigned jandupej Mar 7, 2023

jandupej requested review from tannergooding and fanyang-mono March 7, 2023 15:02

YAGNI on a macro.

8f65963

fanyang-mono reviewed Mar 7, 2023

View reviewed changes

This was referenced Mar 7, 2023

Roslyn source generator crash on mono/linux/arm64 #81123

Closed

IOException running NuGet-Migrations during tests in dotnet CLI first run #80619

Closed

fanyang-mono reviewed Mar 7, 2023

View reviewed changes

src/mono/mono/mini/mini-arm64.c Show resolved Hide resolved

src/mono/mono/mini/simd-arm64.h Show resolved Hide resolved

vargaz reviewed Mar 8, 2023

View reviewed changes

src/mono/mono/mini/mini-arm64.c Show resolved Hide resolved

[mono][jit] Table-driven code SIMD generation on arm64 is now restric…

5f12c7d

…ted to exclude certain operations that are easily implemented manually.

jandupej marked this pull request as ready for review March 9, 2023 15:53

jandupej requested review from lambdageek and SamMonoRT as code owners March 9, 2023 15:53

jandupej mentioned this pull request Mar 10, 2023

[mono][jit] Consolidate Mono IR operations that deal with SIMD #83252

Open

vargaz reviewed Mar 10, 2023

View reviewed changes

src/mono/mono/mini/mini-arm64.c Outdated Show resolved Hide resolved

vargaz reviewed Mar 10, 2023

View reviewed changes

src/mono/mono/mini/simd-intrinsics.c Outdated Show resolved Hide resolved

vargaz reviewed Mar 10, 2023

View reviewed changes

src/mono/mono/mini/mini-arm64.c Show resolved Hide resolved

[mono][jit] Code cleanup.

80f5528

vargaz reviewed Mar 10, 2023

View reviewed changes

src/mono/mono/mini/mini-arm64.c Show resolved Hide resolved

vargaz approved these changes Mar 10, 2023

View reviewed changes

build-analysis bot mentioned this pull request Mar 10, 2023

[release/6.0] Doublelinklist GC failures on Mono #83245

Closed

Temporarily disable SIMD on arm64. Fix indentation. Comments.

3757161

jandupej merged commit 02a7695 into dotnet:main Mar 13, 2023

build-analysis bot mentioned this pull request Mar 14, 2023

[jitstress] HardwareIntrinsics_ro fails with "process cannot access the file" error #83298

Closed

kotlarmilos mentioned this pull request Mar 14, 2023

[Perf] Linux/x64: 15 Improvements on 3/13/2023 2:14:21 PM dotnet/perf-autofiling-issues#13990

Closed

kotlarmilos mentioned this pull request Mar 23, 2023

[Perf] Linux/x64: 6 Improvements on 3/13/2023 2:14:21 PM dotnet/perf-autofiling-issues#14224

Closed

kotlarmilos mentioned this pull request Apr 4, 2023

.NET 8 Per-Preview Performance report on WASM, Mono AOT, and Interpreter #84302

Closed

ghost locked as resolved and limited conversation to collaborators Apr 12, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[mono][jit] Adding more arm64 SIMD operations, SIMD codegen with instruction table. #83094

[mono][jit] Adding more arm64 SIMD operations, SIMD codegen with instruction table. #83094

jandupej commented Mar 7, 2023

fanyang-mono commented Mar 7, 2023

vargaz commented Mar 8, 2023

jandupej commented Mar 8, 2023

vargaz commented Mar 8, 2023

tannergooding commented Mar 8, 2023

jandupej commented Mar 8, 2023

tannergooding commented Mar 8, 2023

jandupej commented Mar 8, 2023

jandupej commented Mar 10, 2023

[mono][jit] Adding more arm64 SIMD operations, SIMD codegen with instruction table. #83094

[mono][jit] Adding more arm64 SIMD operations, SIMD codegen with instruction table. #83094

Conversation

jandupej commented Mar 7, 2023

fanyang-mono commented Mar 7, 2023

vargaz commented Mar 8, 2023

jandupej commented Mar 8, 2023

vargaz commented Mar 8, 2023

tannergooding commented Mar 8, 2023

jandupej commented Mar 8, 2023

tannergooding commented Mar 8, 2023

jandupej commented Mar 8, 2023

jandupej commented Mar 10, 2023