Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[mono][jit] Adding more arm64 SIMD operations, SIMD codegen with instruction table. #83094

Merged
merged 7 commits into from
Mar 13, 2023

Conversation

jandupej
Copy link
Member

@jandupej jandupej commented Mar 7, 2023

Transformation of Mono IR to machine code is now automated with a table in simd-arm64.h. The table format should be general enough also for x86(-64). Opcodes not listed in the table go through the preexisting process. Only arm64 instructions are considered right now.

Adding comparison operations, negation, ones complement.

Contributes to #80566 and #43051.

Note: before merging the final PR, disable SIMD until all intrinsics are implemented (as in 31a9c6c)

…ew others, tabular approach to instruciton specs.
…chine code transformation is now table-generated for select operations. Fixing issues.
src/mono/mono/mini/mini-arm64.c Outdated Show resolved Hide resolved
src/mono/mono/mini/mini-arm64.c Outdated Show resolved Hide resolved
src/mono/mono/mini/mini-arm64.c Outdated Show resolved Hide resolved
src/mono/mono/mini/mini-arm64.c Outdated Show resolved Hide resolved
src/mono/mono/mini/mini-arm64.c Outdated Show resolved Hide resolved
@fanyang-mono
Copy link
Member

CI failures on runtime (Build Libraries Test Run release mono linux arm64 Debug) and runtime (Build browser-wasm linux Release AllSubsets_Mono_RuntimeTests monointerpreter) are not related to this PR.

@vargaz
Copy link
Contributor

vargaz commented Mar 8, 2023

This makes the code a little bit hard to read. It might be better to only do it for only a subset like the OP_XBINOP ones.

@jandupej
Copy link
Member Author

jandupej commented Mar 8, 2023

This makes the code a little bit hard to read. It might be better to only do it for only a subset like the OP_XBINOP ones.

@vargaz Which part are you referring to? (the comment has no line number attached)

@vargaz
Copy link
Contributor

vargaz commented Mar 8, 2023

The simd opcodes are not regular enough for them to be generated from a table imho. The OP_XBINOP opcodes are easier to handle since they each map to an llvm intrinsics, and the llvm intrinsics map to SIMD instructions.

@tannergooding
Copy link
Member

The simd opcodes are not regular enough for them to be generated from a table imho

This might just be due to the way Mono is currently setup. Overall the data is incredibly regular and we need very little special handling in RyuJIT. This is true for both x86/x64 and Arm64. Arm64 is notably more regular than x64 is, simply due to all the intrinsics being introduced in the same ISA, and so the represented tables are more dense and less sparse.

@jandupej
Copy link
Member Author

jandupej commented Mar 8, 2023

The table effectively unpacks into two tables (one where the opcode fully defines the operations, and one where inst_c0 is needed in addition to opcode). I agree this is confusing. The idea is to make adding a new mapping of mono IR to CPU instruction simpler; I think this solution goes in that general direction. So what can be done to make this more intuitive? I can think of these:

  • Implement the table only for XBINOP, but the comparisons and unary operations and others would have to be handled manually. This, I think, would defeat the purpose of the tabular approach.
  • Consolidate the IR operations in a way that would enable a nice regular table. This would be a big job, I imagine, but might be worth it.
  • Document or streamline the proposed implementation better.
  • Rework the table usage so that it does not unpack into a switch, but fills an array at startup that would allow as to map opcodexinst_c0 into a basic arm64 opcode + a function for adding register numbers, immediate values into it. The resulting code might be more readable.

Do you have any other ideas? @vargaz @tannergooding @fanyang-mono

@tannergooding
Copy link
Member

For RyuJIT, we effectively have one node type to represent all hardware intrinsics: GT_HWINTRINSIC and we don't have unique node types for unary vs binary vs etc. This is unlike regular operations where we may have more explicit types such as GenTreeUnOp, GenTreeOp, etc..

This was done specifically to help with the general table driven approach and in general specialized handling that SIMD nodes may require as compared to other nodes.

The GenTreeHWIntrinsic node tracks some core information such as the:

  • Intrinsic ID
  • Operands (specially tracking 2, for the common case of 0/1/2 operands)
  • SIMD Size Info (used to track input size, so you can differentiate between overloads)
  • SIMD Base Type (used to track input type, again so you can differentiate between overloads)

This is in addition to "standard" things that all nodes track, like return type. There are also a couple special flags/fields that have multiple uses based on the intrinsic ID for the very few cases that do need specialized handling.

We then have one nice big table for each architecture:

This table tracks the ISA and function name, which combined form the Intrinsic ID (such as NI_AdvSimd_Add), it tracks the simd size (where -1 means the size is variable and 0 means the operation is "scalar"), the number of arguments the API takes (again where -1 means variable), and then a "per-base type" mapping of what instruction is to be emitted (e.g. NI_AdvSimd_AbsoluteDifference is INS_sabd for sbyte and INS_uabd for byte).

We then track a category, which is used to help decide which of the table driven paths an API goes down. Most of these are HW_Category_SIMD. A category HW_Category_Helper generally indicates it needs some form of specialized handling or shouldn't be regularly encountered.

We then also track some flags that also help drive the compilation handling in general. HW_Flag_SpecialImport and HW_Flag_SpecialCodegen respectively mean that there is some aspect that can't be table driven today. Most often this is just because of a unique behavior or handling that isn't worth adding a flag to handle.

For Arm64, we have roughly 461 intrinsics, of those:

  • 55 have some form of specialized codegen requirement today
  • 10 have some form of specialized import requirement today

Since each of these intrinsics has 10 types it needs to consider support for, there are a total of 4610 instruction entries tracked. The majority of this is "sparse" at 2997 entries. The reason for this level of sparseness is primarily because there are many APIs which are only valid for float/double and so they account for 8 INS_invalid entries each. We could refactor this a bit to save on the space, but haven't had much of a need to do so yet, particularly given the overall savings we're seeing from just having the table driven approach existing in the first place.

Notably there are also ~195 xplat APIs for NI_Vector64_* and NI_Vector128_* which are always specially imported today. We could make this table driven, but it'd be a slightly different table/setup since they provide fallback paths and more general customized support.


From all of this, we are able to generally table drive each phase of the JIT. We are also able to broadly share a large amount of logic between the x86, x64, and Arm64 intrinsic support.

This is applicable to:

  • import
  • morph
  • value numbering
  • lowering
  • codegen

Outside of defining the tables and the HWIntrinsicInfo struct they map to, we aren't using macros or other things to simplify code.

We just map the ID to the HWIntrinsicInfo, call the relevant member to get the flags or other info required to determine which path to take, and then execute some small amount of shared handling.

Asserts exist to help ensure that anything that needs to be changed when new functionality is added gets handled.

@jandupej
Copy link
Member Author

jandupej commented Mar 8, 2023

Thanks, @tannergooding , that will be useful.

…ted to exclude certain operations that are easily implemented manually.
@jandupej
Copy link
Member Author

After discussion with @vargaz, the table was modified to only encode the operations which are hidden under umbrella OP_s. Operations like OP_NEGATION are excluded for now. The longer-term plan is to consolidate these "stray" operations into a more predictable structure and include them then. That part is tracked by #83252.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants