Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Experiment - Remove space restriction from XarchIntrinsicConstants #102913

Closed
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
8 changes: 4 additions & 4 deletions src/coreclr/nativeaot/Runtime/startup.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -50,11 +50,11 @@ extern RhConfig * g_pRhConfig;

#if defined(HOST_X86) || defined(HOST_AMD64) || defined(HOST_ARM64)
// This field is inspected from the generated code to determine what intrinsics are available.
EXTERN_C int g_cpuFeatures;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These are bit-tested at runtime using compiler generated code. This is going to be quite complicated to do without regressing perf on 32bit x86.

I already wrote this elsewhere - I would like to get an understanding why we're wasting bits in this enum - a lot of the bits are redundant, e.g. XArchIntrinsicConstants_Avx2 and XArchIntrinsicConstants_VectorT256. Why do we need this duplication? Is this putting RyuJIT implementation details ("VectorT256" that is not an ISA extension as far as I know) into PAL?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

a lot of the bits are redundant, e.g. XArchIntrinsicConstants_Avx2 and XArchIntrinsicConstants_VectorT256

That's 3 bits (VectorT128, VectorT256, VectorT512). I would not call it a lot. Is there anything else?

Why do we need this duplication? Is this putting RyuJIT implementation details ("VectorT256" that is not an ISA extension as far as I know) into PAL?

Right, we do not need the VectorT* bits in the minipal. They got there by sequence of copy&paste operations of code that was in CoreCLR JIT/EE interface originally.

We do need the VectorT* bits on JIT/EE interface (that is different enum). The vector size is modelled as virtual ISA there.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's 3 bits (VectorT128, VectorT256, VectorT512). I would not call it a lot. Is there anything else?

All of the VL suffixes that seem to just be shortcuts for XArchIntrinsicConstants_Avx512f_vl plus something else that already has a bit.

The question is basically whether we can get away with 32 bits until we're ready to call Sse41 and friends "baseline". The underlying oses are starting to do that on both Windows and Linux side. Or we'll need to do the more complicated thing and extend this to arbitrary lengths like this PR is doing.

Copy link
Member

@tannergooding tannergooding May 31, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

a lot of the bits are redundant, e.g. XArchIntrinsicConstants_Avx2 and XArchIntrinsicConstants_VectorT256.

These are not duplicate. The former describes support for Avx2 the latter describes the size of Vector<T>, these do not have to line up and it may be important to understand the encoded size of Vector<T> for ABI purposes.

There are some opportunities to fold together bits. For example, AVX512*_VL could just be Avx512vl. Given that we require AVX512F+BW+CD+DQ+VL to be provided simultaneously, we could combine all of these into one as well.

We could notably also not allow as fine-grained control as this currently is allowing. Rather than allowing individual opt-in to SSE3, SSSE3, SSE4.1, etc, we could force these to be present at the respective target baselines (i.e. x86-64-v2, x86-64-v3, x86-64-v4) and only include the standalone ISAs that are independent of the primary baselines

However, all of that is less control than most other compilers toolchains give and we're likely to run out of bits sooner than we're able to meaningfully compress them and drop support for downlevel ISAs (although I am definitively still supportive of us moving towards an x86-64-v2 baseline). So it's probably goodness that we take a look at this since we're already pushing up against the boundaries already.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These are not duplicate.

They are duplicate/undesirable at PAL level. PAL should be policy free. XArchIntrinsicConstants_... is PAL level enum.

They are not duplicate at the JIT/EE interface level where the Vector policy size policy lives.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All of the VL suffixes that seem to just be shortcuts for XArchIntrinsicConstants_Avx512f_vl plus something else that already has a bit.

I agree that we can use just a single bit for VL in this enum if we needed to free up more bits.

int g_cpuFeatures = 0;
EXTERN_C HardwareIntrinsicConstants g_cpuFeatures;
HardwareIntrinsicConstants g_cpuFeatures = {0};

// This field is defined in the generated code and sets the ISA expectations.
EXTERN_C int g_requiredCpuFeatures;
EXTERN_C HardwareIntrinsicConstants g_requiredCpuFeatures;
#endif

#ifdef TARGET_UNIX
Expand Down Expand Up @@ -180,7 +180,7 @@ bool DetectCPUFeatures()
#if defined(HOST_X86) || defined(HOST_AMD64) || defined(HOST_ARM64)
g_cpuFeatures = minipal_getcpufeatures();

if ((g_cpuFeatures & g_requiredCpuFeatures) != g_requiredCpuFeatures)
if (!areEqualHardwareIntrinsicConstants(&g_cpuFeatures, &g_requiredCpuFeatures))
{
PalPrintFatalError("\nThe required instruction sets are not supported by the current CPU.\n");
RhFailFast();
Expand Down
372 changes: 277 additions & 95 deletions src/coreclr/tools/Common/Compiler/HardwareIntrinsicHelpers.cs

Large diffs are not rendered by default.

4 changes: 2 additions & 2 deletions src/coreclr/tools/Common/InstructionSetHelpers.cs
Original file line number Diff line number Diff line change
Expand Up @@ -57,10 +57,10 @@ public static InstructionSetSupport ConfigureInstructionSetSupport(string instru

string jitInterfaceLibrary = "jitinterface_" + RuntimeInformation.ProcessArchitecture.ToString().ToLowerInvariant();
nint libHandle = NativeLibrary.Load(jitInterfaceLibrary, System.Reflection.Assembly.GetExecutingAssembly(), DllImportSearchPath.ApplicationDirectory);
int cpuFeatures;
HardwareIntrinsicHelpers.HardwareIntrinsicConstants cpuFeatures = new HardwareIntrinsicHelpers.HardwareIntrinsicConstants();
unsafe
{
var getCpuFeatures = (delegate* unmanaged<int>)NativeLibrary.GetExport(libHandle, "JitGetProcessorFeatures");
var getCpuFeatures = (delegate* unmanaged<HardwareIntrinsicHelpers.HardwareIntrinsicConstants>)NativeLibrary.GetExport(libHandle, "JitGetProcessorFeatures");
cpuFeatures = getCpuFeatures();
}
HardwareIntrinsicHelpers.AddRuntimeRequiredIsaFlagsToBuilder(instructionSetSupportBuilder, cpuFeatures);
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -34,11 +34,11 @@ public static MethodIL EmitIsSupportedIL(MethodDesc method, FieldDesc isSupporte
{
case TargetArchitecture.X86:
case TargetArchitecture.X64:
flag = XArchIntrinsicConstants.FromInstructionSet(instructionSet);
flag = XArchIntrinsicFeatures.FromInstructionSet(instructionSet);
break;

case TargetArchitecture.ARM64:
flag = Arm64IntrinsicConstants.FromInstructionSet(instructionSet);
flag = Arm64IntrinsicFeatures.FromInstructionSet(instructionSet);
break;

default:
Expand Down Expand Up @@ -67,12 +67,12 @@ public static int GetRuntimeRequiredIsaFlags(InstructionSetSupport instructionSe
case TargetArchitecture.X86:
case TargetArchitecture.X64:
foreach (InstructionSet instructionSet in instructionSetSupport.SupportedFlags)
result |= XArchIntrinsicConstants.FromInstructionSet(instructionSet);
result |= XArchIntrinsicFeatures.FromInstructionSet(instructionSet);
break;

case TargetArchitecture.ARM64:
foreach (InstructionSet instructionSet in instructionSetSupport.SupportedFlags)
result |= Arm64IntrinsicConstants.FromInstructionSet(instructionSet);
result |= Arm64IntrinsicFeatures.FromInstructionSet(instructionSet);
break;

default:
Expand Down
2 changes: 1 addition & 1 deletion src/coreclr/tools/aot/jitinterface/jitwrapper.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -53,7 +53,7 @@ DLL_EXPORT void JitProcessShutdownWork(ICorJitCompiler * pJit)
return pJit->ProcessShutdownWork(nullptr);
}

DLL_EXPORT int JitGetProcessorFeatures()
DLL_EXPORT HardwareIntrinsicConstants JitGetProcessorFeatures()
{
return minipal_getcpufeatures();
}
96 changes: 48 additions & 48 deletions src/coreclr/vm/codeman.cpp

Large diffs are not rendered by default.

190 changes: 126 additions & 64 deletions src/native/minipal/cpufeatures.c

Large diffs are not rendered by default.

123 changes: 74 additions & 49 deletions src/native/minipal/cpufeatures.h
Original file line number Diff line number Diff line change
Expand Up @@ -3,71 +3,96 @@

#ifndef HAVE_MINIPAL_CPUFEATURES_H
#define HAVE_MINIPAL_CPUFEATURES_H

#define NUM_PARTS 3
//
// Should match the constants defined in the compiler in HardwareIntrinsicHelpers.cs
//

typedef struct {
uint32_t parts[NUM_PARTS];
} HardwareIntrinsicConstants;

inline bool areEqualHardwareIntrinsicConstants(HardwareIntrinsicConstants *constants1, HardwareIntrinsicConstants *constants2) {
for (int i = 0; i < NUM_PARTS; i++) {
if (constants1->parts[i] != constants2->parts[i]) {
return false;
}
}
return true;
}

inline void setFlag(HardwareIntrinsicConstants *constants, int flag) {
constants->parts[flag / 32] |= (1 << (flag % 32));
}

inline void clearFlag(HardwareIntrinsicConstants *constants, int flag) {
constants->parts[flag / 32] &= ~(1 << (flag % 32));
}

inline bool isFlagSet(const HardwareIntrinsicConstants *constants, int flag) {
return (constants->parts[flag / 32] & (1 << (flag % 32))) != 0;
}

#if defined(HOST_X86) || defined(HOST_AMD64)
enum XArchIntrinsicConstants
enum XArchIntrinsicFeatures
{
XArchIntrinsicConstants_Aes = 0x0001,
XArchIntrinsicConstants_Pclmulqdq = 0x0002,
XArchIntrinsicConstants_Sse3 = 0x0004,
XArchIntrinsicConstants_Ssse3 = 0x0008,
XArchIntrinsicConstants_Sse41 = 0x0010,
XArchIntrinsicConstants_Sse42 = 0x0020,
XArchIntrinsicConstants_Popcnt = 0x0040,
XArchIntrinsicConstants_Avx = 0x0080,
XArchIntrinsicConstants_Fma = 0x0100,
XArchIntrinsicConstants_Avx2 = 0x0200,
XArchIntrinsicConstants_Bmi1 = 0x0400,
XArchIntrinsicConstants_Bmi2 = 0x0800,
XArchIntrinsicConstants_Lzcnt = 0x1000,
XArchIntrinsicConstants_AvxVnni = 0x2000,
XArchIntrinsicConstants_Movbe = 0x4000,
XArchIntrinsicConstants_Avx512f = 0x8000,
XArchIntrinsicConstants_Avx512f_vl = 0x10000,
XArchIntrinsicConstants_Avx512bw = 0x20000,
XArchIntrinsicConstants_Avx512bw_vl = 0x40000,
XArchIntrinsicConstants_Avx512cd = 0x80000,
XArchIntrinsicConstants_Avx512cd_vl = 0x100000,
XArchIntrinsicConstants_Avx512dq = 0x200000,
XArchIntrinsicConstants_Avx512dq_vl = 0x400000,
XArchIntrinsicConstants_Avx512Vbmi = 0x800000,
XArchIntrinsicConstants_Avx512Vbmi_vl = 0x1000000,
XArchIntrinsicConstants_Serialize = 0x2000000,
XArchIntrinsicConstants_VectorT128 = 0x4000000,
XArchIntrinsicConstants_VectorT256 = 0x8000000,
XArchIntrinsicConstants_VectorT512 = 0x10000000,
XArchIntrinsicConstants_Avx10v1 = 0x20000000,
XArchIntrinsicConstants_Avx10v1_V256 = 0x40000000,
XArchIntrinsicConstants_Avx10v1_V512 = 0x80000000,
XArchIntrinsicConstants_Aes = 0,
XArchIntrinsicConstants_Pclmulqdq = 1,
XArchIntrinsicConstants_Sse3 = 2,
XArchIntrinsicConstants_Ssse3 = 3,
XArchIntrinsicConstants_Sse41 = 4,
XArchIntrinsicConstants_Sse42 = 5,
XArchIntrinsicConstants_Popcnt = 6,
XArchIntrinsicConstants_Avx = 7,
XArchIntrinsicConstants_Fma = 8,
XArchIntrinsicConstants_Avx2 = 9,
XArchIntrinsicConstants_Bmi1 = 10,
XArchIntrinsicConstants_Bmi2 = 11,
XArchIntrinsicConstants_Lzcnt = 12,
XArchIntrinsicConstants_AvxVnni = 13,
XArchIntrinsicConstants_Movbe = 14,
XArchIntrinsicConstants_Avx512f = 15,
XArchIntrinsicConstants_Avx512f_vl = 16,
XArchIntrinsicConstants_Avx512bw = 17,
XArchIntrinsicConstants_Avx512bw_vl = 18,
XArchIntrinsicConstants_Avx512cd = 19,
XArchIntrinsicConstants_Avx512cd_vl = 20,
XArchIntrinsicConstants_Avx512dq = 21,
XArchIntrinsicConstants_Avx512dq_vl = 22,
XArchIntrinsicConstants_Avx512Vbmi = 23,
XArchIntrinsicConstants_Avx512Vbmi_vl = 24,
XArchIntrinsicConstants_Serialize = 25,
XArchIntrinsicConstants_VectorT128 = 26,
XArchIntrinsicConstants_VectorT256 = 27,
XArchIntrinsicConstants_VectorT512 = 28,
XArchIntrinsicConstants_Avx10v1 = 29,
XArchIntrinsicConstants_Avx10v1_V256 = 30,
XArchIntrinsicConstants_Avx10v1_V512 = 31,
};
#endif // HOST_X86 || HOST_AMD64

#if defined(HOST_ARM64)
enum ARM64IntrinsicConstants
{
ARM64IntrinsicConstants_AdvSimd = 0x0001,
ARM64IntrinsicConstants_Aes = 0x0002,
ARM64IntrinsicConstants_Crc32 = 0x0004,
ARM64IntrinsicConstants_Dp = 0x0008,
ARM64IntrinsicConstants_Rdm = 0x0010,
ARM64IntrinsicConstants_Sha1 = 0x0020,
ARM64IntrinsicConstants_Sha256 = 0x0040,
ARM64IntrinsicConstants_Atomics = 0x0080,
ARM64IntrinsicConstants_Rcpc = 0x0100,
ARM64IntrinsicConstants_VectorT128 = 0x0200,
ARM64IntrinsicConstants_Rcpc2 = 0x0400,
ARM64IntrinsicConstants_Sve = 0x0800,
ARM64IntrinsicConstants_AdvSimd = 0,
ARM64IntrinsicConstants_Aes = 1,
ARM64IntrinsicConstants_Crc32 = 2,
ARM64IntrinsicConstants_Dp = 3,
ARM64IntrinsicConstants_Rdm = 4,
ARM64IntrinsicConstants_Sha1 = 5,
ARM64IntrinsicConstants_Sha256 = 6,
ARM64IntrinsicConstants_Atomics = 7,
ARM64IntrinsicConstants_Rcpc = 8,
ARM64IntrinsicConstants_VectorT128 = 9,
ARM64IntrinsicConstants_Rcpc2 = 10,
ARM64IntrinsicConstants_Sve = 11,
};

#include <assert.h>

// Bit position for the ARM64IntrinsicConstants_Atomics flags, to be used with tbz / tbnz instructions
#define ARM64_ATOMICS_FEATURE_FLAG_BIT 7
static_assert((1 << ARM64_ATOMICS_FEATURE_FLAG_BIT) == ARM64IntrinsicConstants_Atomics, "ARM64_ATOMICS_FEATURE_FLAG_BIT must match with ARM64IntrinsicConstants_Atomics");
#define ARM64_ATOMICS_FEATURE_FLAG_VALUE 7
static_assert(ARM64_ATOMICS_FEATURE_FLAG_VALUE == ARM64IntrinsicConstants_Atomics, "ARM64_ATOMICS_FEATURE_FLAG_BIT must match with ARM64IntrinsicConstants_Atomics");

#endif // HOST_ARM64

Expand All @@ -76,7 +101,7 @@ extern "C"
{
#endif // __cplusplus

int minipal_getcpufeatures(void);
HardwareIntrinsicConstants minipal_getcpufeatures(void);
bool minipal_detect_rosetta(void);

#ifdef __cplusplus
Expand Down
Loading