Add AVX512-IFMA intrinsics. #676

hdevalence · 2019-02-05T03:51:02Z

Progress on #310, adding all the (unmasked) versions of the IFMA intrinsics.

Because AVX512VL extends the AVX512 intrinsics to ymm and xmm operands, there's a question about where the 256-bit and 128-bit variants of the instructions should live. My feeling (mentioned in discussion in #310) is that the cleanest thing to do is to keep the length-extended intrinsics with their "parents", which is what I did here.

I don't know what the CI situation is, but I can confirm that these tests do pass on my Cannonlake machine (and don't pass when the numbers are changed).

hdevalence · 2019-02-05T04:05:45Z

Not sure if the appveyor failures are related to these changes; the avx512ifma tests seem like they passed there...

alexcrichton · 2019-02-05T06:25:59Z

Yeah I think AppVeyor can be ignored but looks like some instruction assertions are failing on Travis?

gnzlbg · 2019-02-05T10:43:07Z

It appears that a vpmadd52luq is being generated for some intrinsics instead of vpmadd52huq which the assert_instr require.

It appears to be a bug in the assert_instr() tests of the new intrinsics. E.g. the test for _mm512_madd52lo_epu64 requires vpmadd52huq but the Intel Intrinsics Guide says that the intrinsic generates a vpmadd52luq (https://software.intel.com/sites/landingpage/IntrinsicsGuide/#text=_mm512_madd52lo_epu64&expand=3497), so the failures appear to be right.

@hdevalence note that for the assert_instr test to run, you need to run the tests in --release mode. These tests do not run in debug mode.

hdevalence · 2019-02-05T18:52:44Z

@hdevalence note that for the assert_instr test to run, you need to run the tests in --release mode. These tests do not run in debug mode.

Ooops, thanks for pointing this out, I'll fix it!

hdevalence · 2019-02-05T19:00:57Z

Fixed; tested locally with --release. Thanks for pointing that out!

gnzlbg · 2019-02-05T19:33:47Z

Progress! It seems that the build is mostly green now :)

The verification of the intrinsics is failing:

--- verify_all_signatures stdout ----
failed to verify `_mm512_madd52hi_epu64`
  * intel cpuid `avx512ifma52` not in `avx512ifma` for _mm512_madd52hi_epu64
failed to verify `_mm512_madd52lo_epu64`
  * intel cpuid `avx512ifma52` not in `avx512ifma` for _mm512_madd52lo_epu64
failed to verify `_mm256_madd52hi_epu64`
  * intel cpuid `avx512ifma52` not in `avx512ifma,avx512vl` for _mm256_madd52hi_epu64
failed to verify `_mm256_madd52lo_epu64`
  * intel cpuid `avx512ifma52` not in `avx512ifma,avx512vl` for _mm256_madd52lo_epu64
failed to verify `_mm_madd52hi_epu64`
  * intel cpuid `avx512ifma52` not in `avx512ifma,avx512vl` for _mm_madd52hi_epu64
failed to verify `_mm_madd52lo_epu64`
  * intel cpuid `avx512ifma52` not in `avx512ifma,avx512vl` for _mm_madd52lo_epu64
failed to verify `_mm512_set1_epi64`
  * intrinsic `_mm512_set1_epi64` uses a 64-bit bare type but may be available on 32-bit platforms

It seems that in the intel-x86.xml file states that these intrinsics require the avx512ifma52 CPUID feature flag, but we name this feature avx512ifma, so this condition is returning false, hence the error. Wanna give fixing this a try? The best would be to just add a function that maps the Intel CPUID name to the Rust target feature name. It should just return the CPUID name unless the name is avx512ifma52 in which case it should return avx512ifma. Once you have this function, just use it inside the contains in the code above to translate the feature names.

hdevalence · 2019-02-07T04:17:39Z

I added a fixup pass for the CPUID names, hopefully it's OK; I'm not sure what to do about the remaining failure re: _mm512_set1_epi64... I don't understand why AVX512 would be available on 32-bit platforms anyways.

gnzlbg · 2019-02-07T09:13:16Z

@hdevalence wow nice work, only one failure to go !

I don't understand why AVX512 would be available on 32-bit platforms anyways.

You can compile for an x86 target with 32-bit wide pointers and still use SIMD intrinsics. Basically, 32-bit platform refers to the target_pointer_with. If you know that your application doesn't need to address more than 4Gb of memory, you can save some space by just making your pointers 32-bit wide instead of 64-bit wide, even if your binary is running on a 64-bit system.

I'm not 100% sure, but I think that error message is trying to tell us that _mm512_set1_epi64 belongs in the x86_64 module instead of the x86 module. Would you mind trying if moving the function makes the test pass ?

hdevalence · 2019-02-07T09:39:24Z

Hmm, I don't think I really understand what's going on. stdsimd has a distinction between x86 / x86_64 modules. Previously I had no idea what the distinction was, so I stuck the avx512ifma.rs in the same place as avx512f.rs.

From your comment I gather that the reason that AVX512, AVX2 etc. are in x86 rather than x86_64 is that they can be used in 32-bit mode (? like the x32 ABI?). But my understanding was that x32 is a Linux-specific ABI that just changes the pointer width to 32 bits, and all of the instructions are otherwise identical to normal x86-64 code. So I don't think I even understand what the criteria would be for it to be "correct" to put it in one place or another.

I noticed that there's a hardcoded check here: https://github.com/rust-lang-nursery/stdsimd/blob/master/crates/stdsimd-verify/tests/x86-intel.rs#L362 which includes _mm256_set1_epi64x, which is the ymm analogue of the desired zmm instruction, but I also don't really understand what that whitelist is doing. Would it be correct to add _mm512_set1_epi64 to that list instead of moving the intrinsics?

Sorry for all of the confusion on this; I'm happy to do whatever works or is correct.

gnzlbg · 2019-02-07T11:37:44Z

So I don't think I even understand what the criteria would be for it to be "correct" to put it in one place or another.

The main criteria here has always been empirical. When compiling for 32-bit x86, LLVM fails to generate working machine code for some SIMD intrinsics, so we just put those in the x86_64 module and called it a day.

I noticed that there's a hardcoded check here: https://github.com/rust-lang-nursery/stdsimd/blob/master/crates/stdsimd-verify/tests/x86-intel.rs#L362 which includes _mm256_set1_epi64x, which is the ymm analogue of the desired zmm instruction, but I also don't really understand what that whitelist is doing. Would it be correct to add _mm512_set1_epi64 to that list instead of moving the intrinsics?

Ah yes, since the test pass on x86 targets (and only verify fails), that would be correct (LLVM is able to generate code that works for these targets).

alexcrichton · 2019-02-07T11:37:45Z

For x86/x86_64 the intention was that if you'd be able to compile code for i686-unknown-linux-gnu and run that with intrinsics and whatnot. It looks like avx/avx2/etc all work on i686-unknown-linux-gnu so long as the CPU does

For intrinsics that take 64-bit types, though, they can't map to an instruction on i686-unknown-linux-gnu because there's no 64-bit registers, so they're moved into the x86_64 mod. It should be fine to define most avx-512 things in x86, but anything taking a u64/i64 argument should go into x86_64

gnzlbg · 2019-02-07T11:40:31Z

For intrinsics that take 64-bit types, though, they can't map to an instruction on i686-unknown-linux-gnu because there's no 64-bit registers, so they're moved into the x86_64 mod.

@alexcrichton the weird thing here is that the i686-unknown-linux-gnu build jobs for this intrinsic are passing, so LLVM appears to be able to generate working code for these (there is no assert_instr though, so the machine code might be horrible..).

alexcrichton · 2019-02-07T11:42:39Z

Oh sure yeah LLVM "does the right thing" in that it gets the code to work on i686-unknown-linux-gnu, but it's sort of a lie and it exposes how LLVM has tons of polyfills for all sorts of SIMD operations. There's no native implementation of taking a 64-bit integer and broadcasting it to 512 bits on i686-unknown-linux-gnu, but LLVM can still codegen one.

(we only want to stabilize official things, though, not things that LLVM happens to be able to do)

gnzlbg · 2019-02-07T11:48:21Z

So @hdevalence the safest thing is to just move that intrinsic to the x86_64 module.

hdevalence · 2019-02-08T00:16:30Z

Hmm, wouldn't the safest thing be to move all the AVX-512 intrinsics into the x86_64 module?

hdevalence · 2019-02-08T00:22:29Z

There's no native implementation of taking a 64-bit integer and broadcasting it to 512 bits on i686-unknown-linux-gnu,

I don't really understand what this means; the "native" implementation according to https://software.intel.com/sites/landingpage/IntrinsicsGuide/#text=_mm512_set1_epi64&expand=4927 is vpbroadcastq zmm, r64, which is possible on any CPU that has AVX-512 registers -- there are no processors that can parse EVEX but not REX.

gnzlbg · 2019-02-11T09:48:46Z

@hdevalence

With stdsimd you can write a binary that targets very old 32-bit targets without SIMD support, and the same binary can use run-time feature detection to use SIMD registers when it runs on a 64-bit modern CPU.

We've had problems in the past with the code generated for functions using 64-bit integers in those targets, even though functions using SIMD registers are lowered just fine.

My recommendation is to stick to the policy of moving functions using 64-bit integers to the x86_64 module. If you care about using these functions when generating 32-bit binaries, please open an issue about it. We can always move the function to the x86 module later in a backwards compatible way.

Per rust-lang#676 (comment) , LLVM is able to generate code for this intrinsic on `x86` targets.

hdevalence · 2019-02-11T16:44:58Z

Updated so that _mm512_set1_epi64 matches its AVX2 analogue _mm256_set1_epi64x.

gnzlbg · 2019-02-11T18:06:06Z

Thank you so much!

hdevalence · 2019-02-12T17:50:32Z

Awesome! I can try adding the mask variants later. Do you know what the average latency on "merged into stdsimd" -> "appears in nightly" is?

gnzlbg · 2019-02-12T17:53:11Z

Once this PR lands (rust-lang/rust#58373) the nightly afterwards should contain these changes.

hdevalence · 2019-02-12T18:10:26Z

Cool, thanks for the information and the help with this PR!

gnzlbg · 2019-03-28T10:51:12Z

@hdevalence is any library using these already? We probably can start thinking about stabilizing these if that would help.

hdevalence · 2019-03-28T21:40:54Z

I'm using them in curve25519-dalek, but I'm not sure if there's a rush to stabilize them, because they only run on one CPU, which is only available in a single laptop only sold in mainland China or in a single NUC.

gnzlbg · 2019-03-29T08:39:14Z

I see, well let us know if that changes!

hdevalence added 3 commits February 4, 2019 19:45

Add _mm512_set1_epi64.

e6cd792

Add avx512ifma skeleton.

3d4f047

Add AVX512VL variants of IFMA instructions.

ffcee60

Fix incorrect assert_instr tests.

0aee500

Add a fixup pass for cpuid values when validating XML.

36b1062

Note that clang accepts _mm512_set1_epi64 on 32-bit.

e354ca3

Per rust-lang#676 (comment) , LLVM is able to generate code for this intrinsic on `x86` targets.

gnzlbg merged commit b035a7e into rust-lang:master Feb 11, 2019

hdevalence deleted the ifma branch February 11, 2019 21:10

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add AVX512-IFMA intrinsics. #676

Add AVX512-IFMA intrinsics. #676

hdevalence commented Feb 5, 2019

hdevalence commented Feb 5, 2019

alexcrichton commented Feb 5, 2019

gnzlbg commented Feb 5, 2019 •

edited

Loading

hdevalence commented Feb 5, 2019

hdevalence commented Feb 5, 2019

gnzlbg commented Feb 5, 2019

hdevalence commented Feb 7, 2019

gnzlbg commented Feb 7, 2019

hdevalence commented Feb 7, 2019

gnzlbg commented Feb 7, 2019

alexcrichton commented Feb 7, 2019

gnzlbg commented Feb 7, 2019

alexcrichton commented Feb 7, 2019

gnzlbg commented Feb 7, 2019

hdevalence commented Feb 8, 2019

hdevalence commented Feb 8, 2019

gnzlbg commented Feb 11, 2019

hdevalence commented Feb 11, 2019

gnzlbg commented Feb 11, 2019

hdevalence commented Feb 12, 2019

gnzlbg commented Feb 12, 2019

hdevalence commented Feb 12, 2019

gnzlbg commented Mar 28, 2019

hdevalence commented Mar 28, 2019

gnzlbg commented Mar 29, 2019

Add AVX512-IFMA intrinsics. #676

Add AVX512-IFMA intrinsics. #676

Conversation

hdevalence commented Feb 5, 2019

hdevalence commented Feb 5, 2019

alexcrichton commented Feb 5, 2019

gnzlbg commented Feb 5, 2019 • edited Loading

hdevalence commented Feb 5, 2019

hdevalence commented Feb 5, 2019

gnzlbg commented Feb 5, 2019

hdevalence commented Feb 7, 2019

gnzlbg commented Feb 7, 2019

hdevalence commented Feb 7, 2019

gnzlbg commented Feb 7, 2019

alexcrichton commented Feb 7, 2019

gnzlbg commented Feb 7, 2019

alexcrichton commented Feb 7, 2019

gnzlbg commented Feb 7, 2019

hdevalence commented Feb 8, 2019

hdevalence commented Feb 8, 2019

gnzlbg commented Feb 11, 2019

hdevalence commented Feb 11, 2019

gnzlbg commented Feb 11, 2019

hdevalence commented Feb 12, 2019

gnzlbg commented Feb 12, 2019

hdevalence commented Feb 12, 2019

gnzlbg commented Mar 28, 2019

hdevalence commented Mar 28, 2019

gnzlbg commented Mar 29, 2019

gnzlbg commented Feb 5, 2019 •

edited

Loading