Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add AVX512-IFMA intrinsics. #676

Merged
merged 6 commits into from
Feb 11, 2019
Merged

Add AVX512-IFMA intrinsics. #676

merged 6 commits into from
Feb 11, 2019

Conversation

hdevalence
Copy link
Contributor

Progress on #310, adding all the (unmasked) versions of the IFMA intrinsics.

Because AVX512VL extends the AVX512 intrinsics to ymm and xmm operands, there's a question about where the 256-bit and 128-bit variants of the instructions should live. My feeling (mentioned in discussion in #310) is that the cleanest thing to do is to keep the length-extended intrinsics with their "parents", which is what I did here.

I don't know what the CI situation is, but I can confirm that these tests do pass on my Cannonlake machine (and don't pass when the numbers are changed).

@hdevalence
Copy link
Contributor Author

Not sure if the appveyor failures are related to these changes; the avx512ifma tests seem like they passed there...

@alexcrichton
Copy link
Member

Yeah I think AppVeyor can be ignored but looks like some instruction assertions are failing on Travis?

@gnzlbg
Copy link
Contributor

gnzlbg commented Feb 5, 2019

It appears that a vpmadd52luq is being generated for some intrinsics instead of vpmadd52huq which the assert_instr require.

It appears to be a bug in the assert_instr() tests of the new intrinsics. E.g. the test for _mm512_madd52lo_epu64 requires vpmadd52huq but the Intel Intrinsics Guide says that the intrinsic generates a vpmadd52luq (https://software.intel.com/sites/landingpage/IntrinsicsGuide/#text=_mm512_madd52lo_epu64&expand=3497), so the failures appear to be right.

@hdevalence note that for the assert_instr test to run, you need to run the tests in --release mode. These tests do not run in debug mode.

@hdevalence
Copy link
Contributor Author

@hdevalence note that for the assert_instr test to run, you need to run the tests in --release mode. These tests do not run in debug mode.

Ooops, thanks for pointing this out, I'll fix it!

@hdevalence
Copy link
Contributor Author

Fixed; tested locally with --release. Thanks for pointing that out!

@gnzlbg
Copy link
Contributor

gnzlbg commented Feb 5, 2019

Progress! It seems that the build is mostly green now :)

The verification of the intrinsics is failing:

--- verify_all_signatures stdout ----
failed to verify `_mm512_madd52hi_epu64`
  * intel cpuid `avx512ifma52` not in `avx512ifma` for _mm512_madd52hi_epu64
failed to verify `_mm512_madd52lo_epu64`
  * intel cpuid `avx512ifma52` not in `avx512ifma` for _mm512_madd52lo_epu64
failed to verify `_mm256_madd52hi_epu64`
  * intel cpuid `avx512ifma52` not in `avx512ifma,avx512vl` for _mm256_madd52hi_epu64
failed to verify `_mm256_madd52lo_epu64`
  * intel cpuid `avx512ifma52` not in `avx512ifma,avx512vl` for _mm256_madd52lo_epu64
failed to verify `_mm_madd52hi_epu64`
  * intel cpuid `avx512ifma52` not in `avx512ifma,avx512vl` for _mm_madd52hi_epu64
failed to verify `_mm_madd52lo_epu64`
  * intel cpuid `avx512ifma52` not in `avx512ifma,avx512vl` for _mm_madd52lo_epu64
failed to verify `_mm512_set1_epi64`
  * intrinsic `_mm512_set1_epi64` uses a 64-bit bare type but may be available on 32-bit platforms

It seems that in the intel-x86.xml file states that these intrinsics require the avx512ifma52 CPUID feature flag, but we name this feature avx512ifma, so this condition is returning false, hence the error. Wanna give fixing this a try? The best would be to just add a function that maps the Intel CPUID name to the Rust target feature name. It should just return the CPUID name unless the name is avx512ifma52 in which case it should return avx512ifma. Once you have this function, just use it inside the contains in the code above to translate the feature names.

@hdevalence
Copy link
Contributor Author

I added a fixup pass for the CPUID names, hopefully it's OK; I'm not sure what to do about the remaining failure re: _mm512_set1_epi64... I don't understand why AVX512 would be available on 32-bit platforms anyways.

@gnzlbg
Copy link
Contributor

gnzlbg commented Feb 7, 2019

@hdevalence wow nice work, only one failure to go !

I don't understand why AVX512 would be available on 32-bit platforms anyways.

You can compile for an x86 target with 32-bit wide pointers and still use SIMD intrinsics. Basically, 32-bit platform refers to the target_pointer_with. If you know that your application doesn't need to address more than 4Gb of memory, you can save some space by just making your pointers 32-bit wide instead of 64-bit wide, even if your binary is running on a 64-bit system.

I'm not 100% sure, but I think that error message is trying to tell us that _mm512_set1_epi64 belongs in the x86_64 module instead of the x86 module. Would you mind trying if moving the function makes the test pass ?

@hdevalence
Copy link
Contributor Author

Hmm, I don't think I really understand what's going on. stdsimd has a distinction between x86 / x86_64 modules. Previously I had no idea what the distinction was, so I stuck the avx512ifma.rs in the same place as avx512f.rs.

From your comment I gather that the reason that AVX512, AVX2 etc. are in x86 rather than x86_64 is that they can be used in 32-bit mode (? like the x32 ABI?). But my understanding was that x32 is a Linux-specific ABI that just changes the pointer width to 32 bits, and all of the instructions are otherwise identical to normal x86-64 code. So I don't think I even understand what the criteria would be for it to be "correct" to put it in one place or another.

I noticed that there's a hardcoded check here: https://github.com/rust-lang-nursery/stdsimd/blob/master/crates/stdsimd-verify/tests/x86-intel.rs#L362 which includes _mm256_set1_epi64x, which is the ymm analogue of the desired zmm instruction, but I also don't really understand what that whitelist is doing. Would it be correct to add _mm512_set1_epi64 to that list instead of moving the intrinsics?

Sorry for all of the confusion on this; I'm happy to do whatever works or is correct.

@gnzlbg
Copy link
Contributor

gnzlbg commented Feb 7, 2019

So I don't think I even understand what the criteria would be for it to be "correct" to put it in one place or another.

The main criteria here has always been empirical. When compiling for 32-bit x86, LLVM fails to generate working machine code for some SIMD intrinsics, so we just put those in the x86_64 module and called it a day.

I noticed that there's a hardcoded check here: https://github.com/rust-lang-nursery/stdsimd/blob/master/crates/stdsimd-verify/tests/x86-intel.rs#L362 which includes _mm256_set1_epi64x, which is the ymm analogue of the desired zmm instruction, but I also don't really understand what that whitelist is doing. Would it be correct to add _mm512_set1_epi64 to that list instead of moving the intrinsics?

Ah yes, since the test pass on x86 targets (and only verify fails), that would be correct (LLVM is able to generate code that works for these targets).

@alexcrichton
Copy link
Member

For x86/x86_64 the intention was that if you'd be able to compile code for i686-unknown-linux-gnu and run that with intrinsics and whatnot. It looks like avx/avx2/etc all work on i686-unknown-linux-gnu so long as the CPU does

For intrinsics that take 64-bit types, though, they can't map to an instruction on i686-unknown-linux-gnu because there's no 64-bit registers, so they're moved into the x86_64 mod. It should be fine to define most avx-512 things in x86, but anything taking a u64/i64 argument should go into x86_64

@gnzlbg
Copy link
Contributor

gnzlbg commented Feb 7, 2019

For intrinsics that take 64-bit types, though, they can't map to an instruction on i686-unknown-linux-gnu because there's no 64-bit registers, so they're moved into the x86_64 mod.

@alexcrichton the weird thing here is that the i686-unknown-linux-gnu build jobs for this intrinsic are passing, so LLVM appears to be able to generate working code for these (there is no assert_instr though, so the machine code might be horrible..).

@alexcrichton
Copy link
Member

Oh sure yeah LLVM "does the right thing" in that it gets the code to work on i686-unknown-linux-gnu, but it's sort of a lie and it exposes how LLVM has tons of polyfills for all sorts of SIMD operations. There's no native implementation of taking a 64-bit integer and broadcasting it to 512 bits on i686-unknown-linux-gnu, but LLVM can still codegen one.

(we only want to stabilize official things, though, not things that LLVM happens to be able to do)

@gnzlbg
Copy link
Contributor

gnzlbg commented Feb 7, 2019

So @hdevalence the safest thing is to just move that intrinsic to the x86_64 module.

@hdevalence
Copy link
Contributor Author

Hmm, wouldn't the safest thing be to move all the AVX-512 intrinsics into the x86_64 module?

@hdevalence
Copy link
Contributor Author

There's no native implementation of taking a 64-bit integer and broadcasting it to 512 bits on i686-unknown-linux-gnu,

I don't really understand what this means; the "native" implementation according to https://software.intel.com/sites/landingpage/IntrinsicsGuide/#text=_mm512_set1_epi64&expand=4927 is vpbroadcastq zmm, r64, which is possible on any CPU that has AVX-512 registers -- there are no processors that can parse EVEX but not REX.

@gnzlbg
Copy link
Contributor

gnzlbg commented Feb 11, 2019

@hdevalence

With stdsimd you can write a binary that targets very old 32-bit targets without SIMD support, and the same binary can use run-time feature detection to use SIMD registers when it runs on a 64-bit modern CPU.

We've had problems in the past with the code generated for functions using 64-bit integers in those targets, even though functions using SIMD registers are lowered just fine.

My recommendation is to stick to the policy of moving functions using 64-bit integers to the x86_64 module. If you care about using these functions when generating 32-bit binaries, please open an issue about it. We can always move the function to the x86 module later in a backwards compatible way.

Per rust-lang#676 (comment) , LLVM is able to generate code for this intrinsic on `x86` targets.
@hdevalence
Copy link
Contributor Author

Updated so that _mm512_set1_epi64 matches its AVX2 analogue _mm256_set1_epi64x.

@gnzlbg gnzlbg merged commit b035a7e into rust-lang:master Feb 11, 2019
@gnzlbg
Copy link
Contributor

gnzlbg commented Feb 11, 2019

Thank you so much!

@hdevalence hdevalence deleted the ifma branch February 11, 2019 21:10
@hdevalence
Copy link
Contributor Author

Awesome! I can try adding the mask variants later. Do you know what the average latency on "merged into stdsimd" -> "appears in nightly" is?

@gnzlbg
Copy link
Contributor

gnzlbg commented Feb 12, 2019

Once this PR lands (rust-lang/rust#58373) the nightly afterwards should contain these changes.

@hdevalence
Copy link
Contributor Author

Cool, thanks for the information and the help with this PR!

@gnzlbg
Copy link
Contributor

gnzlbg commented Mar 28, 2019

@hdevalence is any library using these already? We probably can start thinking about stabilizing these if that would help.

@hdevalence
Copy link
Contributor Author

I'm using them in curve25519-dalek, but I'm not sure if there's a rush to stabilize them, because they only run on one CPU, which is only available in a single laptop only sold in mainland China or in a single NUC.

@gnzlbg
Copy link
Contributor

gnzlbg commented Mar 29, 2019

I see, well let us know if that changes!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants