Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

simple, but inelegant support for non-x86 using SIMDe #1163

Merged
merged 1 commit into from
Apr 27, 2021

Conversation

mr-c
Copy link
Contributor

@mr-c mr-c commented Mar 1, 2021

Could address #1160

Do you prefer a git submodule instead of the amalgamated header? Happy to switch that out.

The $(SFX) is useful for compiling different executables with -mavx for STAR-avx and -mav2 for STAR-avx2, etc..

Or you could add in a dispatcher around opal and compile that object multiple times. That would make the resulting binary much more portable and would be appreciated by docker container users, bioconda, etc..

@alexdobin
Copy link
Owner

Hi Michael,

thanks for the PR!
I am a bit lost on how this works.
I tried to build it with make SFX=-mavx2
but it does not pass the SFX flag to opal compilation.

Should we pass this flag explicitly, i.e.

opal/opal.o : opal/opal.cpp opal/opal.h
        cd opal && \
        $(CXX) -c -O3 $(CPPFLAGS) $(CXXFLAGS) -I. $(SFX) -std=c++11 opal.cpp

Cheers
Alex

@mr-c
Copy link
Contributor Author

mr-c commented Mar 8, 2021

Hey Alex. For Debian we set SFX to change the name of the executables and we add the appropriate compiler flag to CXXFLAGS / CFLAGS

The reason those are separate is that we also do a compilation with an additional compiler flag but with -SFX set to -plain.

We then ship all the versions and install a wrapper script (that picks the best version based upon the user's CPU) in place of the original executable

@mr-c
Copy link
Contributor Author

mr-c commented Mar 8, 2021

Sorry for not explaining this very well the first time!

@alexdobin
Copy link
Owner

Hi Michael,

thanks a lot for the explanation!
I need to think about how to merge it with the simple compilation for general users, where I need to supply the proper architecture. I will probably add auto-detection of AVX2/AVX/SSE4.1 (these are the only options opal can use).

I will make sure that if SFX is defined, then it will just use SFX logic.

Cheers
Alex

@mr-c
Copy link
Contributor Author

mr-c commented Mar 8, 2021

I will probably add auto-detection of AVX2/AVX/SSE4.1 (these are the only options opal can use).

With https://github.com/alexdobin/STAR/pull/1163/files#diff-e3c6cef1e1cddf53d9c53744c9c5baaf1159edc926f45db748ea4eb02f2b1fc6R10 opal can use sub-SSE 4.1, like SSE2; or without SSE all together.

thanks a lot for the explanation!

You are very welcome!

@mr-c
Copy link
Contributor Author

mr-c commented Mar 8, 2021

@alexdobin
Copy link
Owner

Hi Michael,

I am not sure I understand how to make SIMDe work properly.
I thought that if I use -mavx2, the compilation will work on a machine with any SIMD level, using (if necessary) SIMDe emulation for lower SIMD levels.

I have tried this on a machine with max sse4.2, and it threw compile errors with -mavx2 , -msse4.2 , -msse4.1
It compiled with -mavx , with a bunch of "redefined" warnings from simde_avx2.h

Many thanks again!
Alex

@mr-c
Copy link
Contributor Author

mr-c commented Mar 9, 2021

I am not sure I understand how to make SIMDe work properly.
I thought that if I use -mavx2, the compilation will work on a machine with any SIMD level, using (if necessary) SIMDe emulation for lower SIMD levels.

Hey Alex,

SIMDe enables your code to work the same way for the capabilities it was compiled for.

For example, in Debian we compile STAR multiple times, once each for AVX2, AVX, SSE4.2, etc..
So we get separate executables. We decide which executable to run via the script at https://salsa.debian.org/med-team/rna-star/-/blob/master/debian/bin/simd-dispatch which checks the current processor's capabilities and picks accordingly.

If you want one binary that will run using the highest SIMD level available, then you use https://gcc.gnu.org/onlinedocs/gcc/Function-Multiversioning.html or compile the relevant compilation unit (library) multiple times and add your own dispatching code. This is a bit more advanced :-)

@alexdobin
Copy link
Owner

Hi Michael,

sorry, I am still missing something.

Opal supports sse4.1, sse4.2, avx and avx2.
Imagine we have a machine that does not have sse4.1 and higher support.
I thought that SIMDe would allow compiling Opal (say with -msse4.1) on this machine, and SIMDe will "translate" the sse4.1 instructions so that the code can still be run on this machine.
Of course, this is especially important for ARMs, I doubt there any x86 systems out there without sse4.1 support.

Thanks!
Alex

@mr-c
Copy link
Contributor Author

mr-c commented Mar 9, 2021

SIMDe lets Opal use SSE4.1 intrinsics in its source and the result can be compiled for SSE4.1 (-msse4.1) and any "lower" SIMD level, including no intel SIMD at all.

non-SSE4.1 on X86 systems is required for Debian, as per the "amd64" Debian Architecture baseline

For ARM & ARM64, SIMDe includes optimized implementation of the Intel SSE* intrinsics using the ARM NEON instructions

SIMDe is insufficient on its own to achieve a single binary that can run on any X86 processor at the highest level of SIMD support; for that one needs a CPU dispatcher in addition.

@alexdobin
Copy link
Owner

Thanks for the clarification!
I am almost on the same page, I hope.
I understand that the binaries need to be generated for each architecture, and dispatching is a great idea, but I am not there yet.
At the moment I am trying to check user experience for compiling the code on a machine that does not support AVX2.
I tried to compile Opal a machine that supports SSE4.2 but not AVX. It's RedHat 6.9, GCC 5.3.0

The original Opal version compiles fine with -mavx and -msse4.2

The SIMDe Opal version compiles with -mavx, throwing a few warning like those:

./simde_avx2.h:34897:0: warning: "_mm256_i64gather_pd" redefined
   #define _mm256_i64gather_pd(base_addr, vindex, scale) simde_mm256_i64gather_pd(HEDLEY_REINTERPRET_CAST(simde_float64 const*, base_addr), vindex, scale)

However,it seems the object still contains AVX2 commands:

  30:   c5 f1 fe c0             vpaddd %xmm0,%xmm1,%xmm0
  30:   c5 f1 fe c0             vpaddd %xmm0,%xmm1,%xmm0

I also see these commands when I compile the original Opal with -mavx

Another problem is that the compilation with -msse4.2 fails:

In file included from opal.cpp:10:0:
./simde_avx2.h:26316:26: error: conflicting declaration ‘typedef __vector(4) long int __m256i’
     typedef simde__m256i __m256i;
                          ^
In file included from /sonas-hs/gingeras/nlsas_norepl/user/dobin/Software/GCC/5.3.0/gcc-5.3.0-install/lib/gcc/x86_64-unknown-linux-gnu/5.3.0/include/immintrin.h:41:0,
                 from /sonas-hs/gingeras/nlsas_norepl/user/dobin/Software/GCC/5.3.0/gcc-5.3.0-install/lib/gcc/x86_64-unknown-linux-gnu/5.3.0/include/x86intrin.h:46,
                 from /sonas-hs/gingeras/nlsas_norepl/user/dobin/Software/GCC/5.3.0/gcc-5.3.0-install/include/c++/5.3.0/x86_64-unknown-linux-gnu/bits/opt_random.h:33,
                 from /sonas-hs/gingeras/nlsas_norepl/user/dobin/Software/GCC/5.3.0/gcc-5.3.0-install/include/c++/5.3.0/random:50,
                 from /sonas-hs/gingeras/nlsas_norepl/user/dobin/Software/GCC/5.3.0/gcc-5.3.0-install/include/c++/5.3.0/bits/stl_algo.h:66,
                 from /sonas-hs/gingeras/nlsas_norepl/user/dobin/Software/GCC/5.3.0/gcc-5.3.0-install/include/c++/5.3.0/algorithm:62,
                 from opal.cpp:1:
/sonas-hs/gingeras/nlsas_norepl/user/dobin/Software/GCC/5.3.0/gcc-5.3.0-install/lib/gcc/x86_64-unknown-linux-gnu/5.3.0/include/avxintrin.h:56:19: note: previous declaration as ‘typedef __vector(4) long long int __m256i’
 typedef long long __m256i __attribute__ ((__vector_size__ (32),

I wonder if this may indicate some sort of incompatibility between Opal and SIMDe.

Cheers
Alex

@nemequ
Copy link

nemequ commented Mar 10, 2021

Using AVX for example (just so I don't have to type both a bunch of times):

The -mavx flags tell the compiler that it is allowed to generate AVX code. This doesn't just mean that you can explicitly call AVX functions; the compiler is actually allowed to use those features for all code. Since AVX code is allowed, the compiler also defines the __AVX__ macro to indicate that code is allowed to use native AVX functions.

The __AVX__ macro is what determines whether SIMDe will do. For example, if __AVX__ is defined then SIMDe will implement simde_mm256_add_ps using _mm256_add_ps, otherwise it will fall back on something else (a pair of _mm_add_ps on SSE+, vaddq_f32 on NEON, vec_add on POWER, etc., or a pure-C portable implementation in the worst case).

On most compilers you're actually not allowed to call native AVX functions if you didn't pass the flag to enable AVX; you'll usually get an error about a "target mismatch". It is possible to get around this for a specific function (using the target attribute), but honestly that's not used all that often since not all compilers support it.

Defining SIMDE_ENABLE_NATIVE_ALIASES prior to including SIMDe will also cause SIMDe to #define _mm256_add_ps(a, b) simde_mm256_add_ps((a), (b)) so you don't have to change your code.

So basically what SIMDe allows you to do is call functions like simde_mm256_add_ps (or, if you have enabled native aliases, _mm256_add_ps) regardless of whether you have enabled AVX support in your compiler. It will then choose the best implementation it knows about based on the preprocessor macros which are defined by the compiler. If you pass -mavx it (and the compiler in general) will use AVX, but if you only pass -msse4.2 it will be limited to SSE 4.2.

As for the redefinition warnings, usually that happens because you're including both the native header (e.g., immintrin.h) and SIMDe, but a bug in SIMDe is definitely feasible. If you can post some info on how to reproduce the warnings (commands used and whether you have made any modifications to Michael's patch) I'd be happy to take a look and see if I can figure out what is going on. FWIW, SIMDe should work without warnings with -mavx (in fact, we have a CI job to try to make sure it does).

@mr-c
Copy link
Contributor Author

mr-c commented Mar 11, 2021

As for the redefinition warnings, usually that happens because you're including both the native header (e.g., immintrin.h) and SIMDe, but a bug in SIMDe is definitely feasible. If you can post some info on how to reproduce the warnings (commands used and whether you have made any modifications to Michael's patch) I'd be happy to take a look and see if I can figure out what is going on. FWIW, SIMDe should work without warnings with -mavx (in fact, we have a CI job to try to make sure it does).

This may be due to me having made an amalgamated header for simde/x86/avx2.h for this pull request. I wonder if it persists if we switch to a local copy (or git submodule / git sub tree)

@nemequ
Copy link

nemequ commented Mar 12, 2021

This may be due to me having made an amalgamated header for simde/x86/avx2.h for this pull request. I wonder if it persists if we switch to a local copy (or git submodule / git sub tree)

I doubt it; the amalgamated header shouldn't really be any different… the scrcipt just scans the files and replaces #include "..." with the file, which is basically the same thing the compiler does. The only real difference is that the script treats every include as an include_once.

I managed to reproduce the error @alexdobin is seeing, it looks like it only happens on very old versions of GCC. It looks like libc's opt_random.h (included indirectly from opal.cpp via algorithm.h) includes x86intrin.h, which includes immintrin.h (which is where the AVX types are defined). I'm not sure when exactly GCC fixed that issue, but we can work around it with a pretty straightforward patch in SIMDe: e8b7a2ec175ceb3725ce0827ef9a6725b6309cc9.

@alexdobin
Copy link
Owner

Hi Michael and Evan,

thanks a lot for the explanations, and for looking into the compilation issues.
I finally got, I think. My misconception was that I was thinking that SIMDe is trying to match the "native" architecture of the machine where it compiles. It makes more sense, of course, that it actually compiles for the architecture specified with -m...

The warning might be indeed related to the old GCC. I do not get any warnings with 9.2.0 . It would be great if it could work with older versions as many people still have them. As a stopgap, I could make SIMDe optional for such cases.

However, there is a more serious issue.
Opal deals with the architecture itself: inside the opal.cpp, it checks the AVX2 and SSE4_1 macros, and chooses the proper SIMD commands. If none of these macros are defined, it still compiles (w/o warnings), but at the run-time simply returns an error when called.
So basically it will not work with any flags other than -msse4.x or -mavx2.
I should be able to fix it by removing the SSE4_1 checks from opal.cpp.

Not sure if it's feasible, but I wonder if SIMDe can (optionally) use its own architecture target, independent of the compiler's -m... option. This would simplify retrofitting old codes for SIMDe.

Cheers
Alex

@nemequ
Copy link

nemequ commented Mar 13, 2021

The warning might be indeed related to the old GCC. I do not get any warnings with 9.2.0 . It would be great if it could work with older versions as many people still have them. As a stopgap, I could make SIMDe optional for such cases.

Sorry, I wasn't really clear in my previous comment: the patch I linked to is already merged into SIMDe. As soon as the copy of SIMDe in the patch is updated to the newer version it should work on older compilers. I don't have a copy of RHEL 6.9 sitting around, but it works on Ubuntu 16.04 with GCC 5 (where I was able to reproduce the problem before applying that patch to SIMDe).

I don't think you ever answered Michael's question about the amalgamated header vs. a submodule. Basically, there is the main SIMDe repository, which is pretty massive because of the test suite. There is also a simde-no-tests repo which is just the part of SIMDe designed to be used from other projects and works well as a submodule or subtree. Finally, there is a script we can use to generate a single large amalgamated header (basically in just concatenates a bunch of the SIMDe headers into a single file), which is what this patch currently uses.

If you're comortable using submodules (or subtrees) then the simde-no-tests repo is a great way to go since it makes updating your copy of SIMDe really straightforward. If you don't like submodules an amalgamated header works well, but updating it is slightly more involved.

Opal deals with the architecture itself: inside the opal.cpp, it checks the AVX2 and SSE4_1 macros, and chooses the proper SIMD commands. If none of these macros are defined, it still compiles (w/o warnings), but at the run-time simply returns an error when called.

There are a couple of options here. Do you have any code you could use for benchmarking a few different options? If so, you could first try always using the AVX2 code path. SIMDe will automatically use whatever ISA extension you target really supports, even if it's only SSE 4.1 or SSE2 (or NEON for that matter). There is often a bit of overhead when converting between 128- and 256-bit vectors, but depending on the code it might not be noticable.

You could just change your #ifdef __AVX2__ (or whatever it is) to #if 1 and compile with -msse4.1 and benchmark. Then, switch the preprocessor over so you use the SSE4.1 path instead and compile with -msse4.1 and benchmark. If there isn't a significant difference it might be best to just drop the SSE4.1 path and reduce your maintenance burden.

If there is a significant difference my suggestion would be to change those macros so that instead of switching based on __AVX2__ and __SSE4_1__ you use the SIMDE_NATURAL_VECTOR_SIZE macro so that if the target natively supports 256-bit vectors you use the AVX2 path, otherwise you use the SSE4.1 path.

Basically, instead of

#if defined(__AVX2__)
  use_avx2_implementation();
#elif defined(__SSE4_1__)
  use_sse4_1_implementation();
#else
  error();
#endif

You could just do:

#if SIMDE_NATURAL_VECTOR_SIZE >= 256
  use_avx2_implementation();
#else
  use_sse4_1_implementation();
#endif

Not sure if it's feasible, but I wonder if SIMDe can (optionally) use its own architecture target, independent of the compiler's -m... option. This would simplify retrofitting old codes for SIMDe.

It should be possible to do that, but I don't think it would really simplify the process. Rather the opposite, most likely :(

The logic for choosing which features to enable is in simde/simde-features.h. For example, if you're compiling with -mavx2 but don't want to enable AVX2 support, you could define SIMDE_X86_AVX2_NO_NATIVE to prevent SIMDe from using AVX2. However, you won't actually prevent the compiler from using AVX2, so if you're trying to make code compiled with -mavx2 runnable on targets without AVX2 support there is a good chance it won't work.

If you're talking about going the other way (having SIMDe call native AVX2 implementations when compiled without -mavx2), you could achieve this by explicitly define SIMDE_X86_AVX2_NATIVE and using function-specific option pragmas. Something like:

#define SIMDE_X86_AVX2_NATIVE
#pragma GCC push_options
#pragma GCC target("avx2")
#include <simde/x86/avx2.h>
#pragma GCC pop_options

But I don't think that would have the effect you want anyways. You still don't have dynamic dispatch, and the generated code depends on AVX2, so why not just use -mavx2?

I'd really suggest one of the options I mentioned above; most likely the SIMDE_NATURAL_VECTOR_SIZE thing, but just always using the AVX2 path is definitely worth a try if you have a good way to test it.

Is opal the only place you're using SIMD code? If so, it sounds like you have a pretty straightforward way to leverage the upcoming glibc-hwcaps support. Just compile a few different variations of opal as a library and let glibc choose the best implementation at runtime.

@mr-c
Copy link
Contributor Author

mr-c commented Mar 15, 2021

@alexdobin I updated the bundled SIMDe version to include @nemequ 's patch

@alexdobin
Copy link
Owner

Hi Evan, Michael,

thanks for the suggestions and update!

I am leaning towards taking the easiest path (for the users and myself), as you suggested: forcing opal to always choose AVX and SIMDe to do the translation for all the lower architectures. Even if it might be somewhat less efficient for SSE4.1, it's not going to affect the overall performance too much, since this operation is only ~5% of the overall run time.

For the same reason, I would stay with the amalgamated header for now. Submodules are nice but require an extra step of pulling submodules, which may be cumbersome for some users. I am thinking of making the submodule optional in the future, for the advanced users who want the most up-to-date SIMDe.

It will take me a couple of days to finalize the changes and test them, and then I will get back to you.
Thanks again for all your help!

Cheers
Alex

@alexdobin alexdobin merged commit 3950f51 into alexdobin:master Apr 27, 2021
@alexdobin
Copy link
Owner

Hi Evan, Michael,

I have finally tested the changes described above, modifying opal to always use avx2.
The patch is on GitHub master, it works fine on my servers.
I introduced CXXFLAGS_SIMD user-definable flag, which defaults to -mavx2 , but can be changed to match the target processor.

Thanks a lot for your help!
Cheers
Alex

@mr-c mr-c deleted the SIMDe branch April 28, 2021 09:15
@mr-c
Copy link
Contributor Author

mr-c commented Apr 28, 2021

You are very welcome @alexdobin !

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants