Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support AVX2 in windows better #381

Merged
merged 1 commit into from
Jan 6, 2023

Conversation

Mike-Bell
Copy link
Contributor

@Mike-Bell Mike-Bell commented Jan 6, 2023

I spent quite a while trying to figure out why AVX2 wasn't enabled when I built this on Windows in my build system. Adding /arch:AVX2 to the C-compiler seems to fix it. I'm not building through Visual Studio but via a more customized build system (leveraging MSBuild), so that's likely why this isn't required for other people building on Windows, but since ggml.c is a C file (not C++), it seems like this should be correct and necessary in some cicrumstances.

@RndyP
Copy link

RndyP commented Jan 6, 2023

Thanks Mike! I had tried AVX2 and it worked, but was extremely slow. The JFK.wav sample took 17 seconds, where with AVX it takes 4.5 seconds. Using Visual Studio with AVX2 set under C/C++ Code Generation -> Enable Enhanced Instruction Set -> Advanced Vector Extensions 2 (/arch:AVX2) time went from 4.5 to about 3.6 seconds. Nice improvement.

@ggerganov ggerganov merged commit 41e05c6 into ggerganov:master Jan 6, 2023
@Mike-Bell Mike-Bell deleted the SupportWindowsAvx2Better branch January 6, 2023 17:46
@RndyP
Copy link

RndyP commented Jan 6, 2023

I did some research into what's going on here. One of the most critical sections of code is in ggml_vec_dot_f16(). AVX2 has fused multiply add instructions and AVX does not. So with AVX2 we have:
image
Without (just AVX) we have:
image
Note in the AVX2 case the disassembly shows the use of the fused intrinsics vfmadd231ps

@ggerganov
Copy link
Owner

ggerganov commented Jan 6, 2023

@RndyP
So recently, I was revisiting the SIMD code in ggml.c and I came to the realisation that we don't use AVX2 at all because all x86 intrinsics that we currently use are either AVX, FMA or F16C. This is based on lookup in the following site:

https://www.intel.com/content/www/us/en/docs/intrinsics-guide/index.html

For example, the case that you refer to I think actually depends on wether FMA is supported or not:

whisper.cpp/ggml.c

Lines 483 to 487 in 87dd4a3

#if defined(__FMA__)
#define GGML_F32x8_FMA(a, b, c) _mm256_fmadd_ps(b, c, a)
#else
#define GGML_F32x8_FMA(a, b, c) _mm256_add_ps(_mm256_mul_ps(b, c), a)
#endif

So it is strange that AVX2 flags make any difference at all on Windows. My understanding is that AVX2 should not make a difference, but I might be missing something.

@RndyP
Copy link

RndyP commented Jan 6, 2023

I think what's happening is the Microsoft compiler is smart enough to convert the _mm256_add_ps(_mm256_mul_ps(b, c), a) to the fused assembly version, when the compiler option is set. The code above goes from 16 down to 12 vector instructions. 4.5 to 3.6 seconds is a nice improvement.

rock3125 pushed a commit to rock3125/whisper.cpp that referenced this pull request Feb 21, 2023
anandijain pushed a commit to anandijain/whisper.cpp that referenced this pull request Apr 28, 2023
jacobwu-b pushed a commit to jacobwu-b/Transcriptify-by-whisper.cpp that referenced this pull request Oct 24, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants