-
Notifications
You must be signed in to change notification settings - Fork 9.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Remove excessive floating-point divides #4312
base: main
Are you sure you want to change the base?
Conversation
Loft the loop-invariant divide outside the hot loop, and/or invert the variable to turn FDIV into FMUL.
Do you have timing values from your tests? |
Co-authored-by: Stefan Weil <sw@weilnetz.de>
It will be CPU specific, but I see +10% on my Ampere Altra. |
That's a very significant improvement! I wonder how this ARM64 cpu compares to Intel / AMD cpus with Tesseract recognition and training. |
If there are standard tests that you run, please do share the results. I was using |
Does Ampere Altra offer additional opcodes which could be used to make Tesseract's neural network code faster? We currently use Neon code for ARM64 (see src/arch/*neon.cpp). |
You can run Here are my results on a Mac mini M2 for running
|
Shaves off 25% runtime on Ampere Altra running OCR using the tessdata_orig Russian language model with --oem 2.
After some wrangling, I was able to get the unit tests running on my machine. Here is a rollup of the tests which run longer than 1ms total. I basically culled this out using
|
Conform to style.
With the latest changes, I get +25% on this cmdline. I have attached the input image here (you need to uncompress it).
|
What does When I run your test on
On another host (virtual machine with Ampere Altra) I also see no clear winner when running 2 x 3 tests: without PR 221...229 s, with PR 214...234 s. |
T inv_prob_total = 1 / prob_total; | ||
for (int i = 0; i < n; i++) { | ||
inout[i] /= prob_total; | ||
inout[i] *= inv_prob_total; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Isn't this kind of optimization something which a good compiler should do automatically?
Although the proposed changes replace FP divides by FP multiplications, I could not reproduce the reported positive effect. Maybe others are more lucky and can confirm the results, or I can reproduce them when I have more information. |
@heshpdx : Is it with or without simd usage? |
Good question. This was using the generic path, without intrinsics. |
If |
I made a test on RPi4 (armv7l) with 32bit Debian, gcc (Debian 12.2.0-14) 12.2.0. $ time ./tesseract.main -l rus --tessdata-dir ./tessdata_orig --oem 2 math-ru.bmp math_out
real 6m26.522s
user 14m21.678s
sys 0m7.456s
$ time ./tesseract.4312 -l rus --tessdata-dir ./tessdata_orig --oem 2 math-ru.bmp math_out
real 6m26.177s
user 14m21.324s
sys 0m7.456s |
Loft the loop-invariant divide outside the hot loops, and/or invert the variable to turn FDIV into FMUL.
Most CPUs are slower at FP division compared to FP multiplication. This should provide some uplift in performance. I was testing with the integer models.