-
-
Notifications
You must be signed in to change notification settings - Fork 851
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add sse2 version of select #1804
Conversation
Codecov Report
@@ Coverage Diff @@
## master #1804 +/- ##
==========================================
- Coverage 87.10% 87.09% -0.01%
==========================================
Files 936 936
Lines 47832 47855 +23
Branches 6009 6011 +2
==========================================
+ Hits 41662 41681 +19
- Misses 5178 5180 +2
- Partials 992 994 +2
Flags with carried forward coverage won't be shown. Click here to find out more.
Continue to review full report at Codecov.
|
Vector128<byte> cb0 = Sse2.SubtractSaturate(c0, b0); | ||
Vector128<byte> ac = Sse2.Or(ac0, ca0); | ||
Vector128<byte> bc = Sse2.Or(bc0, cb0); | ||
Vector128<byte> pa = Sse2.UnpackLow(ac, Zero); // |a - c| |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
While vector creation on pre .net6 is compiled to a really bad sequence of scalar sets, explicit Vector128<T>.Zero
compiles to a xor command which would be a little clearer imo. Plus it might be a very tiny bit faster than static variable read.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks @br3aker
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FYI: default
for the vector creates the same code as Vector128<T>.Zero
(I prefer Vector128.Zero).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM 👍
Sse2.Store((ushort*)p, diff); | ||
} | ||
|
||
int paMinusPb = output[0] + output[1] + output[2] + output[3]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can put this into the fixed
-block and access it via the pointer to avoid bound checks*.
If output
would be too small, then there's a bug somewhere else 😉 (fortunately there's none).
* or reverse the order to read output[3]
first, then [2]
, ... thant there's only one bound-check.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ahhh, too late....
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
or reverse the order to read output[3] first, then [2], ... thant there's only one bound-check.
ah yeah, always forget about that trick, thx. Will do with a follow up PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll put it (incl. the return) within the fixed block here.
Prerequisites
Description
This adds a SSE2 version of the method Select(), which is used during lossless encoding. Its a bit faster, but its very specific to the image which is encoded, since this only gets used in
Predictor11
TODO:
Before:
After with sse2: