-
-
Notifications
You must be signed in to change notification settings - Fork 7
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Minor improvement on the dataLength % 3
.
#114
Merged
Merged
Changes from 1 commit
Commits
Show all changes
8 commits
Select commit
Hold shift + click to select a range
ffb75a0
Minor improvement on the `dataLength % 3`.
ycrumeyrolle dd8ea7c
Update Base64UrlEncoder.Encode.cs
ycrumeyrolle 8b29e0a
Update source/gfoidl.Base64/Internal/Base64UrlEncoder/Base64UrlEncode…
ycrumeyrolle 2e74e9b
Apply suggestions from code review
ycrumeyrolle e376b94
FIx test assertion failure
ycrumeyrolle e566b68
Inverse method call for throwing the ArgumentOutOfRangeException as s…
ycrumeyrolle 27f487c
Update source/gfoidl.Base64/Internal/Base64UrlEncoder/Base64UrlEncode…
gfoidl ae21697
Update source/gfoidl.Base64/Internal/Base64UrlEncoder/Base64UrlEncode…
gfoidl File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be nice if the JIT can emit
cmov
here, so it gets branchless. Unfortunately there is no "trick" to force the JIT to do so, except some very obscure hacks (which we shouldn't use here).There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am curious to know the obscure hacks...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not exactely a
cmov
(which would be desired), but still a branchless version with some bit-twiddling and the knowledge thattrue / false
is represented as1 / 0
-> snippet in sharplabIt the stack isn't used, it would be even better.
The main advantage will be when the branch-predictor can't work reliable due to "random" inputs, i.e. unknown patterns for
mod3 == 0
. Then this approach may gain some perf.Contrast this with clang, so it could be pretty concise branchless code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Interesting: if the
bool
is "casted" tobyte
instead ofint
the stack-usage goes away and code looks pretty good.Can you give this a try with the above benchmark? If it has potential to be faster, we should go with such a approach (although state otherwise above 😉).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Tried with the
BranchlessSelect
casted as byte. The results are interesting:The 'branchless' is not as efficient as expected .
When reverting the branch
BranchlessSelect(mod3 != 0, 3 - mod3, 0)
instead ofBranchlessSelect(mod3 == 0, 0, 3 - mod3)
, the result are better but still under the branched one.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the numbers. 👍
Some thoughts -- but I don't know what matches here...
The effect of
cmov
et.al. depends on the distribution of taken / not taken branches, i.e. on how effecient the branch predictor can work. If the branch predictor does a good job, then the "branchy code" may be more efficient, as the pipeline can speculatively process the following steps, whereas withcmov
the rest of the pipeline needs to wait for the result ofcmov
.The code for the branchless version is quite a bit, so it may be more efficient just to use simple and compact code with a branch -- which can be predicted, as in the benchmarks where for constant input length the result will always be the same.
I'm pretty sure that there will be a benchmark constellation where the branchless version will be quite a bit faster.
In effect I think we should go with the code as is. It's easy and not hacky...and as the numbers show quite good. Furthermore I doubt a slight improvement here will show any effect on overall perf.
Once the runtime (JIT) will detect a pattern like
return cond ? a : b
and emit better code (maybe with tiering) we'll gain the perf-win for free.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
More or less just for fun I tried a vectorized approach to make it branchless. I show it here because you are curios 😉
Benchmark is laid out so that (hopefully) the branch-predictor won't do a good job.
Benchmark-code
Codegen