Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve __clzsi2 performance #366

Merged
merged 1 commit into from
Jul 28, 2020

Conversation

AaronKutch
Copy link
Contributor

This was something I originally made for PR 332, but I realized I should PR it by itself so I can remove it from my specialized-div-rem crate. As an example, the assembly generation for RISC-V used to be

 addi    a1, zero, 64
 beqz    a2, .LBB202_2
 addi    a1, zero, 32
 add     a0, zero, a2
.LBB202_2:
 srli    a2, a0, 16
 beqz    a2, .LBB202_7
 addi    a1, a1, -16
 srli    a0, a2, 8
 bnez    a0, .LBB202_8
.LBB202_4:
 add     a0, zero, a2
 srli    a2, a0, 4
 bnez    a2, .LBB202_9
.LBB202_5:
 add     a2, zero, a0
 srli    a0, a2, 2
 bnez    a0, .LBB202_10
.LBB202_6:
 add     a0, zero, a2
 addi    a3, zero, 1
 addi    a2, zero, -2
 bgeu    a3, a0, .LBB202_11
 j       .LBB202_12
.LBB202_7:
 add     a2, zero, a0
 srli    a0, a2, 8
 beqz    a0, .LBB202_4
.LBB202_8:
 addi    a1, a1, -8
 srli    a2, a0, 4
 beqz    a2, .LBB202_5
.LBB202_9:
 addi    a1, a1, -4
 srli    a0, a2, 2
 beqz    a0, .LBB202_6
.LBB202_10:
 addi    a1, a1, -2
 addi    a3, zero, 1
 addi    a2, zero, -2
 bltu    a3, a0, .LBB202_12
.LBB202_11:
 neg     a2, a0
.LBB202_12:
 add     a0, a2, a1
 ret

but is now:

 srli    a1, a0, 32
 snez    a1, a1
 slli    a1, a1, 5
 srl     a0, a0, a1
 srli    a2, a0, 16
 snez    a2, a2
 slli    a2, a2, 4
 srl     a0, a0, a2
 or      a1, a1, a2
 addi    a2, zero, 255
 sltu    a2, a2, a0
 slli    a2, a2, 3
 srl     a0, a0, a2
 or      a1, a1, a2
 addi    a2, zero, 15
 sltu    a2, a2, a0
 slli    a2, a2, 2
 srl     a0, a0, a2
 or      a1, a1, a2
 addi    a2, zero, 3
 sltu    a2, a2, a0
 slli    a2, a2, 1
 srl     a0, a0, a2
 or      a1, a1, a2
 addi    a2, zero, 1
 sltu    a2, a2, a0
 srl     a0, a0, a2
 or      a1, a1, a2
 add     a0, a0, a1
 addi    a1, zero, 64
 sub     a0, a1, a0
 ret

This is completely branchless. In fact, I could make it const. A previous comment mentioned making __clzsi2 const, but intrinsics! doesn't seem to support const.

@AaronKutch
Copy link
Contributor Author

AaronKutch commented Jul 10, 2020

What is up with i686-pc-windows-gnu builds spuriously failing on me? I have encountered this multiple times before.

@alexcrichton
Copy link
Member

Thanks! Would you be up to adding a microbenchmark and also expanding the test suite for this intrinsic at the same time as well?

@AaronKutch
Copy link
Contributor Author

I don't see any benchmarks in compiler-builtins. Where would I put a microbenchmark?

@bjorn3
Copy link
Member

bjorn3 commented Jul 14, 2020

ones = !0 >> r0;
let r1: u32 = bit_indexing_mask & random::<u32>();
let mask = ones.rotate_left(r1);
match (random(), random()) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tests should be deterministic. Otherwise they could fail one time and succed another time.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could I use a seedable RNG like rand_xoshiro::Xoroshiro128StarStar?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am also using this fuzzing technique in PR #332.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess you could.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Randomized testing is fine in my opinion but this should also print out the input on failure so if it happens in CI it can be reproduced.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if using something like https://docs.rs/quickcheck/0.9.2/quickcheck/ something we can consider for tests?

for _ in 0..63 {
assert_eq!(__clzsi2(i) as u32, i.leading_zeros());
i <<= 2;
i += 1;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can these tests be left in because there's presumably no harm in deterministically trying to catch edge cases?

@alexcrichton
Copy link
Member

There's also examples of benchmarks on #365, but the general idea is that if this is done for performance reasons it would be good to confirm that the performance did indeed improve.

@AaronKutch
Copy link
Contributor Author

AaronKutch commented Jul 14, 2020

I'm confused. When I look at this commit, I can see benches being added under a testcrate/benches folder. This folder doesn't look like it exists in the github browser or my local clone. When I search for the #[bench] keyword, I don't see benchmarks anywhere in compiler-builtins. Is there some kind of git submodule magic happening here? edit: Is my PR and #365 going to be the first PRs to ever add benchmarks to compiler-builtins?

Copy link
Member

@nagisa nagisa left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The implementation looks good to me.

ones = !0 >> r0;
let r1: u32 = bit_indexing_mask & random::<u32>();
let mask = ones.rotate_left(r1);
match (random(), random()) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if using something like https://docs.rs/quickcheck/0.9.2/quickcheck/ something we can consider for tests?

@nagisa
Copy link
Member

nagisa commented Jul 15, 2020

There's also examples of benchmarks on #365, but the general idea is that if this is done for performance reasons it would be good to confirm that the performance did indeed improve.

IMO looking at something like llvm-mca output for cases like this gives significantly more information and also, in my opinion, is a more rigorous way to evaluate performance of functions in compiler-builtins. So, while I think benchmarks would be cool, I’d also be fine with accepting with with just an evaluation of llvm-mca or similar function latency/throughput analysis for targets that might use this function.

@nagisa
Copy link
Member

nagisa commented Jul 15, 2020

I did some llvm-mca spelunking:

uOps Per Cycle IPC Block RThroughput
OLD x86_64 znver1 3.05 3.05 9.3
NEW x86_64 znver1 1.48 1.48 10.0
OLD Apple-A13 1.58 1.58 7.8
NEW Apple-A13 1.10 1.10 11.5
OLD Cortex-A73 1.66 1.66 15.0
NEW Cortex-A73 1.22 1.22 14.5

(lower Block RThroughput is better and is probably the best indication of real-world performance you can expect)

With the caveat that this implementation is not relevant on superscalar CPUs which is the only thing llvm-mca, apparently, supports, the old version is better in 2 out of 3 cases.

Cursory look suggests that both the new and old implementation are both pretty good about their register dependencies and don’t spend much time overall waiting on anything, just that LLVM ends up just generating fewer instructions on targets that happen to have a "conditional set/move" kind of instructions. For tested targets the assembly also ends up non-branchy with either new and old implementations, which makes sense since the generated code is using conditional moves.

For example here’s assembly for MIPS:

Old
	srl	$1, $4, 16
	addiu	$2, $zero, 16
	addiu	$3, $zero, 32
	movz	$2, $3, $1
	addiu	$3, $2, -8
	movz	$1, $4, $1
	srl	$4, $1, 8
	movz	$3, $2, $4
	addiu	$2, $3, -4
	movz	$4, $1, $4
	srl	$1, $4, 4
	movz	$2, $3, $1
	addiu	$3, $zero, -2
	addiu	$5, $2, -2
	movz	$1, $4, $1
	srl	$4, $1, 2
	movz	$5, $2, $4
	movz	$4, $1, $4
	negu	$1, $4
	sltiu	$2, $4, 2
	movz	$1, $3, $2
	jr	$ra
	addu	$2, $1, $5
New
	ori	$1, $zero, 65535
	sltu	$1, $1, $4
	sll	$1, $1, 4
	srlv	$2, $4, $1
	addiu	$3, $zero, 255
	sltu	$3, $3, $2
	sll	$3, $3, 3
	srlv	$2, $2, $3
	or	$1, $1, $3
	addiu	$3, $zero, 32
	addiu	$4, $zero, 1
	addiu	$5, $zero, 3
	addiu	$6, $zero, 15
	sltu	$6, $6, $2
	sll	$6, $6, 2
	srlv	$2, $2, $6
	sltu	$5, $5, $2
	sll	$5, $5, 1
	srlv	$2, $2, $5
	sltu	$4, $4, $2
	srlv	$2, $2, $4
	or	$1, $1, $6
	or	$1, $1, $5
	or	$1, $1, $4
	addu	$1, $1, $2
	jr	$ra
	subu	$2, $3, $1

So overall conclusion at least from me is that its looking like if there's any improvement with the new implementation, its probably going to be limited to RISC-V...

(though also remember that targets like x86_64 and aarch64 have no use for this builtin as they have a native instruction)

@AaronKutch
Copy link
Contributor Author

Thank you very much for the spelunking. I looked at RISC-V specifically because it was giving me a hard time optimizing division functions. It gave such a assembly improvement to RISC-V and only slightly degraded local benchmarks for an OOO machine, so I assumed it would be the most performant for in-order CPUs in general and especially RISC-V. I thought I remembered testing the assembly for ARM also, but looking at it again, the old version is significantly better there too. I think we should have both implementations, but enable the new one only on RISC-V. What do you think?

@nagisa
Copy link
Member

nagisa commented Jul 15, 2020

Yeah, I think its fine to conditionally switch between implementations in compiler-builtins. Its a library that's used everywhere, after all.

@alexcrichton
Copy link
Member

If we do have multiple implementations, however, I'd like to ensure that they're implemented in such a way that they're both tested thoroughly, because otherwise it's a situation that's very easy to have a risc-v-only bug.

@AaronKutch
Copy link
Contributor Author

AaronKutch commented Jul 17, 2020

I think that the code is large enough to move the __clzsi2 function under a leading_zeros.rs file instead of being an exceptional case in mod.rs, but it seems that the test crate cannot see the new file despite the docs clearly showing one. I presume that compiler-builtins is doing something hacky with its exports. I guess we do not want a breaking change to solve such a small inelegance anyway.

My current plan I have of testing both variations of counting leading zeros thoroughly is to have two functions, usize_leading_zeros_default and usize_leading_zeros_riscv, make __clzsi2 conditionally use one or the other, and add a feature flag the testcrate can set in order to see both underlying functions.

Would a better solution be having a CI test for RISC-V? I read this post that says "the full Rust test suite passes for a RISC-V cross-compilation toolchain". I see a wide variety of CI tests being performed currently, I wonder why RISC-V isn't included.

edit: RISC-V GNU/Linux is being added as a host platform soon.

@alexcrichton
Copy link
Member

I'm not sure what's up with modules, nothing's really that fancy in this crate and it should in general act similarly to the rest of Rust's module system. Are you sure that you reexported the intrinsics in mod.rs?

For testing adding RISC-V should be fine.

I wonder why RISC-V isn't included.

This crate is old. RISC-V targets didn't exist when this crate was created. No one's added it since then.

@AaronKutch AaronKutch force-pushed the better-leading-zeros branch 2 times, most recently from 3a22529 to 7f00717 Compare July 22, 2020 04:25
@AaronKutch
Copy link
Contributor Author

What do you think about my testing setup? I don't know if there is a better solution. Does expose-alt-impls look like a flag that could be used similarly by other specialization tests?

Copy link
Member

@alexcrichton alexcrichton left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks reasonable to me, thanks! I've got a few minor comments but otherwise seems good to me to go.

Can you also add comments for why the riscv version isn't used on other platforms?

}

#[cfg(any(target_arch = "riscv32", target_arch = "riscv64"))]
intrinsics! {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To avoid the duplication here could this perhaps use if cfg!(...) internally?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes

src/int/mod.rs Outdated
mod leading_zeros;

#[cfg(feature = "expose-alt-impls")]
pub mod leading_zeros;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it'd be fine to just always export this module, there shouldn't be a need to make it #[cfg] pub

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Making it pub always will expose the usize_leading_zeros... functions by default, but the only reason they should ever be public is for testing purposes.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think exporting that is fine, there are many exported symbols from this library other than the intrinsics due to traits and such, so we don't need to be super strict about what exactly is exported.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have made it public always and removed unneeded configuration

@AaronKutch
Copy link
Contributor Author

I found out that u128::leading_zeros has a huge code size. This PR will help, but I think there is a problem upstream wherever it is defined.
the assembly of

pub fn lz(x: u128) -> usize {
    let mut x = x;
    let mut z = 0;
    if ((x >> 64) as u64) != 0 {
        z += 64;
    } else {
        x >>= 64;
    }
    z + x.leading_zeros() as usize
}

is huge (much larger than it should be even before this PR) but changing the last line to be

   z + (x as u64).leading_zeros() as usize

fixes it. I think there is a similar problem somewhere.

@AaronKutch AaronKutch force-pushed the better-leading-zeros branch 2 times, most recently from f2dcdcf to 4a976e0 Compare July 28, 2020 17:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants