Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tweak EntityHasher for more speed #10605

Closed
wants to merge 3 commits into from

Conversation

scottmcm
Copy link
Contributor

@scottmcm scottmcm commented Nov 17, 2023

Objective

#9903 did a great job on EntityHasher, but seeing it in the 0.12 release notes nerd-sniped me and I found some μoptimizations to it.

(This was also the source of #10519.)

See also some Discord conversation in https://discord.com/channels/691052431525675048/1172033156845674507/1174887022914187264.

Solution

EntityHasher does two things today that are great and I keep here:

  • being pass-through for H1 (assuming fewer than ≈3,758,096,384 entities in the table)
  • using multiplication to populate H2

(For H1 & H2, see https://youtu.be/ncHmEUmJZf4?t=1540)

But there's two things that can tweak:

Skip generation entirely

Today, the hasher bit-ors the generation into the high 32 bits of the hash. Unfortunately, H2 only cares about the high 7 bits of the result, so unless you get to generation 33,554,432 that doesn't actually do anything.

As a good demonstration of that, note that the benchmark that just looks up extant entities with the (probably-)wrong generation is actually faster with this change, despite being the one where including the generation in the hash should matter the most:
image
(Note how the miss on generation is ≈25% slower than the miss on index, since the hash has a high probability of noticing the miss from the index before getting to == on the entities.)

And render-world clears things every frame, so if I'm understanding correctly it'll almost never have generation conflicts anyway.

By skipping the generation, the optimized codegen for the hash doesn't even bother to load it from the &Entity.

Phrase the hash with less shifting

Today, the hash does a 64-bit multiplication of the input with a 64-bit constant, but then it shifts it left by 32 bits, which means it actually threw away most of the result. That's not actually a problem, really, since LLVM is clever and figures out what's going on, but had me puzzled for a while.

That optimization inspired the approach here: rather than shifting and masking, just put that in the multiplication constant directly. And thanks to skipping the generation, it's a 32-bit input. So on both 64-bit and 32-bit, it only needs a single multiplication. (LLVM sees the particular form of the constant and lowers it in a smart way using a slightly different constant in the multiplication: https://rust.godbolt.org/z/nGWzhfcfe)

This leaves the hash, once some intermediate parts optimize away, as just

u64::from(e.index).wrapping_mul(0x9e37_79b9_0000_0001)

That's It

And that's about the simplest possible thing that we could do while still trying to avoid H2 collisions.

This produces identical results (Proof: https://alive2.llvm.org/ce/z/orQoq9) as multiply-shift-or, just LLVM seemingly doesn't want to do it by itself (https://rust.godbolt.org/z/58avPcG1o).

Macrobenchmarks

TBH there's not much to see here. Some slightly better, some slightly worse, but I don't think any are particularly interesting -- I don't think I have consistent enough frame timings to see differences of a couple of instructions in hashing, especially when those instructions are fast bitops.

Yellow (this trace) is the new code and red (external trace) is the baseline.

cargo run --example bevymark --release --features bevy/trace_tracy -- --benchmark --waves 160 --per-wave 1000 --mode mesh2d --ordered-z
image

cargo run --example bevymark --release --features bevy/trace_tracy -- --benchmark --waves 160 --per-wave 1000 --mode mesh2d
image

cargo run --example many_cubes --release --features bevy/trace_tracy -- --benchmark
image

cargo run --example bevymark --release --features bevy/trace_tracy -- --benchmark --waves 160 --per-wave 1000 --mode sprite --ordered-z
image

cargo run --example bevymark --release --features bevy/trace_tracy -- --benchmark --waves 160 --per-wave 1000 --mode sprite
image

cargo run --example many_buttons --release --features bevy/trace_tracy
image


Changelog

Tweaked EntityHasher for a bit more speed in usual cases.

Migration Guide

(No breaking changes.)

@ItsDoot ItsDoot added C-Performance A change motivated by improving speed, memory usage or compile times A-Utils Utility functions and types labels Nov 17, 2023
@@ -285,6 +285,20 @@ impl Hasher for EntityHasher {

#[inline]
fn write_u64(&mut self, i: u64) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Demonstration of the overall assembly diff with this change: https://rust.godbolt.org/z/9zxfMdEfv

// of a conflict in the index despite a good hash function.
//
// This masking actually ends up with negative cost after optimization,
// since it saves needing to do the shift-and-or between the fields.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Demonstration of this by comparing the codegen without the masking to having the masking: https://rust.godbolt.org/z/3vvsT38ar

@scottmcm
Copy link
Contributor Author

There was a good conversation on Discord about whether it's ok to drop the generation. I've put some text in the comment about that, but to summarize the part that's specific to the previous implementation:

If you work through some simplifications on

fn write_u64(&mut self, i: u64) {
// Apparently hashbrown's hashmap uses the upper 7 bits for some SIMD
// optimisation that uses those bits for binning. This hash function
// was faster than i | (i << (64 - 7)) in the worst cases, and was
// faster than PassHasher for all cases tested.
self.hash = i | (i.wrapping_mul(FRAC_U64MAX_PI) << 32);
}

you'll find out that it's the same as u64::from(index.wrapping_mul(MAGIC) | generation) << 32 | u64::from(index). Or, further removing unnecessary detail, it's u64::from(generation) << 32 | some_function_of(index).

So the generation is technically used, but since it's only in the high bits, what does it take for it to matter?

  • H1 looks at low bits, so the generation wouldn't matter until you get to 2³³ buckets. hashbrown aims for a load factor of 87.5% or less, so it reallocates to that after 2³² buckets are 87.5% filled, and thus you need to have over 2³² × 87.5% = 3,758,096,384 items in the hash table for the generation to affect H1, which seems mostly unlikely.
  • H2 looks at the high 7 bits, so the generation doesn't affect that until you get to generation 1 << (32 - 7) = 33,554,432.

To put those in perspective, that's either

  • a table spending 31.5 GiB of RAM on just the Entity keys (including the 1 byte of stored H2 metadata for each), or
  • a program that, assuming it bumps the generation on a particular index once per frame at 144 FPS, has been running for 2.7 days

Thus it's fine for this change to ignore generation, since the previous one effectively was too.

@scottmcm scottmcm marked this pull request as ready for review November 18, 2023 09:01
@scottmcm scottmcm marked this pull request as draft November 18, 2023 20:27
@scottmcm
Copy link
Contributor Author

scottmcm commented Nov 18, 2023

I need to re-run numbers for this after #10558

Looks like this'll need a new OP, since the new repr changes things. Closing this one; will open a new one with all new perf stuff and description.

@scottmcm scottmcm closed this Nov 18, 2023
@scottmcm scottmcm deleted the faster-hasher branch November 18, 2023 22:51
github-merge-queue bot pushed a commit that referenced this pull request Nov 28, 2023
# Objective

Keep essentially the same structure of `EntityHasher` from #9903, but
rephrase the multiplication slightly to save an instruction.

cc @superdump 
Discord thread:
https://discord.com/channels/691052431525675048/1172033156845674507/1174969772522356756

## Solution

Today, the hash is
```rust
        self.hash = i | (i.wrapping_mul(FRAC_U64MAX_PI) << 32);
```
with `i` being `(generation << 32) | index`.

Expanding things out, we get
```rust
i | ( (i * CONST) << 32 )
= (generation << 32) | index | ((((generation << 32) | index) * CONST) << 32)
= (generation << 32) | index | ((index * CONST) << 32)  // because the generation overflowed
= (index * CONST | generation) << 32 | index
```

What if we do the same thing, but with `+` instead of `|`? That's almost
the same thing, except that it has carries, which are actually often
better in a hash function anyway, since it doesn't saturate. (`|` can be
dangerous, since once something becomes `-1` it'll stay that, and
there's no mixing available.)

```rust
(index * CONST + generation) << 32 + index
= (CONST << 32 + 1) * index + generation << 32
= (CONST << 32 + 1) * index + (WHATEVER << 32 + generation) << 32 // because the extra overflows and thus can be anything
= (CONST << 32 + 1) * index + ((CONST * generation) << 32 + generation) << 32 // pick "whatever" to be something convenient
= (CONST << 32 + 1) * index + ((CONST << 32 + 1) * generation) << 32
= (CONST << 32 + 1) * index +((CONST << 32 + 1) * (generation << 32)
= (CONST << 32 + 1) * (index + generation << 32)
= (CONST << 32 + 1) * (generation << 32 | index)
= (CONST << 32 + 1) * i
```

So we can do essentially the same thing using a single multiplication
instead of doing multiply-shift-or.

LLVM was already smart enough to merge the shifting into a
multiplication, but this saves the extra `or`:

![image](https://github.com/bevyengine/bevy/assets/18526288/d9396614-2326-4730-abbe-4908c01b5ace)
<https://rust.godbolt.org/z/MEvbz4eo4>

It's a very small change, and often will disappear in load latency
anyway, but it's a couple percent faster in lookups:

![image](https://github.com/bevyengine/bevy/assets/18526288/c365ec85-6adc-4f6d-8fa6-a65146f55a75)

(There was more of an improvement here before #10558, but with `to_bits`
being a single `qword` load now, keeping things mostly as it is turned
out to be better than the bigger changes I'd tried in #10605.)

---

## Changelog

(Probably skip it)

## Migration Guide

(none needed)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-Utils Utility functions and types C-Performance A change motivated by improving speed, memory usage or compile times
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants