Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fallible APIs for StringInterner for #70 #71

Open
wants to merge 7 commits into
base: master
Choose a base branch
from

Conversation

royaltm
Copy link

@royaltm royaltm commented May 27, 2024

Hi @Robbepop!

This is my take on issue #70.

This pull request is NOT READY to merge.
I haven't added any tests nor did I check whether new code breaks anything in the existing test suites.

Let's iterate this through here instead.

@royaltm
Copy link
Author

royaltm commented May 27, 2024

First, I haven't found issues adding fallible API to Backend implementations (except the capacity issue, below).

The most problematic change involves the dedup HashMap:

I'm also a bit concerned about the sanity of the with_capacity and the new try_reserve API, since the implementation relies on guesswork - estimating average word size by google - like with everything "on average" will mostly be of no use for most particular cases.

What if instead we give a user a way to estimate the "average" string size? Or perhaps even the separate additional capacity for string data.

@Robbepop
Copy link
Owner

Robbepop commented May 29, 2024

@royaltm thank you for the PR!

I think this PR generally goes into the right direction.

The RawMap API should have been implemented and used by the StringInterner long ago, but I didn't get to it. Maybe it would be nice to separate this change into another PR and basing this PR on it. The idea is to no longer use hashbrown's RawEntry API and instead use its RawMap API for everything.

CI is currently unhappy with docs and formatting. Also a memory consumption test is broken because the number of expected allocations decreased. Can you explain why?

I am a bit sceptic about the usefulness of the reserve API for StringInterner in general after seeing the mess that it introduces. The reserve API was once useful when we still had the SimpleBackend. However, the SimpleBackend got removed because it was too inefficient and thus served no real purpose. Without it the reserve API depends too much on heuristics and should probably be removed or redesigned to not be based on the expected number of strings to intern but on the expected bytes of all expected strings to be interned but even then it wouldn't be ideal imo.

@royaltm
Copy link
Author

royaltm commented May 29, 2024

Perhaps different number of allocations must have something to do with changes to the buffer backend where I introduced the calculate_var7_size function so I am able to reserve space for the next entry including the var7 size marker ahead of pushing to the backend buffer.
Now the buffer backend is resized only once before encode_var_usize is called and the string itself is pushed.
Previously var7 size marker was pushed separately from the string itself and each operation could possibly reallocate backend buffer. But that's just my guess.

As for the RawTable API, I'm not quite sure what advantage it would have over current implementation. Perhaps just less intermediary function calls?

@Robbepop
Copy link
Owner

As for the RawTable API, I'm not quite sure what advantage it would have over current implementation. Perhaps just less intermediary function calls?

I have read somewhere that the RawEntry API of the hashbrown crate was to be phased out in favor of the newer and more versatile RawMap type. Since you are now making use of the raw crate feature I thought it would make sense to go full in with the change. Not necessary though. We can go ahead without it.

@royaltm
Copy link
Author

royaltm commented May 29, 2024

I suppose moving to RawMap interface should be only an internal change. I propose we can tackle this issue after the one at hand.

Regarding try_reserve and with_capacity interface I agree that its usefulness is pretty questionable. One thing is that It's really hard to make predictions ahead. Especially with the BucketBackend and what makes the problem even worse is that we can't shrink_to_fit the head String due to the known problems with reallocation.

Perhaps I'll remove the try_reserve calls from API or make them internal where they are still required, and we'll see where that get us.

@Robbepop
Copy link
Owner

I suppose moving to RawMap interface should be only an internal change. I propose we can tackle this issue after the one at hand.

Regarding try_reserve and with_capacity interface I agree that its usefulness is pretty questionable. One thing is that It's really hard to make predictions ahead. Especially with the BucketBackend and what makes the problem even worse is that we can't shrink_to_fit the head String due to the known problems with reallocation.

Perhaps I'll remove the try_reserve calls from API or make them internal where they are still required, and we'll see where that get us.

The try_reserve probably does not make sense for the Backend to implement since it is too backend specific. At the moment no backend has a proper and well defined way to do it. For the StringInterner itself try_reserve could still be useful for its hashmap to deduplicate strings, however, it is probably not much help in practice.

@royaltm
Copy link
Author

royaltm commented Jun 11, 2024

Hi @Robbepop!

Sorry for the delay, I've had a whirlwind of obligations to fulfill in the passing weeks and I'd rather not contribute at all than contribute haphazardly during my lunch break as good and quickly seldom meet.

  • I removed the Backend::try_reserve from trait and backend implementations,
  • I left StringInterner::try_reserve with a doc note about what it does exactly,
  • fixed docs issues,
  • added a mention of try_get_or_intern in the root doc of the StringInterner.
  • testing of calculate_var7_size together with each encode_var_usize,
  • updated number of (de)allocations in buffer_backend::test_memory_consumption.

If this meets your criteria I don't think there's anything left considering the scope of this PR.

@royaltm
Copy link
Author

royaltm commented Jun 20, 2024

Hi @Robbepop,

I've fixed formatting errors, I'm not sure however what can I do to fix the tarpaulin issue.
Is there anything else on my end that needs to be done?

@Robbepop
Copy link
Owner

I just ran benchmarks for this branch locally and compared them to the current main branch.

Unfortunately I saw quite some big performance regressions with this PR:

  • get_or_intern/fill-empty/new/: ~10% regression
  • get_or_intern/fill-empty/with_capacity/: ~20-30% regression
  • get_or_intern/already-filled/: ~5% regression
  • get_or_intern_static: BuckedBackend improved by 5%, other backends regressed by 10%

head: FixedString::with_capacity(cap),
head: FixedString::try_with_capacity(cap).unwrap(),
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not a big fan of using raw unwrap calls but coming up with proper expect messages is not easy here I see.
I usually try to use unwrap_or_else(|| panic!("...")) because it allows to use scoped information for the message and otherwise I use expect("...") with a proper message stating why this must not happen. However, here it is due to the guarantees of the API. 🤔

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here I was actually assuming the error message from try_with_capacity will do the job.

Comment on lines +155 to +157
// self.spans.try_reserve(1)?;
// // this cannot fail, the previous line either returned or added at least 1 free slot
// let _ = self.spans.push_within_capacity(value);
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could also recursively call try_push_span again instead.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wouldn't that end up calling try_next_symbol twice?

With the new code, we try to reserve capacity eagerly:

        self.spans.try_reserve(1)?;
        self.spans.push(interned);

which unfortunately checks capacity redundantly, the 2nd one being in the push itself, which might impact the performance. The whole point of using push_within_capacity is to try to push first, where there is only one capacity check in the optimistic scenario, and only when that fails code branches to allocate.

Comment on lines +174 to +180
// FIXME: vec_push_within_capacity #100486, replace the following with:
//
// if let Err(value) = self.full.push_within_capacity(old_string) {
// self.full.try_reserve(1)?;
// // this cannot fail, the previous line either returned or added at least 1 free slot
// let _ = self.full.push_within_capacity(value);
// }
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems to be the same code as above so maybe propose to introduce a helper method instead. Or?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure, but that would need to be a generic function that takes a vector as argument because spans and full are vectors of different types

fn encode_var_usize(&mut self, value: usize) -> usize {
encode_var_usize(&mut self.buffer, value)
fn try_encode_var_length(&mut self, length: usize) -> Result<()> {
let add_len = length + calculate_var7_size(length);
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would rename it to something like len_encode_var_usize which is very similar to encode_var_usize to draw the parallels to the reader of the code.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Though I have to admit that it is a bit unfortunate that this encoding step is now 2 phased.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

encode_var_usize pushes each byte separately, calling try_reserve every time a byte is encoded would take down the performance even more. However with push_within_capacity stabilized that could be implemented differently.

Alternative solution is to encode length into an array on the stack and then call extend_from_slice on self.buffer. This however can't be done efficiently without a bit of unsafe code (MaybeUninit) or would require using ArrayVec or a similar crate.

Comment on lines +156 to +157
/// According to google the approx. word length is 5.
const DEFAULT_STR_LEN: usize = 5;
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should really get rid of this thing entirely. My bad to have introduced them in the first place. 🙈

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By that, do you mean to remove with_capacity from the Backend altogether?

Comment on lines -161 to +172
/// We encode the `usize` string length into the buffer as well.
const LEN_USIZE: usize = mem::size_of::<usize>();
/// According to google the approx. word length is 5.
const DEFAULT_STR_LEN: usize = 5;
let bytes_per_string = DEFAULT_STR_LEN + LEN_USIZE;
// We encode the `usize` string length into the buffer as well.
let var7_len: usize = calculate_var7_size(capacity);
let bytes_per_string = DEFAULT_STR_LEN + var7_len;
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As said in a prior comment, we should probably get rid of all the heuristics based with_capacity methods and get this down to the basics. A Rust BTreeMap for example also does not provide a with_capacity method because it simply does not make sense to do so.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree, it's rather pointless.

Comment on lines +210 to +218
/// Calculate var7 encoded size from a given value.
#[inline]
fn calculate_var7_size(value: usize) -> usize {
// number of bits to encode
// value = 0 would give 0 bits, hence: |1, could be anything up to |0x7F as well
let bits = usize::BITS - (value | 1).leading_zeros();
// (bits to encode / 7).ceil()
((bits + 6) / 7) as usize
}
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you please explain this formula a bit better so that it does not look like magic to the reader?

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also: is this precise or an over approximation?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This formula is precise and this is also reflected in tests.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The formula goes like this:

  1. Take the binary size (bit length) of the original length
let bits = usize::BITS - (value | 1).leading_zeros();

For example if your length is 42 this is binary 101010, the bit length is 6.
Assuming usize::BITS is 64 leading_zeros of 42usize is 58. 64 - 58 = 6.

  1. If the binary size is 0, make it 1 (0 is the only case that breakes the formula, hence (value | 1)).

  2. The encoded size in bytes is the total number of encoded bits divided by 7 (each byte encodes only 7 bits of the original value), rounded up because the last byte's 7-bits might not be filled entirely: hence (bits + 6).

// (bits to encode / 7).ceil()
((bits + 6) / 7) as usize

For example 0x7F bit length is 7, (7 + 6) / 7 == 1

The length is encoded as: 0b0111_1111

0x80 bit length is 8, (8 + 6) / 7 == 2

The length is encoded as: 0b1000_0000 0b0000_0001

Copy link
Author

@royaltm royaltm Jun 21, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's take the original bit value of the encoded length (each bit placement is represented by a character)

ZYXWVUTSRQPONMLKJIHGFEDCBA9876543210 (bit-length: 36)

The encoded chunks end up being:

!6543210 !DCBA987 !KJIHGFE !RQPONML !YXWVUTS _______Z

Where ! and _ are 1 and 0 bit markers respectively.

As seen clearly each 7 bits ends up in each chunk with the last chunk filled with the remaining 1-7 bits

(36 + 6) / 7 == 6

Now if we take:

YXWVUTSRQPONMLKJIHGFEDCBA9876543210 (bit-length: 35)

The encoded chunks end up being:

!6543210 !DCBA987 !KJIHGFE !RQPONML _YXWVUTS

(35 + 6) / 7 == 5

This process repeats every 7 bits.

The only exception is the 0 itself which if we followed the formula precisely should be encoded in exactly 0 bytes - because there are no bits to encode!

This leads us to the interesting observation that we could in fact extend the encoded var7 by adding any number of redundant 0 at the end, and the decoding formula (providing we have an infinite integer capacity) would still work:

!6543210 !DCBA987 !KJIHGFE !RQPONML !YXWVUTS ________
!6543210 !DCBA987 !KJIHGFE !RQPONML !YXWVUTS !_______ ________
!6543210 !DCBA987 !KJIHGFE !RQPONML !YXWVUTS !_______ !_______ ________

and so on...

I hope that explains why 0 is breaking the formula without actually really breaking it. What we really need is just 1 bit of information that this is the end of the stream of data.

Comment on lines +160 to +161
// According to google the approx. word length is 5.
const DEFAULT_WORD_LEN: usize = 5;
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again: we should remove these heuristics entirely.

@@ -204,6 +224,13 @@ where
backend,
} = self;
let hash = make_hash(hasher, string.as_ref());
// this call requires hashbrown `raw` feature enabled
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please remove superflous comment

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, this was more of a message to you than the real code comment.

@Robbepop
Copy link
Owner

All in all this looks good. The benchmark regressions need to be fixed before we can merge though.

@royaltm
Copy link
Author

royaltm commented Jun 21, 2024

I'm beginning to think that without stabilization of push_within_capacity we can't do much to improve backend's fill performance, as we ended up with an algorithm that checks the capacity of the vector multiple times for a single push operation. Please see the comment above #71 (comment) for more on this.

If I'm not mistaken another area of improvement might be in the raw hash-map department. With the latest changes I have also introduced redundant capacity checks in the try_get_or_intern_using function.

@royaltm
Copy link
Author

royaltm commented Jun 21, 2024

Also, while I was pondering on the var-int buffer encoding issue if we are going to address the "2 phased encoding step" and we decide to try the single-step array encoding approach I'd also suggest to change the var-int algorithm to the more efficient one:

I've created a gist with more info on the topic and playground examples:
https://gist.github.com/royaltm/f31ce5b1c8878e7c10ff7d335f3826e2

Please feel free to check it out at your leisure.

@royaltm
Copy link
Author

royaltm commented Jun 21, 2024

Perhaps a better approach is to use https://doc.rust-lang.org/alloc/vec/struct.Vec.html#method.spare_capacity_mut.

Something along the lines of:
https://play.rust-lang.org/?version=stable&mode=release&edition=2021&gist=bd41840d6595d2787c2f60b86a67e84b

Although a lot of additional unsafety is introduced the generated assembly looks neat:

There's a very tight loop at: .LBB10_5 and a single capacity pre-check that branches to .LBB10_2 to make more room.

playground::encode_var_usize:
	push	r14
	push	rbx
	push	rax
	mov	rbx, rsi
	lea	rdx, [rsi + 64]
	mov	rax, qword ptr [rdi]
	mov	rsi, qword ptr [rdi + 16]
	sub	rax, rsi
	cmp	rax, rdx
	jb	.LBB10_2
	mov	rcx, qword ptr [rdi + 8]
	add	rcx, rsi
	xor	eax, eax
	cmp	rbx, 128
	jb	.LBB10_6

.LBB10_4:
	mov	rdx, rbx

.LBB10_5:
	mov	r8d, edx
	or	r8b, -128
	mov	byte ptr [rcx + rax], r8b
	inc	rax
	add	rdx, -128
	mov	rbx, rdx
	shr	rbx, 7
	cmp	rdx, 16383
	mov	rdx, rbx
	ja	.LBB10_5

.LBB10_6:
	mov	byte ptr [rcx + rax], bl
	lea	rcx, [rsi + rax]
	inc	rcx
	inc	rax
	mov	qword ptr [rdi + 16], rcx
	add	rsp, 8
	pop	rbx
	pop	r14
	ret

.LBB10_2:
	mov	r14, rdi
	call	alloc::raw_vec::RawVec<T,A>::reserve::do_reserve_and_handle
	mov	rdi, r14
	mov	rsi, qword ptr [r14 + 16]
	mov	rcx, qword ptr [r14 + 8]
	add	rcx, rsi
	xor	eax, eax
	cmp	rbx, 128
	jae	.LBB10_4
	jmp	.LBB10_6

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants