Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

perf(filecoin-proofs): speed up Fr32Reader #1341

Merged
merged 4 commits into from
Nov 9, 2020
Merged

Conversation

dignifiedquire
Copy link
Contributor

Switch to processing bit padding in blocks of 127bytes using u128s.

This also improves correctness, as now the reader will emit a full Fr32 at the end, not just a single byte with the additional bits, which is why some of the test code was updated.

Speed differences can be seen here, with the diffs being the same benchmarks run against master. https://gist.github.com/dignifiedquire/1c21680f237a8f3aa5ad43f0008ffa41

cryptonemo
cryptonemo previously approved these changes Nov 5, 2020
Copy link
Collaborator

@cryptonemo cryptonemo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I reviewed this earlier and it looks good, thanks for rebasing.

Switch to processing bit padding in blocks of 127bytes using u128s
Copy link
Collaborator

@porcuquine porcuquine left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This generally looks good, and I can't see any definite problems. However, the naming throughout is confusing enough that I find it hard to follow the logic. In some cases, the naming actively confuses me because words seem to be used ambiguously. If you can clarify such cases by using less ambiguous naming and especially avoid cases where the same word is being used to mean entirely different things (if any such actually exist), that would be valuable. Although it seems like a small thing, once such naming gets locked in, it can be very expensive for future readers of the code to swap in the context required to understand it.

/// Are we done reading?
done: bool,
}

const NUM_U128S_PER_BLOCK: usize = 8;
const MASK_SKIP_HIGH_2: u128 = 0b0011_1111_1111_1111_1111_1111_1111_1111_1111_1111_1111_1111_1111_1111_1111_1111_1111_1111_1111_1111_1111_1111_1111_1111_1111_1111_1111_1111_1111_1111_1111_1111;

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would rather see this written in a way that can be seen correct by inspection, like:

u128::MAX - (0b11 << 126);

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I find that much harder to inspect to be honest

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Really? It's easy for you to see at a glance that this is wrong?

0b0011_1111_1111_1111_1111_1111_1111_1111_1111_1111_1111_1111_1111_1111_1111_1111_1111_1111_1111_1111_1111_1111_1111_1111_1111_1111_1111_1111_1111_1111_1111

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

neither of them is "easy" for me, but in thase case I simply count the 1s and know if it is good, but your notation requires me to think through bitshfiting rules, and how the subtraction operator is used on a bit level.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My main point was to avoid counting the 1s — not the specific construction used to avoid it. Especially with an appropriate comment, I'm not averse to asking readers to think through bitshifting rules.

A simpler version which requires less thinking and still no counting would be:

const MASK_SKIP_HIGH_2: u128 =  u128::MAX >> 2;

Maybe that is the best of both worlds?

impl<R: io::Read> Fr32Reader<R> {
pub fn new(source: R) -> Self {
Fr32Reader {
source,
target_offset: 0,
buffer: Default::default(),
in_buffer: [0; NUM_U128S_PER_BLOCK * 16],
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we move the 16 into a named constant?

// 0..254
{
out_buffer[0] = in_buffer[0];
out_buffer[1] = in_buffer[1] & MASK_SKIP_HIGH_2;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Somewhere, there should be a comment noting that u128 is 16 bytes, so Fr32 output is arranged in groups of 2 u128s — and here we we handle groups of 4 Fr32s because that is the period at which alignment repeats.

out_buffer[6] |= in_buffer[6] << 6; // low 122 bits
out_buffer[7] = in_buffer[6] >> 122; // top 6 bits
out_buffer[7] |= in_buffer[7] << 6; // low 120 bits
out_buffer[7] &= MASK_SKIP_HIGH_2; // zero high 2 bits
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you could implement this with a macro which would make the logic entirely clear while still being efficient. The above is fine, just requires more effort to verify that what is written our here really does correspond to that notional 'macroexpansion'.

self.target_offset += 8;
return Ok(1);
Ok(n) => {
let tmp = buf;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is tmp needed? Might be clearer to just assign buf = &mut buf[n..] if that works.

pub fn read_u32(&mut self) -> u32 {
debug_assert!(self.available() >= 32);
self.process_block();
self.to_process += div_ceil(bytes_read * 8, 254);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This to_process quantity is very confusing. Maybe it's just the name. We just finished processing (process_block()), so I would not expect the amount 'to process' to increase. Can you rework the naming so what is actually being tracked is clearer?

Is the word 'process' actually referring to the same thing in these two names? If not, one should really be changed.

target[2] = value[2];
target[3] = value[3];
}
let start = read;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Amplifying my comment when read was introduced, I think this is a confusing name. I think I would be less confused if it was something like target_read. As a reader, it's very hard to keep track of the context and semantics of these variables. If there's not a way to make it clearer by the structure, at least maybe more explicit names will help.

I think part of the confusion, too, is that in this context, you're writing to target but referring to the completed result as having read. I know that's what this method purports to do, just explaining how it creates confusion, giving that you also read from self.source into a buffer, then read from that buffer before 'reading' into target.

Maybe target_offset would be clearest?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well it is very unfortunate that the past tense and the verb happen to be the same word here, as this is explicitly the past tense, maybe "bytes_read" would be better?

while read < num_bytes {
// load new block
if self.to_process == 0 {
let bytes_read = self.fill_buffer()?;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This fills in_buffer.

/// Currently read byte.
buffer: Buffer,
/// Currently read block.
in_buffer: [u8; NUM_U128S_PER_BLOCK * 16],
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of 16, you could explicitly calculate with size_of to show that the in and out buffers hold the same number of bytes.

.copy_from_slice(&self.out_buffer.as_byte_slice()[out_start..out_end]);
read += len;
self.out_offset += len;
self.to_process -= div_ceil(len * 8, 256);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

div_ceil(len, 32) might be clearer. But this raises a question, if to_process is the number of blocks, and one block is 4 Fr, shouldn't this be div_ceil(len, 128)?

I can believe the actual algorithm is doing the right thing here, but I'm confused trying to line up all the names and descriptions throughout this whole changeset.

I think maybe 'block' in this context is being used to mean 'one Frs worth of bytes' — but it seems like in the definition of NUM_U128S_PER_BLOCK (where the value is 8), 'block' is being used to refer to 128 bytes = 4 Fr32s.

Am I missing something, or can some rectification of names help?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

to process is in number of Frs sorry, which is 254 bits in, 256bits out

@dignifiedquire dignifiedquire merged commit 4e0594c into master Nov 9, 2020
@dignifiedquire dignifiedquire deleted the perf-fr32 branch November 9, 2020 17:16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants