Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Large number of page faults occurring in fn rav1d_submit_frame #1358

Open
ivanloz opened this issue Sep 13, 2024 · 6 comments
Open

Large number of page faults occurring in fn rav1d_submit_frame #1358

ivanloz opened this issue Sep 13, 2024 · 6 comments
Assignees

Comments

@ivanloz
Copy link

ivanloz commented Sep 13, 2024

(Copied from #1294 (comment) -- @kkysen for visibility, thanks!)

I've been looking into what remaining sources of overhead might be and wanted to chime in with some of my findings.

One major discrepency when running benchmarks I noticed between dav1d and rav1d was an order of magnitude difference in the number of madvise calls and and page-faults. I also saw a larger number of context-switches in rav1d than dav1d (likely related to page-faults?). This may explain at least some of the performance difference seen.

Digging into this, I found the majority (~82%) of the page-faults in rav1d are coming from rav1d::src::decode::rav1d_submit_frame. Specifically, the stack trace points to:

<alloc::boxed::Box<[rav1d::src::refmvs::RefMvsTemporalBlock]> as core::iter::traits::collect::FromIterator<rav1d::src::refmvs::RefMvsTemporalBlock>>::from_iter::<core::iter::adapters::map::Map<core::ops::range::Range<usize>, rav1d::src::decode::rav1d_submit_frame::{closure#1}>>

I believe that corresponds to this closure:

rav1d/src/decode.rs

Lines 5223 to 5227 in 7d72409

f.mvs = Some(
(0..f.sb128h as usize * 16 * (f.b4_stride >> 1) as usize)
.map(|_| Default::default())
.collect(),
);

This is the equivalent operation in dav1d:

rav1d/src/decode.c

Lines 3623 to 3624 in 7d72409

f->mvs_ref = dav1d_ref_create_using_pool(c->refmvs_pool,
sizeof(*f->mvs) * f->sb128h * 16 * (f->b4_stride >> 1));

Here dav1d_submit_frame allocates using dav1d_ref_create_using_pool. This then calls into dav1d_mem_pool_pop, which allocates from pooled memory (initialized in dav1d_mem_pool_init). This likely reduces the amount of allocator calls.

The switch from using pooled memory in rav1d looks to have been introduced as part of 6420e5a, PR #984.

@kkysen kkysen changed the title Large number of page faults occurring in rav1d_submit_frame Large number of page faults occurring in fn rav1d_submit_frame Sep 17, 2024
@CrazyboyQCD
Copy link

CrazyboyQCD commented Sep 19, 2024

We could use other allocators like jemalloc and mimalloc for comparison.

@kkysen
Copy link
Collaborator

kkysen commented Sep 19, 2024

It's possible that if we just changed this to use an explicitly zeroed allocation, it will be optimized enough. If the performance of that is good, that's much easier than pooling it again and managing the lifetime for that.

@kkysen
Copy link
Collaborator

kkysen commented Sep 19, 2024

The optimization Rust does for zeroed allocations is only done for types that implement IsZero, which is only implemented for the primitive types and for arrays of them. Thus, doing it for custom types won't work out of the box. We could, however, allocate zeroed u8s (or maybe a larger alignment primitive for more aligned types), align it, and then use FromZeroes to transmute it. There is also the issue of how we can initialize an Arc<[T]> this way, but I think we can figure that out.

@ivanloz, if you want to see if doing this for f.mvs fixed the performance issue you measured on your machine, that'd be great. We can just try a quick and dirty solution to see if it's worth pursuing first before trying to figure out a good API for it and doing it for all of our Vecs, AlignedVecs, Box<[_]>s, and Arc<[_]>s that are zero-initialized.

@kkysen
Copy link
Collaborator

kkysen commented Sep 19, 2024

zerocopy (if we enable the "alloc" feature) actually already has FromZeroes::new_{box_slice,vec}_zeroed that do this optimization, though more directly. It doesn't have an Arc one, though, as that's more complex, but it should be enough for testing I think.

@ivanloz
Copy link
Author

ivanloz commented Sep 19, 2024

While I've got experience tracking down performance issues, I'm a bit of a novice when it comes to actually writing Rust. This sounds like it should be a relatively straightforward change to put together, but I haven't had much luck implementing it.

If you (or someone else) can provide the patch I'd be happy to test it out and report the results.

@kkysen
Copy link
Collaborator

kkysen commented Sep 19, 2024

While I've got experience tracking down performance issues, I'm a bit of a novice when it comes to actually writing Rust. This sounds like it should be a relatively straightforward change to put together, but I haven't had much luck implementing it.

If you (or someone else) can provide the patch I'd be happy to test it out and report the results.

Sure, I'll work on it when I get some more time. If there's not much of a rush either, we'll also probably want to wait for Arc::new_zeroed_slice (rust-lang/rust#129396) to stabilize. The similar new_uninit/new_uninit_slices functions were just recently stabilized in rust-lang/rust#63291, so hopefully there is still momentum to stabilize the zeroed version sooner. Otherwise, we'll have to use Arc<Box<[T]>>s instead of Arc<[T]>s (in release mode) to get this zeroed initialization optimization, as there's no other way to (safely/stably) create a zeroed Arc.

@kkysen kkysen self-assigned this Sep 19, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants