Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Benchmark is not an apples to apples comparison (sse4.2) #85

Open
errantmind opened this issue Feb 26, 2021 · 8 comments
Open

Benchmark is not an apples to apples comparison (sse4.2) #85

errantmind opened this issue Feb 26, 2021 · 8 comments

Comments

@errantmind
Copy link

Hello, first, thanks for making this tool.

I wanted to point out your benchmark is a bit unfair as you compare httparse sse4 against picohttpparser without sse4. The reason picohttpparser doesn't have sse4 is because your dependency 'pico-sys' does not compile picohttpparser with sse4 enabled.

Your benchmark showed a ~60% improvement in performance for 'bench_pico' once sse4 was enabled in the underlying crate.

I forked the underlying crate 'pico-sys' and made a few modifications if you want to verify my results:

https://github.com/errantmind/rust-pico-sys

@errantmind errantmind changed the title benchmark.. apples to apples Benchmark is not an apples to apples comparison (sse4.2) Feb 26, 2021
@errantmind
Copy link
Author

I've also been experimenting with getting inlining to work across the FFI, and succeeded using Rusts' 'linker-plugin-lto', clang-12, and lld-12. This improved the benchmarks for pico a little more and put both pico benchmarks in the lead, the full pico benchmark hitting ~2900 MB/s vs httparse at 1751 MB/s on my ancient laptop.

@seanmonstar
Copy link
Owner

Ah yea good point. Originally httparse didn't have SIMD support either, so it was more similar.

@errantmind
Copy link
Author

I haven't looked all that far into it but I'm interested in your thoughts on why Pico is faster. Is it doing some memory management tricks or something? ..I'm working on a pet project and am trying to figure out if I should just write it in C, or if there is a way to get comparable results with unsafe Rust

@seanmonstar
Copy link
Owner

How do you run the Rust benchmarks? Do you set the target CPU so it doesn't have to do runtime checks? https://rust-lang.github.io/packed_simd/perf-guide/target-feature/rustflags.html

@errantmind
Copy link
Author

errantmind commented Feb 28, 2021

I run these flags globally in my config.toml:

rustflags=["-Ctarget-cpu=native","-Ctarget-feature=+sse4.2"]

@errantmind
Copy link
Author

errantmind commented Mar 1, 2021

I'm going to dump some info here for reproducibility purposes

The speed improvements came primarily from two areas, both involved modifying the underlying Pico bindings crate

  1. Modifying the underlying crate to compile with sse4.2 and LTO (added -msse4 and -flto=thin to the cc compile command)
  2. Check llvm version and current host using rustc --version --verbose. Host information needed later
  3. Install llvm version used by rustc, current nightly uses llvm 11. On Ubuntu 20.04 you can install the binaries needed with sudo apt-get install clang-11 lld-11
  4. Set clang as primary for cc using export CC=/usr/bin/clang-11 (modify this location as needed by your dist)
  5. Set the appropriate rustflags in ~/.cargo/config.toml . Use the host information above. For me this is:
[target.x86_64-unknown-linux-gnu]
rustflags = [
   "-Ctarget-cpu=native",
   "-Clink-arg=-fuse-ld=lld",
   "-Clinker=clang-11",
]
  1. Clean up benchmark project if needed with cargo clean && rm Cargo.lock
  2. cargo bench in benchmark crate

Full cc command from Pico bindings crate:

cc::Build::new()
        .file("extern/picohttpparser/picohttpparser.c")
        .opt_level_str(&"fast")
        .flag("-funroll-loops")
        .flag("-msse4")
        .flag("-flto=thin")
        .flag("-march=native")
        .compile("libpicohttpparser.a");

@errantmind
Copy link
Author

Updated the above comment as the steps it described were incorrect. The above steps work as expected. Here are the results of my latest test:

results

@errantmind
Copy link
Author

errantmind commented Mar 3, 2021

Alright, the adventure is coming to an end with this final update:

  • Cargo automatically adds the linker-plugin-lto flag when building certain kinds of crates, like my sys crate in this example. This can be verified by passing verbose (i.e. cargo build --release --verbose)
    • It appears to be unnecessary to build all the dependency chain with linker-plugin-lto flag, just the sys crate (which is automatic). If all dependencies are built with linker-plugin-lto, there is actually a loss of about 5% performance
  • Using a cargo-wide config (e.g. ~/.cargo/config.toml) is overwritten by setting RUSTFLAGS
  • A cargo-wide config overrides a cargo config local to a project (e.g. <project>/.cargo/config.toml)
  • clang-12 is significantly (~5%) faster than clang-11 for the pico tests (for some unknown reason). clang-13 (dev build), so far, is not significantly faster than clang-12

Final results

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants