Tracking issue for speeding up rustc via its build configuration #103595

nnethercote · 2022-10-26T19:39:08Z

There are several ways to speed up rustc by changing its build configuration, without changing its code: use a single codegen unit (CGU), profile-guided optimization (PGO), link-time optimization (LTO), post-link optimization (via BOLT), and using a better allocator (e.g. jemalloc or mimalloc).

This is a tracking issue for doing these for the most popular Tier 1 platforms: Linux64 (x86_64-unknown-linux-gnu), Win64 (x86_64-pc-windows-msvc), and Mac (x86_64-apple-darwin).

Items marked with [2022] are on the Compiler performance roadmap for 2022.

Single CGU

Benefits: rustc is faster, uses less memory, has a smaller binary.
Costs: rustc takes longer to build.

Linux64: Build rustc with a single CGU on x64 Linux #115554, merged 2023-10-01.
Win64: Build rustc with 1CGU on x86_64-pc-windows-msvc #112267, merged 2024-03-12.
Mac: Build rustc with 1CGU on x86_64-apple-darwin #112268, merged 2024-03-12.

PGO

Benefits: rustc is faster.
Costs: rustc takes longer to build.

Linux64: Utilize PGO for rustc linux dist builds #80262, merged 2020-12-23.
Win64 [2022]: Utilize PGO for windows x64 rustc dist builds #96978, merged 2022-07-12.
Mac [2022]:
- Problems with symbols not being matched correctly in PGO profiles.

Other PGO attempts:

Call-site aware PGO for LLVM: Use CS PGO for LLVM #111806, no speed-up measured, seems like its benefits are superseded by BOLT.
PGO for libstd: Apply PGO to libstd on CI #97038, no speed-up measured.

LTO

Benefits: rustc is faster.
Costs: rustc takes longer to build.

Linux64:
- rustc front-end: Enable LTO for rustc_driver.so #101403, merged 2022-10-24.
- LLVM: done some time ago.
Win64:
- rustc front-end: Enable ThinLTO for rustc on x64 msvc #103591, merged 2022-12-11. Caused a miscompilation and was reverted on 2023-03-14.
- LLVM [2022]: currently statically linked, which prevents LTO, but this could be changed
Mac:
- rustc front-end: Enable ThinLTO for rustc on x86_64-apple-darwin #103647 and Re-enable ThinLTO for rustc on x86_64-apple-darwin #105845, merged 2022-12-19.
- LLVM [2022]

This is all thin LTO, which gets most of the benefits of fat LTO with a much lower link-time cost.

Other LTO attempts:

LTO for rustdoc: [perftest] Use LTO for compiling rustdoc #102885, no speed-up measured.
Fat LTO: Use fat LTO for compiling rustc #103453, no speed-up measured, large CI build cost.

BOLT

Benefits: rustc is faster.
Costs: rustc takes longer to build.

Linux64:
- rustc front-end: Optimize librustc_driver.so with BOLT #116352, merged 2023-10-14.
- LLVM: Use BOLT in CI to optimize LLVM #94381, merged 2022-10-10.
Win64: N/A
Mac: N/A

Bolt only works on ELF binaries, and thus is Linux-only.

Instruction set

Benefits: rustc is faster?
Costs: rustc won't run on old CPUs.

x86_64: Update to v2/v3/APX sometime in the future. So far, the perf. wins haven't been convincing enough to upgrade, because it will reduce compatibility for older CPUs. Some perf. results can be found here.

Linker

Benefits: rustc (linking) is faster.
Costs: hard to get working.

lld: Using lld by default would make linking times of crates much faster. Some perf. results can be found here.

Better allocator

Benefits: rustc (linking) is faster.
Costs: rustc uses more memory?

Linux64: jemalloc, done some time ago.
Win64 [2022]
Mac: jemalloc, done some time ago.

Note: #92249 and #92317 tried using two different versions of mimalloc (one 1.7-based, one 2.0-based) instead of jemalloc, but the speed/memory tradeoffs in both cases were deemed inferior (the max-rss regressions expected to be fixed in the 2.x series still exist as of 2.0.6, see #103944).

Note: we use a better allocator by simply overriding malloc/free, rather than using #[global_allocator]. See this Zulip thread for some discussion about the sub-optimality of this.

About tracking issues

Tracking issues are used to record the overall progress of implementation.
They are also used as hubs connecting to other relevant issues, e.g., bugs or open design questions.
A tracking issue is however not meant for large scale discussion, questions, or bug reports about a feature.
Instead, open a dedicated issue for the specific matter and add the relevant feature gate label.

The text was updated successfully, but these errors were encountered:

the8472 · 2022-10-26T21:18:36Z

Another thing to try that was brought up on zulip is aligning the text segment to 2MB so it can be loaded in a way that transparent huge pages kick in, but I'm not 100% sure if this even works yet, last time I checked large page support in the page cache was still a WIP and I haven't seen it in release notes.

lqd · 2022-10-26T21:31:32Z

An update for windows:

I'm trying ThinLTO for rustc_driver.dll in Enable ThinLTO for rustc on x64 msvc #103591 as local results looked interesting enough (lower than what we saw on x86_64-unknown-linux-gnu but still quite good).
I don't think we actually can ThinLTO LLVM separately on MSVC easily today: that is done by linking to LLVM as a shared library, which is not supported on x86_64-pc-windows-msvc. It's likely possible to do in the future, but is more involved than a build config tweak.

nnethercote · 2022-10-27T03:29:05Z

Another thing to try that was brought up on zulip is aligning the text segment to 2MB so it can be loaded in a way that transparent huge pages kick in

Given that it's not even clear that this works, let's leave it off this issue for now.

the8472 · 2022-10-27T11:32:44Z

It looks like file-backed huge pages are supported now. https://www.kernel.org/doc/html/latest/filesystems/proc.html#meminfo

FileHugePages
Memory used for filesystem data (page cache) allocated with huge pages
FilePmdMapped
Page cache mapped into userspace with huge pages

I think it was introduced with this one torvalds/linux@793917d so it needs at least 5.18

It also depends on filesystem support.

tschuett · 2022-10-31T18:54:32Z

I read and actually checked that rustc links dynamically against rustc_driver. The same article said we dlopen codegen backends. Isn't there a way for the common case, a static binary: rustc + rustc_driver + LLVM?

Kobzol · 2022-10-31T19:01:33Z

We could do that in theory, but I'm not sure if it would be that useful. Static linking has some benefits, but mostly only for tiny crates, the diminishing returns start to kick in early. rustc is basically a "one-liner" that calls into the entrypoint of rustc_driver, and that is now LTO-optimized.

tschuett · 2022-10-31T19:03:36Z

I would enable LTO over the full binary. Startup time may also be better.

Kobzol · 2022-10-31T19:08:25Z

Well, the "full binary" is one function call into rustc_driver :) But it's true that I haven't tried benchmarking static linking on top of the current LTO optimized librustc_driver. I'll try it, to see if static linking + LTO could provide substantial benefits.

Shipping a statically linked rustc would probably only be possible on some OSes (Linux), and it would increase the size of the distributed artifacts by a nontrivial amount.

tschuett · 2022-10-31T19:11:32Z

My LLVM folder is full of large statically linked LTO'd binaries (OSX).

Could you also statically link in the LLVM backend?

Kobzol · 2022-10-31T19:18:45Z

In theory, yes. In practice, I'm not sure if our current build system supports it (will check).

tschuett · 2022-10-31T19:23:37Z

No worries.

-rwxr-xr-x 1 xxx staff 129M Jul 28 18:44 clang-15

It only links against system libraries and no LLVM libraries.

Elabajaba · 2022-12-22T02:07:34Z

I'm not sure if it helps, but on Windows LLVM can be built with a different allocator using the -DLLVM_INTEGRATED_CRT_ALLOC=path/to/allocator flag (supports rpmalloc|mimalloc|snmalloc). However I tried doing that a few months ago, and while the llvm builds seemed to work fine, rustc_llvm failed to build using a version of LLVM built with mimalloc or snmalloc (I didn't end up testing rpmalloc at the time). The errors at the time were: for snmalloc error: renaming of the library 'INCLUDE' was specified, however this crate contains no '#[link(...)]' attributes referencing this library for mimalloc the error was the same, except it was F instead of INCLUDE

The LLVM pr (https://reviews.llvm.org/D71786) also showed some pretty major performance gains when it was implemented.

jyn514 · 2023-02-03T19:46:19Z

@michaelwoerister also suggested in #49180 that we could set codegen-units=1 for the compiler (we already do that for std).

the8472 · 2023-12-13T17:48:29Z

x86_64: Update to v2/v3/APX sometime in the future. So far, the perf. wins haven't been convincing enough to upgrade, because it will reduce compatibility for older CPUs.

Note that there's a bit of a catch-22. We could start adding specialized SIMD impls for some important core routines if std were built with a higher baseline, which would increase the performance delta. But as long as such builds don't exist it's hardly worth it because it'll only benefit users of -Zbuild-std.

Do any of the ARM targets offer a baseline that's high enough to include fancy simd features? Maybe some generic simdification work on those could show potential benefits that would be enabled on top of what's gained by compiling with a higher baseline.

Or maybe an AVX2 codepath could be added to hashbrown since that can benefit some users even without build-std. @Amanieu have there been any experiments in that direction?

Mark-Simulacrum · 2023-12-13T17:59:52Z

Apple aarch64 should have a very modern baseline.

Kobzol · 2023-12-13T21:25:00Z

Note that there's a bit of a catch-22. We could start adding specialized SIMD impls for some important core routines if std were built with a higher baseline, which would increase the performance delta

That's a very good point, I agree with that!

I think that we could start with the compiler (and its stdlib), to potentially make it faster, but still keep the actual Linux x64 target without v2/v3/v4 CPU features by default.

Do you have any ideas where we could start using x86 v2 CPU features?

the8472 · 2023-12-13T22:23:03Z

I think that we could start with the compiler (and its stdlib),

Doesn't the compiler link the same stdlib it uses to build programs?

Do you have any ideas where we could start using x86 v2 CPU features?

Adopting the utf8 validation impl from simdutf8. Other than that it'll need some exploration. Maybe the stdsimd folks have some ideas on tap.
There's always a difference between "if I had all the most recent CPU features I could..." and more modest optimizations 😅

Maybe rustc hash could be replaced with a different mixing function? ... odd, I can't find the feature level of the PCLMULQDQ instruction.

Kobzol · 2023-12-13T22:40:52Z

Doesn't the compiler link the same stdlib it uses to build programs?

IIRC it doesn't, it has its own copy. Conceptually, it needs to be able to use to add any (target) stdlib to the resulting program to cross-compile. But I might be wrong.

tschuett · 2023-12-13T23:42:32Z

The targets on the Platform Support page look like legit target triples. Could you encode v4 in triple and ship two Linux versions for x86?

Kobzol · 2023-12-14T07:10:35Z

In theory yes, but I'm not sure if it's the best solution. Maybe we could bump the defaut target to v2/v3 and keep an unoptimized v1 target for people with old CPUs. In any case, maintaining a target is not for free, so maybe there are better solutions.

To clarify, it's a very different thing to ship a v2 compiler and to make the x86 Linux target v2 by default. We're really only considering the first thing for now.

tschuett · 2023-12-14T08:38:57Z

Shipping a highly tuned v4 compiler with -mtune=icelake should give some speedup and you can use current cpu features: AVX512, PCLMULQDQ, ..

Haswell was launched June 4, 2013.

x86-64: CMOV, CMPXCHG8B, FPU, FXSR, MMX, FXSR, SCE, SSE, SSE2
x86-64-v2: (close to Nehalem) CMPXCHG16B, LAHF-SAHF, POPCNT, SSE3, SSE4.1, SSE4.2, SSSE3
x86-64-v3: (close to Haswell) AVX, AVX2, BMI1, BMI2, F16C, FMA, LZCNT, MOVBE, XSAVE
x86-64-v4: AVX512F, AVX512BW, AVX512CD, AVX512DQ, AVX512VL

Kobzol · 2023-12-14T09:21:02Z

Yes, our measurements show that v3 produces ~1-3% speedup for the compiler. But on it's own, that hasn't been worth it to use it so far, because there is non-trivial maintenance costs for doing that, plus we would drop some existing users. We'll need to tread carefully.

the8472 · 2023-12-14T11:07:59Z

AVX-512 is not really viable for broad use because AMD only started shipping it recently and intel has only shipped it to some of their market segments (workstation/server chips) and even disabled it on some recent chips due to inconsistencies between P and E cores.
AVX2 has a much larger user base.

See https://store.steampowered.com/hwsurvey/Steam-Hardware-Software-Survey-Welcome-to-Steam?platform=linux
99% for SSE42 (~v2)
93% for AVX2 (~v3)
6.5% for AVX512 (~v4)

Build `rustc` with 1CGU on `x86_64-pc-windows-msvc` Distribute `x86_64-pc-windows-msvc` artifacts built with `rust.codegen-units=1`, like we already do on Linux. 1) effect on code size on `x86_64-pc-windows-msvc`: it's a 3.67% reduction on `rustc_driver.dll` - before, [`41d97c8a5dea2731b0e56fe97cd7cb79e21cff79`](https://ci-artifacts.rust-lang.org/rustc-builds/41d97c8a5dea2731b0e56fe97cd7cb79e21cff79/rustc-nightly-x86_64-pc-windows-msvc.tar.xz): 137605632 - after, [`704aaa875e4acccc973cbe4579e66afbac425691`](https://ci-artifacts.rust-lang.org/rustc-builds/704aaa875e4acccc973cbe4579e66afbac425691/rustc-nightly-x86_64-pc-windows-msvc.tar.xz): 132551680 2) time it took on CI - the [first `try` build](https://github.com/rust-lang-ci/rust/actions/runs/8155647651/job/22291592507) took: 1h 31m - the [second `try` build](https://github.com/rust-lang-ci/rust/actions/runs/8157043594/job/22295790552) took: 1h 32m 3) most recent perf results: - on a slightly noisy desktop [here](rust-lang#112267 (comment)) - ChrisDenton's results [here](rust-lang#112267 (comment)) Related tracking issue for build configuration: rust-lang#103595

Build `rustc` with 1CGU on `x86_64-apple-darwin` Distribute `x86_64-apple-darwin` artifacts built with `rust.codegen-units=1`, like we already do on Linux. 1) effect on code size on `x86_64-apple-darwin`: it's a 11.14% reduction on `librustc_driver.dylib` - before, [`41d97c8a5dea2731b0e56fe97cd7cb79e21cff79`](https://ci-artifacts.rust-lang.org/rustc-builds/41d97c8a5dea2731b0e56fe97cd7cb79e21cff79/rustc-nightly-x86_64-apple-darwin.tar.xz): 161232048 - after, [`7549dbdc09f0c4f6cc84002ac03081828054784b`](https://ci-artifacts.rust-lang.org/rustc-builds/7549dbdc09f0c4f6cc84002ac03081828054784b/rustc-nightly-x86_64-apple-darwin.tar.xz): 143256928 2) time it took on CI: - the [first `try` build](https://github.com/rust-lang-ci/rust/actions/runs/8155512915/job/22291187124) took: 1h 33m - the [second `try` build](https://github.com/rust-lang-ci/rust/actions/runs/8157057880/job/22295839911) took: 1h 45m 3) most recent perf results on (a noisy) x64 mac are [here](rust-lang#112268 (comment)). Related tracking issue for build configuration: rust-lang#103595

nnethercote added T-compiler Relevant to the compiler team, which will review and decide on the PR/issue. C-tracking-issue Category: A tracking issue for an RFC or an unstable feature. WG-compiler-performance Working group: Compiler Performance labels Oct 26, 2022

lqd mentioned this issue Feb 3, 2023

Build released compiler artifacts as optimized as possible #49180

Closed

nnethercote mentioned this issue Mar 2, 2023

Compiler Performance Tracking Issue #48547

Open

workingjubilee mentioned this issue Nov 1, 2023

Evaluate Profile-Guided Optimization (PGO) and LLVM BOLT tensorchord/pgvecto.rs#43

Open

lqd mentioned this issue Mar 5, 2024

Build rustc with 1CGU on x86_64-pc-windows-msvc #112267

Merged

lqd mentioned this issue Mar 5, 2024

Build rustc with 1CGU on x86_64-apple-darwin #112268

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Tracking issue for speeding up rustc via its build configuration #103595

Tracking issue for speeding up rustc via its build configuration #103595

nnethercote commented Oct 26, 2022 •

edited by Kobzol

Loading

the8472 commented Oct 26, 2022

lqd commented Oct 26, 2022 •

edited

Loading

nnethercote commented Oct 27, 2022

the8472 commented Oct 27, 2022 •

edited

Loading

tschuett commented Oct 31, 2022

Kobzol commented Oct 31, 2022

tschuett commented Oct 31, 2022

Kobzol commented Oct 31, 2022

tschuett commented Oct 31, 2022

Kobzol commented Oct 31, 2022

tschuett commented Oct 31, 2022

Elabajaba commented Dec 22, 2022

jyn514 commented Feb 3, 2023

the8472 commented Dec 13, 2023

Mark-Simulacrum commented Dec 13, 2023

Kobzol commented Dec 13, 2023

the8472 commented Dec 13, 2023

Kobzol commented Dec 13, 2023

tschuett commented Dec 13, 2023

Kobzol commented Dec 14, 2023 •

edited

Loading

tschuett commented Dec 14, 2023

Kobzol commented Dec 14, 2023

the8472 commented Dec 14, 2023 •

edited

Loading

Tracking issue for speeding up rustc via its build configuration #103595

Tracking issue for speeding up rustc via its build configuration #103595

Comments

nnethercote commented Oct 26, 2022 • edited by Kobzol Loading

Single CGU

PGO

LTO

BOLT

Instruction set

Linker

Better allocator

About tracking issues

the8472 commented Oct 26, 2022

lqd commented Oct 26, 2022 • edited Loading

nnethercote commented Oct 27, 2022

the8472 commented Oct 27, 2022 • edited Loading

tschuett commented Oct 31, 2022

Kobzol commented Oct 31, 2022

tschuett commented Oct 31, 2022

Kobzol commented Oct 31, 2022

tschuett commented Oct 31, 2022

Kobzol commented Oct 31, 2022

tschuett commented Oct 31, 2022

Elabajaba commented Dec 22, 2022

jyn514 commented Feb 3, 2023

the8472 commented Dec 13, 2023

Mark-Simulacrum commented Dec 13, 2023

Kobzol commented Dec 13, 2023

the8472 commented Dec 13, 2023

Kobzol commented Dec 13, 2023

tschuett commented Dec 13, 2023

Kobzol commented Dec 14, 2023 • edited Loading

tschuett commented Dec 14, 2023

Kobzol commented Dec 14, 2023

the8472 commented Dec 14, 2023 • edited Loading

nnethercote commented Oct 26, 2022 •

edited by Kobzol

Loading

lqd commented Oct 26, 2022 •

edited

Loading

the8472 commented Oct 27, 2022 •

edited

Loading

Kobzol commented Dec 14, 2023 •

edited

Loading

the8472 commented Dec 14, 2023 •

edited

Loading