Use REP MOVSQ/STOSQ on x86_64 #365

josephlr · 2020-07-08T08:29:37Z

Addresses part of #339 (see also rust-osdev/cargo-xbuild#77)

The implementations for these functions are quite simple and (on many recent processors) are the fastest way to implement memcpy/memmove/memset. The implementations in MUSL and in the Linux kernel were used for inspiration.

Benchmarks for the memory functions were also added in this PR, and can be invoked by running cargo bench --package=testcrate. The results of running this benchmarks on different hardware show that the qword-based variants are almost always faster than the byte-based variants.

While the implementations for memcmp/bcmp could be made faster though use of SIMD intrinsics, using rep cmpsb/rep cmpsq makes them slower, so they are left as-is in this PR.

Note that #164 added some optimized versions for memcmp on ARM.

alexcrichton · 2020-07-08T14:08:11Z

Thanks for this!

I forget, but do we already have tests for memset/memcmp/etc? If not, could you add some as part of this PR?

Additionally, do you have some benchmark numbers for how these perform?

src/mem/x86_64.rs

josephlr · 2020-07-09T20:15:23Z

Additionally, do you have some benchmark numbers for how these perform?

I updated this PR to add memcpy/memset/memcmp benchmarks to the testcrate crate. It allows comparing the libc functions (via copy_from_slice/slice.cmp/etc...) to the Rust functions provided by this crate.

I ran a bunch of trials, results are below, the main takeaways are:

Using the rep movsb/stosb makes the Rust memcpy/memset implementation as fast as musl/glibc's
- About a 70-80% improvement (for 4K blocks).
- Given that musl essentially has an identical implementation, this makes sense
Using repe cmpsb for memcmp actually makes things worse, so I'll remove it.
- This is consistent with the Intel's optimization guide.
- The memcmp implementation is very slow, there's room to improve here.

memcpy

Implementation	4 KiB blocks (GiB/sec)	1 MiB blocks (GiB/sec)
Current, simple Rust loop	57.7	30.1
This PR (`rep movsb`)	98.7 (+71%)	37.3 (+24%)
x86_64 Linux musl libc	94.9 (+64%)	38.1 (+27%)
x86_64 Linux GNU libc	126.0 (+118%)	35.7 (+19%)

memset

Implementation	4 KiB blocks (GiB/sec)	1 MiB blocks (GiB/sec)
Current, simple Rust loop	68.6	45.3
This PR (`rep stosb`)	121.7 (+77%)	63.6 (+40%)
x86_64 Linux musl libc	112.2 (+63%)	63.8 (+41%)
x86_64 Linux GNU libc	112.7 (+64%)	63.7 (+41%)

memcmp

Implementation	4 KiB blocks (GiB/sec)	1 MiB blocks (GiB/sec)
Current, simple Rust loop	3.6	3.6
This PR (`repe cmpsb`)	2.2 (-38%)	2.2 (-37%)
x86_64 Linux musl libc	3.5 (-1%)	3.6 (-1%)
x86_64 Linux GNU libc	78.8 (+2110%)	81.9 (+2182%)

alexcrichton · 2020-07-09T21:02:57Z

Nice! Thos are some pretty slick wins and also nice find that memcmp doesn't speed up all that much. Also that's pretty crazy how much faster glibc is for memcmp than a simple loop!

CryZe · 2020-07-09T21:05:35Z

This claims to close my issue, but isn't this only about x86, while the performance problems seem to be happening across the board? (Especially on WASM it seems rather bad atm)

alexcrichton · 2020-07-09T21:14:20Z

We can leave it open for other platforms, but FWIW there's not really much else we can do for wasm. The bulk memory proposal fixes this issue, however, because the memory.copy instructions is basically the exact same as call memcpy and it implemented by the engine so is much faster.

CryZe · 2020-07-09T21:17:03Z

If you use the WASI target it does a loop copying 32-bit values rather than individual bytes because it uses the musl implementation in the wasilibc then. I haven't done any real benchmarking, but I'd expect that to be faster. But yeah the bulk memory proposal also fixes that issue if you use that target feature.

src/mem/x86_64.rs

alexcrichton · 2020-07-10T14:23:51Z

Also, to confirm, do we have tests for this in-repo? If not could some be added?

josephlr · 2020-07-10T19:59:47Z

Also, to confirm, do we have tests for this in-repo? If not could some be added?

I would want better test coverage than what we currently have. I'm planning to add a bunch before this CL is ready for review again.

nagisa · 2020-07-19T01:23:21Z

src/mem/x86_64.rs

@@ -0,0 +1,69 @@
+use super::c_int;
+
+// On recent Intel processors, "rep movsb" and "rep stosb" have been enhanced to


How does this implementation fare on non-intel implementations of x86_64?

I've been investigating performance on AMD hardware (the only other x86 platform where anyone cares about performance). This has led me to modify the implementation. When I have some time, I'll post the results and update this comment to clarify the impact on AMD as well.

On Intel fast string support is detectable by two flags in CPUID and one enable bit in IA32_MISC_ENABLE, if necessary. AMD may provide the same detection tools.

So it looks like virtually all newish AMD and Intel processors support some sort of "REP MOVS enhancement" (i.e. rep movs is somehow better than a normal loop). However, if the ermsb feature flag isn't present (like on all AMD processors) then rep movsq seems better than rep movsb. With ermsb the two variants are about the same speed.

Given this, I just implemented the rep movsq version unconditionally without any CPUID checking. Variants that use rep movsb when Intel's ERMSB/FSRM feature is enabled could be added later, but there doesn't seem to be much of a gain (at least with the benchmarks I'm running here).

Soveu · 2020-10-13T21:14:50Z

These implementations assume that the direction flag is not set, which could not be always the case

src/mem/x86_64.rs

Signed-off-by: Joe Richey <joerichey@google.com>

This allows comparing the "normal" implementations to the implementations provided by this crate. Signed-off-by: Joe Richey <joerichey@google.com>

josephlr · 2020-10-15T10:14:42Z

These implementations assume that the direction flag is not set, which could not be always the case

Per the asm! docs, "On x86, the direction flag (DF in EFLAGS) is clear on entry to an asm block and must be clear on exit.", so I think these impls are fine.

The assembly generated seems correct: https://rust.godbolt.org/z/GGnec8 Signed-off-by: Joe Richey <joerichey@google.com>

josephlr · 2020-10-15T11:24:41Z

For AMD performance, I'm getting some conflicting results about which is better rep movsb or rep movsq. I think it might be because I'm using a VM. I don't have any real AMD CPU hardware, could anyone in this thread run the benchmarks against dd7b7dc and 2f9f61f and tell me what they find?

andyhhp · 2020-10-17T00:36:51Z

DF

Even C doesn't tolerate DF being set generally. There are two legitimate uses of it which I have encountered. One is memmove(), and one is code dumps in backtraces, where you need to be wary of hitting page/permission boundaries (see https://github.com/xen-project/xen/blob/master/xen/arch/x86/traps.c#L175-L204).

For performance, things are very tricky, and once size does not fit all. Presumably here we're talking about mem*() calls which have survived through LLVM's optimisations passes, and are the variations which don't decompose nicely?

If alignment information is available at compile time, then rep stos{l,q} is faster than rep stosb on earlier hardware. Intel have some forthcoming features literally named Fast Zero-Length MOVSB, Fast Short STOSB, Fast short CMPSB/SCASB (https://software.intel.com/content/dam/develop/external/us/en/documents/intel-tdx-module-1eas.pdf Page 120. Not sure if this was intended to be public right now, but it is.) which should give anyone a hint that the current variations aren't great for small %ecx inputs.

Frankly, study a popular libc and follow their lead. A lot of time and effort has gone into optimising them generally across multiple generations of processor. Alternatively, if you do feel like doing feature-based dispatch, that will get better results if you can pick the optimum algorithm for the CPU you're on.

Soveu · 2020-10-17T11:08:08Z

Linux encourages to use rep movsb/stosb with memcpy/memset
memcpy
memset

Soveu · 2020-10-18T21:04:38Z

i wonder why memmove got faster than memcpy lol
edit: literally changing the commits changes the result of some *_rust functions, probably that weird function alignment problem on amd

Signed-off-by: Joe Richey <joerichey@google.com>

josephlr · 2020-10-23T08:49:23Z

@alexcrichton the tests have been added so this is now ready for final review and merging.

This implementation sticks with the rep movsq/rep stosq implementation used by MUSL and Linux (see the links in the PR description). The final assembly looks optimal (memcpy/memset are identical to Linux's implementation).

For performance numbers, see my link in #365 (comment)

Signed-off-by: Joe Richey <joerichey@google.com>

alexcrichton · 2020-10-23T16:50:12Z

Thanks again for this! This all looks great to me. As one final thing, though, I'm not sure if the asm feature is actually ever enabled on CI, so could you add a line here to test the feature?

testcrate/benches/mem.rs

Signed-off-by: Joe Richey <joerichey@google.com>

josephlr · 2020-10-24T10:19:50Z

Thanks again for this! This all looks great to me. As one final thing, though, I'm not sure if the asm feature is actually ever enabled on CI, so could you add a line here to test the feature?

Done (for tests and builds), also the testcrate enables the "asm" feature by default, should that be changed?

alexcrichton · 2020-10-24T15:58:00Z

Hm yeah ideally that would change but that's probably best left to another PR, thanks again!

This change is needed for compiler-builtins to check for this feature when implementing memcpy/memset. See: rust-lang/compiler-builtins#365 The change just does compile-time detection. I think that runtime detection will have to come in a follow-up CL to std-detect. Like all the CPU feature flags, this just references rust-lang#44839 Signed-off-by: Joe Richey <joerichey@google.com>

Add compiler support for LLVM's x86_64 ERMSB feature This change is needed for compiler-builtins to check for this feature when implementing memcpy/memset. See: rust-lang/compiler-builtins#365 Without this change, the following code compiles, but does nothing: ```rust #[cfg(target_feature = "ermsb")] pub unsafe fn ermsb_memcpy() { ... } ``` The change just does compile-time detection. I think that runtime detection will have to come in a follow-up CL to std-detect. Like all the CPU feature flags, this just references rust-lang#44839 Signed-off-by: Joe Richey <joerichey@google.com>

in principle the PR uses rust-lang/compiler-builtins#365 to improve the performance

78: using of the asm feature to improve the performance of basic functions r=jbreitbart a=stlankes - PR uses rust-lang/compiler-builtins#365 to improve the performance - fix broken CI and build the bootloader on windows correctly Co-authored-by: Stefan Lankes <slankes@eonerc.rwth-aachen.de>

* mem: Move mem* functions to separate directory Signed-off-by: Joe Richey <joerichey@google.com> * memcpy: Create separate memcpy.rs file Signed-off-by: Joe Richey <joerichey@google.com> * benches: Add benchmarks for mem* functions This allows comparing the "normal" implementations to the implementations provided by this crate. Signed-off-by: Joe Richey <joerichey@google.com> * mem: Add REP MOVSB/STOSB implementations The assembly generated seems correct: https://rust.godbolt.org/z/GGnec8 Signed-off-by: Joe Richey <joerichey@google.com> * mem: Add documentations for REP string insturctions Signed-off-by: Joe Richey <joerichey@google.com> * Use quad-word rep string instructions Signed-off-by: Joe Richey <joerichey@google.com> * Prevent panic when compiled in debug mode Signed-off-by: Joe Richey <joerichey@google.com> * Add tests for mem* functions Signed-off-by: Joe Richey <joerichey@google.com> * Add build/test with the "asm" feature Signed-off-by: Joe Richey <joerichey@google.com> * Add byte length to Bencher Signed-off-by: Joe Richey <joerichey@google.com>

The referenced issue in compiler-builtins (rust-lang/compiler-builtins#365) has been merged.

josephlr mentioned this pull request Jul 8, 2020

How to build *compiler_builtins* in optimized mode rust-osdev/cargo-xbuild#77

Closed

bjorn3 reviewed Jul 8, 2020

View reviewed changes

src/mem/x86_64.rs Show resolved Hide resolved

josephlr force-pushed the ermsb branch 2 times, most recently from 97ad0fa to 012085a Compare July 8, 2020 23:29

josephlr force-pushed the ermsb branch 2 times, most recently from 7a52621 to 7321f5c Compare July 10, 2020 09:45

Amanieu reviewed Jul 10, 2020

View reviewed changes

src/mem/x86_64.rs Outdated Show resolved Hide resolved

alexcrichton mentioned this pull request Jul 14, 2020

Improve __clzsi2 performance #366

Merged

nagisa reviewed Jul 19, 2020

View reviewed changes

josephlr mentioned this pull request Oct 2, 2020

memcpy and co. are really unoptimized #339

Open

andyhhp reviewed Oct 14, 2020

View reviewed changes

src/mem/x86_64.rs Outdated Show resolved Hide resolved

josephlr added 3 commits October 15, 2020 03:04

mem: Move mem* functions to separate directory

c6c7621

Signed-off-by: Joe Richey <joerichey@google.com>

memcpy: Create separate memcpy.rs file

80b7c01

Signed-off-by: Joe Richey <joerichey@google.com>

benches: Add benchmarks for mem* functions

ee54782

This allows comparing the "normal" implementations to the implementations provided by this crate. Signed-off-by: Joe Richey <joerichey@google.com>

josephlr force-pushed the ermsb branch from 7321f5c to 611808e Compare October 15, 2020 10:09

mem: Add REP MOVSB/STOSB implementations

fb03d26

The assembly generated seems correct: https://rust.godbolt.org/z/GGnec8 Signed-off-by: Joe Richey <joerichey@google.com>

josephlr force-pushed the ermsb branch from 611808e to 2f9f61f Compare October 15, 2020 10:28

Prevent panic when compiled in debug mode

de4ed28

Signed-off-by: Joe Richey <joerichey@google.com>

Add tests for mem* functions

fe71a12

Signed-off-by: Joe Richey <joerichey@google.com>

josephlr force-pushed the ermsb branch from f72f6fc to fe71a12 Compare October 23, 2020 09:17

nagisa reviewed Oct 23, 2020

View reviewed changes

testcrate/benches/mem.rs Show resolved Hide resolved

nagisa approved these changes Oct 23, 2020

View reviewed changes

josephlr added 2 commits October 24, 2020 03:15

Add build/test with the "asm" feature

aa326a3

Signed-off-by: Joe Richey <joerichey@google.com>

Add byte length to Bencher

d4a180a

Signed-off-by: Joe Richey <joerichey@google.com>

alexcrichton merged commit 33ad366 into rust-lang:master Oct 24, 2020

josephlr deleted the ermsb branch October 26, 2020 10:05

josephlr mentioned this pull request Oct 26, 2020

Add compiler support for LLVM's x86_64 ERMSB feature rust-lang/rust#78396

Merged

stlankes mentioned this pull request Oct 28, 2020

using compiler builtins to realize basic operations (e.g. memcpy) hermit-os/kernel#115

Merged

josephlr mentioned this pull request Nov 3, 2020

Use REP MOVSB/STOSB when the ERMSB feature is present #392

Merged

stlankes added a commit to stlankes/hermit-rs that referenced this pull request Nov 21, 2020

using of the asm feature to improve the performance of basic functions

465ec3b

in principle the PR uses rust-lang/compiler-builtins#365 to improve the performance

stlankes mentioned this pull request Nov 21, 2020

using of the asm feature to improve the performance of basic functions hermit-os/hermit-rs#78

Merged

dspencer12 added a commit to dspencer12/blog_os that referenced this pull request Feb 23, 2021

Remove note on builtin memory optimizations

92998fb

The referenced issue in compiler-builtins (rust-lang/compiler-builtins#365) has been merged.

dspencer12 mentioned this pull request Feb 23, 2021

Remove note on builtin memory optimizations phil-opp/blog_os#932

Merged

phil-opp pushed a commit to phil-opp/blog_os that referenced this pull request Feb 23, 2021

Remove note on builtin memory optimizations (#932)

88f32ff

The referenced issue in compiler-builtins (rust-lang/compiler-builtins#365) has been merged.

bjorn3 mentioned this pull request Jul 3, 2022

Align destination in x86_64's mem* instructions. #474

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use REP MOVSQ/STOSQ on x86_64 #365

Use REP MOVSQ/STOSQ on x86_64 #365

josephlr commented Jul 8, 2020 •

edited

Loading

alexcrichton commented Jul 8, 2020

josephlr commented Jul 9, 2020 •

edited

Loading

alexcrichton commented Jul 9, 2020

CryZe commented Jul 9, 2020

alexcrichton commented Jul 9, 2020

CryZe commented Jul 9, 2020

alexcrichton commented Jul 10, 2020

josephlr commented Jul 10, 2020

nagisa Jul 19, 2020

josephlr Jul 19, 2020

petrochenkov Oct 13, 2020

josephlr Oct 15, 2020 •

edited

Loading

Soveu commented Oct 13, 2020

josephlr commented Oct 15, 2020

josephlr commented Oct 15, 2020

andyhhp commented Oct 17, 2020

Soveu commented Oct 17, 2020

Soveu commented Oct 18, 2020 •

edited

Loading

josephlr commented Oct 23, 2020

alexcrichton commented Oct 23, 2020

josephlr commented Oct 24, 2020

alexcrichton commented Oct 24, 2020

		@@ -0,0 +1,69 @@
		use super::c_int;

		// On recent Intel processors, "rep movsb" and "rep stosb" have been enhanced to

Use REP MOVSQ/STOSQ on x86_64 #365

Use REP MOVSQ/STOSQ on x86_64 #365

Conversation

josephlr commented Jul 8, 2020 • edited Loading

alexcrichton commented Jul 8, 2020

josephlr commented Jul 9, 2020 • edited Loading

memcpy

memset

memcmp

alexcrichton commented Jul 9, 2020

CryZe commented Jul 9, 2020

alexcrichton commented Jul 9, 2020

CryZe commented Jul 9, 2020

alexcrichton commented Jul 10, 2020

josephlr commented Jul 10, 2020

nagisa Jul 19, 2020

Choose a reason for hiding this comment

josephlr Jul 19, 2020

Choose a reason for hiding this comment

petrochenkov Oct 13, 2020

Choose a reason for hiding this comment

josephlr Oct 15, 2020 • edited Loading

Choose a reason for hiding this comment

Soveu commented Oct 13, 2020

josephlr commented Oct 15, 2020

josephlr commented Oct 15, 2020

andyhhp commented Oct 17, 2020

Soveu commented Oct 17, 2020

Soveu commented Oct 18, 2020 • edited Loading

josephlr commented Oct 23, 2020

alexcrichton commented Oct 23, 2020

josephlr commented Oct 24, 2020

alexcrichton commented Oct 24, 2020

josephlr commented Jul 8, 2020 •

edited

Loading

josephlr commented Jul 9, 2020 •

edited

Loading

josephlr Oct 15, 2020 •

edited

Loading

Soveu commented Oct 18, 2020 •

edited

Loading