Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Document movnt needs sfence #1457

Closed
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
33 changes: 33 additions & 0 deletions crates/core_arch/src/x86/avx.rs
Original file line number Diff line number Diff line change
Expand Up @@ -1683,6 +1683,17 @@ pub unsafe fn _mm256_lddqu_si256(mem_addr: *const __m256i) -> __m256i {
/// aligned memory location. To minimize caching, the data is flagged as
/// non-temporal (unlikely to be used again soon)
///
/// # Safety
///
/// After using this intrinsic, but before any atomic operations occur, a call
/// to `_mm_sfence()` must be performed. A safe function that includes unsafe
/// usage of this intrinsic must always end in `_mm_sfence()`.
Comment on lines +1688 to +1690
Copy link
Member

@the8472 the8472 Aug 9, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is a bit too strict. There are niche scenarios where one would write out data without ever reading it again or at least without reading it again on another thread.
In those cases some later release write would be incidental and only meant to order other, regular writes.

Maybe the entire requirement could be conditional on "if the written memory is intended to be made accessible on another thread through a release operation", with the recommendation "if in doubt, add a fence".

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mean, I don't think it's coherent to leave an unsatisfied obligation hanging unless it's still in unsafe?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That depends on the perspective. Yes, it's an obligation to... restore consistency with the current rust memory model which assumes that all writes must be ordered with a release operation.
But under some unspecified extended model it may be valid to leave some memory locations unordered.
AIUI the purpose of ordering is to avoid data races which are UB. If the another thread never accesses the memory then this is unspecified behavior but not necessarily UB.

E.g. if you're writing bytes to a framebuffer that's concurrently being scanned out by the GPU then the fence doesn't add anything. You're not synchronizing with anything. You're just racing against time. Either the write makes out to the pixels or it doesn't.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess the conservative definition is fine for now. But it could be phrased in a way that makes it clear that it may be replaced with a more refined definition at some point.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd prefer to tell people to always discharge the obligation (even in the case you mention, the store-store fence would serialize any remaining deferred write-buffers, which is desired if you might run out of time otherwise!) and relax things when we mechanize a better spec.

Copy link
Member

@RalfJung RalfJung Aug 10, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We are currently moving into a different direction in rust-lang/rust#114582: if we follow @talchas' approach, the rule will be: after using this intrinsic, but (happens-)before any other read or write of this location in any thread, a fence must occur (the fence must be in the same thread that called the intrinsic).

Basically, the write actually happens in another thread via a non-atomic store, so accessing this location may cause a data race. Doing a fence waits for that other thread to complete its store, avoiding the data race.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm. I tried to keep the rationale for the requirement slightly vague as we discuss it, and approach it as more of a "do this or your program explodes into flames" type of warning, which is why it goes on to explain further reads or writes to the location are discouraged. So with that in mind, what is this missing?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd prefer if we'd call that UB instead of merely "discouraged".

Ideally even further nontemporal writes (before the next sfence) would be UB... I'd find it strange to have a situation where a nontemporal store would be allowed but a regular store would not.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I opened #1534 with my alternative wording.

///
/// Reading and writing to the memory stored-to by any other means, after any
/// nontemporal store has been used to write to that memory, but before the
/// use of `_mm_sfence()`, is discouraged. Such reads can lead to pipeline
/// stalls and yet-unspecified program behavior.
Comment on lines +1692 to +1695
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm. Well, both x86's movntdq and Arm's stnp retain local-thread serial ordering if you use them twice on the same location, i.e. that using _mm_stream_ps(ptr, a); _mm_stream_ps(ptr, b); still will yield b when you later read it, and coalescing such writes is traditionally allowed by compilers, so I would prefer to retain the more ambiguous verbiage I used here, @RalfJung. Mostly, I said "any other means" to cast a wide net so that people avoid it rather than inviting people to try to reason about in a way that might end in them deciding it's suddenly okay to perform an atomic store or atomic load on the memory from another thread.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do agree that we should canonically allow _mm_stream_ps(ptr, a); _mm_store_ps(ptr, b); _mm_sfence(); eventually but I would prefer it live in an in-between state until we are clear on our mechanics.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The thing is that the proposed doc just doesn't align with the proposed spec in rust-lang/rust#114582. If you go with @talchas' original proposed spec, the doc should be something like

"After the _mm_stream_ps until the next time this thread calls _mm_sfence, any attempt to access (read or write) this memory from this thread by means other than _mm*_stream* operations is UB. Furthermore, any access from another threads must synchronize-with that _mm_sfence or else any access (read or write) from the other thread is UB, including _mm*_stream* operations.

Basically, the nontemporal writes performed by streaming operations should be considered not ordered even with other operations in the thread they appear in, except with other nontemporal writes, until the next sfence which establishes synchronization with all nontemporal writes of the current thread."

That would allow your example. Crucially it disallows _mm_stream_ps(ptr, a); *ptr = a; _mm_sfence(); and _mm_stream_ps(ptr, a); println!("{:?}", *ptr); _mm_sfence(); which is disallowed by the proposed spec. We should probably have examples in the docs.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah, _mm_store_ps is morally equivalent to assignment.

Copy link
Member

@RalfJung RalfJung Aug 11, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh sorry, I misread your example and thought it said stream twice.

If we make _mm_store_ps also inline assembly we might be able to come up with a mechanism that allows it. But for regular assignment I'm quite worried that LLVM optimizations will make deductions from seeing an assignment that are incompatible with the nontemporal state of that location.

Copy link
Member

@RalfJung RalfJung Aug 15, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is a myth that "you cannot read uninitialized memory" and I think it is harmful. I've encountered tons of confusion caused by people axiomatically thinking "I cannot read uninit memory". Instead we should teach people what actually happens in our spec: when reading memory at a certain type, that memory must be sufficiently "valid" for this type. If you read at bool, it must be 0 or 1. If you read at i32, it must be initialized. If you read at MaybeUninit<_>, it can be anything. I think telling people they can't read from uninit memory is a complete distraction and leads to an unnecessarily complicated mental model. There's a reason that "reading from uninit memory" does not show up in this list.

But anyway, it seems unlikely we will get to an agreement on terminology or teaching philosophy here and that's all rather off-topic anyway. The on-topic question is whether we should allow people to perform regular writes in between the nontemporal write and the fence. I don't see a good motivation for doing that, and it opens some tricky questions that we'd need to carefully figure out before allowing it. For instance, if I do a nontemporal write and then a regular write, do I even still need the fence or is it now guaranteed that the next "release" operation synchronizes everything properly?

Copy link
Member Author

@workingjubilee workingjubilee Aug 18, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ehnn I mean yes, just...

...Anyways I think, dreadfully, to actually answer your question about the write series (i.e.

movnti [somewhere], something
mov [somewhere], something
mov [flag], 1

), that on x86 you kiiiinda still need the fence, from the ISA's point of view?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm. The local results should be consistent absolutely, but I am concerned about code that may look more like moving overlapping sizes, so

vmovntdq [somewhere], ymm09
mov [somewhere], r09
mov [flag], 1

The first 8 bytes are guaranteed once the flag is set due to the two regular movs participating in TSO, but the vmovntdq means 24 bytes are in an ambiguous state.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do I even still need the fence or is it now guaranteed that the next "release" operation synchronizes everything properly?

If the answer to that is yes then I don't think we can allow regular writes. A regular write in Rust is guaranteed to be properly released by a release operation, after all.

We can have intrinsics that do "regular write to something that might be in the nontemporal state", but those need the same inline-assembly-and-Rust-replacement-code-spec treatment as streaming writes.

I'd really prefer if we didn't have to do this... the usecases we are aware of don't need this, do they?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, alright then, I didn't realize that was the specific way the sequencing invariants were formed. It's possible the reality may be amenable to regular writes (assuming a sort of "wholesale adoption of the x86 mechanics into Rust" approach for this case) but I'd have to examine the rules... very closely.

I don't think the majority of use cases need this in practice, correct, so now that we have hashed out that issue as existing, then I am happy to let this one go. I will update the documentation changes here accordingly.

///
/// [Intel's documentation](https://www.intel.com/content/www/us/en/docs/intrinsics-guide/index.html#text=_mm256_stream_si256)
#[inline]
#[target_feature(enable = "avx")]
Expand All @@ -1696,6 +1707,17 @@ pub unsafe fn _mm256_stream_si256(mem_addr: *mut __m256i, a: __m256i) {
/// to a 32-byte aligned memory location. To minimize caching, the data is
/// flagged as non-temporal (unlikely to be used again soon).
///
/// # Safety
///
/// After using this intrinsic, but before any atomic operations occur, a call
/// to `_mm_sfence()` must be performed. A safe function that includes unsafe
/// usage of this intrinsic must always end in `_mm_sfence()`.
///
/// Reading and writing to the memory stored-to by any other means, after any
/// nontemporal store has been used to write to that memory, but before the
/// use of `_mm_sfence()`, is discouraged. Such reads can lead to pipeline
/// stalls and yet-unspecified program behavior.
///
/// [Intel's documentation](https://www.intel.com/content/www/us/en/docs/intrinsics-guide/index.html#text=_mm256_stream_pd)
#[inline]
#[target_feature(enable = "avx")]
Expand All @@ -1711,6 +1733,17 @@ pub unsafe fn _mm256_stream_pd(mem_addr: *mut f64, a: __m256d) {
/// caching, the data is flagged as non-temporal (unlikely to be used again
/// soon).
///
/// # Safety
///
/// After using this intrinsic, but before any atomic operations occur, a call
/// to `_mm_sfence()` must be performed. A safe function that includes unsafe
/// usage of this intrinsic must always end in `_mm_sfence()`.
///
/// Reading and writing to the memory stored-to by any other means, after any
/// nontemporal store has been used to write to that memory, but before the
/// use of `_mm_sfence()`, is discouraged. Such reads can lead to pipeline
/// stalls and yet-unspecified program behavior.
///
/// [Intel's documentation](https://www.intel.com/content/www/us/en/docs/intrinsics-guide/index.html#text=_mm256_stream_ps)
#[inline]
#[target_feature(enable = "avx")]
Expand Down
33 changes: 33 additions & 0 deletions crates/core_arch/src/x86/avx512f.rs
Original file line number Diff line number Diff line change
Expand Up @@ -26144,6 +26144,17 @@ pub unsafe fn _mm_mask_testn_epi64_mask(k: __mmask8, a: __m128i, b: __m128i) ->

/// Store 512-bits (composed of 16 packed single-precision (32-bit) floating-point elements) from a into memory using a non-temporal memory hint. mem_addr must be aligned on a 64-byte boundary or a general-protection exception may be generated.
///
/// # Safety
///
/// After using this intrinsic, but before any atomic operations occur, a call
/// to `_mm_sfence()` must be performed. A safe function that includes unsafe
/// usage of this intrinsic must always end in `_mm_sfence()`.
///
/// Reading and writing to the memory stored-to by any other means, after any
/// nontemporal store has been used to write to that memory, but before the
/// use of `_mm_sfence()`, is discouraged. Such reads can lead to pipeline
/// stalls and yet-unspecified program behavior.
///
/// [Intel's documentation](https://www.intel.com/content/www/us/en/docs/intrinsics-guide/index.html#text=_mm512_stream_ps&expand=5671)
#[inline]
#[target_feature(enable = "avx512f")]
Expand All @@ -26155,6 +26166,17 @@ pub unsafe fn _mm512_stream_ps(mem_addr: *mut f32, a: __m512) {

/// Store 512-bits (composed of 8 packed double-precision (64-bit) floating-point elements) from a into memory using a non-temporal memory hint. mem_addr must be aligned on a 64-byte boundary or a general-protection exception may be generated.
///
/// # Safety
///
/// After using this intrinsic, but before any atomic operations occur, a call
/// to `_mm_sfence()` must be performed. A safe function that includes unsafe
/// usage of this intrinsic must always end in `_mm_sfence()`.
///
/// Reading and writing to the memory stored-to by any other means, after any
/// nontemporal store has been used to write to that memory, but before the
/// use of `_mm_sfence()`, is discouraged. Such reads can lead to pipeline
/// stalls and yet-unspecified program behavior.
///
/// [Intel's documentation](https://www.intel.com/content/www/us/en/docs/intrinsics-guide/index.html#text=_mm512_stream_pd&expand=5667)
#[inline]
#[target_feature(enable = "avx512f")]
Expand All @@ -26166,6 +26188,17 @@ pub unsafe fn _mm512_stream_pd(mem_addr: *mut f64, a: __m512d) {

/// Store 512-bits of integer data from a into memory using a non-temporal memory hint. mem_addr must be aligned on a 64-byte boundary or a general-protection exception may be generated.
///
/// # Safety
///
/// After using this intrinsic, but before any atomic operations occur, a call
/// to `_mm_sfence()` must be performed. A safe function that includes unsafe
/// usage of this intrinsic must always end in `_mm_sfence()`.
///
/// Reading and writing to the memory stored-to by any other means, after any
/// nontemporal store has been used to write to that memory, but before the
/// use of `_mm_sfence()`, is discouraged. Such reads can lead to pipeline
/// stalls and yet-unspecified program behavior.
///
/// [Intel's documentation](https://www.intel.com/content/www/us/en/docs/intrinsics-guide/index.html#text=_mm512_stream_si512&expand=5675)
#[inline]
#[target_feature(enable = "avx512f")]
Expand Down
33 changes: 33 additions & 0 deletions crates/core_arch/src/x86/sse2.rs
Original file line number Diff line number Diff line change
Expand Up @@ -1277,6 +1277,17 @@ pub unsafe fn _mm_storel_epi64(mem_addr: *mut __m128i, a: __m128i) {
/// To minimize caching, the data is flagged as non-temporal (unlikely to be
/// used again soon).
///
/// # Safety
///
/// After using this intrinsic, but before any atomic operations occur, a call
/// to `_mm_sfence()` must be performed. A safe function that includes unsafe
/// usage of this intrinsic must always end in `_mm_sfence()`.
///
/// Reading and writing to the memory stored-to by any other means, after any
/// nontemporal store has been used to write to that memory, but before the
/// use of `_mm_sfence()`, is discouraged. Such reads can lead to pipeline
/// stalls and yet-unspecified program behavior.
///
/// [Intel's documentation](https://www.intel.com/content/www/us/en/docs/intrinsics-guide/index.html#text=_mm_stream_si128)
#[inline]
#[target_feature(enable = "sse2")]
Expand All @@ -1290,6 +1301,17 @@ pub unsafe fn _mm_stream_si128(mem_addr: *mut __m128i, a: __m128i) {
/// To minimize caching, the data is flagged as non-temporal (unlikely to be
/// used again soon).
///
/// # Safety
///
/// After using this intrinsic, but before any atomic operations occur, a call
/// to `_mm_sfence()` must be performed. A safe function that includes unsafe
/// usage of this intrinsic must always end in `_mm_sfence()`.
///
/// Reading and writing to the memory stored-to by any other means, after any
/// nontemporal store has been used to write to that memory, but before the
/// use of `_mm_sfence()`, is discouraged. Such reads can lead to pipeline
/// stalls and yet-unspecified program behavior.
///
/// [Intel's documentation](https://www.intel.com/content/www/us/en/docs/intrinsics-guide/index.html#text=_mm_stream_si32)
#[inline]
#[target_feature(enable = "sse2")]
Expand Down Expand Up @@ -2469,6 +2491,17 @@ pub unsafe fn _mm_loadl_pd(a: __m128d, mem_addr: *const f64) -> __m128d {
/// To minimize caching, the data is flagged as non-temporal (unlikely to be
/// used again soon).
///
/// # Safety
///
/// After using this intrinsic, but before any atomic operations occur, a call
/// to `_mm_sfence()` must be performed. A safe function that includes unsafe
/// usage of this intrinsic must always end in `_mm_sfence()`.
///
/// Reading and writing to the memory stored-to by any other means, after any
/// nontemporal store has been used to write to that memory, but before the
/// use of `_mm_sfence()`, is discouraged. Such reads can lead to pipeline
/// stalls and yet-unspecified program behavior.
///
/// [Intel's documentation](https://www.intel.com/content/www/us/en/docs/intrinsics-guide/index.html#text=_mm_stream_pd)
#[inline]
#[target_feature(enable = "sse2")]
Expand Down
24 changes: 24 additions & 0 deletions crates/core_arch/src/x86/sse4a.rs
Original file line number Diff line number Diff line change
Expand Up @@ -62,6 +62,18 @@ pub unsafe fn _mm_insert_si64(x: __m128i, y: __m128i) -> __m128i {
/// Non-temporal store of `a.0` into `p`.
///
/// Writes 64-bit data to a memory location without polluting the caches.
///
/// # Safety
///
/// After using this intrinsic, but before any atomic operations occur, a call
/// to `_mm_sfence()` must be performed. A safe function that includes unsafe
/// usage of this intrinsic must always end in `_mm_sfence()`.
///
/// Reading and writing to the memory stored-to by any other means, after any
/// nontemporal store has been used to write to that memory, but before the
/// use of `_mm_sfence()`, is discouraged. Such reads can lead to pipeline
/// stalls and yet-unspecified program behavior.
///
#[inline]
#[target_feature(enable = "sse4a")]
#[cfg_attr(test, assert_instr(movntsd))]
Expand All @@ -73,6 +85,18 @@ pub unsafe fn _mm_stream_sd(p: *mut f64, a: __m128d) {
/// Non-temporal store of `a.0` into `p`.
///
/// Writes 32-bit data to a memory location without polluting the caches.
///
/// # Safety
///
/// After using this intrinsic, but before any atomic operations occur, a call
/// to `_mm_sfence()` must be performed. A safe function that includes unsafe
/// usage of this intrinsic must always end in `_mm_sfence()`.
///
/// Reading and writing to the memory stored-to by any other means, after any
/// nontemporal store has been used to write to that memory, but before the
/// use of `_mm_sfence()`, is discouraged. Such reads can lead to pipeline
/// stalls and yet-unspecified program behavior.
///
#[inline]
#[target_feature(enable = "sse4a")]
#[cfg_attr(test, assert_instr(movntss))]
Expand Down
10 changes: 10 additions & 0 deletions crates/core_arch/src/x86_64/sse2.rs
Original file line number Diff line number Diff line change
Expand Up @@ -66,6 +66,16 @@ pub unsafe fn _mm_cvttsd_si64x(a: __m128d) -> i64 {
/// To minimize caching, the data is flagged as non-temporal (unlikely to be
/// used again soon).
///
/// # Safety
///
/// After using this intrinsic, but before any atomic operations occur, a call
/// to `_mm_sfence()` must be performed. A safe function that includes unsafe
/// usage of this intrinsic must always end in `_mm_sfence()`.
///
/// Reading and writing to the memory stored-to by any other means, after any
/// nontemporal store has been used to write to that memory, is discouraged.
/// Doing so can lead to pipeline stalls and yet-unspecified program behavior.
///
/// [Intel's documentation](https://www.intel.com/content/www/us/en/docs/intrinsics-guide/index.html#text=_mm_stream_si64)
#[inline]
#[target_feature(enable = "sse2")]
Expand Down
Loading