Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Document movnt needs sfence #1457

Closed

Conversation

workingjubilee
Copy link
Member

For every intrinsic that may generate any of the MOVNT family of instructions, specify it must be followed by _mm_sfence.

Also, ask people to not think too hard about what actually happens with write-combining memory buffers. They probably don't want to know, and in terms of the Rust abstract machine, we aren't actually entirely sure yet.

@rustbot
Copy link
Collaborator

rustbot commented Aug 9, 2023

r? @Amanieu

(rustbot has picked a reviewer for you, use r? to override)

For every intrinsic that may generate any of the MOVNT family
of instructions, specify it must be followed by `_mm_sfence`.

Also, ask people to not think too hard about what actually
happens with write-combining memory buffers. They probably
don't want to know, and in terms of the Rust abstract machine,
we aren't actually entirely sure yet.
Comment on lines +1688 to +1690
/// After using this intrinsic, but before any atomic operations occur, a call
/// to `_mm_sfence()` must be performed. A safe function that includes unsafe
/// usage of this intrinsic must always end in `_mm_sfence()`.
Copy link
Member

@the8472 the8472 Aug 9, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is a bit too strict. There are niche scenarios where one would write out data without ever reading it again or at least without reading it again on another thread.
In those cases some later release write would be incidental and only meant to order other, regular writes.

Maybe the entire requirement could be conditional on "if the written memory is intended to be made accessible on another thread through a release operation", with the recommendation "if in doubt, add a fence".

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mean, I don't think it's coherent to leave an unsatisfied obligation hanging unless it's still in unsafe?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That depends on the perspective. Yes, it's an obligation to... restore consistency with the current rust memory model which assumes that all writes must be ordered with a release operation.
But under some unspecified extended model it may be valid to leave some memory locations unordered.
AIUI the purpose of ordering is to avoid data races which are UB. If the another thread never accesses the memory then this is unspecified behavior but not necessarily UB.

E.g. if you're writing bytes to a framebuffer that's concurrently being scanned out by the GPU then the fence doesn't add anything. You're not synchronizing with anything. You're just racing against time. Either the write makes out to the pixels or it doesn't.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess the conservative definition is fine for now. But it could be phrased in a way that makes it clear that it may be replaced with a more refined definition at some point.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd prefer to tell people to always discharge the obligation (even in the case you mention, the store-store fence would serialize any remaining deferred write-buffers, which is desired if you might run out of time otherwise!) and relax things when we mechanize a better spec.

///
/// After using this intrinsic, but before any atomic operations occur, a call
/// to `_mm_sfence()` must be performed. A safe function that includes unsafe
/// usage of this intrinsic must always end in `_mm_sfence()`.
Copy link
Member

@RalfJung RalfJung Aug 10, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We are currently moving into a different direction in rust-lang/rust#114582: if we follow @talchas' approach, the rule will be: after using this intrinsic, but (happens-)before any other read or write of this location in any thread, a fence must occur (the fence must be in the same thread that called the intrinsic).

Basically, the write actually happens in another thread via a non-atomic store, so accessing this location may cause a data race. Doing a fence waits for that other thread to complete its store, avoiding the data race.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm. I tried to keep the rationale for the requirement slightly vague as we discuss it, and approach it as more of a "do this or your program explodes into flames" type of warning, which is why it goes on to explain further reads or writes to the location are discouraged. So with that in mind, what is this missing?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd prefer if we'd call that UB instead of merely "discouraged".

Ideally even further nontemporal writes (before the next sfence) would be UB... I'd find it strange to have a situation where a nontemporal store would be allowed but a regular store would not.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I opened #1534 with my alternative wording.

Comment on lines +1692 to +1695
/// Reading and writing to the memory stored-to by any other means, after any
/// nontemporal store has been used to write to that memory, but before the
/// use of `_mm_sfence()`, is discouraged. Such reads can lead to pipeline
/// stalls and yet-unspecified program behavior.
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm. Well, both x86's movntdq and Arm's stnp retain local-thread serial ordering if you use them twice on the same location, i.e. that using _mm_stream_ps(ptr, a); _mm_stream_ps(ptr, b); still will yield b when you later read it, and coalescing such writes is traditionally allowed by compilers, so I would prefer to retain the more ambiguous verbiage I used here, @RalfJung. Mostly, I said "any other means" to cast a wide net so that people avoid it rather than inviting people to try to reason about in a way that might end in them deciding it's suddenly okay to perform an atomic store or atomic load on the memory from another thread.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do agree that we should canonically allow _mm_stream_ps(ptr, a); _mm_store_ps(ptr, b); _mm_sfence(); eventually but I would prefer it live in an in-between state until we are clear on our mechanics.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The thing is that the proposed doc just doesn't align with the proposed spec in rust-lang/rust#114582. If you go with @talchas' original proposed spec, the doc should be something like

"After the _mm_stream_ps until the next time this thread calls _mm_sfence, any attempt to access (read or write) this memory from this thread by means other than _mm*_stream* operations is UB. Furthermore, any access from another threads must synchronize-with that _mm_sfence or else any access (read or write) from the other thread is UB, including _mm*_stream* operations.

Basically, the nontemporal writes performed by streaming operations should be considered not ordered even with other operations in the thread they appear in, except with other nontemporal writes, until the next sfence which establishes synchronization with all nontemporal writes of the current thread."

That would allow your example. Crucially it disallows _mm_stream_ps(ptr, a); *ptr = a; _mm_sfence(); and _mm_stream_ps(ptr, a); println!("{:?}", *ptr); _mm_sfence(); which is disallowed by the proposed spec. We should probably have examples in the docs.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah, _mm_store_ps is morally equivalent to assignment.

Copy link
Member

@RalfJung RalfJung Aug 11, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh sorry, I misread your example and thought it said stream twice.

If we make _mm_store_ps also inline assembly we might be able to come up with a mechanism that allows it. But for regular assignment I'm quite worried that LLVM optimizations will make deductions from seeing an assignment that are incompatible with the nontemporal state of that location.

Copy link
Member

@RalfJung RalfJung Aug 15, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is a myth that "you cannot read uninitialized memory" and I think it is harmful. I've encountered tons of confusion caused by people axiomatically thinking "I cannot read uninit memory". Instead we should teach people what actually happens in our spec: when reading memory at a certain type, that memory must be sufficiently "valid" for this type. If you read at bool, it must be 0 or 1. If you read at i32, it must be initialized. If you read at MaybeUninit<_>, it can be anything. I think telling people they can't read from uninit memory is a complete distraction and leads to an unnecessarily complicated mental model. There's a reason that "reading from uninit memory" does not show up in this list.

But anyway, it seems unlikely we will get to an agreement on terminology or teaching philosophy here and that's all rather off-topic anyway. The on-topic question is whether we should allow people to perform regular writes in between the nontemporal write and the fence. I don't see a good motivation for doing that, and it opens some tricky questions that we'd need to carefully figure out before allowing it. For instance, if I do a nontemporal write and then a regular write, do I even still need the fence or is it now guaranteed that the next "release" operation synchronizes everything properly?

Copy link
Member Author

@workingjubilee workingjubilee Aug 18, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ehnn I mean yes, just...

...Anyways I think, dreadfully, to actually answer your question about the write series (i.e.

movnti [somewhere], something
mov [somewhere], something
mov [flag], 1

), that on x86 you kiiiinda still need the fence, from the ISA's point of view?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm. The local results should be consistent absolutely, but I am concerned about code that may look more like moving overlapping sizes, so

vmovntdq [somewhere], ymm09
mov [somewhere], r09
mov [flag], 1

The first 8 bytes are guaranteed once the flag is set due to the two regular movs participating in TSO, but the vmovntdq means 24 bytes are in an ambiguous state.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do I even still need the fence or is it now guaranteed that the next "release" operation synchronizes everything properly?

If the answer to that is yes then I don't think we can allow regular writes. A regular write in Rust is guaranteed to be properly released by a release operation, after all.

We can have intrinsics that do "regular write to something that might be in the nontemporal state", but those need the same inline-assembly-and-Rust-replacement-code-spec treatment as streaming writes.

I'd really prefer if we didn't have to do this... the usecases we are aware of don't need this, do they?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, alright then, I didn't realize that was the specific way the sequencing invariants were formed. It's possible the reality may be amenable to regular writes (assuming a sort of "wholesale adoption of the x86 mechanics into Rust" approach for this case) but I'd have to examine the rules... very closely.

I don't think the majority of use cases need this in practice, correct, so now that we have hashed out that issue as existing, then I am happy to let this one go. I will update the documentation changes here accordingly.

@workingjubilee workingjubilee marked this pull request as draft August 10, 2023 21:33
@workingjubilee
Copy link
Member Author

Will address the comments after the conversation settles a bit.

@bors
Copy link
Contributor

bors commented Feb 21, 2024

☔ The latest upstream changes (presumably c8ae80e) made this pull request unmergeable. Please resolve the merge conflicts.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants