Improve thread scheduler memory ordering #43418

vtjnash · 2021-12-13T20:31:39Z

Some of these should probably land separately, but collecting these as I go along with trying to improve thread-safety, and similar support.

src/threading.c

vchuravy · 2022-01-06T18:00:06Z

bump! some of this looks very good to have.

giordano · 2022-01-20T17:36:24Z

Confirming also here that 6bba175 appears to fix #41820 for me.

Pickup TSAN/ASAN and LLVM assertion build fixes from #43418

tkf · 2022-01-21T04:11:06Z

src/partr.c

-            if (!multiq_check_empty()) {
+            // acquire sleep-check lock
+            jl_atomic_store_relaxed(&ptls->sleep_check_state, sleeping);
+            jl_fence(); // add a sequenced-before edge between the sleep_check_state modify and get_next_task checks


Why not

Suggested change

jl_fence(); // add a sequenced-before edge between the sleep_check_state modify and get_next_task checks

jl_fence(); // ensures sleep_check_state is sequentially-consistent with enqueuing the task

as you wrote in jl_wakeup_thread? Since sequenced-before is a single-thread property (and we get it "for free"), we don't need a fence and I don't think this is very informative here. Rather, we need this for the global acyclicity of SC. (I usually informally reason that this is for "avoiding store-load reordering" because it's a Dekker-like situation.)

Yeah, this is annoying to describe properly. I probably want seq-cst happens-before here. But it is not with enqueuing of the task, since that might only be atomic release. And anyways, seq-cst stores only create a global ordering with other seq-cst operations and fences, but IIUC, does not necessarily enforce an ordering on other nearby operations.

IANAL but I think a somewhat concrete argument would be something like the following. Using the terminology by Lahav et al. (2017) (as standardese is hard...) we see that we can have two reads-before edges:

rb1: multiq_check_empty -> multiq_insert

rb2: jl_atomic_load_relaxed(&ptls->sleep_check_state) in jl_wakeup_thread -> jl_atomic_store_relaxed(&ptls->sleep_check_state, sleeping)

(Since they are inter-thread edges that start from a read and end with a write, they do not contribute to the usual happens-before edges.)

(Since rb1 touches multiple memory locations, I guess it's more like a "bundle of edges".)

Let us also name the two fences sandwiching these edges:

F1: the first jl_fence in jl_wakeup_thread

F2: the jl_fence in jl_task_get_next that I quoted above

These fences do double duty of promoting the two reads-before edges to the partial SC relation edges:

sc1: F2 -> rb1 -> F1

sc2: F1 -> rb2 -> F2

Since both sc1 and sc2 are part of the partial SC relation that is acyclic under C+20 memory model, they can't form a cycle. We can use this to show that the following unwated execution is impossible:

Initial state:

queue is empty

sleep_check_state is not_sleeping

Thread 1 (dequeuer) does:

1a: jl_atomic_store_relaxed(&ptls->sleep_check_state, sleeping)

1b: multiq_check_empty returns true

Thread 2 (enqueuer) does:

2a: multiq_insert

2b: jl_atomic_load_relaxed(&ptls->sleep_check_state) in jl_wakeup_thread returns not_sleeping

i.e., the dequeuer misses the enqueue and enqueuer misses the sleep state transition.

This is impossible because it forms a cycle

2b -> 1a (due to the value read in 2b)

1a -> 1b (sequenced-before)

1b -> 2a (due to the value read in 1b)

2a -> 2b (sequenced-before)

which is forbidden because both 1b -> 2a (rb1) and 2b -> 1a (rb2) are part of the partial SC relation, thanks to the SC fences.

But this argument is essentially a copy of the argument about the standard "store buffering litmus test" so I don't think it's helpful to expand the explanation this much.

But I think it's very helpful to document which fences "pair up." Maybe we can use markdown footnote-like comment like this

jl_fence(); // [^store_buffering_1]

in the first fence in jl_wakeup_thread and the fence of jl_task_get_next that I quoted above? We can then document which stores and loads are taken care of by this fence outside the function body (so that it's OK to write a somewhat lengthy comment).

... } // [^store_buffering_1]: These fences are used to avoid the cycle 2b -> 1a -> 1b -> 2a -> 2b where // * Dequeuer: // * 1a: `jl_atomic_store_relaxed(&ptls->sleep_check_state, sleeping)` // * 1b: `multiq_check_empty` returns true // * Enqueuer: // * 2a: `multiq_insert` // * 2b: `jl_atomic_load_relaxed(&ptls->sleep_check_state)` in `jl_wakeup_thread` returns `not_sleeping` // i.e., the dequeuer misses the enqueue and enqueuer misses the sleep state transition.

src/partr.c

tkf · 2022-01-21T05:32:47Z

src/partr.c

-        uv_mutex_unlock(&sleep_locks[tid]);
+
+    if (jl_atomic_load_relaxed(&other->sleep_check_state) == sleeping) {
+        if (jl_atomic_cmpswap_relaxed(&other->sleep_check_state, &state, not_sleeping)) {


I wonder if you technically need acq_rel ordering or a manual fence for the success path for a happens-before edge to jl_atomic_store_relaxed(&ptls->sleep_check_state, sleeping) in jl_task_get_next. But it would be very counter-intuitive if you need one...

(I also wondered if it can disrupt the Dekker-like pattern in jl_task_get_next and jl_wakeup_thread. But IIUC this is fine since the RMW edge extends the release sequence edge.)

We do not release this elsewhere, so I think that ordering would be invalid here. In fact, I think the only valid ordering for this operation is relaxed, because we use relaxed stores elsewhere.

The relaxed swap is not permitted to fail spuriously, so if we get here, we know at least one thread will complete the transition from sleeping->not_sleeping, and that thread will also ensure that to wake the condition signal. (we might get more than one thread completing this transition, if we managed to be very slow here, and the thread was already awake for other reasons, finished processing the work, and then went back to sleep already. But we are not worried about spurious wake-ups.)

Right..., I just realized that fence for jl_atomic_store_relaxed(&ptls->sleep_check_state, sleeping) is on the opposite side so it can't start a synchronizes-with edge (unless we put a fence before jl_atomic_store_relaxed(&ptls->sleep_check_state, sleeping) as well or strengthen its ordering).

I think the only valid ordering for this operation is relaxed, because we use relaxed stores elsewhere.

I believe it's OK. We can get happens-before edges even with mixed memory orderings as long as they are at least relaxed (not non-atomic) and we put fences before the write and after the read. See fence-atomic and atomic-fence synchronizations in https://en.cppreference.com/w/cpp/atomic/atomic_thread_fence

The relaxed swap is not permitted to fail spuriously,

I think the difference between strong and weak CASes is orthogonal to the memory ordering, at least for the memory model of the abstract machine. compare_exchange_weak and compare_exchange_strong both have memory ordering arguments. Also, in the formal declarative memory model (by Lahav et al. 2017, which is the basis of C++20 memory model), the distinction of strong and weak CAS do not exist; IIUC, failed weak CASes just appear in the execution graph as repeated read immediately followed by a write (i.e., failed writes are ignored).

But we are not worried about spurious wake-ups.

I was more worried about the case that somehow the wake-up is missed. Intuitively, I thought you need to send the signal "after" you know that jl_atomic_store_relaxed(&ptls->sleep_check_state, sleeping) has happened. But since relaxed memory read/write doesn't enforce any ordering, we technically can't assume what the other task (that is supposedly about to sleep) is actually doing. I cannot help wondering if I'm either taking C/C+ committee's FUD way too seriously and/or I'm still misunderstanding how C++20 memory model works. It's really bizarre that we can't rely on the "causality" through relaxed memory accesses alone. IIRC, Hans Boehm has a very intricate example involving GVN and speculative execution but I don't know if that applies here.

That page on fences doesn't appear to mention seq_cst, and that is what we require here. Merely knowing acquire and release is insufficient to provide any interesting constraint here on the behavior of the program. What we need to prove is that the release on thread 1 (the store to sleep_check_state) has a global consistent ordering with the release on thread 2 (the store to the workqueue) relative to the subsequent checks on those threads for the contents of the workqueue or the sleep_check_state, respectively.

For the seq_cst properties, we have that the description of the fence on https://en.cppreference.com/w/cpp/atomic/memory_order:

2) the sequentially-consistent fences are only establishing total ordering for the fences themselves, not for the atomic operations in the general case (sequenced-before is not a cross-thread relationship, unlike happens-before)

I was more worried about the case that somehow the wake-up is missed

Once we reach this point, we are into the land of locks and condition objects, which are hopefully much simpler to analyze and declare safe in this case.

https://en.cppreference.com/w/cpp/atomic/memory_order:

2) the sequentially-consistent fences are only establishing total ordering for the fences themselves, not for the atomic operations in the general case (sequenced-before is not a cross-thread relationship, unlike happens-before)

I think that's a bit stale info on cppreference side. That's actually one of the major improvements in the C++20 memory model. P0668R4: Revising the C++ memory model includes a patch to remove similar sentences from the standard. In particular, the first paragraphs in the section Strengthen SC fences provides nice background information.

[Notes for people who want to dig into the original resource; e.g., future me] The relevant definition in Lahav et al. is:

The [E^sc] prefix and suffix in psc_base indicate that arbitrary SC evens, including reads and writes, can start/end the partial SC relation, provided that read/write starts or ends the SC-before relation scb. On the other hand, fences ([F^sc]) can work in more general setting where scb is prefixed/suffixed by the happens-before relation hb (i.e., which is indicated by [F^sc]; hb^? and hb^?; [F^sc]). For example, this indicates that reads-before rb relation starting with a SC read and ends with a SC fence ([E^sc]; rb; [F^sc]) is a part of psc_base and hence psc.

Once we reach this point, we are into the land of locks and condition objects, which are hopefully much simpler to analyze and declare safe in this case.

Yes, I agree. I was just curious to know how do I reason about this if I want to do it rigorously.

src/jl_uv.c

The svec set method needs to do a little bit more work (for asserts and write barriers), which is not necessary if we know it was just allocated.

These were previously implied (on TSO platforms, such as x86) by the atomic_store to sleeping and the sleep_locks acquire before the wake_check loop, but this makes it more explicit. We might want to consider in the future if it would be better (faster) to acquire each possible lock on the sleeping path instead, so that we do each operation with seq_cst, instead of using a fence to only order the operations we care about directly.

Ensures better that we support nesting of threaded regions and access to this variable from any thread. Previously, it appears that we might toggle this variable on another thread (via thread migration), but would then not successfully wake up the thread 0 and cause it to resume sleeping on the event loop.

vtjnash requested review from tkf, JeffBezanson and Keno December 13, 2021 20:31

simeonschaub marked this pull request as ready for review December 13, 2021 20:39

simeonschaub marked this pull request as draft December 13, 2021 20:40

tkf mentioned this pull request Dec 15, 2021

Make task sticky before entering threaded region #43420

Closed

tkf reviewed Dec 15, 2021

View reviewed changes

src/threading.c Outdated Show resolved Hide resolved

vchuravy mentioned this pull request Jan 6, 2022

Pickup ASAN and LLVM assertion build fixes from #43418 #43685

Merged

vtjnash added a commit that referenced this pull request Jan 20, 2022

Merge pull request #43685 from JuliaLang/vc/llvm_asan

2e854a4

Pickup TSAN/ASAN and LLVM assertion build fixes from #43418

vtjnash force-pushed the improve-threads branch from 2a6a8ab to c3f7396 Compare January 20, 2022 20:37

vtjnash marked this pull request as ready for review January 20, 2022 20:38

vtjnash changed the title ~~Improve threads~~ Improve thread scheduler memory ordering Jan 20, 2022

vchuravy added this to the 1.8 milestone Jan 20, 2022

vchuravy added domain:io Involving the I/O subsystem: libuv, read, write, etc. domain:multithreading Base.Threads and related functionality labels Jan 20, 2022

tkf reviewed Jan 21, 2022

View reviewed changes

vtjnash added 4 commits January 26, 2022 13:01

svec: optimize allocations in debug

7b636ff

The svec set method needs to do a little bit more work (for asserts and write barriers), which is not necessary if we know it was just allocated.

handle exit signal failure

9c7cfa9

vtjnash force-pushed the improve-threads branch from c3f7396 to 0eaf35a Compare January 26, 2022 18:02

vtjnash added the status:merge me PR is reviewed. Merge when all tests are passing label Jan 26, 2022

tkf merged commit 05e0cb9 into master Jan 27, 2022

tkf deleted the improve-threads branch January 27, 2022 01:42

tkf removed the status:merge me PR is reviewed. Merge when all tests are passing label Jan 27, 2022

giordano mentioned this pull request Jun 10, 2022

Darwin/ARM64: Julia freezes on @threads loops #45626

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve thread scheduler memory ordering #43418

Improve thread scheduler memory ordering #43418

vtjnash commented Dec 13, 2021

vchuravy commented Jan 6, 2022

giordano commented Jan 20, 2022 •

edited

Loading

tkf Jan 21, 2022

vtjnash Jan 21, 2022

tkf Jan 23, 2022 •

edited

Loading

tkf Jan 23, 2022 •

edited

Loading

tkf Jan 21, 2022

vtjnash Jan 21, 2022

tkf Jan 22, 2022 •

edited

Loading

vtjnash Jan 23, 2022

tkf Jan 23, 2022

	jl_fence(); // add a sequenced-before edge between the sleep_check_state modify and get_next_task checks
	jl_fence(); // ensures sleep_check_state is sequentially-consistent with enqueuing the task

Improve thread scheduler memory ordering #43418

Improve thread scheduler memory ordering #43418

Conversation

vtjnash commented Dec 13, 2021

vchuravy commented Jan 6, 2022

giordano commented Jan 20, 2022 • edited Loading

tkf Jan 21, 2022

Choose a reason for hiding this comment

vtjnash Jan 21, 2022

Choose a reason for hiding this comment

tkf Jan 23, 2022 • edited Loading

Choose a reason for hiding this comment

tkf Jan 23, 2022 • edited Loading

Choose a reason for hiding this comment

tkf Jan 21, 2022

Choose a reason for hiding this comment

vtjnash Jan 21, 2022

Choose a reason for hiding this comment

tkf Jan 22, 2022 • edited Loading

Choose a reason for hiding this comment

vtjnash Jan 23, 2022

Choose a reason for hiding this comment

tkf Jan 23, 2022

Choose a reason for hiding this comment

giordano commented Jan 20, 2022 •

edited

Loading

tkf Jan 23, 2022 •

edited

Loading

tkf Jan 23, 2022 •

edited

Loading

tkf Jan 22, 2022 •

edited

Loading