tokio-rs · carllerche · Apr 1, 2020 · Apr 1, 2020 · Apr 1, 2020 · Apr 1, 2020
diff --git a/content/blog/2020-04-preemption.md b/content/blog/2020-04-preemption.md
@@ -0,0 +1,184 @@
++++
+date = "2020-03-31"
+title = "Reducing tail latencies with automatic cooperative task yielding"
+description = "April 1, 2020"
+menu = "blog"
+weight = 982
++++
+
+Tokio is a runtime for asynchronous Rust applications. It allows writing code
+using `async` & `await` syntax. For example:
+
+```rust
+let mut listener = TcpListener::bind(&addr).await?;
+
+loop {
+    let (mut socket, _) = listener.accept().await?;
+
+    tokio::spawn(async move {
+        // handle socket
+    });
+}
+```
+
+The Rust compiler transforms this code into a state machine. The Tokio runtime
+executes these state machines, multiplexing many tasks on a handful of threads.
+Tokio's scheduler requires that the generated task's state machine yields control
+back to the scheduler in order to multiplex tasks. Each `.await` call is an
+opportunity to yield back to the scheduler. In the above example,
+`listener.accept().await` will return a socket if one is pending. If there are
+no pending sockets, control is yielded back to the scheduler.
+
+This system works well in most cases. However, when a system comes under load,
+it is possible for an asynchronous resource to always be ready. For
+example, consider an echo server:
+
+```rust
+tokio::spawn(async move {
+    let mut buf = [0; 1024];
+
+    loop {
+        let n = socket.read(&mut buf).await?;
+
+        if n == 0 {
+            break;
+        }
+
+        // Write the data back
+        socket.write(buf[..n]).await?;
+    }
+});
+```
+
+
+If data is received faster than it can be processed, it is possible that more
+data will have already been received by the time the processing of a data chunk
+completes. In this case, `.await` will never yield control back to the scheduler,
+other tasks will not be scheduled, resulting in starvation and large latency
+variance.
+
+Currently, the answer to this problem is that the user of Tokio is responsible
+for adding yield points every so often. In practice, very few actually do this
+and end up being vulnerable to this sort of problem.
+
+A common solution to this problem is preemption. OS threads will interrupt
+thread execution every so often in order to ensure fair scheduling of all
+threads. Runtimes that have full control over execution (Go, Erlang, etc.)
+will also use preemption to ensure fair scheduling of tasks. This is
+accomplished by injecting yield points — code which checks if the task has been
+executing for long enough and yields back to the scheduler if so — at compile-time.
+Unfortunately, Tokio is not able to use this technique as Rust's `async` generators
+do not provide any mechanism for executors (like Tokio) to inject such yield points.
+
+## Per-task operation budget
+
+Even though Tokio is not able to **preempt**, there is still an opportunity to
+nudge a task to yield back to the scheduler. As of [0.2.14], each Tokio task has
+an operation budget. This budget is reset when the scheduler switches to the
+task. Each Tokio resource (socket, timer, channel, ...) is aware of this
+budget. As long as the task has budget remaining, the resource operates as it did
+previously. Each asynchronous operation (actions that users must `.await` on)
+decrements the task's budget. Once the task is out of budget, all resources will
+perpetually return "not ready" until the task yields back to the scheduler. At that point,
+the budget is reset, and future `.await`s on Tokio resources will again function normally.
+
+Let's go back to the echo server example from above. When the task is scheduled, it
+is assigned a budget of 128 operations. When `socket.read(..)` and
+`socket.write(..)` are called, the budget is decremented. If the budget is zero,
+the task yields back to the scheduler. If either `read` or `write` cannot
+proceed due to the underlying socket not being ready (no pending data or a full
+send buffer), then the task also yields back to the scheduler.
+
+The idea originated from a conversation I had with [Ryan Dahl][ry]. He is
+using Tokio as the underlying runtime for [Deno][deno]. When doing some HTTP
+experimentation  with [Hyper] a while back, he was seeing some high tail
+latencies in some benchmarks. The problem was due to a loop not yielding back to
+the scheduler under load. Hyper ended up fixing the problem by hand in this one
+case, but Ryan mentioned that, when he worked on [node.js][node], they handled
+the problem by adding **per resource** limits. So, if a TCP socket was always
+ready, it would force a yield every so often. I mentioned this conversation to
+[Jon Gjenset][jonhoo], and he came up with the idea of placing the limit on
+the task itself instead of on each resource.
+
+The end result is that Tokio should be able to provide more consistent runtime
+behavior under load. While the exact heuristics will most likely be tweaked over
+time, initial measurements show that, in some cases, tail latencies are reduced
+by almost 3x.
+
+[![benchmark](https://user-images.githubusercontent.com/176295/73222456-4a103300-4131-11ea-9131-4e437ecb9a04.png)](https://user-images.githubusercontent.com/176295/73222456-4a103300-4131-11ea-9131-4e437ecb9a04.png)
+
+"master" is before the automatic yielding and "preempt" is after. Click for a
+bigger version, see also the original [PR comment][pr] for more details.
+
+## A note on blocking
+
+Although automatic cooperative task yielding improves performance in many cases,
+it cannot preempt tasks. Users of Tokio must still take care to avoid both CPU
+intensive work and blocking APIs. The [`spawn_blocking`][spawn_blocking] function
+can be used to "asyncify" these sorts of tasks by running them on a thread pool
+where blocking is allowed.
+
+Tokio does not, and will not attempt to detect blocking tasks and automatically
+compensate by adding threads to the scheduler. This question has come up a
+number of times in the past, so allow me to elaborate.
+
+For context, the idea is for the scheduler to include a monitoring thread. This
+thread would poll scheduler threads every so often and check that workers are
+making progress. If a worker is not making progress, it is assumed that the
+worker is executing a blocking task, and a new thread should be spawned to
+compensate.
+
+This idea is not new. The first occurence of this strategy that I am aware of is
+in the .NET thread pool, and was introduced more than ten years ago.
+Unfortunately, the strategy has a number of problems and because of this, it has
+not been featured in other thread pools / schedulers (Go, Java, Erlang, etc.).
+
+The first problem is that it is very hard to define "progress". A naive
+definition of progress is whether or not a task has been scheduled for over some
+unit of time. For example, if a worker has been stuck scheduling the same task
+for more than 100ms, then that worker is flagged as blocked and a new thread is
+spawned. In this definition, how does one detect scenarios where spawning a new
+thread **reduces** throughput? This can happen when the scheduler is generally
+under load and adding threads would make the situation much worse. To combat
+this, the .NET thread pool uses [hill climbing][hill].
+
+The second problem is that any automatic detection strategy will be vulnerable
+to bursty or otherwise uneven workloads. This specific problem has been the bane
+of the .NET thread pool and is known as the "stuttering" problem. The hill
+climbing strategy requires some period of time (hundreds of milliseconds) to
+adapt to load changes. This time period is needed, in part, to be able to
+determine that adding threads is improving the situation and not making it
+worse.
+
+The stuttering problem can be managed with the .NET thread pool, in part,
+because the pool is designed to schedule **coarse** tasks, i.e. tasks that
+execute in the order of hundreds of milliseconds to multiple seconds. However,
+in Rust, asynchronous task schedulers are designed to schedule tasks that should run in
+the order of microseconds to tens of milliseconds at most. In this case, any
+stutttering problem from a heuristic-based scheduler will result in far greater
+latency variations.
+
+The most common follow-up question I've received after this is "doesn't the Go
+scheduler automatically detect blocked tasks?". The short answer is: no. Doing
+so would result in the same stuttering problems as mentioned above. Also, Go has
+no need to have generalized blocked task detection because Go is able to
+preempt. What the Go scheduler **does** do is annotate potentially blocking
+system calls. This is roughly equivalent to the Tokio APIs
+[`spawn_blocking`][spawn_blocking] and [`block_in_place`][block_in_place].
+
+In short, as of now, the automatic cooperative task yielding strategy that has
+just been introduced is the best we have found for reducing tail latencies.
+
+<div style="text-align:right">&mdash;Carl Lerche</div>
+
+
+[0.2.14]: #
+[ry]: https://github.com/ry
+[deno]: https://github.com/denoland/deno
+[Hyper]: github.com/hyperium/hyper/
+[node]: https://nodejs.org
+[jonhoo]: https://github.com/jonhoo/
+[pr]: https://github.com/tokio-rs/tokio/pull/2160#issuecomment-579004856
+[spawn_blocking]: https://docs.rs/tokio/0.2/tokio/task/fn.spawn_blocking.html
+[block_in_place]: https://docs.rs/tokio/0.2/tokio/task/fn.block_in_place.html
+[hill]: https://en.wikipedia.org/wiki/Hill_climbing
diff --git a/layouts/partials/header.html b/layouts/partials/header.html
@@ -51,7 +51,7 @@
           </li>
           <li class="nav-item">
             {{ $isCommunity := hasPrefix .URL "/community" }}
-            <a class="nav-link {{ if $isCommunity }} active {{ end }}" href="{{ ref . "/community.md" }}">Community</a>
+            <a class="nav-link {{ if $isCommunity }} active {{ end }}" href="{{ ref . "/community" }}">Community</a>
           </li>
           <li class="nav-item">
             {{ $blog := index (.Site.Menus.blog) 0 }}