Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

blog: reducing tail latencies with auto yielding #422

Merged
merged 18 commits into from
Apr 1, 2020
Merged
Show file tree
Hide file tree
Changes from 12 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
184 changes: 184 additions & 0 deletions content/blog/2020-04-preemption.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,184 @@
+++
date = "2020-03-31"
title = "Reducing tail latencies with automatic cooperative task yielding"
description = "April 1, 2020"
menu = "blog"
weight = 982
+++

Tokio is a runtime for asynchronous Rust applications. It allows writing code
using `async` & `await` syntax. For example:

```rust
let mut listener = TcpListener::bind(&addr).await?;

loop {
let (mut socket, _) = listener.accept().await?;

tokio::spawn(async move {
// handle socket
});
}
```

The Rust compiler transforms this code into a state machine. The Tokio runtime
executes these state machines, multiplexing many tasks on a handful of threads.
Tokio's scheduler requires that the generated task's state machine yields control
back to the scheduler in order to multiplex tasks. Each `.await` call is an
opportunity to yield back to the scheduler. In the above example,
`listener.accept().await` will return a socket if one is pending. If there are
no pending sockets, control is yielded back to the scheduler.

This system works well in most cases. However, when a system comes under load,
it is possible for an asynchronous resource to always be ready. For
example, consider an echo server:

```rust
tokio::spawn(async move {
let mut buf = [0; 1024];

loop {
let n = socket.read(&mut buf).await?;

if n == 0 {
break;
}

// Write the data back
socket.write(buf[..n]).await?;
}
});
```


If data is received faster than it can be processed, it is possible that more
data will have already been received by the time the processing of a data chunk
completes. In this case, `.await` will never yield control back to the scheduler,
other tasks will not be scheduled, resulting in starvation and large latency
variance.

Currently, the answer to this problem is that the user of Tokio is responsible
for adding yield points every so often. In practice, very few actually do this
Copy link
Sponsor Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we link to yield_now, and maybe also rust-lang/futures-rs#2047 ?

and end up being vulnerable to this sort of problem.
carllerche marked this conversation as resolved.
Show resolved Hide resolved

A common solution to this problem is preemption. OS threads will interrupt
Copy link
Sponsor Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"With normal OS threads, the kernel will interrupt..."

thread execution every so often in order to ensure fair scheduling of all
threads. Runtimes that have full control over execution (Go, Erlang, etc.)
will also use preemption to ensure fair scheduling of tasks. This is
accomplished by injecting yield points — code which checks if the task has been
executing for long enough and yields back to the scheduler if so — at compile-time.
Unfortunately, Tokio is not able to use this technique as Rust's `async` generators
do not provide any mechanism for executors (like Tokio) to inject such yield points.

## Per-task operation budget

Even though Tokio is not able to **preempt**, there is still an opportunity to
nudge a task to yield back to the scheduler. As of [0.2.14], each Tokio task has
an operation budget. This budget is reset when the scheduler switches to the
task. Each Tokio resource (socket, timer, channel, ...) is aware of this
budget. As long as the task has budget remaining, the resource operates as it did
previously. Each asynchronous operation (actions that users must `.await` on)
decrements the task's budget. Once the task is out of budget, all resources will
perpetually return "not ready" until the task yields back to the scheduler. At that point,
the budget is reset, and future `.await`s on Tokio resources will again function normally.

Let's go back to the echo server example from above. When the task is scheduled, it
is assigned a budget of 128 operations. When `socket.read(..)` and
carllerche marked this conversation as resolved.
Show resolved Hide resolved
`socket.write(..)` are called, the budget is decremented. If the budget is zero,
the task yields back to the scheduler. If either `read` or `write` cannot
proceed due to the underlying socket not being ready (no pending data or a full
send buffer), then the task also yields back to the scheduler.

The idea originated from a conversation I had with [Ryan Dahl][ry]. He is
using Tokio as the underlying runtime for [Deno][deno]. When doing some HTTP
experimentation with [Hyper] a while back, he was seeing some high tail
latencies in some benchmarks. The problem was due to a loop not yielding back to
the scheduler under load. Hyper ended up fixing the problem by hand in this one
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit/tioli: can we reference a Hyper PR/commit for the fix?

case, but Ryan mentioned that, when he worked on [node.js][node], they handled
the problem by adding **per resource** limits. So, if a TCP socket was always
ready, it would force a yield every so often. I mentioned this conversation to
[Jon Gjenset][jonhoo], and he came up with the idea of placing the limit on
the task itself instead of on each resource.

The end result is that Tokio should be able to provide more consistent runtime
behavior under load. While the exact heuristics will most likely be tweaked over
time, initial measurements show that, in some cases, tail latencies are reduced
by almost 3x.

[![benchmark](https://user-images.githubusercontent.com/176295/73222456-4a103300-4131-11ea-9131-4e437ecb9a04.png)](https://user-images.githubusercontent.com/176295/73222456-4a103300-4131-11ea-9131-4e437ecb9a04.png)

"master" is before the automatic yielding and "preempt" is after. Click for a
bigger version, see also the original [PR comment][pr] for more details.

carllerche marked this conversation as resolved.
Show resolved Hide resolved
## A note on blocking

Although automatic cooperative task yielding improves performance in many cases,
it cannot preempt tasks. Users of Tokio must still take care to avoid both CPU
intensive work and blocking APIs. The [`spawn_blocking`][spawn_blocking] function
can be used to "asyncify" these sorts of tasks by running them on a thread pool
where blocking is allowed.

Tokio does not, and will not attempt to detect blocking tasks and automatically
compensate by adding threads to the scheduler. This question has come up a
number of times in the past, so allow me to elaborate.

For context, the idea is for the scheduler to include a monitoring thread. This
thread would poll scheduler threads every so often and check that workers are
making progress. If a worker is not making progress, it is assumed that the
worker is executing a blocking task, and a new thread should be spawned to
compensate.

This idea is not new. The first occurence of this strategy that I am aware of is
in the .NET thread pool, and was introduced more than ten years ago.
Unfortunately, the strategy has a number of problems and because of this, it has
not been featured in other thread pools / schedulers (Go, Java, Erlang, etc.).

The first problem is that it is very hard to define "progress". A naive
definition of progress is whether or not a task has been scheduled for over some
unit of time. For example, if a worker has been stuck scheduling the same task
for more than 100ms, then that worker is flagged as blocked and a new thread is
spawned. In this definition, how does one detect scenarios where spawning a new
thread **reduces** throughput? This can happen when the scheduler is generally
under load and adding threads would make the situation much worse. To combat
this, the .NET thread pool uses [hill climbing][hill].
Copy link
Contributor

@Darksonn Darksonn Apr 1, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel like a few more words can be added about the hill climbing heuristic they use? I know what hill climbing is as I specialize in OR, but even that doesn't let me guess any further details on what they are measuring here.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I found https://mattwarren.org/2017/04/13/The-CLR-Thread-Pool-Thread-Injection-Algorithm/ which seems like a pretty good discussion of the specific hill-climbing approach used in CLR (at a glance). May be good as a second reference?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are there reference(s) for how this was a problem in .NET? If so, it would be nice to link to that.


The second problem is that any automatic detection strategy will be vulnerable
to bursty or otherwise uneven workloads. This specific problem has been the bane
of the .NET thread pool and is known as the "stuttering" problem. The hill
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are there references we can link to showing the impacts of the stuttering problem in .NET? After a quick google search, I found http://joeduffyblog.com/2006/07/08/clr-thread-pool-injection-stuttering-problems/ from Joe Duffy's blog (although it's also from 2006)

climbing strategy requires some period of time (hundreds of milliseconds) to
adapt to load changes. This time period is needed, in part, to be able to
determine that adding threads is improving the situation and not making it
worse.

The stuttering problem can be managed with the .NET thread pool, in part,
because the pool is designed to schedule **coarse** tasks, i.e. tasks that
execute in the order of hundreds of milliseconds to multiple seconds. However,
in Rust, asynchronous task schedulers are designed to schedule tasks that should run in
the order of microseconds to tens of milliseconds at most. In this case, any
stutttering problem from a heuristic-based scheduler will result in far greater
latency variations.

The most common follow-up question I've received after this is "doesn't the Go
scheduler automatically detect blocked tasks?". The short answer is: no. Doing
so would result in the same stuttering problems as mentioned above. Also, Go has
no need to have generalized blocked task detection because Go is able to
preempt. What the Go scheduler **does** do is annotate potentially blocking
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there something we can link to for more information on how Go annotates potentially blocking calls?

Copy link
Sponsor Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, doesn't Go inject yield points as well? Good references here are golang/go#10958 and golang/go#24543.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mostly got this by reading the source...

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure what I can ref.

system calls. This is roughly equivalent to the Tokio APIs
[`spawn_blocking`][spawn_blocking] and [`block_in_place`][block_in_place].
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The difference is that Go does this in the standard library, right?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tokio does as well... for example tokio::fs. The difference being that Tokio provides access to these fns as it doesn't preempt.


In short, as of now, the automatic cooperative task yielding strategy that has
just been introduced is the best we have found for reducing tail latencies.
hawkw marked this conversation as resolved.
Show resolved Hide resolved

<div style="text-align:right">&mdash;Carl Lerche</div>


[0.2.14]: #
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remember to update this once it has been released.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for reminding me, i had already forgotten... I'll probably forget again anyway 😆

[ry]: https://github.com/ry
[deno]: https://github.com/denoland/deno
[Hyper]: github.com/hyperium/hyper/
[node]: https://nodejs.org
[jonhoo]: https://github.com/jonhoo/
[pr]: https://github.com/tokio-rs/tokio/pull/2160#issuecomment-579004856
[spawn_blocking]: https://docs.rs/tokio/0.2/tokio/task/fn.spawn_blocking.html
[block_in_place]: https://docs.rs/tokio/0.2/tokio/task/fn.block_in_place.html
[hill]: https://en.wikipedia.org/wiki/Hill_climbing
2 changes: 1 addition & 1 deletion layouts/partials/header.html
Original file line number Diff line number Diff line change
Expand Up @@ -51,7 +51,7 @@
</li>
<li class="nav-item">
{{ $isCommunity := hasPrefix .URL "/community" }}
<a class="nav-link {{ if $isCommunity }} active {{ end }}" href="{{ ref . "/community.md" }}">Community</a>
<a class="nav-link {{ if $isCommunity }} active {{ end }}" href="{{ ref . "/community" }}">Community</a>
</li>
<li class="nav-item">
{{ $blog := index (.Site.Menus.blog) 0 }}
Expand Down