LocalSet, JoinHandle and budget coming together can lead to hangs with 100% CPU usage. #2460

najamelan · 2020-04-29T11:24:50Z

Affected Versions

0.2.14 - 0.2.19

Platform

arch linux, but no IO involved.

Description

Please see the issue reported first at futures-rs because I suspected JoinAll was maybe at fault.

JoinHandle often reports Pending even though it's task has already returned Ready. Over a collection of futures like in LocalSet, this can require several runs before all contained JoinHandles return Ready. When there are more tasks in the set than budget, this can lead to the JoinHandles never returning Ready and tasks awaiting them to hang.

It is possible that an inefficiency in the synchronization of JoinHandle also has an impact when using a threadpool, but it might not be so noticable since this just throws more resources at the problem.

The text was updated successfully, but these errors were encountered:

Signed-off-by: Eliza Weisman <eliza@buoyant.io>

hawkw · 2020-04-29T19:03:14Z

I think I know what's up here.

LocalSet wraps each time a local task is run in budget:

tokio/tokio/src/task/local.rs

Line 406 in 947045b

Some(task) => crate::coop::budget(|| task.run()),

This is identical to what tokio's other schedulers do when running tasks, and in theory should give each task its own budget every time it's polled.

However, LocalSet is different from other schedulers. Unlike the runtime schedulers, a LocalSet is itself a future that's run on another scheduler, in block_on. block_on also sets a budget:

tokio/tokio/src/runtime/basic_scheduler.rs

Line 131 in 947045b

if let Ready(v) = crate::coop::budget(|| future.as_mut().poll(&mut cx)) {

The docs for budget state that:

tokio/tokio/src/coop.rs

Line 73 in 947045b

/// Enabling budgeting when it is already enabled is a no-op.

This means that inside of a LocalSet, the calls to budget are no-ops. Instead, each future polled by the LocalSet is subtracting from a single global budget.

LocalSet's RunUntil future polls the provided future before polling any other tasks spawned on the local set:

tokio/tokio/src/task/local.rs

Lines 525 to 535 in 947045b

    
           if let Poll::Ready(output) = me.future.poll(cx) { 
        
               return Poll::Ready(output); 
        
           } 
        
           if me.local_set.tick() { 
        
               // If `tick` returns `true`, we need to notify the local future again: 
        
               // there are still tasks remaining in the run queue. 
        
               cx.waker().wake_by_ref(); 
        
           } 
        
           Poll::Pending

In this case, the provided future is JoinAll. Unfortunately, every time a JoinAll is polled, it polls every joined future that has not yet completed. When the number of futures in the JoinAll is >= 128, this means that the JoinAll immediately exhausts the task budget. This would, in theory, be a good thing --- if the JoinAll had a huge number of JoinHandles in it and none of them are ready, it would limit the time we spend polling those join handles.

However, because the LocalSet actually has a single shared task budget, this means polling the JoinAll always exhausts the entire budget. There is now no budget remaining to poll any other tasks spawned on the LocalSet, and they are never able to complete.

There are two possible solutions: LocalSet could use coop::limit to poll each future with a smaller budget, ensuring no one future can exhaust the entire budget. However, this would mean that tasks on a LocalSet would be given a smaller budget than comparable tasks on a regular worker or basic_scheduler. Alternatively, LocalSet could call coop::stop to reset the budget when it is polled. That way, each task spawned on the LocalSet would get its own separate budget rather than starving other tasks. Since LocalSets can only be polled in block_on, and are not competing with other tasks, I think it's correct for it to exempt itself from budgeting.

najamelan · 2020-04-29T21:20:53Z

That's some awesome debugging! Thanks for looking into that. There is one thing I don't understand though:

However, because the LocalSet actually has a single shared task budget, this means polling the JoinAll always exhausts the entire budget. There is now no budget remaining to poll any other tasks spawned on the LocalSet, and they are never able to complete.

From what I gather from the logs I posted, the inner tasks have in fact always run and completed before the outer ones. After they print, they immediately return Poll::Ready. However the joinhandles still said they hadn't. The outer tasks which were being polled would have completed had the JoinHandles not returned Pending.

It is very late here, so I have to sleep, but tomorrow I will run the test with your PR. I imagine it will solve the hang, but I suspect there is still an underlying issue that could have solved this from another angle and that may be desirable to solve as well. I'll have a look at the code of JoinHandle tomorrow as well.

hawkw · 2020-04-29T22:42:25Z

When the JoinHandles are polled, the budget is 0. This causes them to return Pending immediately, without checking if the task has actually completed.

najamelan · 2020-04-30T05:03:21Z

Yes, thanks. It dawned to me overnight that the budget was returning Pending, not the JoinHandle...

## Motivation Currently, an issue exists where a `LocalSet` has a single cooperative task budget that's shared across all futures spawned on the `LocalSet` _and_ by any future passed to `LocalSet::run_until` or `LocalSet::block_on`. Because these methods will poll the `run_until` future before polling spawned tasks, it is possible for that task to _always_ deterministically starve the entire `LocalSet` so that no local tasks can proceed. When the completion of that future _itself_ depends on other tasks on the `LocalSet`, this will then result in a deadlock, as in issue #2460. A detailed description of why this is the case, taken from [this comment][1]: `LocalSet` wraps each time a local task is run in `budget`: https://github.com/tokio-rs/tokio/blob/947045b9445f15fb9314ba0892efa2251076ae73/tokio/src/task/local.rs#L406 This is identical to what tokio's other schedulers do when running tasks, and in theory should give each task its own budget every time it's polled. _However_, `LocalSet` is different from other schedulers. Unlike the runtime schedulers, a `LocalSet` is itself a future that's run on another scheduler, in `block_on`. `block_on` _also_ sets a budget: https://github.com/tokio-rs/tokio/blob/947045b9445f15fb9314ba0892efa2251076ae73/tokio/src/runtime/basic_scheduler.rs#L131 The docs for `budget` state that: https://github.com/tokio-rs/tokio/blob/947045b9445f15fb9314ba0892efa2251076ae73/tokio/src/coop.rs#L73 This means that inside of a `LocalSet`, the calls to `budget` are no-ops. Instead, each future polled by the `LocalSet` is subtracting from a single global budget. `LocalSet`'s `RunUntil` future polls the provided future before polling any other tasks spawned on the local set: https://github.com/tokio-rs/tokio/blob/947045b9445f15fb9314ba0892efa2251076ae73/tokio/src/task/local.rs#L525-L535 In this case, the provided future is `JoinAll`. Unfortunately, every time a `JoinAll` is polled, it polls _every_ joined future that has not yet completed. When the number of futures in the `JoinAll` is >= 128, this means that the `JoinAll` immediately exhausts the task budget. This would, in theory, be a _good_ thing --- if the `JoinAll` had a huge number of `JoinHandle`s in it and none of them are ready, it would limit the time we spend polling those join handles. However, because the `LocalSet` _actually_ has a single shared task budget, this means polling the `JoinAll` _always_ exhausts the entire budget. There is now no budget remaining to poll any other tasks spawned on the `LocalSet`, and they are never able to complete. [1]: #2460 (comment) ## Solution This branch solves this issue by resetting the task budget when polling a `LocalSet`. I've added a new function to `coop` for resetting the task budget to `UNCONSTRAINED` for the duration of a closure, and thus allowing the `budget` calls in `LocalSet` to _actually_ create a new budget for each spawned local task. Additionally, I've changed `LocalSet` to _also_ ensure that a separate task budget is applied to any future passed to `block_on`/`run_until`. Additionally, I've added a test reproducing the issue described in #2460. This test fails prior to this change, and passes after it. Fixes #2460 Signed-off-by: Eliza Weisman <eliza@buoyant.io>

hawkw added a commit that referenced this issue Apr 29, 2020

add test reproducing #2460

7f71fb7

Signed-off-by: Eliza Weisman <eliza@buoyant.io>

hawkw mentioned this issue Apr 29, 2020

task: fix LocalSet having a single shared task budget #2462

Merged

hawkw closed this as completed in #2462 Apr 30, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

LocalSet, JoinHandle and budget coming together can lead to hangs with 100% CPU usage. #2460

LocalSet, JoinHandle and budget coming together can lead to hangs with 100% CPU usage. #2460

najamelan commented Apr 29, 2020

hawkw commented Apr 29, 2020

najamelan commented Apr 29, 2020 •

edited

Loading

hawkw commented Apr 29, 2020

najamelan commented Apr 30, 2020

LocalSet, JoinHandle and budget coming together can lead to hangs with 100% CPU usage. #2460

LocalSet, JoinHandle and budget coming together can lead to hangs with 100% CPU usage. #2460

Comments

najamelan commented Apr 29, 2020

Affected Versions

Platform

Description

hawkw commented Apr 29, 2020

najamelan commented Apr 29, 2020 • edited Loading

hawkw commented Apr 29, 2020

najamelan commented Apr 30, 2020

najamelan commented Apr 29, 2020 •

edited

Loading