Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Possible regression in 0.2.14 and further (hangs, stack overflow) #2422

Closed
mexus opened this issue Apr 20, 2020 · 7 comments
Closed

Possible regression in 0.2.14 and further (hangs, stack overflow) #2422

mexus opened this issue Apr 20, 2020 · 7 comments
Labels
A-tokio Area: The main tokio crate C-question User questions that are neither feature requests nor bug reports I-hang Program never terminates, resulting from infinite loops, deadlock, livelock, etc. M-runtime Module: tokio/runtime S-waiting-on-author Status: awaiting some action (such as code changes) from the PR or issue author.

Comments

@mexus
Copy link
Contributor

mexus commented Apr 20, 2020

Version

0.2.14 and higher (up to 0.2.18 so far)

Platform

  • Linux ... 5.6.3-arch1-1 #1 SMP PREEMPT Wed, 08 Apr 2020 07:47:16 +0000 x86_64 GNU/Linux
  • Linux ... 5.6.5-1.el7.elrepo.x86_64 #1 SMP Thu Apr 16 14:02:22 EDT 2020 x86_64 x86_64 x86_64 GNU/Linux

Description

Disclaimer

Unfortunately, I am currently unable to provide an MRE or even a code sample (NDA) that causes the issue, but still I'm writing an issue in case somebody has experienced what I had to experience.

Problem

I'm developing a reverse-proxy-like application, and after updating tokio from 0.2.13 to 0.2.18 I've found that my app hangs, while consuming 100% of a CPU core (out of many cores). As I mentioned before, I can not disclose all the details, but in general the app does the following things:

  • Establishing lots of "internal" connections (tcp, uds, about 500) simultaneously
  • Establishing more "internal" connections on demand (about 1.5k) almost simultaneously
  • Re-establishing broken "internal" connections (e.g. when the endpoint terminates the connection unexpectedly)
  • Accepting incoming connections (tcp, uds) from "external" clients
  • Proxying data (ratio ~ 8:1, i.e. 1 "internal" connection mentioned above is shared among ~8 "external" clients)

Under the hood, I use FuturesUnordered and Selects from futures-util and do a lot of polls manually in the order I found to be most suitable. I don't spawn anything and use default tokio runtime (via #[tokio::main] macro).

After upgrading to tokio-0.2.18 I've found that my app established about 100 connections to the internal servers and hangs completely, consuming 100% of a CPU core. All the attempts establish a connect to the port it listens to fail because of timeout.

I though then "okay, there's being a major upgrade in tokio's scheduler in 0.12.14, so probably i must not manage all the futures myself and just spawn the tasks!", so I've replaced FuturesUnordered and Selects with spawns and yay! It seemed to have solved the whole issue.

.. until a lot of "internal" servers went offline and ...... the connections where scheduled to be re-established and I've got a stack-overflow error.

So I had to downgrade to tokio-0.2.13, where everything just works (tm).

My question is, how do I investigate the root cause of my issue? Where should I look at first?

Eventually, I would like to provide an MRE, but so far It's just a cry for help :)

Thanks!

@Darksonn
Copy link
Contributor

Are you by any chance using the futures v0.1 FuturesUnordered or one of the early v0.3 versions from before this PR?

@Darksonn Darksonn added A-tokio Area: The main tokio crate C-question User questions that are neither feature requests nor bug reports I-hang Program never terminates, resulting from infinite loops, deadlock, livelock, etc. M-runtime Module: tokio/runtime labels Apr 20, 2020
@mexus
Copy link
Contributor Author

mexus commented Apr 20, 2020

Nope, only 0.3

@Darksonn
Copy link
Contributor

Darksonn commented Apr 20, 2020

Well there have been a few issues with hangs after this got introduced, which exposed a collection of buggy sub-schedulers such as FuturesUnordered (#2047) or Shared (#2130). The former has been fixed in futures version 0.3.2, and the latter has not yet been fixed.
Using futures::executor::block_on from within an async function falls in this category.

If you application hangs, it's likely due to such a buggy sub-scheduler somewhere. As for the stack-overflow, that sometimes happens when people try to make big stack arrays, e.g.:

let mut buf = [0; 4096];
stream.read(&mut buf).await?;

and stuff like this. This should be avoided in futures, because they make the future object massive, which can cause the call to tokio::spawn itself to stack overflow due to moving a very big object a few stack frames down. You should use a vector instead.

Of course, it could also just be an infinite recursive loop. Your backtrace would probably tell you in that case.

@carllerche
Copy link
Member

Do a snapshot of the process (thread stacks) when stuck at 100% CPU. That should show which fn it is stuck in.

@mexus
Copy link
Contributor Author

mexus commented Apr 20, 2020

@Darksonn @carllerche Thanks for the advises! I'll test both my stack-allocated stuff and threads snapshot and return with more facts in a couple of days

@Darksonn Darksonn added the S-waiting-on-author Status: awaiting some action (such as code changes) from the PR or issue author. label Apr 21, 2020
@Darksonn
Copy link
Contributor

I was wondering if you had any further details on this issue? If not, I will have to close the issue due to lack of details.

@mexus
Copy link
Contributor Author

mexus commented May 11, 2020

Yep, I guess let's close it. Since I've got rid of all my "custom" futures with twisted logic and shifted on all the hard work to tokio (via spawn and its friends) I don't see any issues.

So my best guess for now is that I've made some mistakes in the app's logic initially, and it just "happened" to work as expected.

Thanks everyone for attention and sorry I was not able to provide further details

@mexus mexus closed this as completed May 11, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-tokio Area: The main tokio crate C-question User questions that are neither feature requests nor bug reports I-hang Program never terminates, resulting from infinite loops, deadlock, livelock, etc. M-runtime Module: tokio/runtime S-waiting-on-author Status: awaiting some action (such as code changes) from the PR or issue author.
Projects
None yet
Development

No branches or pull requests

3 participants