Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

investigate standard IO locking #4612

Open
Geal opened this issue Feb 7, 2024 · 3 comments · Fixed by #4625
Open

investigate standard IO locking #4612

Geal opened this issue Feb 7, 2024 · 3 comments · Fixed by #4625

Comments

@Geal
Copy link
Contributor

Geal commented Feb 7, 2024

when writing logs to stdout and stderr, we use the standard library's StdOut, which requires locking to make sure multiple threads don't try to write lines at the same time.

There's a performance issue in there, but also, there's the potential to deadlock the router. (I am not entirely clear on the details here) if for some reason one thread takes the stdout lock and does not release it, every other thread that would try to write a log line at that moment would be blocked too, and then the router would be unable to process any more requests.

It should be possible to take the lock once on startup, then keep it the entire time the format layer is using it. But then, what happens on reloads?

@Geal
Copy link
Contributor Author

Geal commented Feb 8, 2024

maybe we should look at https://docs.rs/tracing-appender/latest/tracing_appender/ ?

@Geal
Copy link
Contributor Author

Geal commented Feb 8, 2024

the idea of running a separate thread to handle logs does not sit well with me right now, but I'll give it a try.

The alternative would be to skip stdout's lock/LineWriter/BufWriter pipeline (because it blocks on flushing the lines) and write directly to stdout:

  • create a File from the stdout fd (with a special case for windows, or accepting that the bug would still happen on windows)
  • make this fd non blocking
  • add back a LlineWriter/BufWriter alternative that supports a kind of try_write option

making stdout non blocking has one risk: any use of println! can now panic

@Geal
Copy link
Contributor Author

Geal commented Feb 9, 2024

I tested with tracing_appender, and it is still blocking. To test a clogged stdout, I create a pipe with mkfifo, start the router with --log=trace and redirect the router's output to that pipe, and on the other end, read from the pipe with less -f, but staying on the first page. Then in another shell, I launch router queries. After a few, a router query will just never answer, with the current router from dev, with or without tracing_appender

@Geal Geal mentioned this issue Feb 9, 2024
6 tasks
Geal added a commit that referenced this issue Feb 21, 2024
Fixes #4612 

If the router's output is piped into another process, and that process
does not consume that output, it can entirely lock up the router.

In Rust, stdout access is protected by a lock, to prevent multiple
threads from interleaving their writes. And inside that lock, there is a
LineWriter wrapping a BufWriter to batch line writes.

When the process receiving the output from the router does not consume,
then the BufWriter's writes to the stream start to block, which means
the current thread is blocked while holding the stdout lock. And then
any other thread that might want to log something will end up blocked
too, waiting for that lock to be released.

In this PR we mark stdout as non blocking, which means that writing logs
will fail silently when the buffer is full. This is done in
`executable.rs` to avoid disrupting tests, as any use of println could
panic now.
Documentation is updated to call out that in the cases where the router would have hanged before, if a plugin is using `println!` directly, then that call will panic



Co-authored-by: Jesse Rosenberger <git@jro.cc>
@abernix abernix reopened this Mar 7, 2024
abernix added a commit that referenced this issue Mar 7, 2024
This reverts commit 11fad3c from #4625
on account of discomfort with the solution and not being able to
concretely tie it back an actual fix for the root cause. We have
re-opened #4612 to track the need to get back to this.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants