Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

os.proc.call's timeout has a termination race-condition from SIGTERM and SIGKILL #284

Closed
j-mie6 opened this issue Jul 21, 2024 · 15 comments · Fixed by #286
Closed

os.proc.call's timeout has a termination race-condition from SIGTERM and SIGKILL #284

j-mie6 opened this issue Jul 21, 2024 · 15 comments · Fixed by #286
Milestone

Comments

@j-mie6
Copy link
Contributor

j-mie6 commented Jul 21, 2024

The current implementation of os.proc.call's timeout flag uses two consecutive calls to p.destroy() and p.forciblyDestroy(). The effects of these calls are to first send SIGTERM, and then send SIGKILL to the process.

The roles of these two signals are:

  • SIGTERM: instruct the process to terminate, the process may intercept this and perform necessary clean-up operations, or may decide to ignore it entirely
  • SIGKILL: instruct the process to terminate immediately -- this signal cannot be intercepted.

By sending these two signals back-to-back, the parent process produces a race-condition between how quickly the child can execute its SIGTERM handler and clean up resources and the issue of the SIGKILL. In my experiments, I've found that SIGKILL it the cause of process exit the vast majority of the time. This means that the (potentially necessary) clean-up of the process is often not performed or worse interrupted. If the handler itself contained code to write file contents back to disk, modify a database, and so on, these operations may be corrupted. If the child process itself has children that need terminating, this could not be issued, leading to the parent process hanging.

What are the possible desired outcomes?

There are three ways that the timeout should be terminating the process:

  1. Only send SIGTERM: it doesn't matter how long it takes, we need to ensure safe clean-up
  2. Only send SIGKILL: the process has no important state, it should be terminated immediately
  3. Send SIGTERM, wait an appropriate amount of time, then send SIGKILL: we want to offer the process an opportunity to clean-up, but if this takes too long (perhaps the clean-up process itself is hanging), we want to forcibly terminate -- this is the scenario done by os.lib, albeit without allowing sufficient time to perform the handler.

What is normally done?

The SIGKILL signal is useful to issue when a process is not responding in a timely fashion to its SIGTERM event and the two are usually sent together with a delay. The Linux timeout command offers this with the -k n flag, which sends a SIGKILL signal n seconds after the original timeout sent SIGTERM.

Solutions

The race condition caused by the consecutive calls to destroy and forciblyDestroy is a bug, and could be addressed by supporting a similar system allowing outcomes (1), (2), or (3) configurably and safely.

For backwards compatibility, however, it might be wise to just support (1) with the current system.

@sake92
Copy link
Collaborator

sake92 commented Jul 21, 2024

The option 3 feels like the most sensible/useful to me.
One thing I found is that java 8 doesnt destroyForcibly() the same as java 9+.
https://stackoverflow.com/a/52090564/4496364.
Just to take it into consideration/tests

@j-mie6
Copy link
Contributor Author

j-mie6 commented Jul 21, 2024

Option 3 can be ok if the kill delay is configurable or sufficient length. Some people might argue that's an unexpectedly longer timeout than they expected though. In my case, I don't want forcible termination at all.

At the very least, in addition to timeout you could have killAfter: Int, where killAfter = -1 is scenario (1), killAfter = 0 is scenario (2) and killAfter = n, n>0 is scenario (3) done correctly?

The default could then be -1 or perhaps 1000

@lihaoyi
Copy link
Member

lihaoyi commented Jul 22, 2024

I think a configurable kill delay with a default sounds good. Such an addition can be made binary compatible and semantically compatible, and mirrors the Linux timeout command as you have mentioned. Feel free to send a PR

@j-mie6
Copy link
Contributor Author

j-mie6 commented Jul 22, 2024

What do you envision the binary compatible change looking like?

I think a configurable kill delay with a default sounds good. Such an addition can be made binary compatible and semantically compatible, and mirrors the Linux timeout command as you have mentioned. Feel free to send a PR

@lihaoyi
Copy link
Member

lihaoyi commented Jul 22, 2024

Take all the methods currently taking timeout: Long, and also add a timeoutGracePeriod: Long. Keep a forwarder from the old overload to the new overload to preserve bincompat.

AFAICT, this affects os.proc.call, os.Subprocess#join, and os.Subprocess#waitFor

@j-mie6
Copy link
Contributor Author

j-mie6 commented Jul 26, 2024

Take all the methods currently taking timeout: Long, and also add a timeoutGracePeriod: Long. Keep a forwarder from the old overload to the new overload to preserve bincompat.

The problem is that you can't have two overloadings with default arguments. This is why default arguments is a bin-compat nightmare. I think if the argument is placed last and the underlying forwarder is stripped of its defaults this might work...

@j-mie6
Copy link
Contributor Author

j-mie6 commented Jul 26, 2024

os.Subprocess#waitFor is not affected, as it does not attempt to terminate the process.

@lihaoyi
Copy link
Member

lihaoyi commented Jul 26, 2024

Yes, keep the default arguments only on the longest overload and strip them from the shorter ones.

@j-mie6
Copy link
Contributor Author

j-mie6 commented Jul 26, 2024

Also, as a side note, ProcessLike is sealed, is it intentional that SubProcess and ProcessPipeline are not final? The addition of the new join overload will be implemented by both of them, but in theory a user could have overridden them and their behaviour is broken.

@lihaoyi
Copy link
Member

lihaoyi commented Jul 26, 2024

Probably not, but most people won't extend them anyway so a bit of sloppiness isn't a huge deal

@j-mie6
Copy link
Contributor Author

j-mie6 commented Jul 26, 2024

Well, it might well be because of the new overload: it might be better to @deprecatedInheritance them with a note about the new overload?

@lihaoyi
Copy link
Member

lihaoyi commented Jul 26, 2024

sure that works

@j-mie6
Copy link
Contributor Author

j-mie6 commented Jul 26, 2024

Alternatively, I notice we are still in early-semver, and I suspect you'll want this to land in a 0.11 update? I could just mark them as final now if you think that's fine

@lihaoyi
Copy link
Member

lihaoyi commented Jul 26, 2024

Let's preserve bincompat and release it as a 0.10.x for now. Although we can break compat whenever we want and just bump a version, it's also good to preserve compat where possible to give our users an easier time upgrading.

@j-mie6
Copy link
Contributor Author

j-mie6 commented Jul 26, 2024

Fair! (it's been so long since I've been in 0.x that I forgot you can release minor in the patch digit)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants