`os.proc.call`'s `timeout` has a termination race-condition from `SIGTERM` and `SIGKILL` #284

j-mie6 · 2024-07-21T09:58:20Z

The current implementation of os.proc.call's timeout flag uses two consecutive calls to p.destroy() and p.forciblyDestroy(). The effects of these calls are to first send SIGTERM, and then send SIGKILL to the process.

The roles of these two signals are:

SIGTERM: instruct the process to terminate, the process may intercept this and perform necessary clean-up operations, or may decide to ignore it entirely
SIGKILL: instruct the process to terminate immediately -- this signal cannot be intercepted.

By sending these two signals back-to-back, the parent process produces a race-condition between how quickly the child can execute its SIGTERM handler and clean up resources and the issue of the SIGKILL. In my experiments, I've found that SIGKILL it the cause of process exit the vast majority of the time. This means that the (potentially necessary) clean-up of the process is often not performed or worse interrupted. If the handler itself contained code to write file contents back to disk, modify a database, and so on, these operations may be corrupted. If the child process itself has children that need terminating, this could not be issued, leading to the parent process hanging.

What are the possible desired outcomes?

There are three ways that the timeout should be terminating the process:

Only send SIGTERM: it doesn't matter how long it takes, we need to ensure safe clean-up
Only send SIGKILL: the process has no important state, it should be terminated immediately
Send SIGTERM, wait an appropriate amount of time, then send SIGKILL: we want to offer the process an opportunity to clean-up, but if this takes too long (perhaps the clean-up process itself is hanging), we want to forcibly terminate -- this is the scenario done by os.lib, albeit without allowing sufficient time to perform the handler.

What is normally done?

The SIGKILL signal is useful to issue when a process is not responding in a timely fashion to its SIGTERM event and the two are usually sent together with a delay. The Linux timeout command offers this with the -k n flag, which sends a SIGKILL signal n seconds after the original timeout sent SIGTERM.

Solutions

The race condition caused by the consecutive calls to destroy and forciblyDestroy is a bug, and could be addressed by supporting a similar system allowing outcomes (1), (2), or (3) configurably and safely.

For backwards compatibility, however, it might be wise to just support (1) with the current system.

The text was updated successfully, but these errors were encountered:

sake92 · 2024-07-21T11:46:51Z

The option 3 feels like the most sensible/useful to me.
One thing I found is that java 8 doesnt destroyForcibly() the same as java 9+.
https://stackoverflow.com/a/52090564/4496364.
Just to take it into consideration/tests

j-mie6 · 2024-07-21T12:17:09Z

Option 3 can be ok if the kill delay is configurable or sufficient length. Some people might argue that's an unexpectedly longer timeout than they expected though. In my case, I don't want forcible termination at all.

At the very least, in addition to timeout you could have killAfter: Int, where killAfter = -1 is scenario (1), killAfter = 0 is scenario (2) and killAfter = n, n>0 is scenario (3) done correctly?

The default could then be -1 or perhaps 1000

lihaoyi · 2024-07-22T11:12:46Z

I think a configurable kill delay with a default sounds good. Such an addition can be made binary compatible and semantically compatible, and mirrors the Linux timeout command as you have mentioned. Feel free to send a PR

j-mie6 · 2024-07-22T11:33:10Z

What do you envision the binary compatible change looking like?

I think a configurable kill delay with a default sounds good. Such an addition can be made binary compatible and semantically compatible, and mirrors the Linux timeout command as you have mentioned. Feel free to send a PR

lihaoyi · 2024-07-22T11:44:42Z

Take all the methods currently taking timeout: Long, and also add a timeoutGracePeriod: Long. Keep a forwarder from the old overload to the new overload to preserve bincompat.

AFAICT, this affects os.proc.call, os.Subprocess#join, and os.Subprocess#waitFor

j-mie6 · 2024-07-26T12:02:29Z

Take all the methods currently taking timeout: Long, and also add a timeoutGracePeriod: Long. Keep a forwarder from the old overload to the new overload to preserve bincompat.

The problem is that you can't have two overloadings with default arguments. This is why default arguments is a bin-compat nightmare. I think if the argument is placed last and the underlying forwarder is stripped of its defaults this might work...

j-mie6 · 2024-07-26T12:13:12Z

os.Subprocess#waitFor is not affected, as it does not attempt to terminate the process.

lihaoyi · 2024-07-26T12:17:13Z

Yes, keep the default arguments only on the longest overload and strip them from the shorter ones.

j-mie6 · 2024-07-26T12:18:23Z

Also, as a side note, ProcessLike is sealed, is it intentional that SubProcess and ProcessPipeline are not final? The addition of the new join overload will be implemented by both of them, but in theory a user could have overridden them and their behaviour is broken.

lihaoyi · 2024-07-26T12:19:22Z

Probably not, but most people won't extend them anyway so a bit of sloppiness isn't a huge deal

j-mie6 · 2024-07-26T12:20:13Z

Well, it might well be because of the new overload: it might be better to @deprecatedInheritance them with a note about the new overload?

lihaoyi · 2024-07-26T12:21:01Z

sure that works

j-mie6 · 2024-07-26T12:22:34Z

Alternatively, I notice we are still in early-semver, and I suspect you'll want this to land in a 0.11 update? I could just mark them as final now if you think that's fine

lihaoyi · 2024-07-26T12:24:29Z

Let's preserve bincompat and release it as a 0.10.x for now. Although we can break compat whenever we want and just bump a version, it's also good to preserve compat where possible to give our users an easier time upgrading.

j-mie6 · 2024-07-26T12:25:15Z

Fair! (it's been so long since I've been in 0.x that I forgot you can release minor in the patch digit)

j-mie6 mentioned this issue Jul 26, 2024

Added proper SIGTERM/SIGKILL handling for sub-processes #286

Merged

lihaoyi closed this as completed in #286 Aug 5, 2024

lihaoyi closed this as completed in 6609238 Aug 5, 2024

lefou added this to the after 0.10.3 milestone Aug 6, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`os.proc.call`'s `timeout` has a termination race-condition from `SIGTERM` and `SIGKILL` #284

`os.proc.call`'s `timeout` has a termination race-condition from `SIGTERM` and `SIGKILL` #284

j-mie6 commented Jul 21, 2024 •

edited

Loading

sake92 commented Jul 21, 2024

j-mie6 commented Jul 21, 2024 •

edited

Loading

lihaoyi commented Jul 22, 2024

j-mie6 commented Jul 22, 2024

lihaoyi commented Jul 22, 2024

j-mie6 commented Jul 26, 2024

j-mie6 commented Jul 26, 2024

lihaoyi commented Jul 26, 2024

j-mie6 commented Jul 26, 2024 •

edited

Loading

lihaoyi commented Jul 26, 2024

j-mie6 commented Jul 26, 2024

lihaoyi commented Jul 26, 2024

j-mie6 commented Jul 26, 2024

lihaoyi commented Jul 26, 2024

j-mie6 commented Jul 26, 2024

os.proc.call's timeout has a termination race-condition from SIGTERM and SIGKILL #284

os.proc.call's timeout has a termination race-condition from SIGTERM and SIGKILL #284

Comments

j-mie6 commented Jul 21, 2024 • edited Loading

What are the possible desired outcomes?

What is normally done?

Solutions

sake92 commented Jul 21, 2024

j-mie6 commented Jul 21, 2024 • edited Loading

lihaoyi commented Jul 22, 2024

j-mie6 commented Jul 22, 2024

lihaoyi commented Jul 22, 2024

j-mie6 commented Jul 26, 2024

j-mie6 commented Jul 26, 2024

lihaoyi commented Jul 26, 2024

j-mie6 commented Jul 26, 2024 • edited Loading

lihaoyi commented Jul 26, 2024

j-mie6 commented Jul 26, 2024

lihaoyi commented Jul 26, 2024

j-mie6 commented Jul 26, 2024

lihaoyi commented Jul 26, 2024

j-mie6 commented Jul 26, 2024

`os.proc.call`'s `timeout` has a termination race-condition from `SIGTERM` and `SIGKILL` #284

`os.proc.call`'s `timeout` has a termination race-condition from `SIGTERM` and `SIGKILL` #284

j-mie6 commented Jul 21, 2024 •

edited

Loading

j-mie6 commented Jul 21, 2024 •

edited

Loading

j-mie6 commented Jul 26, 2024 •

edited

Loading