Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Put process-creation into a thread #1109

Closed
njsmith opened this issue Jun 13, 2019 · 11 comments · Fixed by #1113 or #1496
Closed

Put process-creation into a thread #1109

njsmith opened this issue Jun 13, 2019 · 11 comments · Fixed by #1113 or #1496

Comments

@njsmith
Copy link
Member

njsmith commented Jun 13, 2019

[Original title: Is process creation actually non-blocking?]

In #1104 I wrote:

the actual process startup is synchronous, so you could just as well have a synchronous version

But uh... it just occurred to me that I'm actually not sure if this is true! I mean, right now we just use subprocess.Popen, which is indeed a synchronous interface. And on Unix, spawning a new process and getting a handle on it is generally super cheap – it's just fork. The exec is expensive, but that happens after the child has split off – the parent doesn't wait for it.

But on Windows, you call CreateProcess, which I think might block the caller while doing all the disk access to set up the new process? Process creation on Windows are notoriously slow, and I don't know how much of that the parent process has to sit and wait for before CreateProcess can return.

And even on Unix, you use vfork, in which case the parent process is blocked until the exec. And on recent Pythons, subprocess uses posix_spawn. On Linux this might use vfork (I'm not actually sure?). And on macOS it uses a native posix_spawn syscall, so who knows what that does. Again, this might not be a big deal... maybe the parent gets to go again the instant the child calls exec, or sooner, without having to wait for any disk access or anything. But I'm not sure!

So... we should figure this out. Because if process creation is slow enough that we need to treat it as a blocking operation, we might need to change the process API to give it an async constructor. (Presumably by making Process.__init__ private, and adding await trio.open_process(...) – similar to how we handle files.)

@njsmith
Copy link
Member Author

njsmith commented Jun 13, 2019

It's easy to find benchmarks of how long it takes a Windows process to start up (i.e., from CreateProcess start → the process actually doing things). It's harder to find benchmarks of CreateProcess's "return latency" (from CreateProcess start → CreateProcess returns in the parent, even if the child is still setting up). But here's a report of a CreateProcess call that occasionally takes 700 ms for no clear reason: https://social.msdn.microsoft.com/Forums/vstudio/en-US/80bcb045-c5ee-4514-b93d-694296747f91/function-createprocess-takes-almost-a-second-to-run?forum=windowsgeneraldevelopmentissues

@eryksun Sorry to bother, but you seem to know all kinds of mysterious things about Windows internals. Do you happen to know at what point during process startup CreateProcess returns?

@asvetlov
Copy link

Interesting observation.
Regarding asyncio I can say that I've implemented subprocess API in a sync manner because twisted did the same already :)

@njsmith
Copy link
Member Author

njsmith commented Jun 13, 2019

On Linux, posix_spawn(3) says:

       The posix_spawn() function commences by calling  fork(2),  or  possibly
       vfork(2) (see below).

       The  PID of the new child process is placed in *pid.  The posix_spawn()
       function then returns control to the parent process.
[...]
       In other words, vfork(2) is used if the caller requests it, or if there
       is no cleanup expected in the child before it  exec(3)s  the  requested
       file.
[...]
ERRORS
       The  posix_spawn()  and  posix_spawnp() functions fail only in the case
       where the underlying fork(2) or vfork(2) call fails

So that seems to make clear that the parent doesn't wait for the child at all..... though it's somewhat contradicted by these comments in the Python source claiming that Linux's posix_spawn does pass errors back to the parent:

https://github.com/python/cpython/blob/905e19a9bf9afd6439ea44fc6a4f3c8631750d6d/Lib/subprocess.py#L641-L647

macOS's posix_spawn(2), OTOH, lists some possible error codes like:

     [EACCES]           Search permission is denied for a component of the
                        path prefix.
     [ENOMEM]           The new process requires more virtual memory than is
                        allowed by the imposed maximum (getrlimit(2)).

...which strongly suggests that the parent process process blocks to do significant disk I/O before posix_spawn returns.

(Of course it's always a little unclear how much we should care about disk I/O... does anyone run macOS executables off of spinning-platters anymore?)

It's also possible to force subprocess not to use posix_spawn, e.g. by passing an empty preexec_fn.

@njsmith
Copy link
Member Author

njsmith commented Jun 13, 2019

OK yeah looking at the current glibc sources: https://sourceware.org/git/?p=glibc.git;a=blob;f=sysdeps/unix/sysv/linux/spawni.c;h=c1abf3f9608642fc530baa514b1703a9e7a1a0f2;hb=HEAD

...the linux man page is a complete pack of lies. posix_spawn unconditionally uses vfork (well, CLONE_VFORK), and the parent process always blocks until the child has exec'd. Sometimes the child opens new files before calling exec, so that definitely could do disk I/O, and then I'm not sure how far exec has to get before the parent process can continue – but logically, I guess it has to get past the point of no return, because until then it might have to pass an error back. And the exec family can return stuff like EACCES, so it definitely has to do file-I/O before it reaches the point-of-no-return.

Oh also, if subprocess.py decides to use fork/exec, then the parent always waits for the child to call exec anyway (using a pipe to report errors, etc.).

So I guess we can consider it established that on Unixes, Popen.__init__ does block to do file-I/O, but probably nothing else.

On Windows, it might do file-I/O, and might also do other slow stuff, we're not sure.

@njsmith
Copy link
Member Author

njsmith commented Jun 13, 2019

@asvetlov Yeah... Twisted treats file-I/O as non-blocking, and I'm not sure if Trio should do the same or not, so I'm not sure if this carries over or not.

It would be very easy to push subprocess spawns into a thread. It would add some overhead.

On my Linux laptop, I get:

In [2]: %timeit subprocess.Popen("/bin/true")                                   
3.14 ms ± 521 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

On a random Mac mini I happen to have access to, I get:

In [3]: %timeit subprocess.Popen("/usr/bin/true")
1000 loops, best of 3: 1.86 ms per loop

I'm not entirely sure I trust these numbers aren't paying some extra overhead as the GC calls waitpid or something. But if they're right, then the overhead of adding a thread is quite low in relative terms. I believe pushing work into a thread only costs on the order of ~100 µs, even with Trio's current unoptimized thread pool design (#6), so maybe just using a thread all the time is fine and appropriate – it only adds ~10% latency at worst.

@njsmith
Copy link
Member Author

njsmith commented Jun 13, 2019

OK yeah Popen.__del__ is messing with my timings there. More accurate version:

Linux laptop:

In [2]: l = []                                                                  

In [3]: %timeit l.append(subprocess.Popen("/bin/true"))                         
2.69 ms ± 22.3 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

Mac mini (have to drop the number of loops to avoid a resource exhaustion error....):

In [6]: l = []
In [7]: %timeit -n 100 l.append(subprocess.Popen("/usr/bin/true"))
100 loops, best of 3: 1.89 ms per loop

So the qualitative conclusions don't change.

@njsmith
Copy link
Member Author

njsmith commented Jun 13, 2019

Some war stories about bugs in Windows process spawning here:

While these are bugs, after fixing the bugs the best-case CreateProcess times he reports are still 5 ms in the first story, and 320 / 20 = ~15 ms in the second story, and this a full-time Chrome dev's workstation, so I assume it runs at least as fast as any commodity hardware and configuration that normal people use.

@asvetlov
Copy link

Yeah, in asyncio we already start subprocesses by async def functions.
Running subprocess spawning code in a thread pool looks like a good idea, it doesn't change asyncio public API

@eryksun
Copy link

eryksun commented Jun 13, 2019

here's a report of a CreateProcess call that occasionally takes 700 ms for no clear reason: https://social.msdn.microsoft.com/Forums/vstudio/en-US/80bcb045-c5ee-4514-b93d-694296747f91/function-createprocess-takes-almost-a-second-to-run?forum=windowsgeneraldevelopmentissues

@eryksun Sorry to bother, but you seem to know all kinds of mysterious things about Windows internals. Do you happen to know at what point during process startup CreateProcess returns?

CreateProcessW and CreateProcessAsUserW call a common internal routine that does the initial work of parsing the executable path from the command line, searching for the executable, creating the process parameters, and so on. Prior to NT 6.0, most of the remaining work was implemented with individual system calls (e.g. NtOpenFile, NtCreateSection, NtCreateProcess, NtQueryInformationProcess, NtSetInformationProcess, NtAllocateVirtualMemory, NtWriteVirtualMemory, and NtCreateThread). In NT 6.0+, this work is orchestrated in the kernel with a single NtCreateUserProcess system call.

The issue with a 700 ms delay that's associated with querying extended attributes is likely due to the kernel security function SeGetCachedSigningLevel. This calls FsRtlQueryKernelEaFile to look for a "$Kernel." extended attribute, apparently to get the file's cached signing level. (Caching this in a kernel EA is secure. The value can be queried from user mode, but it can only be set from kernel mode by either FsRtlSetKernelEaFile or a custom IRP with the minor function IRP_MN_KERNEL_CALL.) 700 ms is probably the time required to verify the signature when the signing level hasn't been cached yet, not the time it takes to query the extended attribute.

The time related to the initial setup in the process itself is unrelated. The initial thread is suspended. When NtCreateUserProcess returns, CreateProcessW sends an LPC notification to the Windows session server (csrss.exe; one instance per session), which keeps its own bookkeeping for Windows subsystem processes and threads and also manages SxS activation (aka the fusion loader). Once the process is created and registered with the subsytem server, CreateProcessW calls ResumeThread and returns the process and thread handles.

A special asynchronous procedure call (APC) for ntdll!LdrInitializeThunk is initially queued to the main thread of the child process. This initializes the process via ntdll!LdrpInitializeProcess, which includes loading static DLL imports and possibly SxS activation that requires LPC messaging to the fusion loader. Once the process is initialized, the main thread resumes normal execution via NtContinue, which begins at ntdll!RtlUserThreadStart. This calls the WINAPI init function kernel32!BaseThreadInitThunk, which calls the image entry point. For a C application that uses wmain, this will be wmainCRTStartup, which sets up the CRT and (unlike Unix where this is the work of the parent process) parses the command line into argc and argv parameters, with optional support for simplistic wildcard expansion of arguments such as "*.txt".

@njsmith
Copy link
Member Author

njsmith commented Jun 21, 2019

In #1113 we ended up not putting process creation into a thread, but instead just setting up the public API changes necessary to do that later. So I guess I'll re-open this as the "actually put it in a thread" issue.

@njsmith njsmith reopened this Jun 21, 2019
@njsmith njsmith changed the title Is process creation *actually* non-blocking? Put process-creation into a thread Jun 21, 2019
@njsmith
Copy link
Member Author

njsmith commented Jun 21, 2019

So the remaining todo item here is: once open_process has been released and we're ready to drop support for the old Process(...) constructor, refactor trio/_subprocess.py to move the startup logic into open_process, and push the call to subprocess.Popen into a thread.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
4 participants