Opened 6 years ago

Last modified 5 years ago

#5710 enhancement new

Use posix_spawn to spawn processes for more fasterness, especially on large processes

Reported by: Glyph Owned by:
Priority: normal Milestone:
Component: core Keywords:
Cc: Branch:
Author:

Description

(Note that since this is a performance ticket, a spawnProcess benchmark is a prerequisite.)

posix_spawn is not burdened with the unfortunate requirement that fork/ exec implies, which is that user code runs in the post-fork-pre-exec environment. Many platforms do actually have practical optimizations for posix_spawn beyond its idiomatic counterpart. These optimizations would likely be magnified by Python's often unfortunate reference-counting/gc behaviors across fork() boundaries.

Not to mention the fact that posix_spawn actually has a signature that is quite similar to Twisted's spawnProcess (with struct attributes instead of parameters, but hey, it's C, what do you want).

Change History (6)

comment:1 Changed 6 years ago by nekto0n

What about proposed preExecFn argument to spawnProcess?

comment:2 Changed 5 years ago by jknight

Doing something *like* this is actually a correctness fix, not a performance fix. I'm nots sure if posix_spawn is actually useful, though, given this (note RESOLVED/FIXED state; Ulrich Drepper at his finest): http://sources.redhat.com/bugzilla/show_bug.cgi?id=10354

But, twisted.internet.process does need to be rewritten to either use posix_spawn or python-subprocess32's _posixsubprocess.c C module (or a variant thereof adding additional functionality if necessary), so that there is no python code executed between the call to fork() and the call to exec().

comment:3 Changed 5 years ago by jknight

You currently can't do just about /anything/ if you want the vfork optimization in glibc. Doesn't seem too useful.

      /* If no major work is done, allow using vfork.  Note that we
         might perform the path searching.  But this would be done by
         a call to execvp(), too, and such a call must be OK according
         to POSIX.  */
      || ((flags & (POSIX_SPAWN_SETSIGMASK | POSIX_SPAWN_SETSIGDEF
                    | POSIX_SPAWN_SETSCHEDPARAM | POSIX_SPAWN_SETSCHEDULER
                    | POSIX_SPAWN_SETPGROUP | POSIX_SPAWN_RESETIDS)) == 0
          && file_actions == NULL))
    new_pid = __vfork ();
  else
    new_pid = __fork ();

comment:4 in reply to:  3 ; Changed 5 years ago by Glyph

Replying to jknight:

You currently can't do just about /anything/ if you want the vfork optimization in glibc. Doesn't seem too useful.

A few points:

  1. if you're doing process spawning for utilization of multiple cores (a quite popular use-case), you don't need to do any of that stuff.
  2. Did they get rid of the POSIX_SPAWN_USEVFORK flag? I thought you could tell it to be a little more permissive if you know that one of those things is "safe". (I don't know how you can ever really know that, in the face of how "all bets are off" vfork is, but still, I thought it was in there.)
  3. You can do some of that work in a helper C executable if you really need to, and just exec twice. If your main / master process is large and wants to fork a lot, you can still get a big win even paying twice for exec().

comment:5 Changed 5 years ago by Glyph

Worth noting that the glibc bug is open again. It seems that Drepper is no longer involved with glibc maintenance, so we no longer have to ... enjoy his inimitable style of maintaining relationships with users.

Also worth noting that glibc is not the only implementation of posix_spawn, so it might be better elsewhere. I am not sure how practically true this is right now, as I don't know the details of any other implementations, but it always might become true in the future; posix_spawn's semantics are such that it can be optimized, fork / exec aren't.

comment:6 in reply to:  4 Changed 5 years ago by Glyph

Replying to glyph:

Replying to jknight:

  1. You can do some of that work in a helper C executable if you really need to, and just exec twice. If your main / master process is large and wants to fork a lot, you can still get a big win even paying twice for exec().

Reading this now it strikes me as somewhat unclear. What I meant to say is, as a separate ticket, we could implement the following optimization:

If spawnProcess is passed parameters that, if passed to posix_spawn as arguments, would cause fork rather than vfork to be called on glibc (or an equivalent non-optimal behavior on a different libc), we could have spawnProcess detect that it's running on libc, then use posix_spawn - using only the attributes which would cause vfork (or equivalent) to be used, to start a short C helper program.

That program would take, as environment variables or command-line arguments, instructions on which process-attribute manipulations to perform (dup2-ing file descriptors, changing user IDs, etc). Said program would perform those instructions, then exec the program that the user intended to spawn.

This could still be a significant advantage over the current fork/exec-based implementation, because such a C program could allocate no memory, or very close to it. Therefore, despite two exec calls, a far smaller executable image would be mapped, and we would not have to worry about a post-fork Python reference count change causing random page copies all over the process's address space.

It might be advisable to have such a program for thread-safety as well, at least in the case where the subprocess is spawned as a different user. The recommended way to implement this with posix_spawn is to simply call setuid (and then use the POSIX_SPAWN_RESETIDS spawnattr flag), which affects the process globally, unlike the current strategy of calling setuid in the post-fork environment.

Note: See TracTickets for help on using tickets.