Opened 9 years ago

Last modified 7 years ago

#4159 enhancement new

Add pre-exec hook to spawnProcess

Reported by: spiv Owned by: Andrew Mahone
Priority: normal Milestone:
Component: core Keywords:
Cc: Jean-Paul Calderone, Michael Hudson-Doyle, itamarst Branch:
Author:

Description

Sometimes it's useful to be able to run some code after fork but before exec. For example I just had a case where it would be handy to be able to set the BZR_LOG environment variable to 'bzr.log.%d' % os.getpid() as a cheap solution to something. mwhudson says he could have used this feature recently as well.

The subprocess module provides a preexec_fn argument, which provides this. It also provides a way to work around bugs like http://bugs.python.org/issue1652! So I think if we provided this feature it would turn out to be useful.

I'm not sure what to provide for platforms without fork. preexec_fn is documented as "Unix only", so that's probably a reasonable answer for Twisted's equivalent. It's certainly all I care about :)

This ticket is perhaps related to #2415.

Attachments (6)

posix_process_preexec_fn.patch (2.4 KB) - added by Andrew Mahone 8 years ago.
preexec_fn support for posix Process
posix_process_preexec_fn.2.patch (6.4 KB) - added by Andrew Mahone 8 years ago.
preexec_fn support for posix Process, and tests
posix_process_preexec_fn.3.patch (7.4 KB) - added by Andrew Mahone 8 years ago.
Adds doc for preexec_fn parameter to spawnProcess.
posix_process_preexec_fn.4.patch (9.0 KB) - added by Andrew Mahone 8 years ago.
posix_process_preexec_fn.5.patch (9.3 KB) - added by Andrew Mahone 8 years ago.
posix_process_preexec_fn.6.patch (9.3 KB) - added by Andrew Mahone 8 years ago.

Download all attachments as: .zip

Change History (26)

comment:1 Changed 9 years ago by Jean-Paul Calderone

Cc: Jean-Paul Calderone added

It's very difficult to support this correctly. It may be impossible. There are some significant gc interaction issues with the feature in as implemented by the subprocess module.

In the use-case you describe, do you actually need the child's pid in the environment variable? Or is that just a way to get a unique log name? If the latter, there are plenty of other possible approaches that wouldn't involve extending the spawnProcess interface.

I'm also somewhat reluctant to further widen the gap between Windows and POSIX in Twisted's process support. The difference is already significant enough that it causes major issues for Twisted itself (largely because of the consequences for buildbot).

comment:2 Changed 9 years ago by spiv

In my use-case, yes, the child's pid in that environment variable would be highly desirable, because it would make finding the log file for a rogue process (e.g. one consuming an unreasonable amount of memory or CPU) trivial. (An arbitrary unique log name would probably be an improvement over the status quo as it happens, but still nowhere as nice as having the PID in the log name.)

I agree there are significant risks with using a pre-exec function, but I think it would be fine to simply document them and say "caller beware". The cases of setting an environment variable or resetting a signal handler are safe AFAICT. For everything else... well don't do that, but if you must then gc.disable() after the fork and cross your fingers? ;)

I'd be curious to know what mwhudson's use case was.

I'm not sure why adding an optional feature would have consequences for buildbot. Presumably buildbot wouldn't use it, so it would make no difference? The gap in Windows and POSIX process support is unfortunate but I think realistically we need to accept that from time-to-time people want (and even need) to make use of platform-specific features such as this. Dealing with a difference in platform features is much easier if the framework acknowledges and supports it. Just about anything half-sensible Twisted can provide here would be better than people reinventing Twisted's fork+exec code, or monkeypatching os.execvpe.

comment:3 Changed 9 years ago by Michael Hudson-Doyle

Cc: Michael Hudson-Doyle added

My use case was essentially the same as spiv's.

comment:4 Changed 9 years ago by Jean-Paul Calderone

I'm not sure why adding an optional feature would have consequences for buildbot. Presumably buildbot wouldn't use it, so it would make no difference?

Why presume that? I have no idea if BuildBot would use it. I know that it uses certain existing spawnProcess functionality which doesn't behave well on Windows.

Just about anything half-sensible Twisted can provide here would be better than people reinventing Twisted's fork+exec code, or monkeypatching os.execvpe.

When you put it that way, certainly. However, I wonder if that is really a false dichotomy. Is a hook to allow for the execution of arbitrary code after fork but before exec the best way to satisfy this use case?

While you're contemplating that, also think about where exactly in the process you want this hook to go. A number of things happen between fork and exec, so it's not as if there is a single obvious place to invoke such a hook.

comment:5 Changed 9 years ago by Glyph

Twisted itself already needs a variation on this functionality: the _BaseProcess._setupChild method. The fact that the "fdmap" functionality is mutually exclusive with using a PTY in the subprocess indicates to me that this might not be the best way to factor it, or the best time to invoke it, but nevertheless a PTYProcess needs to do something different than a Process after fork() but before exec().

For example, one of the things that needs to be done at this time is closing open FDs. That should really be done consistently regardless of the other things being done. And actually, another use-case which can't currently be satisfied is a subprocess which doesn't close any file descriptors. Granted, this isn't the way I'd do it, but for some quick and dirty applications it might be nice to have behavior similar to os.system(), where the spawned subprocess just uses your terminal for a little while.

So here's a starting-point for a proposal: maybe what we want is two parameters: preForkBehaviors and preExecBehaviors, each a list of callables. The childFDs parameter could then be interpreted as one entry for each of those lists, as could usePTY.

In any case, if this is implemented, I'd like to see it used to reduce the differences between PTYProcess and Process.

comment:6 Changed 9 years ago by Glyph

Trac seems to have dropped several other comments I attempted to make here... edited to remove the duplication with my previous comment:

Replying to spiv:

I agree there are significant risks with using a pre-exec function, but I think it would be fine to simply document them and say "caller beware".

I'm inclined to agree, because of the fact that (as I said above) Twisted needs something like this itself, but also because I strongly agree with this:

Just about anything half-sensible Twisted can provide here would be better than people reinventing Twisted's fork+exec code, or monkeypatching os.execvpe.

The cases of setting an environment variable or resetting a signal handler are safe AFAICT. For everything else... well don't do that, but if you must then gc.disable() after the fork and cross your fingers? ;)

Happily, gc is already disabled before the fork in the current implementation, presumably to deal with GC interaction issues in our own post-fork-pre-exec code ;-). I think that it makes sense to say in the "caller beware" statement that GC is already disabled and if you enable it it's your funeral.

comment:7 Changed 9 years ago by itamarst

Cc: itamarst added

I opened ticket #4199 for the signal issue mentioned as a use case (http://bugs.python.org/issue1652), since I feel it should be solved within Twisted.

comment:8 Changed 8 years ago by <automation>

Owner: Glyph deleted

comment:9 Changed 8 years ago by Jean-Paul Calderone

To make the state of this ticket clear, I think this is not the best way to provide the desired functionality, but I'm not going to oppose any work on this ticket. If someone wants to implement a pre-exec hook for spawnProcess on POSIX, go ahead.

comment:10 Changed 8 years ago by Michael Hudson-Doyle

I've run into another use case, which is to call setsid(). This is another thing that usePTY=True implies, but I also need to use childFDs...

comment:11 Changed 8 years ago by Andrew Mahone

I've hit another use case, redirecting process output to a FIFO. Open on a FIFO in write mode will block until a reader opens it, or fail immediately with O_NONBLOCK, so trying to open the output file in the main thread so that it can be placed in childFDs will fail, block-and-serialize, or deadlock if the reader is started "after" the writer. I'm attaching a patch with an initial implementation of preexec_fn for posix Process. If spawnProcess/Process are called with a callable object as the preexec_fn argument, it is passed the args, environment, and any kwargs by _fork, before calling _setupChild, and its return replaces args, environment, and kwargs. I am using it succesfully to make blocking open() calls after the fork, and insert the results into the fdmap passed to _setupChild.

Changed 8 years ago by Andrew Mahone

preexec_fn support for posix Process

Changed 8 years ago by Andrew Mahone

preexec_fn support for posix Process, and tests

comment:12 Changed 8 years ago by Andrew Mahone

This second patch adds test cases for modifying the fdmap, arguments, and environment in spawnProcess via preexec_fn.

comment:13 Changed 8 years ago by Andrew Mahone

Keywords: review added

Changed 8 years ago by Andrew Mahone

Adds doc for preexec_fn parameter to spawnProcess.

comment:14 Changed 8 years ago by Glyph

Keywords: review removed
Owner: set to Andrew Mahone

Hi Unhelpful! Thanks for your patch; this is a great first contribution to Twisted. It seems you're handily proving your handle a lie. :-) As all new contributors discover though, patches rarely make it on the first try. Please address these points and submit a new patch when you can.

  1. First, some general, boring, coding-standard/policy stuff:
    1. When submitting a patch, please include a [wiki:ReviewProcess#Newsfiles NEWS file]. It saves a bit of work for the person who ultimately commits it.
    2. Please submit patches against trunk; the offsets were kinda large
    3. Try to keep lines under 80 characters.
    4. The coding standard for variable and parameter names is camelCase; preexec_fn would be better written preExecFn; although just preExec or preExecFunction would be better.
  2. Now, more specifically about the patch itself; first, the mandatory parts which really need to be addressed for this to be considered for merging:
    1. If you're using ReactorBuilder-based tests, it's a bad idea to return a Deferred. ReactorBuilder has some special logic to build an inner reactor just for that test, whereas returning a Deferred involves allowing the "actual" reactor (the one running trial) to do I/O. The former are used when testing the reactor itself (as these are, since they're testing the reactor's process implementation); the latter are more commonly used as part of integration tests, when it's easier to do real async I/O than to build a hierarchy of stubs. The bottom line here is that by returning a Deferred in these tests you're running two reactors, and that really shouldn't be necessary.
    2. Since self.runReactor actually runs the built reactor, blocks, and completes all the work associated with that reactor, you shouldn't be perfoming assertions in Deferred callbacks. Consider that a spawnProcess implementation which simply called reactor.stop() would cause all of your tests to pass without running their assertions. Instead, use list.append as your callback, and after runReactor returns, assert something about the contents of your list.
    3. If I read this right (I did not actually try it) these tests will run on Windows, where they will error out spectacularly. They need to be adjusted.
    4. The parameter is silently ignored on Windows, but the documentation does not include any portability notes. spawnProcess does set a bad example here, since childFDs argument doesn't say anything about that either.
    5. By adding this argument to the interface of IReactorProcess, you've incompatibly changed the interface so that existing providers outside of Twisted will no longer conform. This is a problem because e.g. verifyObject will cause those projects' tests to fail with no warning, and if they're using fancy stuff like adapters or component registries it will cause functional changes. [wiki:CompatibilityPolicy The compatibility policy outlines the rules for making such changes.]
    6. The documentation of this new parameter is somewhat unclear. 'args, env, and kwargs'? 'env', I suppose, I could guess, but there's usually some parity between 'kwargs' and 'args'; in this case I would assume 'args' is the 'args' parameter to IReactorProcess, but then what's 'kwargs'?
      1. also, what the 'kwargs' argument actually does is exposing an internal, ostensibly private implementation detail of twisted.internet.process as part of the public interface. I think that you should expose this in a different way.
      2. I believe this will interact poorly with PTYProcess.
    7. The test for 'callable' on preexec_fn in Process.__init__ is terrible. I am guessing that you're concerned about passing in a non-callable object and having it blow up in the subprocess, but it might blow up in the subprocess anyway so you need good error reporting regardless.
    8. Don't ever use hasattr; it has peculiar behavior in error conditions. If you absolutely need to, use `getattr(x, y, None) is None`, but in this case you definitely don't absolutely need to. Just always set preexec_fn and check to see if it's None, not if it's present.
  3. And now for some more general, fluffy design stuff about this change. While it probably won't be mandatory - I try not to get legitimate improvements get derailed by a hypothetically optimal implementation - you might want to consider it because I'm always right about everything, all of the time.
    1. This is a real missed opportunity for some refactoring. usePTY and childFDs are both features which could be expressed in terms of a pre-fork and a pre-exec hook, as I noted earlier in the comments on this ticket. Granted, they'd have to be internal, since Twisted provides other stuff related to these features (like writeToChild), but they could use the same general mechanism. This would also avoid the need for a special 'do-you-have-a-preexec-hook' check, since everybody would always have some preexec hooks, just different ones depending on process type.
    2. I am a little concerned about the ever-growing list of parameters to spawnProcess. It seems like this might not be the last one. Take a moment to contemplate [ticket:2415 the ticket for overhauling the process-spawning interface] and see if you have any ideas for how this might fit in there. My current thinking is vaguely that uid, gid, childFDs, usePTY, and preexec_fn all ought to be encapsulated into a single object that does setup somehow, and then chained together by a composite implementation of that interface.
    3. There are definite drawbacks to using a preexec function as the solution to your particular use-case. Since the problem is that opening that FIFO might block, it might of course block for a long time. If it is blocked for a long time, that means that every single time your parent process touches any of its memory, it will be performing a copy, because all of its pages are now copy-on-write and the child may need a copy. It might just be worth paying the cost of a second exec() in order to get a nice, clean, minimal process environment in which to do that blocking open(). Since you're on POSIX anyway, asking the shell to do this for you would save you the cost of spawning Python twice; and shells are reasonably good at redirecting the input and output of things. (Don't get me wrong, shell programming is awful, but this is exactly the sort of thing they're intended for, and the shell will always be completely in filesystem cache because everything in the universe uses it.)

Again, thanks for your contribution, and thanks for diving right in to attack this thorny problem. I hope my review was useful and that we'll be seeing another patch from you soon!

Changed 8 years ago by Andrew Mahone

Changed 8 years ago by Andrew Mahone

Changed 8 years ago by Andrew Mahone

comment:15 Changed 8 years ago by Andrew Mahone

Keywords: review added

I've tried to address the coding style issues brought up, renamed preexec_fn as setupChild, and given it a uniform argument list of (args, environment, fdmap, winStartupInfo) - the last to be used to modify process setup info on windows. For win32 processes, and with usePTY, None will be passed as the fdmap argument. On non-win32 platforms, None will be passed as the winStartupInfo argument. I also removed the assignment from passed-back variables - the objects of interest are all mutable, and I don't see any reason that setupChild should need to replace them with new objects.

comment:16 Changed 8 years ago by Jean-Paul Calderone

Owner: Andrew Mahone deleted

comment:17 Changed 8 years ago by zseil

Owner: set to zseil

comment:18 Changed 8 years ago by zseil

Keywords: review removed
Owner: changed from zseil to Andrew Mahone

Thanks for working on this! My review will be mostly focused on Windows aspects:

  1. The new tests fail for the win32event and iocp reactors, those reactors have a custom spawnProcess implementation. Win32Reactor.spawnProcess can be simply deleted (#5304), but IOCPReactor.spawnProcess will need to be adjusted to the new signature (I think this is Glyph's item 2.3).
  2. While setupChild can't be added to IReactorProcess, it should still be documented in PosixReactorBase's spawnProcess implementation.
  3. There is no exception handling in Windows' Process for setupChild. Matching what posix implementation does in that case will be hard, but it would be great if you manage to add a test that checks what happens when setupChild raises an error.
  4. The use case for the winStartupInfo parameter is not entirely clear and the there are no tests for it. If it is just to be able to modify std* descriptors see the next item.
  5. You could provide fdmap on Windows containig the std* descriptors by wrapping the child's handles with win32file._open_osfhandle, unwrapping them with _get_handle after setupChild is done, and only then assign them to the StartupInfo.

The patch looks good to me otherwise, though I would also prefer if this was built on top of #2415, but feel free to ignore this if you're not in the mood for yak shaving :)

comment:19 Changed 7 years ago by jknight

Adding a preexec_fn is a very bad idea.

After fork in a multithreaded process, the functions you're allowed to call before exec are only those that are async-signal-safe. That means it's impossible to safely call any python code after fork(), before exec(), because malloc is not on that list. That means a preexec_fn is not safe to call.

Of course, that also means that the current implementation of t.i.process is totally broken, even without preexec_fn. Fortunately (or maybe unfortunately?), it only rarely causes an actual problem.

This same bug (of having fork/exec done by python code) was also fixed in cpython back in 2010 via the addition of Modules/_posixsubprocess.c, which reimplements the core of subprocess.py in C. Unfortunately, it was only ever committed to the Python3 fork (3.2), not in mainline Python versions. It's really quite unfortunate that upstream abandoned Python, so, despite having been fixed 2 years ago, is still a problem in the most recent version. But anyways...

twisted.internet.process is buggy now, because it uses python, and that should be fixed, instead of made worse by adding this feature.

(Sidenote: subprocess still has a preexec_fn, since they added it before realizing it was a bad idea, but the API docs warn you not to use it:

Warning: The preexec_fn parameter is not safe to use in the presence of threads in your application. The child process could deadlock before exec is called. If you must use it, keep it trivial! Minimize the number of libraries you call into.

Note: If you need to modify the environment for the child use the env parameter rather than doing it in a preexec_fn. The start_new_session parameter can take the place of a previously common use of preexec_fn to call os.setsid() in the child.

)

comment:20 Changed 7 years ago by Glyph

In case anyone interested in this ticket has not already seen it, see #5710 for more discussion of related issues.

Note: See TracTickets for help on using tickets.