[Twisted-Python] Re: Improving spawnProcess and friends

Nuutti Kotivuori naked at iki.fi
Wed Jul 19 17:25:01 EDT 2006


glyph at divmod.com wrote:
> On Wed, 19 Jul 2006 11:09:35 +0300, Nuutti Kotivuori <naked at iki.fi> wrote:
>> I'd like to write an improvement to spawnProcess - but I thought I
>> should check with the devs here first to get comments and to see if it
>> would be something that could be merged to the mainline at some point.
>
> Thanks :).
>
> It's probably worth mentioning these:
> http://twistedmatrix.com/trac/ticket/1218
> http://twistedmatrix.com/trac/ticket/950
> http://twistedmatrix.com/trac/ticket/1478
> as long as we're talking about this API.

Hopefully all of those will get fixed while we are at it.

>> The primary thing I want to do is to improve the childFDs
>> handling. Instead of the childFDs being a dictionary, I'd like to make
>> it be an object that is set up beforehand.
>
> I'd like to see a proposed API for setting up this object.  A
> different way of doing this (which I think I prefer) would be to add
> a feature to ProcessProtocol rather than modifying the behavior of
> the childFDs argument.  After all, it's generally a particular
> process protocol that wants to be able to communicate on various
> channels.  This way it would be possible - although not necessarily
> easy - to allow your ProcessProtocol object to have a few of its
> methods called post-fork, but pre-exec.

Hmmh. Well, firstly, actually I don't think I explained the object
idea properly. My original idea there was just to use the object as a
holder for the configuration - and nothing more. A syntax extension to
allow a nicely readable way of specifying childFDs instead of mappings
from integer to tuple of lists or whatever a complex set up like that
would end up being. But, now I'm not sure if I want the object at all
there.

I'm not sure if I understand what you are saying, though - atleast not
in a concrete way. I agree that generally a particular process
protocol is coded to explicitly read and write in certain
communication channels - so it is reasonable that the process protocol
would specify these communication channels.

However, I'm not sure I like the idea of methods firing from
ProcessProtocol post-fork but pre-exec. We are living on the child
side there - and calling any methods from the transport, reactor or
many other places would be a bad idea. I'd like the parts that get
executed on the child side be very well defined and generally not
supplied by the user.

>> In addition to the current functionality of duplicating one of the
>> parent's fds and providing simple pipes, I'd like to add features to
>> configure arbitrary PTYs to be set up, for socketpairs to be set up
>> and to allow the duplication of some pipe into multiple child fds. To
>> clarify a bit, I'd like for a way to be able to start a process and
>> say that a socketpair should be created, it should be dup'd to fd 0
>> and fd 1 on the child and a pipe should be created for reading and it
>> should be dup'd to fd 2 on the child. The same with PTYs as well, so
>> this change would unify Process and PTYProcess.
>
>> Also, I would want to make it possible for each created reader,
>> writer, PTY, socketpair, whatever to have a separate protocol. That
>> is, I want to be able to say that this PTY should be handled by my
>> protocol here that inherits from LineReceiver - instead of having to
>> implement all of those in the ProcessProtocol.
>
> I believe that putting this into spawnProcess is operating at the
> wrong layer of abstraction.  The feature you're describing is highly
> useful though, so providing a utility implementation of
> ProcessProtocol that allows you to hook up regular protocols to
> various FDs would be very good.  IMHO ProcessProtocol looks
> deceptively like a regular protocol, and probably its dataReceived
> and connectionLost methods should be (softly) deprecated.  The right
> way to write a ProcessProtocol is to override childDataReceived,
> childConnectionLost and processEnded.  A ProcessProtocol that could
> hook up _arbitrary_ FDs as input and output to a Protocol would be
> incredibly useful but it would be a good start just to have a
> better-defined way to deal with separating process control from
> protocol logic; so useful it might even be worth putting this
> functionality on the base ProcessProtocol.

Yes - the outReceived, errReceived, write, writeSequence, closeStdin,
closeStdout, closeStderr etc. methods seem to be a cause of a lot of
grief. It would be nice if childDataReceived, childConnectionLost,
processEnded, transport.writeToChild, transport.closeChildFD,
transport.signalProcess would be the only thing to worry about.

I was thinking about having the process readers and writers directly
connected to separate Protocols - and having those readers and writers
be the transports for those Protocols. And ProcessProtocol would kind
of just miss out on the communication that is happening and basically
be just a holder for signalProcess and processEnded.

Were you thinking of building the forwarding logic into
ProcessProtocol so that the readers and writers talk to the
ProcessProtocol which then forwards the commands onwards to different
Protocols? If so, what would be the transport for those Protocols? I'd
like for them to be able to say transport.loseConnection() to close
the communication channel they are using? If so, atleast some kind of
new transport forwarding these is needed. Or would they have the
readers and writers as transports - and those readers and writers just
communicate to them through the processProtocol. I'm not sure I'd like
an asymmetric set up like that.

> It seems like a good first step here would be refactoring the
> various flavors of file descriptor (including the mess in
> _pollingfile and _dumbwin32proc) so that they can be easily and
> portably invoked by users wanting to create out-of-band (non
> stdin/stdout) channels of communication between their processes.  I
> think it would be best if these channels could have a "flavor
> requirement" specified (i.e. pipe, socketpair, PTY), but also have a
> general stream-based transport API that would work the same from a
> high-level application's point of view on Windows and Mac and Linux;
> for example, use pipes if available, numeric unix sockets if not,
> localdomain sockets if that's not available either, depending on
> platform.
>
> All of this is leaning towards making 0,1,2 as un-special as
> possible, which I like very much.  That also implies that you'll
> need to clean up stdio.StandardIO, and instead add an API like
> reactor.connectParentFD(factory, fileno, flavor=None).  On my first
> read through I started trying to describe a way to communicate the
> expected list of file descriptors and their settings to the child
> process but that seems best left up to the application.

Okay, I didn't get most of the stuff here. I need more explanation to
understand what's going on. I remember fighting with stdio.StandardIO
at some point, so cleaning that up sounds something that should be
done.

When I'm narrowly looking at this process spawning stuff, I see a
communication channel as a single object that has three places it does
its magic at.

There is the setupPrefork() stage which is called at first, before
anything special is done. This does not return anything, it just does
stuff, if necessary. Then we fork. On the child side we call
setupChild(). This will do all the things necessary for setting up a
communication channel for the child process to use, and it will return
that a single filedescriptor (a file descriptor is just about the only
reasonable thing to survive an exec, so there is no need for variance
here). On the parent side we call setupParent(). This will set up the
communication channel on the parent side, and it will return a
transport that is capable of reading and/or writing the channel.

As an example, for ReadPipe(), setupPrefork() would call os.pipe() and
mark down the read and write file descriptors. Then setupChild() would
close the read fd and return the write fd. And setupParent() would
close the write fd and set up a ProcessReader handing the read fd.

Another example, say ReadNamedPipe('/tmp/namedpipe'), setupPrefork()
would call mknod to create the named pipe if necessary, but would open
no file descriptors. Then setupChild() would open the pipe for writing
and return the fd. And setupParent() would open the pipe for reading
and set up a ProcessReader handling the fd.

>> Backwards compatibility would be obtained by still allowing the dict
>> type in childFDs and just mapping them to the new object inside
>> spawnProcess. Also the custom protocols would be optional and the
>> default protocol would just call childDataReceived and
>> childConnectionLost on the ProcessProtocol.
>
> This is why I'd strongly prefer the logic remain in ProcessProtocol.
> There should be _one_ location for dealing with the state of a
> running subprocess and its shared FD map; the ProcessProtocol seems
> the logical place.  We could hide it all inside the reactor, but I
> am pretty sure that important aspects will forget to be exposed.

I'm still not entirely sure what you mean here. The Process object
(the "transport" or so) would hold the *state* of the running
subprocess and all its communication channels. The ProcessProtocol
would contain the stuff *dealing* with the state. And the initial
configuration of what to set up would either come from spawnProcess
(Process object initializer arguments) or from ProcessProtocol somehow
- or actually both.

It is obvious that ProcessProtocol can't be responsible for all the
configuration the Process needs - it shouldn't, for example, say the
name of the executable being executed. And I don't think it should be
responsible for the chdirs, chroots and such either - those should be
changeable at execution time without touching the
ProcessProtocol. Just as it is possible to change the address a TCP
connection is made without changing the Protocol of that
connection. But I do think it could be responsible for the FD mapping
for the process, since that is intimately tied with the logic of the
ProcessProtocol.

>> The other feature I would like to implement while mucking around
>> with the process module is the preexec_fn feature of the subprocess
>> module. That is, a function that gets executed just before the
>> child is exec'd. There can probably be other uses for this as well,
>> but the main usage for this would be a chroot call. I'm considering
>> if it should be a list of functions executed in order - since there
>> might be a need to do a chroot, a chdir and another chroot or
>> something similar - but of course the user can just supply a
>> function that does that.
>
> This gives me another good idea.  Right now there's a bunch of stuff
> that happens in the guts of the process module; chdir, setuid, fdmap
> setup.  Most of this could be factored into an UNIX-specific
> ProcessProtocol, centralized in "beforeFork" and "beforeExec"
> methods.  I'm not really sure how to best clearly expose that
> non-portable features are required when invoking spawnProcess, but
> the obvious idea is that if
> IUNIXProcessProtocol.providedBy(yourProtocol), barf on Windows,
> Jython (if they ever do another release and we add support for it
> again), etc.  Then you could explicitly subclass either
> UNIXProcessProtocol if you explicitly needed nonportable features
> (which would still allow you to use them on UNIX;
> UNIXProcessProtocol == ProcessProtocol, on platforms which support
> it); but subclassing ProcessProtocol would select a more
> platform-appropriate method.
>
> Originally I thought that the process-communication logic would be
> specific to a reactor, which is why it's factored as it is; after a
> few years of working with process spawning closely though it is
> clear to me that differences are entirely a feature of the OS and
> not of the reactor.

Most of this went past me as well.

If the setup stuff like chdir, setuid, fdmap, etc. is factored into an
UNIX-specific ProcessProtocol, then how are those configured when
spawning the process? I'd hate to have to make a new ProcessProtocol
or chance one every time I wanted to spawn an executable in a
different path.

I have no idea what so ever on what is the best way to expose the
non-portability of many of the things. As it stands now, some
communication channels obviously are non-portable - and in my mind
it's perfectly fine that SocketPair() throws an exception on
instantiation if the platform doesn't support it ;-)

In general I see it such that creating processes is such an OS
specific thing that there are more differences than not. The feature
sets are just too disjoint. So I see this thing as just allowing the
user to pick the features he wants and if they can't be supported, we
can boom.

> Well, I liked your ideas, and then they made me think of some ideas
> that I liked even more :).  I hope that you'll continue to work on
> this.  Please feel free to copy and paste chunks of this (both your
> ideas and mine) into a ticket.

I created an issue #1939 for this and added some of the conversation
here there.

> I'm working on at least 3 projects right now which use process
> spawning and these changes would improve all of them; although only
> one will be running long enough to actually see any of these
> implemented, I'm still pretty excited about the idea.

Running long enough? I intend to be finished by the night ;-)

Well, not really, but I do hope to finish this whole thing rather fast
and not let it linger.

On that note, I created my first try at some of the features. It's
filled with FIXMEs, no documentation, bad class names and no specific
test cases. But, it provides a compatible API and the process test
suite passes with it.

What it does is that it defines a set of communication channel helper
objects. Right now there's PassthroughFD, ReadPipe, WritePipe, PTY and
SocketPair. SocketPair is not implemented.

The API for using these is *really* convoluted now, but it does
support all the features.

The childFDs dictionary in its native form now should contain:

  { <name>: (<helper instance>, <list or tuple of fd numbers>) }

The name part is what the communication channel will be known with in
the ProcessProtocol. That is, the thing returned in childDataReceived
and childConnectionLost etc. The helper instance is an instance of one
of the helper classes. The list or tuple of fd numbers is the child
file descriptor numbers the channel should be ultimately mapped
to. Additionally, a name of '_defaultWriter' specifies the target name
transport.write() calls should write to. There are a few shorthands:

 { <name>: <helper instance> } is mapped to:
 { <name>: (<helper instance>, (<name>,)) }

 { <name>: 'r' } is mapped to:
 { <name> (ReadPipe(), (<name>,)) }

 { <name>: <int> } is mapped to:
 { <name>: (PassthroughFD(<int>), (<name>,)) }

So, to achieve a behaviour that is similar to the old PTYProcess
behaviour, you can say:

 { '_defaultWriter': 1,
   1: (PTY(), (0, 1, 2)) }

That is, make a PTY helper creating a new PTY, name it 1 in the
ProcessProtocol and map it to file descriptors 0, 1 and 2 on the
child. Also, make the default write method write to helper named 1.

This way, the outReceived() method is called on the ProcessProtocol
when data is received from the PTY - and transport.write() will write
data to the PTY.

I shall ponder these things for a while - and start implementing the
next phase when I have a good idea on what it should be.

-- Naked






More information about the Twisted-Python mailing list