[Twisted-Python] spawnProcess - reapProcess not retrying on failures

Tue Sep 2 05:05:33 MDT 2014

On 09/02/2014 05:08 AM, Adi Roiban wrote:
> Hi,
>
> While using spawnProcess on Linux I found out that when an invalid
> executable is called there is a corner case in which a zombie process
> is left until main process exists and can not be closed.
>
> I wrote a test for this but I was not able to reproduce this error in
> isolation, event if I run the test for 10000 times. reapProcess will
> always succeed from the first call.
>
> For the production code I can always reproduce the problem.
>
> Inspecting the execution thread I found out that all pipes are closed
> but spawned process is not closed yet. Due to this
> Process.maybeCallProcessEnded() will call self.reapProcess().
>
> In my case,  os.waitpid(pid, os.WNOHANG) return 0, and
> self.reapProcess() will just ignore this case.

We encountered this problem in our code too.  We worked around it with the 
following code, which basically monkey-patches Twisted to "try again later" when 
waitpid returns 0.  (Most of the code below is just copied from _BaseProcess; 
the important part is the "elif pid == 0" branch.)

-----

"""Workarounds for problems with Twisted."""

import errno
import os

from twisted.python import log
from twisted.internet.process import (
     _BaseProcess,
     reapAllProcesses,
     unregisterReapProcessHandler
)

def workaround_reapProcess(reactor):
     """Install a workaround for unsticking reapProcess.

     Sometimes when a child process takes too long to die that
     reapProcess doesn't catch it in time.  We add a hack where we add
     a timeout to the reactor to try again later.
     """

     def reapProcess(self):
         """
         Try to reap a process (without blocking) via waitpid.

         This is called when sigchild is caught or a Process object loses its
         "connection" (stdout is closed) This ought to result in reaping all
         zombie processes, since it will be called twice as often as it needs
         to be.

         (Unfortunately, this is a slightly experimental approach, since
         UNIX has no way to be really sure that your process is going to
         go away w/o blocking.  I don't want to block.)
         """
         try:
             try:
                 pid, status = os.waitpid(self.pid, os.WNOHANG)
             except OSError, e:
                 if e.errno == errno.ECHILD:
                     # no child process
                     pid = None
                 else:
                     raise
         except:
             log.msg('Failed to reap %d:' % self.pid)
             log.err()
             pid = None
             status = None
         if pid:
             self.processEnded(status)
             unregisterReapProcessHandler(pid, self)
         elif pid == 0:
             # Twisted seems to get stuck if pid is 0, which means that
             # the child process hasn't changed status, but if called
             # after SIGCHLD probably means that the child process is
             # in the process of dying, but hasn't quite died yet.
             # We'll try to kick the reactor to reap the processes
             # again in a bit.
             #
             # We're testing specifically against 0 because pid may
             # also be None in an error case.
             def unstick():
                 reapAllProcesses()
             reactor.callLater(1, unstick)

     _BaseProcess.reapProcess = reapProcess

----

To use this, import your reactor and then call workaround_reapProcess(reactor).

Now that two of us have seen the same problem, we should probably file a ticket 
in the bug tracker.
     --Justin