Opened 3 years ago

Closed 9 months ago

#5151 defect closed fixed (fixed)

_SIGCHLDWaker in twisted.internet.posixbase doesn't check its pipe is still open when registering signal handlers

Reported by: hodgestar Owned by: hodgestar
Priority: normal Milestone:
Component: core Keywords:
Cc: itamar@… Branch: branches/process-exited-errors-5151-2
(diff, github, buildbot, log)
Author: itamarst, hodgestar Launchpad Bug:

Description

The function install in _SIGCHLDWaker looks like:

    def install(self):
        """
        Install the handler necessary to make this waker active.
        """
        _signals.installHandler(self.o)

Here self.o may evaluate to None causing installHandler to throw an exception. This may happen if connectionLost is somehow called before installHandler.

One possible fix is:

    def install(self):
        """
        Install the handler necessary to make this waker active.
        """
        if self.o is not None:
            _signals.installHandler(self.o)

I've observed this in the wild inside a moderately complex test suite. Assistance in understanding the "somehow" that causes self.o to become None would be appreciated.

The following note inside _handleSignals in posixbase might be relevant since it explains one possible scenario in which related code may be executed before the event handlers are installed:

# Also reap all processes right now, in case we missed any
# signals before we installed the SIGCHLD waker/handler.
# This should only happen if someone used spawnProcess
# before calling reactor.run (and the process also exited
# already).

Attachments (6)

waker-process.patch (1.0 KB) - added by exarkun 3 years ago.
test-process-exited-raises.diff (1.2 KB) - added by hodgestar 3 years ago.
The first draft of a test that spawns a process and explodes the reactor.
test-process-exited-raises.2.diff (1.2 KB) - added by hodgestar 3 years ago.
Less broken version of the simple test that raises during processExited and explodes the reactor. Fails (and explodes all later tests) without exarkun's patch. Passes with it.
test-process-exited-raises.3.diff (1.4 KB) - added by hodgestar 3 years ago.
Version of previous patch that doesn't wreck the rest of the test suite if the test fails.
test-process-exited-raises.4.diff (1.5 KB) - added by hodgestar 3 years ago.
Even better test that explicitly raises an error on failure and doesn't erroneously fail if the processExited exception is logged.
issue-5151-1.diff (2.6 KB) - added by hodgestar 3 years ago.
Single patch with fix, test and NEWS update.

Download all attachments as: .zip

Change History (33)

comment:1 Changed 3 years ago by hodgestar

Steps to reproduce:

wget http://pypi.python.org/packages/source/k/katcp/katcp-0.3.4.tar.gz
tar xzf katcp-0.3.4.tar.gz 
cd katcp-0.3.4/
sed -i -e 's/30)/35)/' katcp/tx/test/test_core.py # arrange for a test failure
trial katcp/tx/test # watch all sorts of other tests explode in interesting ways

comment:2 Changed 3 years ago by exarkun

Part of the problem here is that _SIGCHLDWaker.doRead might raise an exception due to its call to process.reapAllProcesses. Normally reapAllProcesses should not raise any exceptions, but it appears to be in this case for some reason.

Either _SIGCHLDWaker should put a try/except around that call, or reapAllProcesses should be more careful not to raise any exceptions.

The exception being raised here is:

          File "/home/exarkun/Projects/Twisted/trunk/twisted/internet/posixbase.py", line 575, in _doReadOrWrite
            why = selectable.doRead()
          File "/home/exarkun/Projects/Twisted/trunk/twisted/internet/posixbase.py", line 230, in doRead
            process.reapAllProcesses()
          File "/home/exarkun/Projects/Twisted/trunk/twisted/internet/process.py", line 46, in reapAllProcesses
            process.reapProcess()
          File "/home/exarkun/Projects/Twisted/trunk/twisted/internet/process.py", line 308, in reapProcess
            log.msg('Failed to reap %d:' % self.pid)
        exceptions.TypeError: %d format: a number is required, not NoneType

comment:3 Changed 3 years ago by exarkun

This happens if application code raises an exception from processExited. The transport implementation doesn't deal with this well, and ends up leaving the dead process's reap handler in the dictionary. Next time a SIGCHLD arrives, the dead process's handler blows up like this and kills the waker.

Attaching a simple patch that probably fixes the issue; lacking unit tests, though.

Changed 3 years ago by exarkun

Changed 3 years ago by hodgestar

The first draft of a test that spawns a process and explodes the reactor.

Changed 3 years ago by hodgestar

Less broken version of the simple test that raises during processExited and explodes the reactor. Fails (and explodes all later tests) without exarkun's patch. Passes with it.

Changed 3 years ago by hodgestar

Version of previous patch that doesn't wreck the rest of the test suite if the test fails.

comment:4 Changed 3 years ago by hodgestar

  • Keywords review added

comment:5 follow-up: Changed 3 years ago by itamar

  • Keywords review removed

As I see it, there are two problems:

  1. If reapAllProcess raises, sigchld handler install blows up. Notice there's another path which can raise exceptions (unregisterReapProcessHandler).
  2. If processExited raises, reapAllProcesses stops. This is different because it might e.g. cause a second process to never be reaped (at least, until reapAllProcesses is called again, which may be never.)

The patch fixes (1). The test is made to pass by fixing (1), but fixing (2) would also make it pass.

The test should be changed to only test (1) (e.g. make reapAllProcesses raise via monkeypatching or something?). (2) should either be fixed (try/except around processExited call) and tested, or should get its own ticket, up to you.

comment:6 Changed 3 years ago by itamar

  • Cc itamar@… added

comment:7 in reply to: ↑ 5 Changed 3 years ago by hodgestar

Replying to itamar:

As I see it, there are two problems:

The intent of the test is to check whether the reactor is still healthy after processExited raises. This seems closer to (2) rather than (1).

Regarding (1), is there an example other than (2) in which reapAllProcesses might raise? If there is another way, I suggest that other way be spun out into a separate ticket instead.

Agreed that only the patch to either posixbase.py or _baseprocess.py is needed to make the test pass. I would prefer the patch to _baseprocess.py to go in.

Changed 3 years ago by hodgestar

Even better test that explicitly raises an error on failure and doesn't erroneously fail if the processExited exception is logged.

comment:8 Changed 3 years ago by itamar

As I mentioned, unregisterReapProcessHandler may in theory raise an exception, which will also break the reactor. So the initial reported symptom is (1), not (2).

I still don't think the test is sufficient, because it (I assume) is passing, it is passing with a fix for (1) even though it's intended to be testing a fix for (2).

comment:9 Changed 3 years ago by hodgestar

Actually my latest test does appear to distinguish between exarkun's two fixes (I should have noticed this earlier). The fix to posixbase no longer makes the test pass. The fix to _baseprocess does. If processEnded raises unregisterReapProcessHandler is never called and the test picks up the un-reaped process. Any fix to only unregisterReapProcessHandler no longer passes the test because the unregisterReapProcessHandler isn't called in the case where the test fails.

Changed 3 years ago by hodgestar

Single patch with fix, test and NEWS update.

comment:10 Changed 3 years ago by hodgestar

  • Keywords review added

comment:11 Changed 3 years ago by exarkun

  • Owner set to exarkun

comment:12 Changed 3 years ago by exarkun

  • Author set to exarkun
  • Branch set to branches/process-exited-errors-5151

(In [32119]) Branching to 'process-exited-errors-5151'

comment:13 Changed 3 years ago by exarkun

(In [32120]) Apply hodgestar's issue-5151-1.diff

refs #5151

comment:14 follow-up: Changed 3 years ago by exarkun

  • Author changed from exarkun to hodgestar
  • Keywords review removed
  • Owner changed from exarkun to hodgestar

Thanks!

  1. Some minor stuff:
    1. Test docstrings should just say what behavior they're testing, without words like "Test that ..." or "should". Also, the test doesn't actually verify that the reactor doesn't explode, and no one really knows what "explode" means here anyway, so it's not a good thing to say. :)
    2. You can use flushLoggedErrors like this, and you probably should: self.assertEqual(1, len(self.flushLoggedErrors(TestException))). This both lets your test pass even with the error logged and makes you confident it actually was logged.
    3. The news fragment doesn't go in the NEWS file yet. :) It goes into "twisted/topfiles/5151.bugfix" file. This is a trick we use to avoid lots of NEWS conflicts.
  2. Some deeper things:
    1. In a ReactorBuilder-style test, use self.runReactor(reactor) instead of reactor.run(). This gives you a timeout and maybe some other stuff that's nice.
    2. reactor.callLater(0.01, ...) is a little gross. Magic timing numbers mean fragile tests and unreproducable behavior. reactor.callLater(0, ...) is usually okay, because 0 really just means "after I'm done here". Does that work in this scenario, or is the 0.01 actually doing something? Another approach might be to stop the reactor in the processEnded callback. And another might even be to just stop it synchronously in processExited.
    3. The test needs to skip on Windows, where there is no process module (notice how the process module is imported at the top of test_process.py. There are some examples of platform-specific skips in this module already that should demonstrate how to do this.
    4. The inspection of the process module state is also somewhat whitebox-y, in what could otherwise be a nice blackbox test. I wonder if it could be eliminated (and therefore the need to skip on windows eliminated, too) by verifying success a different way - perhaps running a second process successfully? However, some of the earlier comments on this ticket confuse me, and I haven't read the earlier versions of the patch. Feel free to tell my why the way the test is written now is the way it should stay written.

(suddenly out of time, but I think that's most of a review, sorry in advance if it turns out I missed anything :)

I checked your patch and some minor fixes to it into the branch now linked on the ticket. Please make future patches against the branch. Thanks again!

comment:15 Changed 3 years ago by exarkun

(In [32122]) minor fixes

refs #5151

comment:16 in reply to: ↑ 14 Changed 3 years ago by hodgestar

  • Status changed from new to assigned

Replying to exarkun:

Test docstrings should just say what behavior they're testing, without words like "Test that ..." or "should". Also, the test doesn't actually verify that the reactor doesn't explode, and no one really knows what "explode" means here anyway, so it's not a good thing to say. :)

Here "explode" means that global state the reactor relies on (i.e. the registered process reap handlers) becomes inconsistent and the reactor is no longer usable to spawn new processes or proceed with other tests. The whitebox-y check at the end of the test cleans up any broken state so that later tests can proceed.

You can use flushLoggedErrors like this, and you probably should: self.assertEqual(1, len(self.flushLoggedErrors(TestException))). This both lets your test pass even with the error logged and makes you confident it actually was logged.

Cool (I learned something new) and thanks for fixing this in the branch.

The news fragment doesn't go in the NEWS file yet. :) It goes into "twisted/topfiles/5151.bugfix" file. This is a trick we use to avoid lots of NEWS conflicts.

Cool (and thanks for fixing this too).

In a ReactorBuilder-style test, use self.runReactor(reactor) instead of reactor.run(). This gives you a timeout and maybe some other stuff that's nice.

Will fix.

reactor.callLater(0.01, ...) is a little gross. Magic timing numbers mean fragile tests and unreproducable behavior. reactor.callLater(0, ...) is usually okay, because 0 really just means "after I'm done here". Does that work in this scenario, or is the 0.01 actually doing something? Another approach might be to stop the reactor in the processEnded callback. And another might even be to just stop it synchronously in processExited.

Using 0 here is fine. Just reactor.stop() works too. Will fix. The reactor.callLater(0.01, ...) was the result of a variety of experiments to try regain control in the case were the test fails.

The test needs to skip on Windows, where there is no process module (notice how the process module is imported at the top of test_process.py. There are some examples of platform-specific skips in this module already that should demonstrate how to do this.

The inspection of the process module state is also somewhat whitebox-y, in what could otherwise be a nice blackbox test. I wonder if it could be eliminated (and therefore the need to skip on windows eliminated, too) by verifying success a different way - perhaps running a second process successfully? However, some of the earlier comments on this ticket confuse me, and I haven't read the earlier versions of the patch. Feel free to tell my why the way the test is written now is the way it should stay written.

See comment replying to the first question -- something needs to fix reapProcessHandlers if the test fails. I'm not sure if similar clean up needs to happen on Windows (I don't have Windows). If not, we could just skip this part of the test if process is not imported (since now the logging of the TestException is checked).

Thanks for the review.

comment:17 Changed 3 years ago by hodgestar

(In [32127]) Use self.runReactor instead of calling reactor.run directly.

refs #5151

comment:18 Changed 3 years ago by hodgestar

(In [32128]) Call reactor.stop directly rather than via an unnecessary reactor.callLater.

refs #5151

comment:19 Changed 3 years ago by hodgestar

(In [32129]) Perform process module read handler clean-up only when the process module
is available. Move assertion that process exited exception was logged until
after the clean-up is finished.

refs #5151

comment:20 Changed 3 years ago by hodgestar

  • Keywords review added

All the comments from the previous review have been dealt with in the branch now so I think it's ready for the next one.

comment:21 Changed 3 years ago by hodgestar

  • Owner hodgestar deleted
  • Status changed from assigned to new

comment:22 Changed 3 years ago by exarkun

  • Keywords review removed
  • Owner set to hodgestar

Thanks! I think the branch is in good shape to be merged.

However, it would be nice if test_processExitedRaises could eventually stop poking the process module. I think this means we should have some direct unit tests for twisted.internet.process (perhaps in twisted/internet/test/test_posixprocess.py) for various exceptional cases of reapAllProcesses, registerReapProcessHandler, and unregisterReapProcessHandler. These can be as whitebox as necessary, since twisted.internet.process is just a big implementation detail. The other cases where an exception might be raised (mentioned by Itamar above) are interesting too, and we want to make sure they're also being handled correctly.

One other trivial thing: the news file puts "Broken process handlers" up front, but I think people reading the news file will be more interested in "a ProcessProtocol.processExited that raises an exception". Can you reformulate the news fragment to change the focus so it's more understandable to an application developer who might be affected by the change?

If you do that and file a new ticket the first point and mention it in the comment above the process-inspecting part of test_processExitedRaises (and if the build results look good), then I think the branch will be in great shape to be merged.

Thanks again.

comment:23 Changed 3 years ago by hodgestar

(In [32204]) Re-phrase NEWS file entry so that the root cause from an application
developer's perspective is more prominent.

refs #5151

comment:24 Changed 13 months ago by itamar

Looks like all that needs doing is filing a new ticket, mentioning in a test, and then this can be merged?

comment:25 Changed 9 months ago by itamar

comment:26 Changed 9 months ago by itamarst

  • Author changed from hodgestar to itamarst, hodgestar
  • Branch changed from branches/process-exited-errors-5151 to branches/process-exited-errors-5151-2

(In [41114]) Branching to 'process-exited-errors-5151-2'

comment:27 Changed 9 months ago by itamarst

  • Resolution set to fixed
  • Status changed from new to closed

(In [41116]) Merge process-exited-errors-5151-2.

Fixes: #5151
Author: hodgestar
Reviewer: exarkun

Handle exceptions raised by ProcessProtocol.processExited.

Note: See TracTickets for help on using tickets.