Opened 8 years ago

Closed 6 years ago

#1997 defect closed fixed (fixed)

perhaps wakeUp could be slightly simpler

Reported by: wingo Owned by: exarkun
Priority: lowest Milestone:
Component: core Keywords:
Cc: spiv, jknight, exarkun, glyph Branch: branches/fix-wakeup-1997
(diff, github, buildbot, log)
Author: exarkun Launchpad Bug:

Description (last modified by glyph)

The test for what thread we're in is unnecessary, and slightly worsens a race with select() because the handler could be called after processing events in the queue but before entering select().

We're leaving it this way for now, because the race is easier to trigger in its current configuration (and still exists in other configurations), but will be removing this redundant and unnecessary code either after or during the fix for the real underlying issue.

See #791.

Attachments (1)

nocalllater.diff (679 bytes) - added by glyph 8 years ago.
my suggested fix, hopefully incredibly simple

Download all attachments as: .zip

Change History (22)

comment:1 follow-up: Changed 8 years ago by spiv

  • Cc spiv added

comment:2 Changed 8 years ago by glyph

Probably a good idea. The waker also needs unit tests.

comment:3 Changed 8 years ago by jknight

Huh "oops". Indeed, that isInIOThread shouldn't be there. I saw this bug in passing quite a while ago, and had a false memory that I'd already fixed it. :(

comment:4 Changed 8 years ago by exarkun

Suggestions for a unit test for this behavior?

comment:5 Changed 8 years ago by exarkun

Here's a suggested fix for the problem. Still no ideas for unit testing, but perhaps this suggestion will lend itself to some ideas.

Get rid of sigInt, sigBreak, sigTerm and _handleSigchld. Add a single signal handler utility method that conceptually does nothing more than:

signals.append('name of signal')
wakeUp()

The reactor I/O thread will check signals and dispatch to registered handlers as necessary.

comment:6 in reply to: ↑ 1 Changed 8 years ago by glyph

A simpler fix would simply be to remove the callLater branch entirely. The only reason that was there was that previously the waker pipe was blocking; it was long ago made non-blocking for other reasons.

I don't see any reasonable way to unit-test this behavior, since it is completely asynchronous and may occur at any time. However, sigBreak, sigTerm and _handleSigchld are definitely ugly and could be removed and replaced with one function, perhaps by a separate ticket... it really seems like we ought to be able to do something else by centralizing signal handling but I can't think of what.

Changed 8 years ago by glyph

my suggested fix, hopefully incredibly simple

comment:7 Changed 8 years ago by jknight

  • Cc jknight exarkun glyph added

Look, I've no idea what ya'll are going on about, but the problem is in wakeUp itself, as I thought had been made clear above. Below patch should be applied.

Index: twisted/internet/base.py
===================================================================
--- twisted/internet/base.py    (revision 17708)
+++ twisted/internet/base.py    (working copy)
@@ -296,11 +296,10 @@
 
     def wakeUp(self):
         """Wake up the event loop."""
-        if not threadable.isInIOThread():
-            if self.waker:
-                self.waker.wakeUp()
-            # if the waker isn't installed, the reactor isn't running, and
-            # therefore doesn't need to be woken up
+        if self.waker:
+            self.waker.wakeUp()
+        # if the waker isn't installed, the reactor isn't running, and
+        # therefore doesn't need to be woken up
 
     def doIteration(self, delay):
         """Do one iteration over the readers and writers we know about."""

comment:8 Changed 8 years ago by exarkun

Yes, that patch probably fixes the problem. I was suggesting simplifying the entire signal handling implementation, since it is basically impossible to unit test. As far as I can tell, your patch is a necessary part of the solution I was suggesting.

comment:9 Changed 8 years ago by glyph

Oh, yes. A necessary part of mine as well. Without the change that I suggested, the waker won't even be invoked in the SIGCHLD case (although your fix will fix _other_ signal-handling race conditions).

comment:10 Changed 8 years ago by glyph

Whoops. I'm totally wrong.

The patch I posted is probably a good idea (along with other changes that would make it actually work in a thread-less environment) but is a separate issue entirely. Thread support has nothing to do with whether you are running in a thread or not.

For the normal case James's patch is adequate. I think we should apply it, because I still can't think of a way to test it.

comment:11 Changed 8 years ago by exarkun

Quotient's unit tests semi-reliably produce some kind of process reaping bug. FWIW, applying James' patch above didn't fix it. This might be a trial bug rather than a real reactor bug, but I can't see how. I don't understand why the patch wouldn't have fixed it otherwise.

comment:12 Changed 8 years ago by glyph

  • Resolution set to wontfix
  • Status changed from new to closed

The bug is more insidious than we had previously thought.

There is an inherent race in Python here, perhaps an unfixable one. Everywhere from the last trip through the ceval loop to the start of the actual 'select' syscall, a signal may arrive, be recognized by Python's signal handler, filed for later execution, and then select runs with its given timeout (probably None), which blocks forever.

JP verified this by inserting a raise(SIGINT) into the line before the actual select call in selectmodule.c. A Python SIGINT handler installed with that modified select module would only run _after_ the select call.

The only potentially correct solution in Python is described in #791. We may add an optimization in C later, which would register a syscall that would directly write to the waker's file descriptor, without waiting for Python code to execute, thus obviating the need for the looping call.

comment:13 Changed 8 years ago by jknight

  • Resolution wontfix deleted
  • Status changed from closed to reopened

Reopening, as this fix should be made anyways.

comment:14 Changed 8 years ago by glyph

  • Priority changed from normal to lowest
  • Summary changed from signals handled via callFromThread, yet the reactor still drops into select() to perhaps SIGCHLD handler could be slightly simpler

I think the fixes in #791 will probably obsolete this fix, and it doesn't actually "fix" anything, it just makes the race infinitesimally less race-y. The thread-support thing is also silly, and should also be applied (along with other changes) purely in the interests of code cleanliness, but it also isn't a fix for any real issue.

comment:15 Changed 8 years ago by glyph

  • Description modified (diff)
  • Owner glyph deleted
  • Status changed from reopened to new
  • Summary changed from perhaps SIGCHLD handler could be slightly simpler to perhaps wakeUp could be slightly simpler

Changing the description too, since apparently we're repurposing this ticket for a minor change rather than the actual fix:

The original description (thanks trac) was:

reactor.iterate() is defined like this:

    def iterate(self, delay=0):
       """See twisted.internet.interfaces.IReactorCore.iterate.
       """
       self.runUntilCurrent()
       self.doIteration(delay)

runUntilCurrent runs all the procs on the threadCallQueue. So if
something is added to the threadCallQueue between runUntilCurrent()
and doIteration(), the reactor needs to have an fd ready for reading
to shortcut the select(). This is done by callFromThread() calling
reactor.wakeUp(), which will write on the wakeup FD.

HOWEVER. For some reason reactor.wakeUp() only writes on the fd if it
is being called from another thread. This is obviously borked in the
signal-handling case, when a signal arrives between runUntilCurrent()
and doIteration(), and is processed via reactor.callFromThread(), as
is the case with SIGCHLD.

Solution: have reactor.wakeUp() always wake the waker, regardless of
the thread it comes from.

The solution is incorrect, for reasons given above.

comment:16 Changed 7 years ago by exarkun

In addition to #791, see #733 and #2535.

comment:17 Changed 6 years ago by glyph

  • Owner set to glyph
  • Status changed from new to assigned

comment:18 Changed 6 years ago by exarkun

  • Author set to exarkun
  • Branch set to branches/fix-wakeup-1997

(In [25654]) Branching to 'fix-wakeup-1997'

comment:19 Changed 6 years ago by exarkun

  • Keywords review added
  • Owner glyph deleted
  • Status changed from assigned to new

Put the fix james recommended into the branch, plus sort of kind of a unit test maybe a bit.

comment:20 Changed 6 years ago by washort

  • Keywords review removed
  • Owner set to exarkun

well, the test fails without the change, and passes with it, so I'd call that a success :)
Looks good to merge.

comment:21 Changed 6 years ago by exarkun

  • Resolution set to fixed
  • Status changed from new to closed

(In [25693]) Merge fix-wakeup-1997

Author: exarkun
Reviewer: washort
Fixes: #1997

Change the reactor wake up method so that it tries to wake up the reactor no
matter what thread it is called from. This avoids a potential deadlock where
the wake up method is called from a signal handler in the main thread after
the thread call queue has been checked but before the event notification API
(ie select, poll, etc) has been entered.

Note: See TracTickets for help on using tickets.