Ticket #1997 (assigned defect )

Opened 2 years ago

Last modified 5 months ago

perhaps wakeUp could be slightly simpler

Reported by: wingo Assigned to: glyph (accepted)
Type: defect Priority: lowest
Milestone: Component: core
Keywords: Cc: spiv, jknight, exarkun, glyph
Branch: Author:
Launchpad Bug:

Description (last modified by glyph)

The test for what thread we're in is unnecessary, and slightly worsens a race with select() because the handler could be called after processing events in the queue but before entering select().

We're leaving it this way for now, because the race is easier to trigger in its current configuration (and still exists in other configurations), but will be removing this redundant and unnecessary code either after or during the fix for the real underlying issue.

See #791.

Attachments

nocalllater.diff (0.7 kB) - added by glyph 2 years ago.
my suggested fix, hopefully incredibly simple

Change History

follow-up: ↓ 6   2006-08-14 13:23:08+00:00 changed by spiv

  • cc set to spiv

  2006-08-14 14:04:48+00:00 changed by glyph

Probably a good idea. The waker also needs unit tests.

  2006-08-14 15:13:54+00:00 changed by jknight

Huh "oops". Indeed, that isInIOThread shouldn't be there. I saw this bug in passing quite a while ago, and had a false memory that I'd already fixed it. :(

  2006-08-21 01:44:30+00:00 changed by exarkun

Suggestions for a unit test for this behavior?

  2006-08-23 19:56:35+00:00 changed by exarkun

Here's a suggested fix for the problem. Still no ideas for unit testing, but perhaps this suggestion will lend itself to some ideas.

Get rid of sigInt, sigBreak, sigTerm and _handleSigchld. Add a single signal handler utility method that conceptually does nothing more than:

signals.append('name of signal')
wakeUp()

The reactor I/O thread will check signals and dispatch to registered handlers as necessary.

in reply to: ↑ 1   2006-08-26 04:46:17+00:00 changed by glyph

A simpler fix would simply be to remove the callLater branch entirely. The only reason that was there was that previously the waker pipe was blocking; it was long ago made non-blocking for other reasons.

I don't see any reasonable way to unit-test this behavior, since it is completely asynchronous and may occur at any time. However, sigBreak, sigTerm and _handleSigchld are definitely ugly and could be removed and replaced with one function, perhaps by a separate ticket... it really seems like we ought to be able to do something else by centralizing signal handling but I can't think of what.

  2006-08-26 04:47:57+00:00 changed by glyph

  • attachment nocalllater.diff added

my suggested fix, hopefully incredibly simple

  2006-08-28 02:02:18+00:00 changed by jknight

  • cc changed from spiv to spiv, jknight, exarkun, glyph

Look, I've no idea what ya'll are going on about, but the problem is in wakeUp itself, as I thought had been made clear above. Below patch should be applied.

Index: twisted/internet/base.py
===================================================================
--- twisted/internet/base.py    (revision 17708)
+++ twisted/internet/base.py    (working copy)
@@ -296,11 +296,10 @@
     def wakeUp(self):
         """Wake up the event loop."""
-        if not threadable.isInIOThread():
-            if self.waker:
-                self.waker.wakeUp()
-            # if the waker isn't installed, the reactor isn't running, and
-            # therefore doesn't need to be woken up
+        if self.waker:
+            self.waker.wakeUp()
+        # if the waker isn't installed, the reactor isn't running, and
+        # therefore doesn't need to be woken up
     def doIteration(self, delay):
         """Do one iteration over the readers and writers we know about."""

  2006-08-28 12:23:38+00:00 changed by exarkun

Yes, that patch probably fixes the problem. I was suggesting simplifying the entire signal handling implementation, since it is basically impossible to unit test. As far as I can tell, your patch is a necessary part of the solution I was suggesting.

  2006-08-28 16:20:25+00:00 changed by glyph

Oh, yes. A necessary part of mine as well. Without the change that I suggested, the waker won't even be invoked in the SIGCHLD case (although your fix will fix _other_ signal-handling race conditions).

  2006-08-28 19:47:44+00:00 changed by glyph

Whoops. I'm totally wrong.

The patch I posted is probably a good idea (along with other changes that would make it actually work in a thread-less environment) but is a separate issue entirely. Thread support has nothing to do with whether you are running in a thread or not.

For the normal case James's patch is adequate. I think we should apply it, because I still can't think of a way to test it.

  2006-08-29 14:08:20+00:00 changed by exarkun

Quotient's unit tests semi-reliably produce some kind of process reaping bug. FWIW, applying James' patch above didn't fix it. This might be a trial bug rather than a real reactor bug, but I can't see how. I don't understand why the patch wouldn't have fixed it otherwise.

  2006-09-12 01:29:28+00:00 changed by glyph

  • status changed from new to closed
  • resolution set to wontfix

The bug is more insidious than we had previously thought.

There is an inherent race in Python here, perhaps an unfixable one. Everywhere from the last trip through the ceval loop to the start of the actual 'select' syscall, a signal may arrive, be recognized by Python's signal handler, filed for later execution, and then select runs with its given timeout (probably None), which blocks forever.

JP verified this by inserting a raise(SIGINT) into the line before the actual select call in selectmodule.c. A Python SIGINT handler installed with that modified select module would only run _after_ the select call.

The only potentially correct solution in Python is described in #791. We may add an optimization in C later, which would register a syscall that would directly write to the waker's file descriptor, without waiting for Python code to execute, thus obviating the need for the looping call.

  2006-09-12 02:21:35+00:00 changed by jknight

  • status changed from closed to reopened
  • resolution deleted

Reopening, as this fix should be made anyways.

  2006-09-12 03:12:22+00:00 changed by glyph

  • priority changed from normal to lowest
  • summary changed from signals handled via callFromThread, yet the reactor still drops into select() to perhaps SIGCHLD handler could be slightly simpler

I think the fixes in #791 will probably obsolete this fix, and it doesn't actually "fix" anything, it just makes the race infinitesimally less race-y. The thread-support thing is also silly, and should also be applied (along with other changes) purely in the interests of code cleanliness, but it also isn't a fix for any real issue.

  2006-09-12 03:17:25+00:00 changed by glyph

  • owner deleted
  • status changed from reopened to new
  • description deleted
  • summary changed from perhaps SIGCHLD handler could be slightly simpler to perhaps wakeUp could be slightly simpler

Changing the description too, since apparently we're repurposing this ticket for a minor change rather than the actual fix:

The original description (thanks trac) was:

reactor.iterate() is defined like this:
    def iterate(self, delay=0):
       """See twisted.internet.interfaces.IReactorCore.iterate.
       """
       self.runUntilCurrent()
       self.doIteration(delay)
runUntilCurrent runs all the procs on the threadCallQueue. So if
something is added to the threadCallQueue between runUntilCurrent()
and doIteration(), the reactor needs to have an fd ready for reading
to shortcut the select(). This is done by callFromThread() calling
reactor.wakeUp(), which will write on the wakeup FD.
HOWEVER. For some reason reactor.wakeUp() only writes on the fd if it
is being called from another thread. This is obviously borked in the
signal-handling case, when a signal arrives between runUntilCurrent()
and doIteration(), and is processed via reactor.callFromThread(), as
is the case with SIGCHLD.
Solution: have reactor.wakeUp() always wake the waker, regardless of
the thread it comes from.

The solution is incorrect, for reasons given above.

  2008-02-01 17:07:51+00:00 changed by exarkun

  • branch deleted
  • author deleted

In addition to #791, see #733 and #2535.

  2008-05-27 20:39:41+00:00 changed by glyph

  • owner set to glyph
  • status changed from new to assigned
Note: See TracTickets for help on using tickets.