[Twisted-Python] Deferreds vs sys.getrecursionlimit()

Fri Nov 14 19:55:00 MST 2008

About once every six months, I wind up debugging a python stack
overflow in my Deferred-using code. The symptom is usually a log
message that ends with:

  File "/usr/lib/python2.5/site-packages/twisted/internet/defer.py", line 344, in _runCallbacks
    self.result = failure.Failure()
  File "/usr/lib/python2.5/site-packages/twisted/python/failure.py", line 265, in __init__
    parentCs = reflect.allYourBase(self.type)
  File "/usr/lib/python2.5/site-packages/twisted/python/reflect.py", line 542, in allYourBase
    accumulateBases(classObj, l, baseClass)
  File "/usr/lib/python2.5/site-packages/twisted/python/reflect.py", line 550, in accumulateBases
    accumulateBases(base, l, baseClass)
  File "/usr/lib/python2.5/site-packages/twisted/python/reflect.py", line 550, in accumulateBases
    accumulateBases(base, l, baseClass)
  File "/usr/lib/python2.5/site-packages/twisted/python/reflect.py", line 550, in accumulateBases
    accumulateBases(base, l, baseClass)
  File "/usr/lib/python2.5/site-packages/twisted/python/reflect.py", line 550, in accumulateBases
    accumulateBases(base, l, baseClass)
exceptions.RuntimeError: maximum recursion depth exceeded

It's always something weird. This time, I took notes. I offer these hints to
help future searchers find a starting point in their own debugging efforts.
The executive summary:

 Certain patterns of using Deferreds work fine while testing, but will fail
 in mysterious ways later on as the application's workload grows larger.

These notes are formatted for Trac, as they were originally written as a
comment for http://allmydata.org/trac/tahoe/ticket/237 . Also, the
"solutions" suggested require the eventual-send operator as provided by
Foolscap, and until/unless reactor.eventually() makes it into Twisted proper,
these solutions may not be convenient for projects that aren't already using
Foolscap.

cheers,
 -Brian

== Problem One: long list of callbacks, all of them are ready ==

Each Deferred (we'll call the first one Deferred A) has a list of callback
functions. Each time you do d.addCallback(), this list grows by one element.

When Deferred A fires, the list is executed in a 'while' loop, in
Deferred._runCallbacks. If the callbacks all return either a normal value or
a Failure, then the list is completely consumed during the one call to
_runCallbacks, and everything is fine.

However, when a callback returns another Deferred B (chaining), the first
Deferred A must wait for the second to finish. The code that does this looks
like:

{{{
    if isinstance(self.result, Deferred):
        self.pause()
        self.result.addBoth(self._continue)
        break
}}}

The second Deferred B might have already been fired by this point, either
because it was born ready (created with defer.succeed, or
defer.maybeDeferred), or because whatever was being waited upon has already
occurred.

If this occurs, the subsequent callback in Deferred A's chain will fire (with
B's result), but it will fire through a 6-frame recursive loop instead of
firing on the next pass of the 'while' loop. As a result, each such
ready-to-fire Deferred will add 6 stack frames. 166 such loops are enough to
put more than 1000 frames on the stack, which will exceed Python's default
sys.getrecursionlimit() .

The 6-frame cycle is:

{{{
  File "twisted/internet/defer.py", line 214, in addBoth
    callbackKeywords=kw, errbackKeywords=kw)
  File "twisted/internet/defer.py", line 186, in addCallbacks
    self._runCallbacks()
  File "twisted/internet/defer.py", line 328, in _runCallbacks
    self.result = callback(self.result, *args, **kw)
  File "twisted/internet/defer.py", line 289, in _continue
    self.unpause()
  File "twisted/internet/defer.py", line 285, in unpause
    self._runCallbacks()
  File "twisted/internet/defer.py", line 341, in _runCallbacks
    self.result.addBoth(self._continue)
}}}

The following sample code will cause this situation:

{{{
import traceback
from twisted.internet import defer

def fire(res, which):
    #print "FIRE", which, "stack:", len(traceback.extract_stack())
    #if which == 2:
    #    traceback.print_stack()
    return defer.succeed(None)

d = defer.Deferred()
for i in range(170):
    d.addCallback(fire, i)

d.callback("go")
}}}

The exception that this provokes is caught by the Deferred's Failure
mechanisms, but then Twisted has an internal failure while trying to capture
it. The actual Unhandled error in Deferred that gets put into the logs is:

{{{
Unhandled error in Deferred:
Traceback (most recent call last):
  File "twisted/internet/defer.py", line 285, in unpause
    self._runCallbacks()
  File "twisted/internet/defer.py", line 341, in _runCallbacks
    self.result.addBoth(self._continue)
  File "twisted/internet/defer.py", line 214, in addBoth
    callbackKeywords=kw, errbackKeywords=kw)
  File "twisted/internet/defer.py", line 186, in addCallbacks
    self._runCallbacks()
--- <exception caught here> ---
  File "twisted/internet/defer.py", line 328, in _runCallbacks
    self.result = callback(self.result, *args, **kw)
  File "twisted/internet/defer.py", line 289, in _continue
    self.unpause()
  File "twisted/internet/defer.py", line 285, in unpause
    self._runCallbacks()
  File "twisted/internet/defer.py", line 344, in _runCallbacks
    self.result = failure.Failure()
  File "twisted/python/failure.py", line 265, in __init__
    parentCs = reflect.allYourBase(self.type)
  File "twisted/python/reflect.py", line 542, in allYourBase
    accumulateBases(classObj, l, baseClass)
  File "twisted/python/reflect.py", line 550, in accumulateBases
    accumulateBases(base, l, baseClass)
  File "twisted/python/reflect.py", line 550, in accumulateBases
    accumulateBases(base, l, baseClass)
  File "twisted/python/reflect.py", line 550, in accumulateBases
    accumulateBases(base, l, baseClass)
  File "twisted/python/reflect.py", line 550, in accumulateBases
    accumulateBases(base, l, baseClass)
exceptions.RuntimeError: maximum recursion depth exceeded
}}}

This problem frequently shows up in code which returns a Deferred for
generality (i.e. some day it might be async), but is using defer.succeed() or
defer.maybeDeferred(some_immediate_call) in the meanwhile.

== Problem Two: deep chain of callbacks, e.g. recursive delayed polling ==

The other kind of recursion-limit-violation failures that occurs with
Deferreds involves a long chain that finally fires. The most common way to
generate such a chain is with a recursive method that separates each call
with a Deferred, such as a polling function that returns a Deferred:

{{{
   def wait_until_done(self, ignored=None):
       if self.done:
          return True
       else:
          d = Deferred()
          reactor.callLater(1.0, d.callback, None)
          d.addCallback(self.wait_until_done)
          return d
}}}

If this function must poll more than 331 times, the reactor tick which
notices the expired timer and fires d.callback will see a
recursion-depth-exceeded exception. The last Deferred fires, which triggers
the _continue callback on the next-to-last Deferred, which allows it to fire,
which triggers the {{{[-2]}}} Deferred, etc. This recursive cycle is of
length 3 and has the following frames:

{{{
  File "twisted/internet/defer.py", line 328, in _runCallbacks
    self.result = callback(self.result, *args, **kw)
  File "twisted/internet/defer.py", line 289, in _continue
    self.unpause()
  File "twisted/internet/defer.py", line 285, in unpause
    self._runCallbacks()
}}}

This one is trickier to find, because the root of the stack will be some
internal reactor call rather than application code. In particular, the bottom
of the stack will look like:

{{{
  File "/tmp/t.py", line 26, in <module>
    reactor.run()
  File "twisted/internet/base.py", line 1048, in run
    self.mainLoop()
  File "twisted/internet/base.py", line 1057, in mainLoop
    self.runUntilCurrent()
  File "twisted/internet/base.py", line 705, in runUntilCurrent
    call.func(*call.args, **call.kw)
  File "twisted/internet/defer.py", line 243, in callback
    self._startRunCallbacks(result)
  File "twisted/internet/defer.py", line 312, in _startRunCallbacks
    self._runCallbacks()
  File "twisted/internet/defer.py", line 328, in _runCallbacks
    self.result = callback(self.result, *args, **kw)
  File "twisted/internet/defer.py", line 289, in _continue
    self.unpause()
  File "twisted/internet/defer.py", line 285, in unpause
    self._runCallbacks()
}}}

The other tricky thing about this failure is that the application code is
sitting on the end of the stack: any callback that is attached to the
Deferred that {{{wait_until_done}}} returns will run in a low-stack
environment. As a result, recursion-depth-exceeded exceptions will be
triggered by seemingly innocent application code. Note how the "DONE" number
changes as you modify the self.count comparsion value in this example:

{{{
#! /usr/bin/python

import traceback
from twisted.internet import reactor
from twisted.internet.defer import Deferred

class Poller:
    count = 0
    def wait_until_done(self, ignored=None):
        self.count += 1
        if self.count > 301: # 331 works, 332 fails.
            return True
        else:
            d = Deferred()
            reactor.callLater(0.0, d.callback, None)
            d.addCallback(self.wait_until_done)
            return d

p = Poller()
def done(res):
    #traceback.print_stack()
    print "DONE", len(traceback.extract_stack())
d = p.wait_until_done()
d.addCallback(done)
reactor.run()
}}}

When this fails, the traceback that shows up in the logs looks like:

{{{
Unhandled error in Deferred:
Traceback (most recent call last):
  File "twisted/internet/defer.py", line 285, in unpause
    self._runCallbacks()
  File "twisted/internet/defer.py", line 328, in _runCallbacks
    self.result = callback(self.result, *args, **kw)
  File "twisted/internet/defer.py", line 289, in _continue
    self.unpause()
  File "twisted/internet/defer.py", line 285, in unpause
    self._runCallbacks()
--- <exception caught here> ---
  File "twisted/internet/defer.py", line 328, in _runCallbacks
    self.result = callback(self.result, *args, **kw)
  File "twisted/internet/defer.py", line 289, in _continue
    self.unpause()
  File "twisted/internet/defer.py", line 285, in unpause
    self._runCallbacks()
  File "twisted/internet/defer.py", line 344, in _runCallbacks
    self.result = failure.Failure()
  File "twisted/python/failure.py", line 265, in __init__
    parentCs = reflect.allYourBase(self.type)
  File "twisted/python/reflect.py", line 542, in allYourBase
    accumulateBases(classObj, l, baseClass)
  File "twisted/python/reflect.py", line 550, in accumulateBases
    accumulateBases(base, l, baseClass)
  File "twisted/python/reflect.py", line 550, in accumulateBases
    accumulateBases(base, l, baseClass)
  File "twisted/python/reflect.py", line 550, in accumulateBases
    accumulateBases(base, l, baseClass)
  File "twisted/python/reflect.py", line 550, in accumulateBases
    accumulateBases(base, l, baseClass)
exceptions.RuntimeError: maximum recursion depth exceeded
}}}

== Combinations ==

Note that these two problems can interact. Each ready-to-fire callback
attached to a single Deferred uses 6 stack frames, and each chained callback
uses 3 stack frames. If X*6+Y*3 > 1000, the code will fail.

== Solutions ==

For problem one, the requirement is that Deferreds never wind up with more
than 166 callbacks that are ready to fire. In other words, there must be at
least one not-ready-to-fire Deferred in each span of 166 callbacks.

One way to accomplish this is to have every 100th call return
{{{foolscap.eventual.fireEventually(result)}}} instead of
{{{defer.succeed(result)}}}. Having every call do this works too, it just
slows things down a bit. (note that the reactor must be running for
fireEventually to work)

{{{
def fire(res, which):
    return defer.fireEventually(None)

d = defer.Deferred()
for i in range(170):
    d.addCallback(fire, i)
}}}

For problem two, the requirement is that the depth of the tail-recursion
chain not exceed 331 cycles, minus some room for the code you're eventually
going to attach to the end. One way to accomplish this is to have every 300th
call (or every single call, if you are willing to accept the slowdown) add an
additional {{{fireEventually}}} to break up the stack.

{{{
    def wait_until_done(self, ignored=None):
        self.count += 1
        if self.count > 301: # 331 works, 332 fails.
            return True
        else:
            d = Deferred()
            reactor.callLater(0.0, d.callback, None)
            d.addCallback(self.wait_until_done)
            d.addCallback(lambda res: fireEventually(res))
            return d
}}}