Opened 11 years ago

Closed 4 years ago

#411 defect closed fixed (fixed)

Returning a Deferred from the callback of another Deferred too many times results a RecursionError

Reported by: etrepum Owned by:
Priority: highest Milestone:
Component: core Keywords:
Cc: glyph, radix, exarkun, spiv, itamarst, etrepum, titty, oubiwann, mithrandi@…, thijs, zooko@… Branch: branches/iterative-deferreds-411-7
(diff, github, buildbot, log)
Author: exarkun, therve, mithrandi, jml Launchpad Bug:

Description (last modified by glyph)

Returning a Deferred from a callback to another Deferred will result in a RecursionError if it is done too many times with Defererds that have already fired.

The traceback in question doesn't have any application code on it and can be very difficult to interepret. At worst, Twisted should alert you to what has happened in a comprehensible manner; however, it should really be possible to just return Deferreds from callbacks indefinitely.

This is related to the more general issue of calling a one Deferred's callback from another Deferred's callback, but we can fix this specific case in a potentially more general way.

Attachments (2)

defer.py (4.4 KB) - added by etrepum 11 years ago.
single_while_loop.patch (22.7 KB) - added by TimothyFitz 4 years ago.
_runCallbacks implementation with only one loop.

Download all attachments as: .zip

Change History (59)

comment:1 Changed 11 years ago by etrepum

The magic stack-blowing number for me is 250, YMMV:

    from twisted.internet.defer import succeed
    def print_(v): print v
    def blowstack(howmany):
        def _blowstack(v):
            print v
            if v < howmany:
                return succeed(v+1).addCallback(_blowstack)
            return 'did %d iterations' % (v,)
        return _blowstack(0)
    blowstack(250).addBoth(print_)

I have a rewrite of the deferred module that uses "tail call optimization" and "very un-
threadsafeness" that fixes this problem and probably makes deferreds much faster (at least 
chaining wise).  The bonus/problem is that it also does the implicit chaining (see issue #410).
I'll probably post the source this weekend.

Changed 11 years ago by etrepum

comment:2 Changed 11 years ago by etrepum

ah what the hell, here's what I have so far.  not quite pluggable as a replacement for twisted's 
deferred yet.  replace microfailure with t.p.failure.

comment:3 Changed 10 years ago by titty

You also get very nasty bugs, which aren't easy to track down.
I was using the following class to transfer a 30mb file, after the file had been
transferred, the program suddenly used up to 500MB ram, before dumping core or
exiting with a MemoryError. Problem seems to be that the deferred code tries to
create a Failure object (which also fails, because of RecursionError), then a
failure object for the upper level and so on. I can supply a complete example if
anyone wants it.


class RemoteFile(object):
    def __init__(self, file=None):
        self.file = RemoteFileReference(file)
        self.data = []

    def readAll(self):
        def gotData(d):
            if d=='':
                return "".join(self.data)
            else:
                self.data.append(d)
                return self.read().addCallback(gotData)


        return self.read().addCallback(gotData)

    def read(self, size=0x10000):
        log.msg("%s: calling remote read(%s)" % (self, size))
        return self.file.callRemote('read', size)

comment:4 Changed 10 years ago by etrepum

My best guess in that particular example is that it should be rewritten using a callLater in there 
somewhere, because on localhost it will almost certainly recurse, since it doesn't have to wait for the 
data.

The fact that deferred fires back IMMEDIATELY instead of giving the runloop a chance isn't always what 
you want :)

comment:5 Changed 10 years ago by glyph

Could you work on streamlining your Deferred improvements for inclusion?

comment:6 Changed 10 years ago by etrepum

I don't have any time to work on this for another week or two.. so if you want it ASAP then you'll have to 
have someone else give it a shot.

comment:7 Changed 10 years ago by glyph

Nope, not high priority.  Just trying to make sure that everything is assigned
to *somebody*.

comment:8 Changed 8 years ago by oubiwann

  • Cc oubiwann added
  • Component set to conch

comment:9 Changed 8 years ago by radix

  • Component changed from conch to core

comment:10 Changed 7 years ago by antoine

Ok, I've looked at this and such an optimization looks simply incompatible with the expectations in the test cases, and the fact that Trial itself works with deferreds.
(the replacement defer.py doesn't work at all with twisted.test.test_defer even if the modifications are carefully merged)

The killer is that most tests in test_defer expect callbacks to be called immediately if possible, which kills any possibility of transforming recursive calls to runCallbacks() into an iterative loop, because if you do so the callbacks are run *after* the test checks whether the callbacks have been run :-))
(they are run when the test method returns to Trial, which is too late).

Perhaps it is possible to optimize for some very specific cases rather than the general runCallbacks() recursivity, but I'm not sure it's worth the additional complexity (and I'm not even sure it's possible, because it probably depends on exactly what kind of things Trial does with callbacks). Also, witnessing my observations above, it would slightly change the semantics of Deferred which may break some existing code.

comment:11 Changed 7 years ago by Peaker

  • Keywords core removed
  • Resolution set to invalid
  • Status changed from new to closed

This breaks commonly expected behavior of Deferreds.
It does not pass the test suite.

Deferreds calling callbacks synchronously is akin to normal function calls, and deferred users should be and are aware of this.

comment:12 follow-up: Changed 6 years ago by glyph

  • Description modified (diff)
  • Priority changed from high to normal
  • Resolution invalid deleted
  • Status changed from closed to reopened
  • Summary changed from Deferreds are good at blowing the stack to Returning a Deferred from the callback of another Deferred too many times results a RecursionError

While the specific optimization in question might not work, there's definitely something we can do to improve this situation.

Brian Warner did a really good writeup of the implications of this problem in a ticket on the allmydata tracker which he later copied as a message to the Twisted list.

However, the suggested fix there (reactor.eventually()) is wrong, in my opinion. You don't need to rip the whole stack in order to fix this problem, only the portion that is already running inside defer.py.

comment:13 Changed 5 years ago by exarkun

  • Author set to exarkun
  • Branch set to branches/iterative-deferreds-411

(In [26584]) Branching to 'iterative-deferreds-411'

comment:14 Changed 5 years ago by exarkun

(In [26585]) Add a test for the stack depth at which callbacks are invoked

refs #411

comment:15 Changed 4 years ago by jml

  • Author changed from exarkun to exarkun, jml
  • Branch changed from branches/iterative-deferreds-411 to branches/iterative-deferreds-411-2

(In [28384]) Branching to 'iterative-deferreds-411-2'

comment:16 Changed 4 years ago by jml

  • Branch changed from branches/iterative-deferreds-411-2 to branches/iterative-deferreds-411-3

(In [28455]) Branching to 'iterative-deferreds-411-3'

comment:17 Changed 4 years ago by therve

  • Owner changed from etrepum to therve
  • Status changed from reopened to new

comment:18 Changed 4 years ago by therve

  • Author changed from exarkun, jml to exarkun, therve, jml
  • Branch changed from branches/iterative-deferreds-411-3 to branches/iterative-deferreds-411-4

(In [28621]) Branching to 'iterative-deferreds-411-4'

comment:19 Changed 4 years ago by therve

  • Author changed from exarkun, therve, jml to exarkun, jml, therve
  • Keywords review added
  • Owner therve deleted

This is ready to review in the attached branch, collective work with jml and exarkun.

I looked at Brian's email, and it looks like some issues are still here, but the situation improves.

comment:20 follow-up: Changed 4 years ago by spiv

Some quick comments, although this patch deserves more thought than I can give it right now:

  • pyflakes reports many duplicate method names in test_defer.py. This must be fixed.
  • _Continue could use some more docstring or comment text explaining more about how it is used, specifically that it is intended to be used as an element of Deferred.callbacks that acts somewhat like addBoth(d._continue), which explains the wacky __getitem__. Maybe cross-reference to Deferred._notifyDeferred too.
  • _notifyDeferred is a very generic and abstract name. I'm not sure what would be better, maybe _addContinue? (and _removeContinue)
  • The except: self.result = failure.Failure() now covers an alarmingly broad try block
  • I haven't yet fully digested the iterative unwinding issue, but intuitively I don't understand why there's a need to operate on up to three Deferreds in _runCallbacks simultaneously rather than two.

To elaborate on the last point, perhaps someone can summarise the cases for me. After _runCallbacks has run the next callback (or errback) in the self.callbacks chain, we have three cases:

  • a result that is not a Deferred
  • a result that is a Deferred, and the result-Deferred has an immediately available result
  • a result that is a Deferred, and the result-Deferred does not have an immediately available result

The first two cases seem easy to do without recursion. Am I wrong? (Is the meaning of "immediately available" subtler than it might appear?)

For the third case, why isn't it just "break out of 'while self.callbacks'"?

comment:21 Changed 4 years ago by spiv

Apologies for the bad trac markup :(

comment:22 Changed 4 years ago by spiv

Also, there appears to be an invariant that self.result will never be a Deferred once self._runCallbacks has returned. Stylistically it might be clearer to never assign a Deferred to self.result in the first place, rather than assign it and then furiously loop until you can assign something else. Or does that impact the usefulness of the repr?

comment:23 Changed 4 years ago by spiv

Oh, I guess it can be a Deferred so long as this one is paused... don't mind me I'm up past my bedtime ;)

comment:24 Changed 4 years ago by itamar

  • Keywords review removed
  • Owner set to therve

Andrew added enough comments that this can probably go out of review until they are fixed.

comment:25 Changed 4 years ago by therve

  • Author changed from exarkun, jml, therve to exarkun, therve, jml
  • Branch changed from branches/iterative-deferreds-411-4 to branches/iterative-deferreds-411-5

(In [28940]) Branching to 'iterative-deferreds-411-5'

comment:26 in reply to: ↑ 20 Changed 4 years ago by therve

  • Keywords review added

Replying to spiv:
Thanks for the review!

  • pyflakes reports many duplicate method names in test_defer.py. This must be fixed.

Fixed?

  • _Continue could use some more docstring or comment text explaining more about how it is used, specifically that it is intended to be used as an element of Deferred.callbacks that acts somewhat like addBoth(d._continue), which explains the wacky __getitem__. Maybe cross-reference to Deferred._notifyDeferred too.

I added some comments.

  • _notifyDeferred is a very generic and abstract name. I'm not sure what would be better, maybe _addContinue? (and _removeContinue)

Those seems like good names. Renamed.

  • The except: self.result = failure.Failure() now covers an alarmingly broad try block

Yeah :/

  • I haven't yet fully digested the iterative unwinding issue, but intuitively I don't understand why there's a need to operate on up to three Deferreds in _runCallbacks simultaneously rather than two.

I think the current implementation is as simple as it could be, unfortunately. I tried to simplify according to your suggestion, but some tests failed each time. I think there is reentrancy in there (with the addBoth call) which makes it confusing.

comment:27 Changed 4 years ago by therve

  • Owner therve deleted

comment:28 Changed 4 years ago by glyph

  • Keywords review removed
  • Owner set to exarkun
  • Priority changed from normal to highest

First, I want to say, to everyone who has worked on this thus far: this is a truly heroic effort. Just trying to read this code made my eyes bleed, and I'm pretty sure that it is one of those rare bits of code which was harder to write than to read. I am in awe that you managed to accomplish this in a compatible manner.

  1. Copyright bump in test_defer.
  2. If you're going to be sneaky and take this opportunity to replace ints with bools, can you also take the opportunity to add some @ivars to document the meaning of those attributes?
  3. It needs a NEWS entry. (The NEWS entry should also mention the documentation improvements/boolean changes.)
  4. _Continue is a bit tricky to understand.
    1. One way to go with this would be to document it more clearly as emulating a callback pair (it took me a minute to figure out that that's what it was doing), and give it a suite of comparison methods, but...
    2. It would be easier if it were just a method on Deferred like the following, which would automatically compare equal, __repr__ in a sensible way, etc. This would allow _removeContinue to be implemented more simply as self.callbacks.remove(other._continuation()).
      def _continuation(self):
          return ((self._continue, None, None),
                  (self._continue, None, None))
      
  5. We should prefer docstrings to comments.
    1. _NO_RESULT should be documented by a @var in the module docstring.
    2. If it continues to exist, _Continue.__getitem__'s comment should just be a docstring.
  6. Sorry to be really nitpicky about comments, but this makes _runCallbacks even harder to understand, and it wasn't a picnic already.
    1. _runCallbacks needs a docstring. There is certainly enough modification going on here to trigger the requirement that undocumented methods being modified need a docstring :). Without a docstring, I have to think really hard about both what it's doing and how it's doing it at the same time, and either one of those things is enough to occupy my full brain.
    2. "Iteratively unwind as much of the Deferred chain as possible"... hmm. It would be helpful to be painfully clear about what, exactly, the "deferred chain" is, given that the comments also talk about "the chain Deferred" and "the callback chain".
    3. I might go so far as to say that the whole while isinstance(self.result, Deferred) loop should really be in a method of its own, and that method should have a pretty substantial docstring of its own.
      1. I'd be happy to be told that there's an efficiency reason why making this a method would be bad. But if you want to tell me that you have to show me a benchmark.
      2. If you're not going to make it a method, pretend it's a method and write a comment at the top that would be worthy of a docstring: lay out the desired behavior when a Deferred is returned from a callback.
    4. The comment about "'third' stealing 'first' from 'second'" makes no sense at all to me. Was this a record of some conclusion reached during a conversation that happened while some people were pairing on this code? Please add some antecedents to clarify what first, second, and third are. Or use variable names, and say why this theft by 'third' is morally justified.
  7. A terminology suggestion - please ignore if it is not helpful - would it any easier to talk about the feature there are so many tests for if we called it something different, instead of "implicit chaining" vs. "explicit chaining" or "callback chains", or sometimes just "chaining". Perhaps "cascading" instead?
  8. Would you mind calling the loop variable in test_boundedStackDepth 'ignored' instead of 'i'? Also, the 'sanity check' comment in that method is >80 chars.

Finally, the implementation looks pretty solid, and I particularly liked test_boundedStackDepth. The issues I've got here are all documentation and comprehensibility stuff, and should hopefully be fairly minor, so if you can address them to your own satisfaction I don't see the need for another review.

comment:29 Changed 4 years ago by therve

  • Owner changed from exarkun to therve

comment:30 Changed 4 years ago by therve

  • Keywords review added
  • Owner therve deleted

I think I handled most comments.

  1. I didn't make a method because my brain hurt when I tried to think about the loop logic with the break and continue.
  1. I think we want to use chain everywhere, but maybe I'm wrong.

Thanks!

comment:31 Changed 4 years ago by exarkun

  • Branch changed from branches/iterative-deferreds-411-5 to branches/iterative-deferreds-411-6

(In [29565]) Branching to 'iterative-deferreds-411-6'

comment:32 Changed 4 years ago by exarkun

  • Keywords review removed
  • Owner set to exarkun

I merged forward and dropped a bunch of changes which are now in the #4300 branch and the #4538 branch.

James recently voiced some concern over the correctness of the basic change here. I'm going to try to write some unit tests which demonstrate the concern is valid.

comment:33 Changed 4 years ago by exarkun

No luck so far. :/

comment:34 Changed 4 years ago by mithrandi

  • Owner changed from exarkun to mithrandi

comment:35 Changed 4 years ago by mithrandi

  • Owner changed from mithrandi to exarkun

Giving it back to exarkun because I'm giving up on this for now. The approach currently implemented in the branch right doesn't work because it ends up "stealing" results from deferreds that might still need them, but I couldn't figure out how to avoid this. I also experimented with some other crazy stuff, but couldn't get any of that to work, either.

If someone can figure out how to rewrite _runCallbacks so that it is tail-recursive, it should be trivial to manually implement the tail call optimization to avoid accumulating stack frames, but I couldn't figure out how to do that either.

comment:36 Changed 4 years ago by mithrandi

  • Author changed from exarkun, therve, jml to exarkun, therve, mithrandi, jml
  • Branch changed from branches/iterative-deferreds-411-6 to branches/iterative-deferreds-411-7

(In [29830]) Branching to 'iterative-deferreds-411-7'

comment:37 Changed 4 years ago by mithrandi

  • Cc mithrandi@… added

comment:38 Changed 4 years ago by exarkun

There's a new approach to _runCallbacks in the branch now. Hopefully it's both more correct an simpler than the previous attempts. There are still a few minor issues with the code, notably:

  • _debugInfo interaction. This functionality is not well tested in trunk. We need to improve test coverage to make sure everything's still working right.
  • _runningCallbacks probably isn't properly handled in all cases. This is another thing that likely needs better test coverage.
  • The way the avoid-recursion case is detected is a little gross and can probably be made cleaner.

comment:39 Changed 4 years ago by thijs

  • Cc thijs added

comment:40 Changed 4 years ago by exarkun

  • Keywords review added
  • Owner exarkun deleted

This seems to be in reasonable shape now. Build results

comment:41 Changed 4 years ago by TimothyFitz

  • Owner set to TimothyFitz

comment:42 in reply to: ↑ 12 ; follow-up: Changed 4 years ago by zooko

Replying to glyph:

However, the suggested fix there (reactor.eventually()) is wrong, in my opinion. You don't need to rip the whole stack in order to fix this problem, only the portion that is already running inside defer.py.

I never understood this comment. Let's assume you are right that you don't need to rip the whole stack in order to fix this problem. But why does ripping the whole stack make the suggested fix wrong? Thanks.

comment:43 Changed 4 years ago by zooko

  • Cc zooko@… added

comment:44 in reply to: ↑ 42 Changed 4 years ago by glyph

Replying to zooko:

Replying to glyph:

However, the suggested fix there (reactor.eventually()) is wrong, in my opinion. You don't need to rip the whole stack in order to fix this problem, only the portion that is already running inside defer.py.

I never understood this comment. Let's assume you are right that you don't need to rip the whole stack in order to fix this problem.

I'm fairly sure that's accurate, since the current solution doesn't rip the whole stack, and it works fine ;).

But why does ripping the whole stack make the suggested fix wrong? Thanks.

Thanks for asking. This really should be spelled out clearly, lest we doom future generations to make this mistake.

First, and most importantly, Deferred shouldn't depend on the Twisted main loop, or any main loop, for that matter. It's very close, at this point, to being properly usable as an external, independent component. The ability to do this is important; it's a selling point that one can write asynchronous libraries without giving up the ability to use that same code in a synchronous context. If you depend on the ability to segregate non-reentrant vat turns, you depend on the presence of a top-level mainloop.

Second - and I'd like to emphasize, this is really a footnote, compared to the other reason - it's slow. Each stack frame that you rip is an extra pile of opcodes and potentially a function that you need to call again later. Twisted should be faster at getting back from the reactor into application code, but even if all reactors were completely optimally efficient, it's faster to avoid the select()-ish syscall between turns.

comment:45 Changed 4 years ago by zooko

Thanks for the answer. I very much agree that making Deferred independent of the reactor and separately distributable/re-usable would be potentially a big win, if only for educational purposes! But I guess the eventually() proposal by Brian Warner doesn't actually require a single top-level main loop, right? It just requires that you can execute some Python code with a nice short stack. One could imagine multiple main loops or even (shudder) a new thread or a thread-pool to do that. Not a serious proposal, just wanting to understand what the possibilities are around this.

comment:46 follow-up: Changed 4 years ago by exarkun

One could imagine multiple main loops or even (shudder) a new thread or a thread-pool to do that.

Perhaps. But there's no reason to want this given an implementation that doesn't require it, right?

comment:47 in reply to: ↑ 46 Changed 4 years ago by zooko

Replying to exarkun:

Perhaps. But there's no reason to want this given an implementation that doesn't require it, right?

If you didn't mind making Deferred depend on reactor then I would still wonder if Brian's proposed hack would have advantages over your implementation. But given that we want to make Deferred independent of reactor, I agree that your implementation would be better than one that depended on "some unspecified way to get a nice short stack to start with".

By the way, is there a good way to write down "we want to make Deferred independent of reactor" so that other people reviewing other tickets will be aware of that desideratum?

Thanks!

comment:48 Changed 4 years ago by glyph

  • Owner TimothyFitz deleted

Timothy, feel free to continue with the review, but I'm un-assigning it because it's been a couple of days and I don't want to discourage anyone else from picking it up in the meanwhile.

comment:49 Changed 4 years ago by TimothyFitz

  • Keywords review removed
  • Owner set to exarkun

Glyph, I'll do you one better and just finish the review :)

All of this is for the _runCallbacks function:

  1. Don't reassign self. If you miss that one line (as I did on my first read through) you misinterpret the code completely.
  2. "chain" is misleading. While it's related to deferred chaining, it's not a chain but really a stack. Part of what complicates reading the code is that you have a stack ('chain') of queues ('current.callbacks').
  3. _debugInfo manipulation should be refactored out of this function. There are four different cases where ._debugInfo is checked and then maybe updated, most of which are fairly similar.
  4. There are two places where results are being "stolen" from one deferred and injected into another, which could be a Deferred._stealResult method.

I'm also prototyping a single while loop version of the code which might streamline things a bit more, but the idea is half-baked and shouldn't hold up review of this branch.

comment:50 Changed 4 years ago by exarkun

  • Keywords review added
  • Owner exarkun deleted

Don't reassign self.

Changed to use current instead.

"chain" is misleading.

I don't think so. The Deferred chain being followed is exactly what it represents. I think it's clear that the data structure it refers to is a stack.

_debugInfo manipulation should be refactored out of this function.

It was factored into a separate function in an intermediate version of the code. There appeared to be a slight slowdown caused by the extra method call, though. As it is now, I think there's a slight speedup instead. So I'm inclined to leave it inlined. ;) I also hope to have some time to improve the Deferred benchmark soon, after which point we might be able to be more precise than "slight" and then I'll feel better about a potential performance/factoring tradeoff.

I'm also prototyping a single while loop version of the code which might streamline things a bit more, but the idea is half-baked and shouldn't hold up review of this branch.

I can sort of imagine one form this change might take. Do you think this will help readability? Or is there some other motivation?

comment:51 Changed 4 years ago by TimothyFitz

Yeah, getting rid of the flag and 2 breaks + 2 while loops would really make the code a lot easier to follow.

Changed 4 years ago by TimothyFitz

_runCallbacks implementation with only one loop.

comment:52 Changed 4 years ago by TimothyFitz

Attached a single-while-loop version of _runCallbacks.

The only big wart left that bothers me is _debugInfo. I think the right solution is a better separation of concerns. I'm thinking it should be renamed to _resultOwner and be the canonical location of the result itself. That way _resultOwner's del method (or better yet weakref callback) would be the canonical check to see if it's a failure instance.

Of course, that's a fairly meaty refactoring that shouldn't hold up this bug from getting fixed.

comment:53 Changed 4 years ago by therve

I'm not sure this patch succeeds at being easier to understand, though...

comment:54 Changed 4 years ago by glyph

  • Keywords review removed
  • Owner set to exarkun

After numerous irrelevant distractions that prevented me from getting a clean
test run, and several uninterrupted hours on a plane where I could stare at the
code and think really hard about it, I have completed a review of this code.

The result is (drum roll please): Looks good, please merge.

The only modification I'd request is that that the documentation and comments
now refer to "chaining" in several places, but chainDeferred doesn't mention
that what it's doing is subtly different. You can still get a recursion
depth exceeded error from doing chainDeferred too many times, like this:

from twisted.internet.defer import Deferred

firstd = prevd = Deferred()
for x in range(300):
    newd = Deferred()
    prevd.chainDeferred(newd)
    prevd = newd

firstd.callback(None)

(I think it might be possible to prevent this as well by making both things
called 'chaining' actually be the same, or at least more similar, but let's
leave that for another ticket, this one has been brain-bending enough.)

This review was made possible by bazaar, which let me work on this on the
plane, bzr-svn, which lets me work on Twisted using bazaar without a
cataclysmic migration effort, and qbzr - specifically bzr qdiff, which made
reading it not drive me crazy. Thank you, bazaar developers.

comment:55 Changed 4 years ago by exarkun

(In [29954]) Note about how chainDeferred is not as cool

refs #411

comment:56 Changed 4 years ago by exarkun

  • Resolution set to fixed
  • Status changed from new to closed

(In [29955]) Merge iterative-deferreds-411-7

Author: exarkun, therve, mithrandi, jml
Reviewers: spiv, glyph, jknight, TimothyFitz
Fixes: #411

Replace the recursive implementation of implicit Deferred chaining
(that is, chaining caused by callbacks which return Deferreds) with
an implementation which is iterative instead. This prevents result
propagation from failing when a long Deferred chain is fired.

comment:57 Changed 4 years ago by <automation>

  • Owner exarkun deleted
Note: See TracTickets for help on using tickets.