Opened 8 years ago

Closed 8 years ago

#5011 defect closed fixed (fixed)

Raising exception in a Deferred callback is outrageously expensive

Reported by: spiv Owned by: spiv
Priority: high Milestone:
Component: core Keywords:
Cc: Tristan Seligmann, fijal Branch: branches/faster-failures-5011
branch-diff, diff-cov, branch-cov, buildbot
Author: spiv

Description (last modified by Jean-Paul Calderone)

tl;dr summary: raise foo in a callback is very very slow, and I have patch to fix it.

At Canonical we recently noticed Launchpad's codehosting spending a surprising amount of time in twisted.python.failure.Failure.__getstate__, as reported in http://twistedmatrix.com/pipermail/twisted-python/2011-March/023755.html. The result is that an operation that should take a small handful of milliseconds is taking 2-5× longer, because Twisted is spending that time cleaning failures that are about to be discarded entirely.

The problem is essentially that capturing the full traceback is expensive (and the deeper your call stack, and the more variables you have, the worse it is), and Deferred doesn't have a good idea of when that information will be useful, so much of the time capturing and processing it wasted.

The current policy is that if a call/errback raises an error, it constructs a failure by calling Failure(), which will capture the traceback in the Failure (unless the error is already a Failure instance, in which case the new Failure shares a __dict__ with the raised one). At this point it's mainly just keeping a reference to a traceback object, so not particularly expensive by itself.

However, if _runCallbacks gets to the end of the callbacks it can run because we've reached the end of the chain (or the current result is a Deferred with no result yet, but that case is not relevant here), and the current result is Failure, it calls cleanFailure on it. That is an enormously expensive operation.

This particular affects us I think because of the combination of fairly deep call stacks (there are many layers of infrastructure/abstraction involved across Launchpad in general, Launchpad's codehosting, bzrlib's abstractions, and Twisted bits), and calling addCallback on Deferreds that have already fired.

In the affected part of Launchpad we have a largely synchronous API wrapped in a Twisted-style async API that returns Deferreds, so those Deferreds will run their callbacks as soon as they are added. That means that a single logical operation that produces multiple Failures along the way will call the expensive cleanFailure method for each of them, because _runCallbacks can only consider one callback at a time. e.g. consider this case:

d = somethingThatReturnsDeferred()
d.addCallback(doSomething)
d.addErrback(translateError)
d.addBoth(finishStuff)

In Launchpad I've got a fairly ugly patch that to avoid raising errors in callbacks instead return traceback-less Failures: <http://bazaar.launchpad.net/~spiv/launchpad/codehosting-cheaper-failures/revision/12686>. It has fixed maybe half the slow down (as compared with John's hack to unconditionally do self.tb = None in his mailing list post), but chasing down all the remaining possible places this can happen in our code and making them uglier is not a very delightful prospect, so I see this as a temporary workaround rather than a solution.

In Twisted I think we probably don't want to be capturing and stringifying tracebacks (and all the locals/globals of every stack frame!) so frequently. We already disable the much cheaper capturing of creator/invoker tracebacks unless defer.setDebugging(True) is called. So I propose that capturing tracebacks in callback exceptions should also be controlled by that debug flag. I have a patch that does this: <http://bazaar.launchpad.net/~spiv/twisted/deferred-no-failure-tracebacks-by-default/revision/15784>

I've also created a crude benchmark demonstrating the difference it makes.

Trunk:

759.6 deferred_callback_chains/sec (3798 deferred_callback_chains in 5 seconds)

My patch:

12646.2 deferred_callback_chains/sec (63231 deferred_callback_chains in 5 seconds)

The benchmark can be found in <lp:~spiv/+junk/twisted-benchmarks>.

Change History (37)

comment:1 Changed 8 years ago by spiv

Another option would be to keep collecting tracebacks in _runCallbacks, but stop collecting the locals/globals of the stack frames in the traceback. That's probably the bulk of the cost, but Deferred never uses that extra information itself, and I bet almost no errbacks do either.

comment:2 Changed 8 years ago by spiv

In addition to the crude microbenchmark being 16× faster, running trial twisted is a few percent faster for me too.

comment:3 Changed 8 years ago by Glyph

Priority: normalhigh

comment:4 Changed 8 years ago by John Arbash Meinel

Note that this is missing one of the primary cases that I uncovered in the Launchpad code: maybeDeferred. Which does:

try:

result = f(*args, kw)

except:

return fail(failure.Failure())

if isinstance(result, Deferred):

return result

elif isinstance(result, failure.Failure):

return fail(result)

else:

return succeed(result)

Note that we don't have a Deferred at this point, to check self.debug. However, we could certainly add a "Deferred.debug" check. I'll post an update to Andrew's proposal.

comment:5 Changed 8 years ago by Tristan Seligmann

Cc: Tristan Seligmann added

comment:6 Changed 8 years ago by spiv

Good point John. I've beaten you to the update though :)

The new diff can be seen at <http://bazaar.launchpad.net/~spiv/twisted/deferred-no-failure-tracebacks-by-default/revision/15785?compare_revid=15783>

comment:7 Changed 8 years ago by spiv

I should point out that although the new patch is only 10× (rather than 16×) faster on my crude benchmark, it appears to be significantly faster at running trial twisted, presumably because of the improved maybeDeferred out-weighing the less aggressive optimisation of Failure: I still keep basic traceback info now (as per my earlier comment), I just added a flag to skip capturing locals and globals.

comment:8 Changed 8 years ago by spiv

Thanks to jml and therve for some quick reviews on IRC. An updated patch can be seen at <http://bazaar.launchpad.net/~spiv/twisted/deferred-no-failure-tracebacks-by-default/revision/15787?compare_revid=15783>.

The most notable change is the addition of a couple of unit tests for Failure(noVars=True). The other tweaks are basically cosmetic (e.g. updating d.has_key to the 21st century while I'm there).

Currently this patch runs twisted trial ~75s on my laptop vs ~85s for trunk! John reports it is almost as good for our Launchpad problem as unconditionally setting self.tb=None.

The only significant change I'm tempted to make is defaulting noVars to True: it's a change in behaviour that's not strictly backwards compatible, but I wonder if it's worth it?

comment:9 Changed 8 years ago by Jean-Paul Calderone

Keywords: review removed
Owner: set to spiv

When I talked to jml on IRC about this performance problem (prior to the creation of this ticket), an idea that came up was to have an alternate Failure implementation that skipped this work. From a maintenance standpoint, this seems preferable to adding more complexity to (the already ridiculous) Failure.__init__. It looks like the noVars flag doesn't add any new untested code paths through that method, but it does modify some untested code paths (by changing them to only run sometimes).

Separately from that, don't make flags that start with "no". :P If a flag to Failure is the approach to take, name it without a negation. Also, there is already a debug Failure mode. If you wanted stack capturing behavior to be off by default, then re-using that debug mode might make more sense (or maybe that's conflating unrelated things, I'm not sure).

Also, I noticed that the bzr branch is based on trunk from about two weeks ago. I don't know if other people who compared the branch to trunk took that into account. Measuring Twisted test suite runs, the branch seems to be about a 10% speedup.

I ran the benchmark a few times, the results for trunk are visible at <http://speed.twistedmatrix.com/timeline/?exe=2&base=none&ben=deferred_callback_chains&env=d3&revs=200>. No results for the branch because 1) the benchmark slave can't check out bzr branches and 2) codespeed doesn't support benchmarking branches.

Overall I think the reduction in functionality in non-debug mode makes sense. I don't remember the last time I saw any code using detailed tracebacks, nor the last time anyone asked a question about that feature. I would like to see more unit test coverage for printTraceback and printBriefTraceback than is being added here (the test just calls the methods, doesn't assert anything about their behavior). Also, perhaps the detailed traceback should explicitly mention that it has no details, if it is formatted from a failure which has not held on to these values?

Also, some coding standard violations. Use test_foo, put docstring triple quotes on their own lines.

comment:10 Changed 8 years ago by Tristan Seligmann

Don't twisted.web / Nevow use the detailed traceback when rendering a traceback in HTML form? I've often found the contents of variables to be useful in debugging an unexpected error a user has somehow stumbled over, especially when it can't be easily or consistently reproduced, but I'm not sure I would want to pay the full cost of Deferred debugging in my production environments.

comment:11 Changed 8 years ago by Jean-Paul Calderone

Indeed, twisted.web.util.formatFailure and nevow.failure.formatFailure. Lots of people object to their current behavior, since it can provide end-users with much more information than a developer might expect.

comment:12 Changed 8 years ago by spiv

Keywords: review added

Updated patch: <http://bazaar.launchpad.net/~spiv/twisted/deferred-no-failure-tracebacks-by-default/revision/15789?compare_revid=15783>

I don't think an alternative Failure implementation is better from a maintenance standpoint: it would be more code, and more code paths. There's no regression in test coverage here, and in fact in the current patch a slight improvement to coverage of the existing code paths. I agree that test_failure is far from comprehensive, but that sounds like a separate ticket to me rather than something that should block this patch.

The flag in my patch is now called captureVars, and as glyph suggested on the mailing list it now defaults to False, i.e. it changes the default behaviour. As a result I've been able to leave defer.py untouched vs. trunk. If we really wanted we could instead arrange for defer.setDebugging(True) to start passing captureVars=True, but my guess is no-one really wants this, and if they do glyph's proposal to add a dedicated debug flag for failures is probably better.

I've also arranged for the detailed traceback to explicitly mention when it has no details.

I think I've fixed the coding standard violations, but my habits are somewhat different so it's possible I've missed something :)

comment:13 Changed 8 years ago by spiv

Owner: spiv deleted

comment:14 Changed 8 years ago by Jean-Paul Calderone

I don't think an alternative Failure implementation is better from a maintenance standpoint: it would be more code, and more code paths. [...] I agree that test_failure is far from comprehensive, but that sounds like a separate ticket to me rather than something that should block this patch.

I disagree. I think we put off cleanups and refactoring far too much and excuse it as being easier right now. That just leaves us with code like the the current Failure.__init__ though (which will be even worse after this change).

Sadly, I'd rather see the speedup land than have you abandon the change (or just maintain it downstream) because cleaning up failure.py is too much work, so I'm not actually going to say this should block this ticket.

If we really wanted we could instead arrange for defer.setDebugging(True) to start passing captureVars=True, but my guess is no-one really wants this, and if they do glyph's proposal to add a dedicated debug flag for failures is probably better.

This means there is no way at all to capture variables in Deferred-created Failures now?

comment:15 Changed 8 years ago by spiv

Sadly, I'd rather see the speedup land than have you abandon the change (or just maintain it downstream) because cleaning up failure.py is too much work, so I'm not actually going to say this should block this ticket.

I agree clean-ups are important, but I'm glad you don't think holding other improvements hostage is the answer.

This means there is no way at all to capture variables in Deferred-created Failures now?

Right. Do you think that matters? If so the 3 obvious options I see are:

  1. pass captureVars=Deferred.debug
  1. add a new setFailureDebugging function to t.i.defer, and pass the flag it sets to captureVars. Default would be False.
  1. glyph's proposal. This would be nice, but needs some fleshing out before it's implementable. FWIW bzrlib's implementation of a similar idea works well for us. It has a bzrlib.debug module which basically just contains debug_flags = set(), and then code elsewhere does stuff like if 'transport' in debug.debug_flags: …

Do you have a preference (possibly for some other option)? For sheer simplicity I'm tempted by the first one.

comment:16 Changed 8 years ago by Jean-Paul Calderone

Keywords: review removed
Owner: set to spiv
  1. pass captureVars=Deferred.debug

This seems like a good idea for now. I do think we should avoid completely disabling the variable capture feature, even if we turn it off by default.

comment:17 Changed 8 years ago by spiv

Keywords: review added
Owner: spiv deleted

Ok, done. Updated patch: <http://bazaar.launchpad.net/~spiv/twisted/deferred-no-failure-tracebacks-by-default/revision/15791?compare_revid=15783>

As before, I've only changed invocations of Failure() in defer.py that weren't passing an exc_value, so it's always possible to explicitly use a detailed Failure by returning one from a callback or passing one to d.errback, regardless of the current value of Deferred.debug. And of course the reverse is also true: you can explicitly use a Failure with no tracebacks today. So I haven't changed the semantics of how defer.py deals with Failures it didn't construct.

You could argue that the ability to pass in your own Failures means my previous patch didn't completely disable the variable capture feature either :)

I've also created some news files for this change.

Any other comments?

comment:18 Changed 8 years ago by spiv

I think the actual diff is the same, but for what it's worth here is the diff after merging current trunk into that branch: <http://bazaar.launchpad.net/~spiv/twisted/deferred-no-failure-tracebacks-by-default/revision/15792?compare_revid=15783.1.67>

comment:19 Changed 8 years ago by spiv

Author: spiv
Branch: branches/faster-failures-5011

(In [31501]) Branching to 'faster-failures-5011'

comment:21 Changed 8 years ago by Jean-Paul Calderone

Description: modified (diff)

comment:22 Changed 8 years ago by Jean-Paul Calderone

Keywords: review removed
Owner: set to spiv

Okay, I am convinced this is a change we should make. So here are some boring things:

  1. Could several more of the new test methods uses getDivisionFailure instead of doing the try/etc thing? That is, test_printingCapturedVarsSmokeTest, test_printingNoVars, and test_printingCaptureVars.
  2. Methods should be separated from each other by two blank lines
  3. I'm not sure what test_getTracebackObjectFromCaptureVars is intending to test. Would an isinstance(..., types.TracebackType) go with the spirit of it?
  4. I think the smoke test style tests aren't great. Fortunately I guess there's really only one of those being added in this branch. Can you add a couple minimal assertions to test_printingCapturedVarsSmokeTest?
  5. In the Failure.__init__ signature, captureVars=False should be indented to align with the ( on the previous line.
  6. It would be great to see some API docs for the frames and stack attributes of Failure. (It would have been even greater if they were there already, but I think you're changing what one might find in them, so I feel okay asking for docs now :)
  7. In __getstate__, this line confuses me: (v[3] is None) and. Is this really necessary? It looks like _safe_repr_vars handles this itself. (And if it were necessary, I'd expect to see the same treatment for v[4]).
  8. _safe_repr_vars should be something like _safeReprVars and have a docstring.

One other thing I'm wondering about is the Failure() calls in other parts of Twisted. There are not very many of these, but there will be no way to capture vars from them. I guess I'm not very worried about this because of the marginal nature of the feature... but it does seem a little lame. I guess your option 2 from the earlier comment would address this, so that might be good to do eventually. But waiting for someone to ask for it seems like a good idea. :)

Thanks! I'm looking forward to Twisted actually getting faster. :)

comment:23 Changed 8 years ago by spiv

Keywords: review added

Thanks for the review!

  1. I'm a touch uncomfortable with getDivisionFailure, because of the risk of skew between its arguments and Failure's constructor. I've made it take *args, **kwargs to avoid that problem, and used it in those places. I added a short docstring to getDivisionFailure while I was there.
  2. Hopefully I've caught all cases of this that I added. It's a bit hard to notice when many of these places weren't using that convention consistently beforehand (because violations don't look unusual).
  3. It is intended to express that the value of captureVars has no effect on the result of getTraceback, i.e. that if you are only consuming non-detailed tracebacks then whether captureVars was used is irrelevant. You get an identical result either way. A broader statement in the same spirit might be “the value passed to captureVars only affects printDetailedTraceback [and get/printTraceback(…,detail='verbose')]”
  4. Some really minimal assertions added: that it emits a string starting with whatever is expected in each case. I also assert that a local variable explicitly put on the stack is visible in the detailed traceback via a crude regex check. If you had something more specific in mind let me know what.
  5. Done.
  6. Done.
  7. You're right, it is unnecessary. Removed.
  8. Renamed, and docstring written. Also tweaked the variable names it uses.

As a bonus: I added a test to catch a bug introduced by a typo, a trailing comma in (what was) _safe_repr_vars. I have of course also fixed that bug.

comment:24 Changed 8 years ago by spiv

(In [31542]) Various tweaks suggested by exarkun's review.

Refs #5011

comment:25 Changed 8 years ago by Michael Hudson-Doyle

Author: spiv

comment:26 Changed 8 years ago by Michael Hudson-Doyle

Author: spiv
Owner: spiv deleted

comment:28 Changed 8 years ago by fijal

Cc: fijal added

Hello.

I propose that this patch should avoid storing the traceback alltogether in case debug is off. While definitely convinient debugging info, in case of PyPy just the call to sys.exc_info forces all the frames and aborts all the jitting. This means huge performance hit in case traceback is stored.

comment:29 Changed 8 years ago by spiv

fijal: I think that is a change worth discussing, but I think it's out of scope for this ticket.

The problem is it would be a pretty disruptive change to default to simply turning off tracebacks in Failures entirely. One obvious casualty is it would cause lots of Trial failures to have no tracebacks (which breaks Trial's own test suite, which quite reasonably asserts at various points that tracebacks are shown to the user). Another problem is that when an unhandled failure happens in a Deferred, or when an errback e.g. uses log.err to report a failure, there'd be no traceback to indicate where the error happened. You'd often just get a context-less “TypeError: NoneType is not callable” or something similarly mysterious.

I think probably the ideal is something a bit smarter than just a single boolean debug flag. Perhaps different flags for different parts of the code, perhaps have a list of exception types to automatically capture or not capture tracebacks for, perhaps something else…

But that's a discussion for a separate ticket. This patch is clearly a step closer towards your ideal, so you should be in favour even if it doesn't go as far as you'd like :)

comment:30 Changed 8 years ago by Jean-Paul Calderone

Owner: set to Jean-Paul Calderone
Status: newassigned

comment:31 Changed 8 years ago by Jean-Paul Calderone

Keywords: review removed
Owner: changed from Jean-Paul Calderone to spiv
Status: assignednew
  1. twisted.web.util.formatFailure has a problem with the None locals and globals values which will now frequently be found in Failure.frames. It tries to iterate over it which will fail. You can observe by running a twistd web server and trying to load, eg, a unix socket from it. There's probably not a lot of code like this out there, but maybe there's some. It would be great if we could keep these as lists (albeit empty ones when Deferred debugging is not enabled) somehow.
  2. There appears to be no direct unit test which covers the true case of if fail is None: in Deferred.errback (ie, running test_defer indicates the body of that condition is unexecuted).
  3. There are also no new Deferred unit tests at all here. The other lines in defer.py which are changing are executed, but nothing verifies that they actually cause Failures with no variables to be constructed.
  4. Please drop the shoulds from the test method docstrings.
  5. In test_printingCapturedVarsSmokeTest, the exampleLocalVar local is unused. Also the indentation for the second line of the re.search is off (it is aligned with the paren of the assertNotEqual, instead of the paren of the re.search).
  6. Likewise on both counts for test_printingCapturedVarsCleanedSmokeTest.

Concerning the first point, the only difficulty with using [] instead of None is that it makes detecting the case in format_frames a little trickier. Perhaps the best way to deal with that is to propagate the captureVars flag into format_frames so it doesn't need to guess based on the state it encounters.

Thanks. :)

comment:32 Changed 8 years ago by spiv

Keywords: review added
Owner: spiv deleted
  1. Ok, done. Rather than add a new param to format_failure to modify the output of format_failure(…, detail='verbose') I added a new value for the detail param. Either way would work, this just felt slightly cleaner to me.
  2. Ok, added. As you note this was coverage that's been lacking for a long time! I explicitly test with and without debugging active.
  3. Ok, tests added for the other changed case too.
  4. Done. The “shoulds” originated in a pre-existing docstring which I've also updated.
  5. It is used: the test asserts that that local variable appears in the output of printDetailedTraceback, although I suppose it appears unused to pyflakes. I've fixed the indentation.
  6. Likewise on both counts.

comment:33 in reply to:  32 Changed 8 years ago by spiv

Here's that comment again, with more Trac-happy markup. Grr.

  1. Ok, done. Rather than add a new param to format_failure to modify the output of format_failure(…, detail='verbose') I added a new value for the detail param. Either way would work, this just felt slightly cleaner to me.
  2. Ok, added. As you note this was coverage that's been lacking for a long time! I explicitly test with and without debugging active.
  3. Ok, tests added for the other changed case too.
  4. Done. The “shoulds” originated in a pre-existing docstring which I've also updated.
  5. It is used: the test asserts that that local variable appears in the output of printDetailedTraceback, although I suppose it appears unused to pyflakes. I've fixed the indentation.
  6. Likewise on both counts.

comment:34 Changed 8 years ago by Jean-Paul Calderone

Keywords: review removed
Owner: set to spiv
  1. @param detail: Three detail ... - should be Four now ;)
  2. It looks like _safeReprVars no longer needs to handle the None case (and it isn't exercised anywhere in the test suite).
  3. The docs in Failure.__doc__ about stack and frame still talks about the possibility of a None, but that's eliminated now.
  4. In test_errbackWithNoArgsNoDebug, test_errbackWithNoArgs, test_errorInCallbackDoesNotCaptureVars, and test_errorInCallbackCapturesVarsWhenDebugging, I suppose func_name et al should be camel case, (funcName, etc). Or _. Or slice localz and globalz off the end of frames[0]. Or something.
  5. There's a left-over print in test_printingCapturedVarsCleanedSmokeTest

Latest build results look good, please merge when the above points are addressed.

comment:35 Changed 8 years ago by spiv

  1. Fixed.
  2. Fixed.
  3. Fixed.
  4. Fixed (sliced).
  5. Fixed.

comment:36 Changed 8 years ago by spiv

(In [31726]) Address review comments.

Refs #5011

comment:37 Changed 8 years ago by spiv

Resolution: fixed
Status: newclosed

(In [31727]) Merge faster-failures-5011: Don't capture all locals and globals in Failures by default.

Author: spiv

Reviewers: jml, therve, exarkun

Fixes: #5011

Failure no longer captures locals and globals by default. This avoids a lot of costly introspection that is usually ignored when a Deferred callback raises an exception. You can pass captureVars=True when constructing a Failure (or for errors during Deferred callbacks use Deferred.setDebugging(True)) if you really want to capture that information.

Note: See TracTickets for help on using tickets.