[Twisted-Python] Unhandled exceptions and observability

Thu Dec 28 06:29:15 MST 2017

On Wed, Dec 27, 2017 at 11:18 PM, Svein Seldal <sveinse at seldal.com> wrote:

> Hi
>
>
> I'm not sure how to write this email, but please let me try. I'd like to
> address something that I see as a limitation in Twisted. It might be that
> my use case is odd or that I'm outside the scope of Twisted, but non the
> less, I'd hope this could be a relevant topic.
>
> Problem:
>
> Unhandled exceptions can leave the application in a half-working state,
> and the in-app observability for them is difficult to obtain. Instead of
> terminating the whole application, the rest of the app can still keep
> running, and can be completely unaware of the failure.
>
> This applies to unhandled errbacks in Deferred and principally to any
> other reactor callbacks. E.g. it can occur in Deferreds being used
> internally in Twisted, where direct access to the object isn't available to
> the caller.
>
> As a user of Twisted, I would like to have the option to catch or fail my
> application completely when these unhandled exceptions occur, as would be
> expected in a sequential program.
>
>
I'm not sure I agree with the problem statement or your idea (below) for
solving it.  However, it's straightforward to implement your idea with
current and all recent versions of Twisted.  This program will exit rapidly:

from twisted.internet import reactor
> from twisted.python.log import addObserver
> def stop_on_errors(event):
>     if event['isError']:
>         reactor.stop()
> addObserver(stop_on_errors)
> def fail():
>     1/0
> reactor.callLater(0, fail)
> reactor.run()

Someone else can probably demonstrate how to do the same thing using
twisted.logger instead.

>
> Background:
>
> I have a larger application using many simultaneous TCP, UDP and UNIX
> connections. As with Twisted, the app is grouped in functions, where most
> of the heavy lifting are done in black-box-ish modules. There is of course,
> no guarantee for everything to work smoothly and if something fails, the
> entire application stops as a clear indication of the failure. However,
> there have been some occasions where this application is found to be
> half-dead, due to a failure occurring in a reactor-based callback that can
> only be seen by reading the logs. The main application is unfortunately
> unaware of its own failure.
>

The black boxes should probably not be so black that they hide whether they
are working or broken from the calling code.  What if they are broken in a
way that doesn't raise an exception?  What if they are broken in a way that
doesn't signal whatever ad hoc channel you invent or discover for
determining if they are broken?  The only real solution is for error
signaling to be a guaranteed part of the interface to the black box.

>
> AFAIK Twisted has no direct mechanism for handling errors that might occur
> when user code is called from the reactor. Or even worse, the caller does
> not know about the occurred failure unless the caller has direct access to
> the failing object. I believe this is more dangerous to reliability than
> the plain failing applications is, due to lower observability.
>

Correct.  No *direct* mechanism.  Various indirect mechanisms exist,
though, such as the logging example given above.

>
> Lets say the following code is used in a running application:
>
>    from twisted.internet.task import LoopingCall
>    class Foo:
>      def __init__(self):
>        self.loop = LoopingCall(self.cb)
>        self.loop.start(2, False)
>      def cb(self):
>        self.count += 1
>
>    # Main app does this:
>    try:
>      foo = Foo()
>    except:
>      print "Won't happen"
>      raise
>
> The code will fail due to the programmical error in cb, but the calling
> application won't fail and thinks everything is fine. The methodology in
> debugging errors like this is by looking through the logs.
>

Foo is broken.  It uses the global reactor.  It creates side-effects in
__init__.  It creates a Deferred (LoopingCall.start) without attaching
callbacks or errbacks.

>
>
> The 0-solution:
>
> Everywhere a function is being called from the reactor, the user is
> responsible to handling all exceptions. As is the current case.
>
> However, this is not completely straight forward. try-expect are great to
> catch expected errors, but it's easy to forget and ignore the unexpected
> ones. Like in the example above. The safeguard for this would be something
> like:
>
>    def cb(self):
>      try:
>         self.count += 1
>      except:
>         print "Whoops. Unexpected"
>         signal_main_app()
>
> And in a large application, there are many entrypoints (e.g. methods in a
> protcol handler), so the code becomes very cluttered. Plus it puts the
> responsibility for the user to implement the signal_main_app() framework.
>

You say "cluttered".  Someone else might say "has correct error handling
code".  You also don't need to modify every piece of application code this
way.  You can compose this error handling into just about anything.  For
example:

def  handle_errors_for_broken_app_code(f):
>     @decorator
>     def g(...):
>         try:
>             f(...)
>         except:
>             signal_main_app()
>     return g

Or whatever variation of function composition strikes your fancy.

An entirely different approach, if you don't want to have to rely on your
black boxes having a reliable error signal is to create a monitoring system
instead.  Flip everything around and check the system for working-ness
instead of broken-ness.  If you ever can't confirm workingn-ess, it's a
good bet something has gone wrong and action should be taken.

>
>
> Proposal:
>
> The ideal solution would be if there were a way to configure Twisted to
> inform about unhandled exceptions. It can be a addSystemEventTrigger(), or
> a SW signal, or a process signal, or perhaps a global execute-last-errback
> function. Possibly in a debug-context.
>
>
Basically, there is a way now.  The logging system - since "unhandled
exceptions" are actually always handled by the reactor and logged.  The
logging-based approach has some properties I would identify as "problems"
but it may work for you.

The idea of a "last errback" is flawed in various ways and has been
discussed and discarded many times in the past (I would love to provide a
link to such discussion and apologize for not doing so; perhaps someone
else can do the necessary digging to find one).

> With this one could inform the application that one deferred object has
> not handled its errbacks. Then the main application is given a choice to
> respond appropriately, like shutting down.
>
>
> Is my concern about the non-observability of unhandled exceptions at all
> warranted? Is the thinking wrong? Are there any other types of solutions to
> this problem? (I would like to avoid having to patch Twisted to do it.)
>
>
Hopefully the above gives you some ideas for alternate solutions.  If
they're not workable, discussion about the particulars of why not might be
interesting and could generate some other ideas.

Thanks,
Jean-Paul
-------------- next part --------------
An HTML attachment was scrubbed...
URL: </pipermail/twisted-python/attachments/20171228/f22a53b6/attachment-0002.html>