[Twisted-web] Defers, the reactor, and idiomatic/proper usage -- new user needs some advice?

Thu Jun 23 10:18:06 MDT 2005

I'm not familiar with feedlib, etc, but I'll answer what I can.

Richard Meraz wrote:
> MAXTIME = 60 # Kill crawl after this time
> TIMEOUT = 20 # Kill page retrieval after this time inactive
> MAXDEPTH = 3 # Recurse this depth when crawling page.
> 
> # Question: There seem to be many idioms to aggregate information from 
> different defered call-back chains in twisted..  Since everything runs 
> in a single thread I just stuck my stuff in a global class and everybody 
> modifies the vars there as I pass it around to the call-backs that 
> should see it.  Seems okay for a small script like this?

That seems fine, yeah. I think I would pass around the StateVars 
instance as a context if I were coding this. Probably the same effect.

> class StateVars:
>     '''Keep Global state for starting/stopping feedfinding and a record 
> of links we have checked and their status'''
>     connections = 1 
>     links_checked = {} # Structure: {url: (is RSS/ATOM/RDF, page-content)}
> 
> # Question: start_feed_crawl is where I set up my defers.  getPage 
> returns a defer and I attach my call-back process_link.
> # addCallbacks adds a callback/errback in parallel so only one or the 
> other is called?  so I need to add
> # the final errback to catch errors from callback process_link ?

Correct. Well, sort of. See below.

> def start_feed_crawl(uri,depth):
>     '''Harvest feeds from a uri'''
> # Question: how to time-out this deferred chain if getPage is taking too 
> long to finish its work.
> # what exactly does the argument timeout to getPage do,  does it timeout 
> the socket after a no-response
> # or does it put an upper-bound on how long getPage has to finish its work?
> 
>     getPage(uri, timeout=TIMEOUT).addCallbacks(callback=process_link,
>                                                       callbackArgs=(uri, 
> depth, StateVars),
>                                                       errback = 
> process_error,
>                                                       errbackArgs=(uri,StateVars)
>                                                       ).addErrback(process_error, 
> uri, StateVars)

It seems clearer to me to write this as follows, but that's personal 
preference:

     d = getPage(...)
     d.addCallbacks(...)
     d.addErrback(...)

But since you're setting up the call to the same errback twice, you 
could simplify this to:

     d = getPage(...)
     d.addCallback(process_link, uri, depth, StateVars)
     d.addErrback(process_error, uri, StateVars)

<http://twistedmatrix.com/projects/core/documentation/howto/defer.html#auto4> 
has a nice visual explanation of what happens when.

> # Question: since I'm starting up these defers in a callback they are
> # being created after I've called reactor.run() since we call start_feed_crawl
> # as we find new links that meet our criteria.  Am I doing anything bad here?
> # All the examples I've seen (eg. p. 548-552 Python Cookbook, great eg by V. Volonghi
> # and P. Cogolo) have their data up-front and therefore set-up all the defers before calling
> # reactor.run().  Here I'm discovering my data as I go along and setting up deferrs while
> # the reactor is spinning.  Here is my fundamental lack of understanding.  While this script
> # seems to run okay, is it okay to do this?

Yes, that's fine. I think the one you've seen the most is the odd case - 
being able to set up all the Deferreds beforehand.

>     # Question: Is this how I kill the reactor -- ie. using some sort of 
> state condition.  Is there a better way,
>     # should I try better to understand deferred-list.  For example.  A 
> top-level deferred-list that contains
>     # other deferred-lists which get created to hold all the defers 
> (created by start_feed_crawl) for the
>     # links on a given page.  Could this deferred-list be told to stop 
> the reactor when the other lists have
>     # fired their callback (after the component defers have finished) ?  
> (Sorry for the convoluted question here
>     # I'm new at this)

What you want to do is stop the reactor when everything is done 
processing. So after you call start_feed_crawl the first time, returning 
the Deferred that getPage gives you, you can add a callback to that 
which stops the reactor. The trick here is that if you stuff that 
deferred into a DeferredList before you add the callback that stops the 
reactor then if your first operation itself returns a deferred, the 
DeferredList won't call its callbacks until the other Deferred operation 
completes. So you'll be stacking up a whole bunch of Deferreds inside 
the first one, and the callback on the DeferredList that does the 
reactor.stop won't fire until you don't return a Deferred.

There might be an easier way to do this, but this the way I know 
(example attached). Someone please let me know if there's an easier way. 
To see the example, run it with 'twistd -noy fetchpage.tac' then do 
'telnet localhost 9000' and send:

GET /?target=http://www.google.com/ HTTP/1.1
Host: localhost

> Final question: occasionally I get errors that come from the http.py 
> code in twisted.  This get printed to the console, but don't necessarily 
> stop my program.  Should my errbacks be catching these?  How do I keep 
> errors from getting logged to the console (beside redirecting stderr). I 
> can post an example if necessary of the errors I'm getting.

When you create the DeferredList, pass in consumeErrors=1 - this will 
make debugging that much more annoying though...

HTH,
Dave
-------------- next part --------------
from twisted.web import server
from twisted.web.resource import Resource
from twisted.web.client import getPage

from twisted.internet import defer, reactor
from twisted.python import log
from cgi import escape
class Foo(Resource):
    counter = 0
    isLeaf=True
    def render_GET (self, request):
        self.rq = request
        target = escape(request.args['target'][0])
        d = getPage(target).addCallback(self.print_page)
        d.addErrback(log.err)
        dl = defer.DeferredList([d])
        dl.addCallback(stopNow)
        dl.addErrback(log.err)
        return server.NOT_DONE_YET

    def print_page (self, html):
        if Foo.counter < 5:
            Foo.counter += 1
            print 'request '+str(Foo.counter)
            d = defer.Deferred()
            d.addCallback(self.print_page)
            d.addErrback(log.err)
            reactor.callLater(1, d.callback, html)
            return d
        else:
            print 'now we can write stuff back'
            self.rq.write(str(len(html))+' '+str(Foo.counter))
            self.rq.finish()
            self.rq.transport.loseConnection()
            # no deferred being returned, stopNow fires

def stopNow(cbval):
    # can't add reactor.stop as a callback directly
    # because it doesn't know what to do with the extra
    # argument being returned from the callback
    print cbval
    reactor.stop()

resource = Foo()
site = server.Site(resource)

from twisted.application import service, internet
application = service.Application("Foo")
internet.TCPServer(9000, site).setServiceParent(application)

# vim: ai sts=4 sw=4 expandtab syntax=python :