Dave Gray dgray at omniti.com
Thu Jun 23 10:18:06 MDT 2005

I'm not familiar with feedlib, etc, but I'll answer what I can.

Richard Meraz wrote:
> MAXTIME = 60 # Kill crawl after this time
> TIMEOUT = 20 # Kill page retrieval after this time inactive
> MAXDEPTH = 3 # Recurse this depth when crawling page.
> # Question: There seem to be many idioms to aggregate information from 
> different defered call-back chains in twisted..  Since everything runs 
> in a single thread I just stuck my stuff in a global class and everybody 
> modifies the vars there as I pass it around to the call-backs that 
> should see it.  Seems okay for a small script like this?

That seems fine, yeah. I think I would pass around the StateVars 
instance as a context if I were coding this. Probably the same effect.

> class StateVars:
>     '''Keep Global state for starting/stopping feedfinding and a record 
> of links we have checked and their status'''
>     connections = 1 
>     links_checked = {} # Structure: {url: (is RSS/ATOM/RDF, page-content)}
> # Question: start_feed_crawl is where I set up my defers.  getPage 
> returns a defer and I attach my call-back process_link.
> # addCallbacks adds a callback/errback in parallel so only one or the 
> other is called?  so I need to add
> # the final errback to catch errors from callback process_link ?

Correct. Well, sort of. See below.

> def start_feed_crawl(uri,depth):
>     '''Harvest feeds from a uri'''
> # Question: how to time-out this deferred chain if getPage is taking too 
> long to finish its work.
> # what exactly does the argument timeout to getPage do,  does it timeout 
> the socket after a no-response
> # or does it put an upper-bound on how long getPage has to finish its work?
>     getPage(uri, timeout=TIMEOUT).addCallbacks(callback=process_link,
>                                                       callbackArgs=(uri, 
> depth, StateVars),
>                                                       errback = 
> process_error,
>                                                       errbackArgs=(uri,StateVars)
>                                                       ).addErrback(process_error, 
> uri, StateVars)

It seems clearer to me to write this as follows, but that's personal 

     d = getPage(...)

But since you're setting up the call to the same errback twice, you 
could simplify this to:

     d = getPage(...)
     d.addCallback(process_link, uri, depth, StateVars)
     d.addErrback(process_error, uri, StateVars)

has a nice visual explanation of what happens when.

> # Question: since I'm starting up these defers in a callback they are
> # being created after I've called reactor.run() since we call start_feed_crawl
> # as we find new links that meet our criteria.  Am I doing anything bad here?
> # All the examples I've seen (eg. p. 548-552 Python Cookbook, great eg by V. Volonghi
> # and P. Cogolo) have their data up-front and therefore set-up all the defers before calling
> # reactor.run().  Here I'm discovering my data as I go along and setting up deferrs while
> # the reactor is spinning.  Here is my fundamental lack of understanding.  While this script
> # seems to run okay, is it okay to do this?

Yes, that's fine. I think the one you've seen the most is the odd case - 
being able to set up all the Deferreds beforehand.

>     # Question: Is this how I kill the reactor -- ie. using some sort of 
> state condition.  Is there a better way,
>     # should I try better to understand deferred-list.  For example.  A 
> top-level deferred-list that contains
>     # other deferred-lists which get created to hold all the defers 
> (created by start_feed_crawl) for the
>     # links on a given page.  Could this deferred-list be told to stop 
> the reactor when the other lists have
>     # fired their callback (after the component defers have finished) ?  
> (Sorry for the convoluted question here
>     # I'm new at this)

What you want to do is stop the reactor when everything is done 
processing. So after you call start_feed_crawl the first time, returning 
the Deferred that getPage gives you, you can add a callback to that 
which stops the reactor. The trick here is that if you stuff that 
deferred into a DeferredList before you add the callback that stops the 
reactor then if your first operation itself returns a deferred, the 
DeferredList won't call its callbacks until the other Deferred operation 
completes. So you'll be stacking up a whole bunch of Deferreds inside 
the first one, and the callback on the DeferredList that does the 
reactor.stop won't fire until you don't return a Deferred.

There might be an easier way to do this, but this the way I know 
(example attached). Someone please let me know if there's an easier way. 
To see the example, run it with 'twistd -noy fetchpage.tac' then do 
'telnet localhost 9000' and send:

GET /?target=http://www.google.com/ HTTP/1.1
Host: localhost

> Final question: occasionally I get errors that come from the http.py 
> code in twisted.  This get printed to the console, but don't necessarily 
> stop my program.  Should my errbacks be catching these?  How do I keep 
> errors from getting logged to the console (beside redirecting stderr). I 
> can post an example if necessary of the errors I'm getting.

When you create the DeferredList, pass in consumeErrors=1 - this will 
make debugging that much more annoying though...

from twisted.web import server
from twisted.web.resource import Resource
from twisted.web.client import getPage

from twisted.internet import defer, reactor
from twisted.python import log
from cgi import escape
class Foo(Resource):
    counter = 0
    def render_GET (self, request):
        self.rq = request
        target = escape(request.args['target'][0])
        d = getPage(target).addCallback(self.print_page)
        dl = defer.DeferredList([d])
        return server.NOT_DONE_YET

    def print_page (self, html):
        if Foo.counter < 5:
            Foo.counter += 1
            print 'request '+str(Foo.counter)
            d = defer.Deferred()
            reactor.callLater(1, d.callback, html)
            return d
            print 'now we can write stuff back'
            self.rq.write(str(len(html))+' '+str(Foo.counter))
            # no deferred being returned, stopNow fires

def stopNow(cbval):
    # can't add reactor.stop as a callback directly
    # because it doesn't know what to do with the extra
    # argument being returned from the callback
    print cbval

resource = Foo()
site = server.Site(resource)

from twisted.application import service, internet
application = service.Application("Foo")
internet.TCPServer(9000, site).setServiceParent(application)

# vim: ai sts=4 sw=4 expandtab syntax=python :

