[Twisted-Python] defers, reactor, idiomatic/proper usage -- new user questions.

Tue Jun 21 11:13:07 MDT 2005

Hello all,

 Just getting started with Twisted. Thanks to the community for
tremendous work.  I diligently went through the entire archives of the
mailing list --  reading choice threads --  and I read through all the
documentation on the twisted-matrix website.  I've never done any
event-driven programming, but there was enough on the site for me to
start getting a handle on things.  Below is a short script that crawls
the links from a URI looking for RSS-type feeds.  I'm hoping that some
of the more experienced developers would be willing to give some
advice about whether I'm using twisted.internet.reactor and
twisted.web.client.getPage correctly.  I've put some comments labeled
as #Question where I'm unsure wheter I understand exactly what I'm
doing -- Im hoping someone can refute/chastise/critique the
understanding that is implied by the code and questions. I have a
thick skin so the more talented/vitriolic the response the better.

 # Find RSS Feeds.
 # Richard Meraz -- rfmeraz at gmail.com. 
from twisted.internet import reactor
from twisted.web.client import getPage

import feedlib # Includes modified version of M. Pilgrims
feedfinder.py and modified version of
                    # D. Mertz code for url extraction from p. 228 of __TPIP__. 

MAXTIME = 60 # Kill crawl after this time
TIMEOUT = 20 # Kill page retrieval after this time inactive
MAXDEPTH = 3 # Recurse this depth when crawling page. 

 # Question: There seem to be many idioms to aggregate information
from different defered call-back chains in twisted..  Since everything
runs in a single thread I just stuck my stuff in a global class and
everybody modifies the vars there as I pass it around to the
call-backs that should see it.  Seems okay for a small script like
this?

class StateVars:
    '''Keep Global state for starting/stopping feedfinding and a
record of links we have checked and their status'''
    connections = 1  
    links_checked = {} # Structure: {url: (is RSS/ATOM/RDF, page-content)}

 # Question: start_feed_crawl is where I set up my defers.  getPage
returns a defer and I attach my call-back process_link.
 # addCallbacks adds a callback/errback in parallel so only one or the
other is called?  so I need to add
 # the final errback to catch errors from callback process_link ?

def start_feed_crawl(uri,depth):
    '''Harvest feeds from a uri'''
 # Question: how to time-out this deferred chain if getPage is taking
too long to finish its work.
 # what exactly does the argument timeout to getPage do,  does it
timeout the socket after a no-response
 # or does it put an upper-bound on how long getPage has to finish its work?

    getPage(uri, timeout=TIMEOUT).addCallbacks
(callback=process_link,

callbackArgs=(uri, depth, StateVars),
                                                      errback = process_error,

errbackArgs=(uri,StateVars)

).addErrback(process_error, uri, StateVars)

def process_link(data,uri,depth,state):
    '''Recursive link processing callback. Determines whether a link
    is a RSS/ATOM/RDF feed. If not then extracts all xml-like links
    that could point to feeds and starts crawl on those.''' 
    if feedlib.couldBeFeedData(data):
        #print 'Feed: %s' % uri
        state.links_checked[uri] = (True,data)
    else:
        state.links_checked[uri] = (False,None)
        if  depth <= MAXDEPTH: 
            alinks = feedlib.getALinks(data,uri)
            links = feedlib.getLinks(data,uri)
            rawurls = feedlib.extract_urls(data)        
            links_to_check = [feedlib.makeFullURI(u)
                              for u in set(alinks+links+rawurls)
                              if feedlib.isXMLRelatedLink(u)]
            for l in links_to_check:
                # Don't need to see it again.
                if state.links_checked.has_key(l):
                    continue
                else:
                    state.connections += 1
                     # Question: since I'm starting up these defers in
a callback they are
                     # being created after I've called reactor.run()
since we call start_feed_crawl
                     # as we find new links that meet our criteria. 
Am I doing anything bad here?
                     # All the examples I've seen (eg. p. 548-552
Python Cookbook, great eg by V. Volonghi
                     # and P. Cogolo) have their data up-front and
therefore set-up all the defers before calling
                     # reactor.run().  Here I'm discovering my data as
I go along and setting up deferrs while
                     # the reactor is spinning.  Here is my
fundamental lack of understanding.  While this script
                     # seems to run okay, is it okay to do this?
                    start_feed_crawl(l,depth+1)

    state.connections -= 1

     # Question: Is this how I kill the reactor -- ie. using some sort
of state condition.  Is there a better way,
     # should I try better to understand deferred-list.  For example. 
A top-level deferred-list that contains
     # other deferred-lists which get created to hold all the defers
(created by start_feed_crawl) for the
     # links on a given page.  Could this deferred-list be told to
stop the reactor when the other lists have
     # fired their callback (after the component defers have finished)
?  (Sorry for the convoluted question here
     # I'm new at this)

    if state.connections <= 0:
        reactor.stop()
    return

def process_error(error,uri,state):
    '''Catch errors in link processing'''
    state.connections -= 1
    if state.connections  <= 0:
        reactor.stop()
    return ''

if __name__ == '__main__':
    import sys
    if len(sys.argv) < 2:
        print 'feedfinder_new.py <uri>'
        sys.exit()

    uri =  feedlib.makeFullURI(sys.argv[1])
    start_feed_crawl(uri,1)

     # Question: I'm killing the process after a pre-determined amount
of time.  However reactor.stop() seems
     # to kill network connections.  is there a way to stop the
reactor but let the connections finish.

    # Hack to blow out any connections that are hung or uncalled after MAXTIME
    reactor.callLater(MAXTIME, reactor.stop)
    reactor.run()

    for l in StateVars.links_checked:
        if StateVars.links_checked[l][0]:
            print l

Final question: occasionally I get errors that come from the http.py
code in twisted.  This get printed to the console, but don't
necessarily stop my program.  Should my errbacks be catching these? 
How do I keep errors from getting logged to the console (beside
redirecting stderr). I can post an example if necessary of the errors
I'm getting.

 Thanks for your help.

 Richard F. Meraz

-- 
Never think there is anything impossible for the soul. It is the
greatest heresy to think so. If there is sin, this is the only sin –
to say that you are weak, or others are weak.

Swami Vivekananda