[Twisted-web] Re: Twisted-web Digest, Vol 15, Issue 20
Richard Meraz
rfmeraz at gmail.com
Thu Jun 23 12:35:15 MDT 2005
Thanks Dave: very clear and easy to follow answer and example code. I
definitely appreciate your time.
Final question. Is there a convenient way to put an upper-bound on
how long twisted.web.client.getPage is allowed to complete its work. I
know twisted.web.client.getPage takes a timeout parameter, but this
seems more like a socket timeout which won't kill for example a
getPage waiting on a low-bandwidth server (is my understanding
correct?)
For example, if I'm using the asyncore.py framework to mange IO i can
use the channel.timestamp attribute to examine how long things have
been going in order to kill long-running IO in a polling loop. Of
course with asyncore I can manage my own polling loop which I can't
see an easy way to do using reactor() (someone care to comment on that
since I'm probably missing something).
-Thanks again
-Richard Meraz
On 6/23/05, twisted-web-request at twistedmatrix.com
<twisted-web-request at twistedmatrix.com> wrote:
> Send Twisted-web mailing list submissions to
> twisted-web at twistedmatrix.com
>
> To subscribe or unsubscribe via the World Wide Web, visit
> http://twistedmatrix.com/cgi-bin/mailman/listinfo/twisted-web
> or, via email, send a message with subject or body 'help' to
> twisted-web-request at twistedmatrix.com
>
> You can reach the person managing the list at
> twisted-web-owner at twistedmatrix.com
>
> When replying, please edit your Subject line so it is more specific
> than "Re: Contents of Twisted-web digest..."
>
>
> Today's Topics:
>
> 1. Re: Defers, the reactor, and idiomatic/proper usage -- new
> user needs some advice? (Dave Gray)
>
>
> ----------------------------------------------------------------------
>
> Message: 1
> Date: Thu, 23 Jun 2005 12:18:06 -0400
> From: Dave Gray <dgray at omniti.com>
> Subject: Re: [Twisted-web] Defers, the reactor, and idiomatic/proper
> usage -- new user needs some advice?
> To: Richard Meraz <rfmeraz at gmail.com>, "Discussion of twisted.web,
> Nevow, and Woven" <twisted-web at twistedmatrix.com>
> Message-ID: <42BAE0BE.1040209 at omniti.com>
> Content-Type: text/plain; charset="windows-1252"
>
> I'm not familiar with feedlib, etc, but I'll answer what I can.
>
> Richard Meraz wrote:
> > MAXTIME = 60 # Kill crawl after this time
> > TIMEOUT = 20 # Kill page retrieval after this time inactive
> > MAXDEPTH = 3 # Recurse this depth when crawling page.
> >
> > # Question: There seem to be many idioms to aggregate information from
> > different defered call-back chains in twisted.. Since everything runs
> > in a single thread I just stuck my stuff in a global class and everybody
> > modifies the vars there as I pass it around to the call-backs that
> > should see it. Seems okay for a small script like this?
>
> That seems fine, yeah. I think I would pass around the StateVars
> instance as a context if I were coding this. Probably the same effect.
>
> > class StateVars:
> > '''Keep Global state for starting/stopping feedfinding and a record
> > of links we have checked and their status'''
> > connections = 1
> > links_checked = {} # Structure: {url: (is RSS/ATOM/RDF, page-content)}
> >
> > # Question: start_feed_crawl is where I set up my defers. getPage
> > returns a defer and I attach my call-back process_link.
> > # addCallbacks adds a callback/errback in parallel so only one or the
> > other is called? so I need to add
> > # the final errback to catch errors from callback process_link ?
>
> Correct. Well, sort of. See below.
>
> > def start_feed_crawl(uri,depth):
> > '''Harvest feeds from a uri'''
> > # Question: how to time-out this deferred chain if getPage is taking too
> > long to finish its work.
> > # what exactly does the argument timeout to getPage do, does it timeout
> > the socket after a no-response
> > # or does it put an upper-bound on how long getPage has to finish its work?
> >
> > getPage(uri, timeout=TIMEOUT).addCallbacks(callback=process_link,
> > callbackArgs=(uri,
> > depth, StateVars),
> > errback =
> > process_error,
> > errbackArgs=(uri,StateVars)
> > ).addErrback(process_error,
> > uri, StateVars)
>
> It seems clearer to me to write this as follows, but that's personal
> preference:
>
> d = getPage(...)
> d.addCallbacks(...)
> d.addErrback(...)
>
> But since you're setting up the call to the same errback twice, you
> could simplify this to:
>
> d = getPage(...)
> d.addCallback(process_link, uri, depth, StateVars)
> d.addErrback(process_error, uri, StateVars)
>
> <http://twistedmatrix.com/projects/core/documentation/howto/defer.html#auto4>
> has a nice visual explanation of what happens when.
>
> > # Question: since I'm starting up these defers in a callback they are
> > # being created after I've called reactor.run() since we call start_feed_crawl
> > # as we find new links that meet our criteria. Am I doing anything bad here?
> > # All the examples I've seen (eg. p. 548-552 Python Cookbook, great eg by V. Volonghi
> > # and P. Cogolo) have their data up-front and therefore set-up all the defers before calling
> > # reactor.run(). Here I'm discovering my data as I go along and setting up deferrs while
> > # the reactor is spinning. Here is my fundamental lack of understanding. While this script
> > # seems to run okay, is it okay to do this?
>
> Yes, that's fine. I think the one you've seen the most is the odd case -
> being able to set up all the Deferreds beforehand.
>
> > # Question: Is this how I kill the reactor -- ie. using some sort of
> > state condition. Is there a better way,
> > # should I try better to understand deferred-list. For example. A
> > top-level deferred-list that contains
> > # other deferred-lists which get created to hold all the defers
> > (created by start_feed_crawl) for the
> > # links on a given page. Could this deferred-list be told to stop
> > the reactor when the other lists have
> > # fired their callback (after the component defers have finished) ?
> > (Sorry for the convoluted question here
> > # I'm new at this)
>
> What you want to do is stop the reactor when everything is done
> processing. So after you call start_feed_crawl the first time, returning
> the Deferred that getPage gives you, you can add a callback to that
> which stops the reactor. The trick here is that if you stuff that
> deferred into a DeferredList before you add the callback that stops the
> reactor then if your first operation itself returns a deferred, the
> DeferredList won't call its callbacks until the other Deferred operation
> completes. So you'll be stacking up a whole bunch of Deferreds inside
> the first one, and the callback on the DeferredList that does the
> reactor.stop won't fire until you don't return a Deferred.
>
> There might be an easier way to do this, but this the way I know
> (example attached). Someone please let me know if there's an easier way.
> To see the example, run it with 'twistd -noy fetchpage.tac' then do
> 'telnet localhost 9000' and send:
>
> GET /?target=http://www.google.com/ HTTP/1.1
> Host: localhost
>
>
>
> > Final question: occasionally I get errors that come from the http.py
> > code in twisted. This get printed to the console, but don't necessarily
> > stop my program. Should my errbacks be catching these? How do I keep
> > errors from getting logged to the console (beside redirecting stderr). I
> > can post an example if necessary of the errors I'm getting.
>
> When you create the DeferredList, pass in consumeErrors=1 - this will
> make debugging that much more annoying though...
>
> HTH,
> Dave
> -------------- next part --------------
> from twisted.web import server
> from twisted.web.resource import Resource
> from twisted.web.client import getPage
>
> from twisted.internet import defer, reactor
> from twisted.python import log
> from cgi import escape
> class Foo(Resource):
> counter = 0
> isLeaf=True
> def render_GET (self, request):
> self.rq = request
> target = escape(request.args['target'][0])
> d = getPage(target).addCallback(self.print_page)
> d.addErrback(log.err)
> dl = defer.DeferredList([d])
> dl.addCallback(stopNow)
> dl.addErrback(log.err)
> return server.NOT_DONE_YET
>
> def print_page (self, html):
> if Foo.counter < 5:
> Foo.counter += 1
> print 'request '+str(Foo.counter)
> d = defer.Deferred()
> d.addCallback(self.print_page)
> d.addErrback(log.err)
> reactor.callLater(1, d.callback, html)
> return d
> else:
> print 'now we can write stuff back'
> self.rq.write(str(len(html))+' '+str(Foo.counter))
> self.rq.finish()
> self.rq.transport.loseConnection()
> # no deferred being returned, stopNow fires
>
> def stopNow(cbval):
> # can't add reactor.stop as a callback directly
> # because it doesn't know what to do with the extra
> # argument being returned from the callback
> print cbval
> reactor.stop()
>
> resource = Foo()
> site = server.Site(resource)
>
> from twisted.application import service, internet
> application = service.Application("Foo")
> internet.TCPServer(9000, site).setServiceParent(application)
>
> # vim: ai sts=4 sw=4 expandtab syntax=python :
>
> ------------------------------
>
> _______________________________________________
> Twisted-web mailing list
> Twisted-web at twistedmatrix.com
> http://twistedmatrix.com/cgi-bin/mailman/listinfo/twisted-web
>
>
> End of Twisted-web Digest, Vol 15, Issue 20
> *******************************************
>
--
Never think there is anything impossible for the soul. It is the
greatest heresy to think so. If there is sin, this is the only sin –
to say that you are weak, or others are weak.
Swami Vivekananda
More information about the Twisted-web
mailing list