[Twisted-Python] Scalability of an rss-aggregator

Andrew Bennetts andrew-twisted at puzzling.org
Wed Mar 31 07:34:06 EST 2004

On Wed, Mar 31, 2004 at 01:27:49PM +0200, Valentino Volonghi aka Dialtone wrote:
> Andrew Bennetts wrote:
> >On Wed, Mar 31, 2004 at 09:33:58AM +0200, Valentino Volonghi aka Dialtone 
> >wrote:
> > 
> >
> >>Hi all,
> >>attached you will find my rss-aggregator made with twisted.
> >>
> >>It's really fast although when I tried with 745 feeds I got some problems.
> >>When the download reached 300 parsed feeds (more or less) it locked till 
> >>I pressed Ctrl+C and then it
> >>processed the remaining 340 feeds in less than 30 seconds... I think 
> >>that my design has at least an issue
> >>but  I cannot find it so easily and I hope someone on this list can help 
> >>me to improve it.
> >
> >By default, Twisted uses the platform name resolver, which is blocking.
> >Perhaps a non-existent domain is causing gethostbyname to block?
> >
> Uhmm... dunno, but I tried to remove the 'locking' feed-source and it 
> didn't change.

Hmm, it's unlikely to be DNS lookups causing it, then.

We need some way to narrow down where it's happening.  There are a few
options I can think of, but they're all a bit heavyweight...

  - Use strace to get some idea what it's doing
  - Use the --spew option of twistd (or manually install the spewer with
    "from twisted.python.util import spewer; sys.settrace(spewer)")
  - Use gdb to attach the process, then and look at the backtrace there.

(You can apparently get the python backtrace in gdb by putting this macro in
your .gdbinit:

define ppystack
    while $pc < Py_Main || $pc > Py_GetArgcArgv
        if $pc > eval_frame && $pc < PyEval_EvalCodeEx
            set $__fn = PyString_AsString(co->co_filename)
            set $__n = PyString_AsString(co->co_name)
            printf "%s (%d): %s\n",  $__fn, f->f_lineno, $__n
        up-silently 1
    select-frame 0

But I've never tried this...

Is it possible that feedparser is hanging on trying to parse that feed?
Perhaps trying putting print statements before and after the
feedparser.parse call.

> >You should be able to test this theory by installing Twisted's resolver:
> >
> >   from twisted.names import client
> >   reactor.installResolver(client.createResolver())
> >
> >client.createResolver makes a resonable effort to use your system's DNS
> >configuration (by looking at /etc/resolve.conf on posix systems, for
> >example), so it should work without any special arguments.
> >
> ok, it changes into a totally non-working script :)
> I get a lot of these:
> [Failure instance: Traceback: exceptions.TypeError, unsubscriptable object
> /usr/lib/python2.3/site-packages/twisted/internet/defer.py:313:_runCallbacks
> /usr/lib/python2.3/site-packages/twisted/names/resolve.py:44:__call__
> /usr/lib/python2.3/site-packages/twisted/names/common.py:36:query
> /usr/lib/python2.3/site-packages/twisted/names/common.py:104:lookupAllRecords
> /usr/lib/python2.3/site-packages/twisted/names/client.py:266:_lookup
> /usr/lib/python2.3/site-packages/twisted/names/client.py:214:queryUDP
> ]

Ouch.  I wonder how that bug crept in?  The twisted.names code is expecting a
sequence of timeouts (to re-issue the query with, until failing at last), but
twisted.internet is only giving it a single integer.  I've filed a bug
report for this: http://twistedmatrix.com/bugs/issue570, if you care :)

> >>BTW When it finishes (with all 740 feeds) it reports an awesome 330 
> >>seconds which is an impressive time, less than half a second
> >>for each feed, and It downloads more than 50Mb of feeds from the net 
> >>(with 745 feeds to download).
> >>   
> >
> >Nice!
> >
> >
> Yup, was going to ask for my script to be used instead of asyncore to 
> Straw developers.
> Straw has a lot of problems with 200 feeds ie resets the connection and
> such. This would be an awesome improvement.

Absolutely.  I've heard similar complaints about straw, and I've been hoping
some keen person would apply Twisted to fix the problem :)


