[Twisted-Python] Lots and lots and lots and lots... of deferreds

Matt Perry matt at unshift.net
Tue Oct 6 23:00:03 EDT 2009


Your limit will usually be the number of file descriptors in the system,
which can be usually changed via ulimit or your system's equivalent.  On
Linux I believe it defaults to 1024, so you should be able to handle 1024
simultaneous connections.

One thing of note is that you say you have concurrency issues handled -- but
with asynchronous I/O, there are no concurrency issues, since there's no
concurrency (at least, not at application level).  This is confusing at
first but it's important to understand.

All that said, you probably want to maintain a queue of URLs and some sort
of graph representation of your data for purposes of finding loops (e.g. A
links to B, B links to C, C links to A).  You can then set an upper limit on
the number of concurrent connections (say 1000) and track the number of
deferreds in the system just based on when you start connections and when
they finish (via callbacks).   Your initial seed can start one URL, and then
its callback can hit all linked nodes, and so on and so on.

You might be hitting a cycle in the page traversal graph, and that is
causing you all sorts of problems in terms of recursion depth or running out
of file descriptors.  Without seeing your code or your target site, though,
it's impossible to say.

Have you considered using another library for web spidering?  I believe
Scrapy (http://scrapy.org) is a good spidering tool, and it might be easier
to use a decent library than roll your own.


  - Matt



On Tue, Oct 6, 2009 at 10:40 PM, Steve Steiner (listsin) <
listsin at integrateddevcorp.com> wrote:

> So, I have a situation...
>
>        I have an application whose basic function is, in simplified form:
>
>        def main():
>                get_web_page(main_page_from_params)
>
>        def get_web_page(page_name):
>                set up a page getter deferred,
>                        one of the callbacks gets the links out of the page
> and sends them
> to get_them()
>
>        def get_them(links):
>                for l in links:
>                        if l is not being gotten or hasn't been got:
>                                deferred = get_web_page(l)
>
>        In other words, I feed in the top level page, then recursively feed
> in any pages linked to by the current page, and they feed in all their
> links, until all pages are gotten.
>
>        I understand the concurrency issues with multiple deferred's trying
> to add pages to the "get list" -- it's properly handled in the code
> (far as I can tell, so far).
>
>        So, here's the question...
>
>        I have a "pages"  list containing all of the pages.
>
>        They are set to either gotten or in-flight.
>
>        In-flight means I have a deferred that's going to go get it (in
> get_web_page()).
>
>        IOW, right now, if I don't already have the page, and I have a link
> to it, I just start a deferred to go get it.
>
>        Should I limit the number of "in-flight" pages?
>
>        Currently, I'm scanning sites that have upwards of 5000 pages and it
> seems that, when I get too many deferred's in flight, the app
> *appears* to crash.
>
>        I'm not sure whether it's actually going out to lunch or just
> appears
> that way and, before I go instrumenting the app to death, can anyone
> tell me whether there is some sort of practical limit to how many "in-
> flight" deferreds might start to cause issues, just due to the sheer
> number?
>
>        Thanks for any insight on this that anyone might offer.
>
>        I expect the usual flurry of  "you must post your exact code or we
> can't help you at all, moron" posts, but...
>
>        In spite of my not having posted specific code, could someone with
> some actual experience in this please give me a clue, within an order
> of magnitude, how many deferreds might start to cause real trouble?
>
> Thanks,
>
> S
>
>
>
>
> _______________________________________________
> Twisted-Python mailing list
> Twisted-Python at twistedmatrix.com
> http://twistedmatrix.com/cgi-bin/mailman/listinfo/twisted-python
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://twistedmatrix.com/pipermail/twisted-python/attachments/20091006/2398f84b/attachment-0001.htm 


More information about the Twisted-Python mailing list