[Twisted-web] Limit the simultaneous twisted.web.client.downloadPage requests

exarkun at twistedmatrix.com exarkun at twistedmatrix.com
Sat Oct 24 09:32:48 EDT 2009

On 01:20 pm, descentspb at gmail.com wrote:
>I am a newbie in twisted, sorry if my question sounds awkward.
>I have written a pretty simple recursive page downloader, which parses
>an html, extracts all the needed links from it, and starts dowloading
>them. The links are the videofiles, so they are pretty large. The
>problem is, that the downloader works TOO FAST :) I want to set
>something like the global bandwidth limit or the maximum limit of
>concurrently downloading files.
>I am using the twisted.web.client.downloadPage to download the files 
>using the Deferred, that it returns.
>I can't understand how to make it still return a Deferred, 
>to that file, but not start downloading right away, but instead start
>downloading it on some kind of event (make a manger-like wrapper for
>that function).
>So I want the code to still look simple like this:
>for link in links:
>    d = downloadPage_limited(link, filename)
>And the wrapper(function downloadPage_limited) will manage the amount 
>concurrent downloads, and will still return the Deferred, which will be
>returned by twisted.web.client.downloadPage.
>Is my idea about a "wrapper" practical and what's the general way to
>write it?
>On which event is it better to decrement the counter of the amount
>currently downloading files?

Yes, that's a good idea.

You might be able to use twisted.internet.defer.DeferredSemaphore to 
handle all of the counting for you.  For example,

    from twisted.internet.defer import DeferredSemaphore
    from twisted.web.client import downloadPage

    class LimitedDownloader:
        def __init__(self, howMany):
            self._semaphore = DeferredSemaphore(howMany)

        def downloadPage(self, *a, **kw):
            return self._semaphore.run(downloadPage, *a, **kw)

    downloader = LimitedDownloader(3)

In this example, DeferredSemaphore.run will only let 3 downloadPage 
calls run concurrently.  If a 4th is attempted before any earlier ones 
finish, it won't actually be called until one of the earlier ones does 
finish, and then it will be called.

