[Twisted-web] Limit the simultaneous twisted.web.client.downloadPage requests

exarkun at twistedmatrix.com exarkun at twistedmatrix.com
Sat Oct 24 09:32:48 EDT 2009

On 01:20 pm, descentspb at gmail.com wrote:
>I am a newbie in twisted, sorry if my question sounds awkward.
>I have written a pretty simple recursive page downloader, which parses
>an html, extracts all the needed links from it, and starts dowloading
>them. The links are the videofiles, so they are pretty large. The
>problem is, that the downloader works TOO FAST :) I want to set
>something like the global bandwidth limit or the maximum limit of
>concurrently downloading files.
>I am using the twisted.web.client.downloadPage to download the files 
>using the Deferred, that it returns.
>I can't understand how to make it still return a Deferred, 
>to that file, but not start downloading right away, but instead start
>downloading it on some kind of event (make a manger-like wrapper for
>that function).
>So I want the code to still look simple like this:
>for link in links:
>    d = downloadPage_limited(link, filename)
>And the wrapper(function downloadPage_limited) will manage the amount 
>concurrent downloads, and will still return the Deferred, which will be
>returned by twisted.web.client.downloadPage.
>Is my idea about a "wrapper" practical and what's the general way to
>write it?
>On which event is it better to decrement the counter of the amount
>currently downloading files?

Yes, that's a good idea.

You might be able to use twisted.internet.defer.DeferredSemaphore to 
handle all of the counting for you.  For example,

    from twisted.internet.defer import DeferredSemaphore
    from twisted.web.client import downloadPage

    class LimitedDownloader:
        def __init__(self, howMany):
            self._semaphore = DeferredSemaphore(howMany)

        def downloadPage(self, *a, **kw):
            return self._semaphore.run(downloadPage, *a, **kw)

    downloader = LimitedDownloader(3)

In this example, DeferredSemaphore.run will only let 3 downloadPage 
calls run concurrently.  If a 4th is attempted before any earlier ones 
finish, it won't actually be called until one of the earlier ones does 
finish, and then it will be called.

More information about the Twisted-web mailing list