[Twisted-Python] clientfactory cleanup slow-down (after many http requests)

Manish Tomar manish.tomar at gmail.com
Thu Aug 11 15:19:45 MDT 2016


Wow! This is the friendliest way to welcome a new Twisted programmer. Great
job Glyph! :)

Regards,
Manish

On Sat, Aug 6, 2016 at 3:51 PM, Glyph Lefkowitz <glyph at twistedmatrix.com>
wrote:

>
> On Aug 6, 2016, at 03:48, Randomcoder <randomcoder1 at gmail.com> wrote:
>
> Hello,
>
> I've been working on a small Twisted program.
>
>
> Cool, thanks for using Twisted.
>
> The program makes HTTP requests to a large number of feeds.
> Twisted is used to speed up the entire process.
> After the feeds are fetched, they're parsed. Finally they should be
> written to a database (to simplify the code, that part is left out).
>
>
> Thanks for including examples, so we know exactly what you're talking
> about! :)
>
> Feeds are fetched in parallel using gatherResults, and a batch is
> built. Then all batches are again gathered into a set of batches,
> a DeferredList is built out of those. A semaphore controls both the
> batch-level list of deferreds, and a semaphore controls the entire batch
> list deferred.
>
> Currently, the program works ok on 100-150 feeds, and BATCH_SIZE between
> 5 and 20.
>
>
> This all seems pretty reasonable and following best practices and such...
>
> However, I notice the program starts to hang for a long time, when the
> number of feeds goes over 150-200.
>
>
> Two key questions: what do you mean by "hang" and what is "a long time"?
> Do you mean it's totally unresponsive, or do you mean it's just failing to
> make progress on downloading more feeds?
>
>
> To be more precise, at the end of running the program, messages
> like these are printed, but the program seems to not be very active:
>
>    Stopping factory <twisted.web.client._HTTP11ClientFactory instance at
> 0x7f0b7d5f3908>
>
> It seems like this is the cleanup phase.
>
>
> This just means that it is finished making connections.  We have to do
> some clean-up around the usefulness of these log messages, sorry :-\.
>
> I've read what I could find on the topic. I wasn't able to make progress
> on it, so I'm posting to the mailing list to ask if someone has
> encountered this
> before. Maybe it's a common pitfall or issue that other people have also
> bumped into.
>
>
> Right now, my guess is this: some of the sites you're contacting have very
> slow proxies, or for some other reason let you *connect* to them, but
> then hang when sent requests.  If you're simultaneously requesting stuff
> from a very large number of different sites, this is sort of inevitably
> bound to happen, either based on network problems, or issues with the sites
> themselves.  I suspect you thought that the connectTimeout argument to
> Agent would save you from this, but that timeout is just for making the
> initial underlying TCP connection, not receiving a full response.  What you
> actually want to do is cancel the Deferred returned by Agent.request.
>
> Luckily, https://treq.readthedocs.io/en/latest/ already implements this
> high-level timeout functionality for you, in the form of the 'timeout='
> argument it accepts.  If you give that a try, do you see more connections
> timing out as it runs, rather than "hanging" the process for long periods
> of time?
>
> As long as I'm looking at your code, as a way of thanking you for
> providing such a nice specific runnable example, I have a few other random
> thoughts which may be useful to you:
>
> - I see you're importing psycopg.  Do you know about https://txpostgres.
> readthedocs.io/en/latest/ ?  You can talk to postgres asynchronously with
> Twisted.
> - d.addCallback(lambda out: out).addCallback(lambda resp:
> client.readBody(resp)) can be much more briefly spelled
> "d.addCallback(client.readBody)". d.addErrback(lambda err: err) does
> nothing and can just be removed.
> - BrowserLikePolicyForHTTPS() is the default, so you don't need to pass
> that.
> - clean_up_and_exit will only be called if batchesDef doesn't fail, and if
> it does fail, it will swallow the exception message.  Rather than manually
> calling `reactor.stop`, you probably want to use react(), <
> https://twistedmatrix.com/documents/16.3.0/api/twisted.
> internet.task.html#react>.  This way your function is an API that anyone
> who wants to use it can call - it just returns a Deferred when it's done -
> but your __main__ block calls react() which will both start and stop the
> reactor, as well as reporting errors if there's a problem while still
> shutting down.
>
> Hope some of that code review is helpful - let us know if the treq timeout
> solves the problem or if the issue is somewhere else!
>
> -glyph
>
> _______________________________________________
> Twisted-Python mailing list
> Twisted-Python at twistedmatrix.com
> http://twistedmatrix.com/cgi-bin/mailman/listinfo/twisted-python
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: </pipermail/twisted-python/attachments/20160811/4842971c/attachment-0002.html>


More information about the Twisted-Python mailing list