[Twisted-Python] clientfactory cleanup slow-down (after many http requests)

Glyph Lefkowitz glyph at twistedmatrix.com
Sat Aug 6 16:51:39 MDT 2016


> On Aug 6, 2016, at 03:48, Randomcoder <randomcoder1 at gmail.com> wrote:
> 
> Hello,
> 
> I've been working on a small Twisted program.

Cool, thanks for using Twisted.

> The program makes HTTP requests to a large number of feeds.
> Twisted is used to speed up the entire process.
> After the feeds are fetched, they're parsed. Finally they should be
> written to a database (to simplify the code, that part is left out).

Thanks for including examples, so we know exactly what you're talking about! :)

> Feeds are fetched in parallel using gatherResults, and a batch is
> built. Then all batches are again gathered into a set of batches,
> a DeferredList is built out of those. A semaphore controls both the
> batch-level list of deferreds, and a semaphore controls the entire batch
> list deferred.
> 
> Currently, the program works ok on 100-150 feeds, and BATCH_SIZE between
> 5 and 20.

This all seems pretty reasonable and following best practices and such...

> However, I notice the program starts to hang for a long time, when the
> number of feeds goes over 150-200.

Two key questions: what do you mean by "hang" and what is "a long time"?  Do you mean it's totally unresponsive, or do you mean it's just failing to make progress on downloading more feeds?

> 
> To be more precise, at the end of running the program, messages
> like these are printed, but the program seems to not be very active:
> 
>    Stopping factory <twisted.web.client._HTTP11ClientFactory instance at 0x7f0b7d5f3908>
> 
> It seems like this is the cleanup phase.

This just means that it is finished making connections.  We have to do some clean-up around the usefulness of these log messages, sorry :-\.

> I've read what I could find on the topic. I wasn't able to make progress
> on it, so I'm posting to the mailing list to ask if someone has encountered this
> before. Maybe it's a common pitfall or issue that other people have also
> bumped into.

Right now, my guess is this: some of the sites you're contacting have very slow proxies, or for some other reason let you connect to them, but then hang when sent requests.  If you're simultaneously requesting stuff from a very large number of different sites, this is sort of inevitably bound to happen, either based on network problems, or issues with the sites themselves.  I suspect you thought that the connectTimeout argument to Agent would save you from this, but that timeout is just for making the initial underlying TCP connection, not receiving a full response.  What you actually want to do is cancel the Deferred returned by Agent.request.

Luckily, https://treq.readthedocs.io/en/latest/ <https://treq.readthedocs.io/en/latest/> already implements this high-level timeout functionality for you, in the form of the 'timeout=' argument it accepts.  If you give that a try, do you see more connections timing out as it runs, rather than "hanging" the process for long periods of time?

As long as I'm looking at your code, as a way of thanking you for providing such a nice specific runnable example, I have a few other random thoughts which may be useful to you:

- I see you're importing psycopg.  Do you know about https://txpostgres.readthedocs.io/en/latest/ <https://txpostgres.readthedocs.io/en/latest/> ?  You can talk to postgres asynchronously with Twisted.
- d.addCallback(lambda out: out).addCallback(lambda resp: client.readBody(resp)) can be much more briefly spelled "d.addCallback(client.readBody)". d.addErrback(lambda err: err) does nothing and can just be removed.
- BrowserLikePolicyForHTTPS() is the default, so you don't need to pass that.
- clean_up_and_exit will only be called if batchesDef doesn't fail, and if it does fail, it will swallow the exception message.  Rather than manually calling `reactor.stop`, you probably want to use react(), <https://twistedmatrix.com/documents/16.3.0/api/twisted.internet.task.html#react <https://twistedmatrix.com/documents/16.3.0/api/twisted.internet.task.html#react>>.  This way your function is an API that anyone who wants to use it can call - it just returns a Deferred when it's done - but your __main__ block calls react() which will both start and stop the reactor, as well as reporting errors if there's a problem while still shutting down.

Hope some of that code review is helpful - let us know if the treq timeout solves the problem or if the issue is somewhere else!

-glyph
-------------- next part --------------
An HTML attachment was scrubbed...
URL: </pipermail/twisted-python/attachments/20160806/96ad3ad1/attachment-0002.html>


More information about the Twisted-Python mailing list