[Twisted-Python] Many connections and TIME_WAIT

Wed Jan 27 08:05:16 EST 2010

On 04:50 am, donal.mcmullan at gmail.com wrote:
>I've been prototyping a client that connects to thousands of servers 
>and
>calls some method. It's not real important to me at this stage whether
>that's via xmlrpc, perspective broker, or something else.
>
>What seems to happen on the client machine is that each network 
>connection
>that gets opened and then closed goes into a TIME_WAIT state, and 
>eventually
>there are so many connections in that state that it's impossible to 
>create
>any more.

Yep.  That's what happens to a TCP connection when you close it.
>
>I'm keeping an eye on the output of
>netstat -an | wc -l
>Initially I've got 569 entries there. When I run my test client, that 
>ramps
>up really quickly and peaks at about 2824. At that point, the client 
>reports
>a callRemoteFailure:

Presumably these numbers have something to do with how quickly you're 
opening and closing new connections.  TIME_WAIT lasts for 2MSL (4 
minutes) to ensure that a future connection doesn't receive data 
intended for a previous connection (clearly a bad thing).

However... 2824 is a pretty low number at which to run out of sockets. 
Perhaps you're running this software on Windows?  I think Windows has a 
ridiculously small number of "client sockets" allocated by default.  I 
seem to recall this being something you can change with a registry edit 
or something like that.

Another option would be to switch to a POSIX-platform instead.

If you're *not* on Windows, then this is odd and perhaps bears further 
scrutiny.
>
>callRemoteFailure [Failure instance: Traceback (failure with no 
>frames):
><class 'twisted.internet.error.ConnectionLost'>: Connection to the 
>other
>side was lost in a non-clean fashion: Connection lost.

This isn't exactly how I'd expect it to fail, but I also don't know what 
"callRemoteFailure" is or where it comes from, so maybe that's not too 
surprising.
>Increasing the file descriptor limits doesn't seem to have any effect 
>on
>this.

Quite so.  The process has, after all, already closed these sockets. 
They no longer count towards the process's file descriptor limit (oh 
dear, I suppose you're not using Windows if you have a file descriptor 
limit to raise).
>
>Is there an established Twisted sanctioned canonical way to free up 
>this
>resource? Or am I doing something wrong? I'm looking into tweaking
>SO_REUSEADDR and SO_LINGER - that sound sane?
>
>Just tapping the lazywebs to see if anyone's already seen this in the 
>wild.

On most reasonably configured Linux machines, you shouldn't run into 
this problem until you're doing at least an order of magnitude more 
work.  Many times, I have run clients that do many thousands of new 
connections per second, resulting in tens of thousands of TIME_WAIT 
sockets on the system with no problem.  So, I'm not sure why you're 
running into this after only a few thousand.

Jean-Paul