[Twisted-Python] connectionLost never reached after calling loseConnection: stuck in CLOSE_WAIT forever

Fri Oct 29 09:22:56 MDT 2010

On 28 Oct, 04:52 pm, ste at demaledetti.net wrote:
>Il 18/10/2010 17:21, Stefano Debenedetti ha scritto:
>>Anyway, I ran Twisted tests on my installation after the patch I
>>mentioned in my previous mail and I got the same results as before
>>applying it so at least it seems it doesn't break any obvious stuff.
>
>
>Sorry for replying to myself but for the record: the patch I sent
>does break stuff (connections are sometimes closed before all data
>has been sent) so don't use it.
>
>The partially good news is that I managed to write a self-contained
>and quite short example that can reproduce the exact same problem
>I'm witnessing in my app. The bad news is that it does so only about
>50% of the times but I thought I would share it while I keep on
>trying to make it more reliable.

Thanks.  Frequent sharing can definitely be more productive than keeping 
everything in secret until it's "done". :)
>Please find attached one .sh file and one .py file, save somewhere
>and make them executable. You will also need netcat (nc).
>
>If you run the .sh file and after three seconds type in a short line
>of text followed by the enter key, you should see the same line you
>typed printed back many times on your terminal and quite a lot of
>network activity going through the localhost interface for about a
>minute. Don't redirect the .sh output to /dev/null, the problem
>seems to occur only when the terminal application you run it in gets
>to 100% CPU while it's printing data received by netcat. Hopefully
>you have a multicore machine and this won't disrupt your desktop.
>
>If you're lucky and nothing bad happens, after a while the .sh
>script will terminate and all connections opened by it and the .py
>file will be closed. Please remember to kill the three python
>processes launched by the script before trying again.
>
>If you're unlucky like I am, after a while all connections will be
>closed except the one between netcat and one of the three servers
>powered by the .py file.
>
>That connection will be in this state according to netstat:
>
># netstat -np --inet 2> /dev/null | grep 127.0.0.1
>tcp        0      0 127.0.0.1:8080          127.0.0.1:36815
>ESTABLISHED 10042/python2.6
>tcp        0      0 127.0.0.1:36815         127.0.0.1:8080
>ESTABLISHED 10051/nc
>
>If you then CTRL-C the .sh script so that netcat gets terminated,
>you will get to the dreaded CLOSE_WAIT forever state:
>
># netstat -np --inet 2> /dev/null | grep 127.0.0.1
>tcp        1      0 127.0.0.1:8080          127.0.0.1:36815
>CLOSE_WAIT  10042/python2.6
>
>
>Please note that even though the .py file is called three times and
>launches a different server application each time, the only one I'm
>interested in is the first one ("one"), the other two are just there
>to simulate the third-party apps that my server is dealing with.
>This is why servers "two" and "three" do seemingly silly stuff
>including closing some of their connections at some point.
>
>My goal is that no matter how and when the client and the "two" and
>"three" servers close their connections to "one", the client
>connection to "one" is always properly terminated and does never get
>stuck in CLOSE_WAIT state.
>
>Thanks for any feedback you might have,

After a few runs, I managed to reproduce the problem.  I instrumented 
the reactor with some extra logging and test_producer.py with a manhole 
server.
The sequence of events appears to be something like this:

  OneA has a producer of OneE
  OneA has a consumer of OneB
  At some point OneB gives up and tells OneA to stopProducing 
(loseConnection)
  OneA.loseConnection stops the reactor from reading OneA and starts it 
writing
  OneA.doWrite happens
    it finds the send buffer empty
    it finds a registered producer (OneE) and resumes it
  OneE never produces any more bytes
  OneE loses its connection at some point and unregisters itself from 
OneA
  OneA takes note that it has no more producer, but does nothing about it

So the bug is likely that FileDescriptor.unregisterProducer doesn't do 
anything special when disconnecting=True.

You should be able to reproduce this very simply by setting up a 
transport-producer/consumer pair, calling loseConnection on the 
transport, then unregistering the producer.

This all sounds somewhat familiar, but I don't see an existing ticket 
for it, so maybe that's my imagination.

Jean-Paul