[Twisted-Python] connectionLost never reached after calling loseConnection: stuck in CLOSE_WAIT forever

Sun Oct 17 11:00:27 MDT 2010

Hello Glyph, thanks for your reply and suggestions.

I don't have a self-contained sample yet but at least I managed to
reproduce it reliably on my installation and after a few more
experiments I think I am narrowing this down, please read below.

> On Oct 16, 2010, at 11:22 AM, Stefano Debenedetti wrote:
> 
>> Does this sound familiar in any way? Any suggestions off top of head
>> while I try to come up with a self-contained sample which is
>> reliably reproducing the issue I'm seeing happening only "sometimes
>> and quite seldom[TM]"?
> 
> I can't recall having seen this exact issue in the past, but as you've described it it sounds like you may have discovered a Twisted bug.  I'm looking forward to your example.
> 
> I do have a few questions:
> 
>     * What version of Twisted are you using?
>     * Have you tried a more recent version? Trunk?

I'm using 10.1.0. I haven't tested on trunk because I see basically
no difference in internet/abstract.py and internet/tcp.py but if you
really think I should I will give trunk a try.

>     * What reactor are you using?
>     * Have you tried a different reactor?

Same behavior with select, poll and epoll.

>     * What platform/OS are you on?  What version?
>     * Have you tried a different platform?

I am using debian lenny with a self-compiled 2.6.35.2 kernel.
/etc/debian_version says: 5.0.5

> I am also curious whether changing
> 
>    proto.transport.loseConnection()
> 
> to
>    reactor.callLater(0, proto.transport.loseConnection)
> 
> makes any difference to your example.

I tried this and it didn't make any difference. Using a 1 second
delay didn't improve things either.

What did make a difference was to comment this line, the problem
never happens without it:

to.transport.registerProducer(_from.transport, True)

Next test I did was to try registering the producer as non-streaming:

to.transport.registerProducer(_from.transport, False)

This also fixes the problem but it causes an exception to be printed
in the log once per set of A, B and C connections:

Traceback (most recent call last):
  File "/home/lala/lib/python/twisted/python/log.py", line 84, in
callWithLogger
    return callWithContext({"system": lp}, func, *args, **kw)
  File "/home/lala/lib/python/twisted/python/log.py", line 69, in
callWithContext
    return context.call({ILogContext: newCtx}, func, *args, **kw)
  File "/home/lala/lib/python/twisted/python/context.py", line 59,
in callWithContext
    return self.currentContext().callWithContext(ctx, func, *args, **kw)
  File "/home/lala/lib/python/twisted/python/context.py", line 37,
in callWithContext
    return func(*args,**kw)
--- <exception caught here> ---
  File "/home/lala/lib/python/twisted/internet/pollreactor.py", line
184, in _doReadOrWrite
    why = selectable.doWrite()
  File "/home/lala/lib/python/twisted/internet/tcp.py", line 428, in
doWrite
    result = abstract.FileDescriptor.doWrite(self)
  File "/home/lala/lib/python/twisted/internet/abstract.py", line
145, in doWrite
    self.producer.resumeProducing()
  File "/home/lala/lib/python/twisted/internet/abstract.py", line
339, in resumeProducing
    assert self.connected and not self.disconnecting
exceptions.AssertionError:

This lead me to change the following lines in the doWrite code in
internet/abstract.py:

if self.disconnecting:
    # But if I was previously asked to let the connection die, do
    # so.
    return self._postLoseConnection()
elif self.producer is not None and ((not self.streamingProducer)
				  or self.producerPaused):
    # tell them to supply some more.
    self.producerPaused = 0
    self.producer.resumeProducing()
#elif self.disconnecting:
#    # But if I was previously asked to let the connection die, do
#    # so.
#    return self._postLoseConnection()

Basically this just inverts the order of checks: first check if
disconnecting, then check if a producer should be unpaused.

This makes the above traceback disappear and still fixes my
CLOSE_WAIT problem.

But using a non-streaming producer makes my app consume a lot more
memory so I reverted back my code to register the producer as streaming:

to.transport.registerProducer(_from.transport, True)

Now the CLOSE_WAIT issue is gone, no traceback appears in the log
and my app consumes the same memory as before. Victory?

I will still try to come up with a self-contained sample which
reproduces the CLOSE_WAIT problem but in the meanwhile I would like
to ask if the above-mentioned change to the doWrite definition in
internet/abstract.py is likely to destroy the universe in the near
future or if it actually sounds like a good idea.

> Thanks, and good luck,
> 
> -glyph

Thanks a lot for your help!

ciao
ste