Opened 7 years ago

Closed 7 years ago

#4719 defect closed fixed (fixed)

connectionLost never reached after calling loseConnection: stuck in CLOSE_WAIT forever

Reported by: Stefano Debenedetti Owned by:
Priority: high Milestone:
Component: core Keywords:
Cc: Stefano Debenedetti, jesstess Branch: branches/close_wait-forever-4719-2
branch-diff, diff-cov, branch-cov, buildbot
Author: exarkun, demaledetti2

Description

A is a twisted.internet.tcp.Connection transport connecting a socket to a protocol. A has a producer C. A is a consumer of B. More than 64KB are written to A. At some point B gives up and tells A to stopProducing (loseConnection). A.loseConnection stops the reactor from reading OneA and starts it writing. A.doWrite happens:

  • it finds the send buffer empty
  • it finds a registered producer (C) and resumes it

C never produces any more bytes. After some time (reactor.callLater 0 is not enough), C unregisters itself from A. A takes note that it has no more producer, but does nothing about it. A's protocol's connectionLost is never called even if its peer shuts down the connection. A's socket is stuck in CLOSE_WAIT state forever with 1 byte in the Recv-Q and 0 bytes in Send-Q.

Exarkun said that the bug is likely that FileDescriptor.unregisterProducer doesn't do anything special when disconnecting=True.

Please see the attached file that demonstrates the problem.

Please also see: http://twistedmatrix.com/pipermail/twisted-python/2010-October/023088.html

Attachments (4)

test_producer2.py (1.1 KB) - added by Stefano Debenedetti 7 years ago.
Self-contained example reproducing the problem.
avoid_hang_in_CLOSE_WAIT.patch (2.5 KB) - added by Stefano Debenedetti 7 years ago.
Proposed fix with unit test
bad-file-descriptor-errors.txt (3.8 KB) - added by Stefano Debenedetti 7 years ago.
Log of new test case errors if the above patch is applied to FileDescriptor.unregisterProducer
avoid_hang_in_CLOSE_WAIT_take2.patch (4.6 KB) - added by Stefano Debenedetti 7 years ago.
New proposed fix now passing both new unit tests

Download all attachments as: .zip

Change History (29)

Changed 7 years ago by Stefano Debenedetti

Attachment: test_producer2.py added

Self-contained example reproducing the problem.

comment:1 Changed 7 years ago by Stefano Debenedetti

Cc: Stefano Debenedetti added

Changed 7 years ago by Stefano Debenedetti

Proposed fix with unit test

comment:2 Changed 7 years ago by Stefano Debenedetti

Keywords: review added

comment:3 Changed 7 years ago by Stefano Debenedetti

The proposed patch introduces other errors related to trying to use an invalid file descriptor. I added a new test to demonstrate this, please find attached a trial log of errors caused by this new test if the above patch to FileDescriptor.unregisterProducer is applied.

Please also find attached a new proposed patch, now passing both new unit tests.

Changed 7 years ago by Stefano Debenedetti

Log of new test case errors if the above patch is applied to FileDescriptor.unregisterProducer

Changed 7 years ago by Stefano Debenedetti

New proposed fix now passing both new unit tests

comment:4 Changed 7 years ago by jesstess

Author: jesstess
Branch: branches/close_wait-forever-4719

(In [30186]) Branching to 'close_wait-forever-4719'

comment:5 Changed 7 years ago by jesstess

(In [30187]) Apply patch avoid_hang_in_CLOSE_WAIT_take2.patch by demaledetti2.

refs #4719

comment:6 Changed 7 years ago by jesstess

Cc: jesstess added

I didn't realize there was so much state on the mailing list for this ticket! I don't think I can review this thoroughly in the immediate future, so others should feel free to jump in, but at least it's in a branch for easier testing: build results here.

comment:7 Changed 7 years ago by Jean-Paul Calderone

Author: jesstessexarkun, jesstess
Branch: branches/close_wait-forever-4719branches/close_wait-forever-4719-2

(In [30234]) Branching to 'close_wait-forever-4719-2'

comment:8 Changed 7 years ago by Jean-Paul Calderone

(In [30235]) Merge forward the fix, but try a different testing approach

refs #4719

comment:9 Changed 7 years ago by Jean-Paul Calderone

(In [30237]) Branching to 'close_wait-forever-4719-2'

comment:10 Changed 7 years ago by Jean-Paul Calderone

(In [30238]) Merge forward the fix, but try a different testing approach

refs #4719

comment:11 Changed 7 years ago by Jean-Paul Calderone

Author: exarkun, jesstessexarkun, demaledetti2
Keywords: review removed
Owner: changed from Glyph to Stefano Debenedetti

I tried simplifying the tests further. I don't think I oversimplified. I don't think writing is necessary to trigger any of the cases being discussed here. I think only having a producer registered at loseConnection time is necessary. The test is a little gross still, because unregistering the producer in the same iteration as closing the connection means something different than unregistering it in a later iteration. But I like this way better than keeping dictionaries of multiple protocols around and trying to operate on them both.

The solution seems just right to me though, so I left it alone.

What do you think of the new test?

comment:12 in reply to:  11 Changed 7 years ago by Stefano Debenedetti

Owner: changed from Stefano Debenedetti to Jean-Paul Calderone

Replying to exarkun:

What do you think of the new test?

It seems good to me. I write here some thoughts spawned by some further experiments but please don't consider these as blocking the merge. I write them here just as a reminder to myself so hopefully one day I will understand better what's going on in the tested code:

1) The new test doesn't cover the error that was introduced by my first patch attempt (i.e. where there was no check if transport.connected was True). This was tested in my test_disconnectEvent2 but it's probably not important at all to test for an error introduced by a broken patch attempt that was never committed anyway.

2) The new test uses a non-streaming producer rather than a streaming one. It's probably not important either but it's still worth noting that the error is not triggered any more by the test if a streaming producer is used there instead. Maybe, in order to replicate the bug, the need for writing data only arises when a streaming producer is used but I won't go as far as stating that a separate test case is needed for this if you don't think it's necessary :)

Thank you.

comment:13 Changed 7 years ago by Glyph

exarkun: I'm just looking at everything in the branch up to the point where you started messing around with timeouts and log messages, in the interests of expediting the review (since I assume that stuff is going to be backed out). Since _ConsumerMixin is a new thing, it needs a docstring, and @ivars for its attributes. Also as long as it's being changed, it would be nice to adjust the whitespace a little; newlines after the initial """, 2 blank lines between the methods.

unregisterProducerAfterDisconnect is a bit hard to follow. I feel like it might be a little easier if you moved some of those class definitions to module scope, or at least grouped them towards the beginning of the test. I realize a lot of the test logic is in the methods there, but the names can say a lot: reactor.listenTCP(0, serverFactoryFor(Disconnect)), for example, tells me enough that I don't need to see the whole class definition, factory creation, etc.

This unfortunately isn't a full review, but I'm pretty comfortable with the code assuming we can get a test skip in there for gtk/win32 (and report a bug there, since the misbehavior we've observed seems pretty bad); I can do a quick spot check when the code is cleaned up.

comment:14 Changed 7 years ago by Jean-Paul Calderone

(In [30328]) Document _ConsumerMixin; remove unused import and debug log message

refs #4719

comment:15 Changed 7 years ago by Jean-Paul Calderone

(In [30329]) remove debug log messages

refs #4719

comment:16 Changed 7 years ago by Jean-Paul Calderone

(In [30330]) Hoist some code out of test_unregisterProducerAfterDisconnect and generally try to make the test more readable

refs #4719

comment:17 Changed 7 years ago by Jean-Paul Calderone

(In [30331]) Some cleanups to test_disconnectWhileProducing

refs #4719

comment:18 Changed 7 years ago by Jean-Paul Calderone

(In [30333]) Skip on Windows on gtk2/glib2

refs #4719

comment:19 Changed 7 years ago by Jean-Paul Calderone

Keywords: review added
Owner: Jean-Paul Calderone deleted
Priority: normalhigh

Thanks. I think I hit all the points you mentioned. Here are the build results, Build results.

comment:20 Changed 7 years ago by Glyph

Keywords: review removed
Owner: set to Jean-Paul Calderone
  1. Ending a line with :: means "put the indented block in a <pre>" in epitext. You really just wanted one colon in _ConsumerMixin's docstring. Otherwise, I love it.
  2. File a ticket on that awful PyGTK/Windows bug, just to remind somebody who knows what's going on (you or me, I guess) to file a ticket on their end when we are feeling more energetic about it. Don't worry about making a coherent description, just point at the code that's broken.
  3. pyflakes on abstract.py: "15: Error 'log' imported but unused"
  4. The whitespace change in gtk2reactor looks pretty pointless; maybe just don't bother to commit it?
  5. The tests are still pretty chatty in the log. If this is intentional for helping future maintainers, then by all means keep it.
  6. serverFactoryFor is pretty handy. Maybe that should be a public API somewhere, someday. (Just thinking out loud; ignore this comment.)
  7. Love the change to iocpreactor.

Address to your satisfaction and land it.

comment:21 Changed 7 years ago by Jean-Paul Calderone

(In [30346]) Remove unused import. Fix epytext markup to be prettier.

refs #4719

comment:22 Changed 7 years ago by Jean-Paul Calderone

(In [30347]) Back out whitespace change

refs #4719

comment:23 Changed 7 years ago by Jean-Paul Calderone

Yay, thanks. 1-4 addressed. I like the chatty tests, so leaving the log messages. Merging.

comment:24 Changed 7 years ago by Jean-Paul Calderone

Resolution: fixed
Status: newclosed

(In [30349]) Merge close_wait-forever-4719-2

Author: exarkun, demaledetti2 Reviewer: exarkun, glyph Fixes: #4719

Fix the implementation of unregisterProducer for most transports so that if loseConnection was called while the producer was registered the, then the connection is then closed. Previously these connections would hang around indefinitely.

comment:25 Changed 6 years ago by <automation>

Owner: Jean-Paul Calderone deleted
Note: See TracTickets for help on using tickets.