Opened 4 years ago

Closed 4 years ago

#4719 defect closed fixed (fixed)

connectionLost never reached after calling loseConnection: stuck in CLOSE_WAIT forever

Reported by: demaledetti2 Owned by:
Priority: high Milestone:
Component: core Keywords:
Cc: ste@…, jessica.mckellar@… Branch: branches/close_wait-forever-4719-2
(diff, github, buildbot, log)
Author: exarkun, demaledetti2 Launchpad Bug:

Description

A is a twisted.internet.tcp.Connection transport connecting a socket to a protocol.
A has a producer C.
A is a consumer of B.
More than 64KB are written to A.
At some point B gives up and tells A to stopProducing (loseConnection).
A.loseConnection stops the reactor from reading OneA and starts it writing.
A.doWrite happens:

  • it finds the send buffer empty
  • it finds a registered producer (C) and resumes it

C never produces any more bytes.
After some time (reactor.callLater 0 is not enough), C unregisters itself from A.
A takes note that it has no more producer, but does nothing about it.
A's protocol's connectionLost is never called even if its peer shuts down the connection.
A's socket is stuck in CLOSE_WAIT state forever with 1 byte in the Recv-Q and 0 bytes in Send-Q.

Exarkun said that the bug is likely that FileDescriptor.unregisterProducer doesn't do anything special when disconnecting=True.

Please see the attached file that demonstrates the problem.

Please also see: http://twistedmatrix.com/pipermail/twisted-python/2010-October/023088.html

Attachments (4)

test_producer2.py (1.1 KB) - added by demaledetti2 4 years ago.
Self-contained example reproducing the problem.
avoid_hang_in_CLOSE_WAIT.patch (2.5 KB) - added by demaledetti2 4 years ago.
Proposed fix with unit test
bad-file-descriptor-errors.txt (3.8 KB) - added by demaledetti2 4 years ago.
Log of new test case errors if the above patch is applied to FileDescriptor.unregisterProducer
avoid_hang_in_CLOSE_WAIT_take2.patch (4.6 KB) - added by demaledetti2 4 years ago.
New proposed fix now passing both new unit tests

Download all attachments as: .zip

Change History (29)

Changed 4 years ago by demaledetti2

Self-contained example reproducing the problem.

comment:1 Changed 4 years ago by demaledetti2

  • Cc ste@… added

Changed 4 years ago by demaledetti2

Proposed fix with unit test

comment:2 Changed 4 years ago by demaledetti2

  • Keywords review added

comment:3 Changed 4 years ago by demaledetti2

The proposed patch introduces other errors related to trying to use an invalid file descriptor. I added a new test to demonstrate this, please find attached a trial log of errors caused by this new test if the above patch to FileDescriptor.unregisterProducer is applied.

Please also find attached a new proposed patch, now passing both new unit tests.

Changed 4 years ago by demaledetti2

Log of new test case errors if the above patch is applied to FileDescriptor.unregisterProducer

Changed 4 years ago by demaledetti2

New proposed fix now passing both new unit tests

comment:4 Changed 4 years ago by jesstess

  • Author set to jesstess
  • Branch set to branches/close_wait-forever-4719

(In [30186]) Branching to 'close_wait-forever-4719'

comment:5 Changed 4 years ago by jesstess

(In [30187]) Apply patch avoid_hang_in_CLOSE_WAIT_take2.patch by demaledetti2.

refs #4719

comment:6 Changed 4 years ago by jesstess

  • Cc jessica.mckellar@… added

I didn't realize there was so much state on the mailing list for this ticket! I don't think I can review this thoroughly in the immediate future, so others should feel free to jump in, but at least it's in a branch for easier testing: build results here.

comment:7 Changed 4 years ago by exarkun

  • Author changed from jesstess to exarkun, jesstess
  • Branch changed from branches/close_wait-forever-4719 to branches/close_wait-forever-4719-2

(In [30234]) Branching to 'close_wait-forever-4719-2'

comment:8 Changed 4 years ago by exarkun

(In [30235]) Merge forward the fix, but try a different testing approach

refs #4719

comment:9 Changed 4 years ago by exarkun

(In [30237]) Branching to 'close_wait-forever-4719-2'

comment:10 Changed 4 years ago by exarkun

(In [30238]) Merge forward the fix, but try a different testing approach

refs #4719

comment:11 follow-up: Changed 4 years ago by exarkun

  • Author changed from exarkun, jesstess to exarkun, demaledetti2
  • Keywords review removed
  • Owner changed from glyph to demaledetti2

I tried simplifying the tests further. I don't think I oversimplified. I don't think writing is necessary to trigger any of the cases being discussed here. I think only having a producer registered at loseConnection time is necessary. The test is a little gross still, because unregistering the producer in the same iteration as closing the connection means something different than unregistering it in a later iteration. But I like this way better than keeping dictionaries of multiple protocols around and trying to operate on them both.

The solution seems just right to me though, so I left it alone.

What do you think of the new test?

comment:12 in reply to: ↑ 11 Changed 4 years ago by demaledetti2

  • Owner changed from demaledetti2 to exarkun

Replying to exarkun:

What do you think of the new test?

It seems good to me. I write here some thoughts spawned by some further experiments but please don't consider these as blocking the merge. I write them here just as a reminder to myself so hopefully one day I will understand better what's going on in the tested code:

1) The new test doesn't cover the error that was introduced by my first patch attempt (i.e. where there was no check if transport.connected was True). This was tested in my test_disconnectEvent2 but it's probably not important at all to test for an error introduced by a broken patch attempt that was never committed anyway.

2) The new test uses a non-streaming producer rather than a streaming one. It's probably not important either but it's still worth noting that the error is not triggered any more by the test if a streaming producer is used there instead. Maybe, in order to replicate the bug, the need for writing data only arises when a streaming producer is used but I won't go as far as stating that a separate test case is needed for this if you don't think it's necessary :)

Thank you.

comment:13 Changed 4 years ago by glyph

exarkun: I'm just looking at everything in the branch up to the point where you started messing around with timeouts and log messages, in the interests of expediting the review (since I assume that stuff is going to be backed out). Since _ConsumerMixin is a new thing, it needs a docstring, and @ivars for its attributes. Also as long as it's being changed, it would be nice to adjust the whitespace a little; newlines after the initial """, 2 blank lines between the methods.

unregisterProducerAfterDisconnect is a bit hard to follow. I feel like it might be a little easier if you moved some of those class definitions to module scope, or at least grouped them towards the beginning of the test. I realize a lot of the test logic is in the methods there, but the names can say a lot: reactor.listenTCP(0, serverFactoryFor(Disconnect)), for example, tells me enough that I don't need to see the whole class definition, factory creation, etc.

This unfortunately isn't a full review, but I'm pretty comfortable with the code assuming we can get a test skip in there for gtk/win32 (and report a bug there, since the misbehavior we've observed seems pretty bad); I can do a quick spot check when the code is cleaned up.

comment:14 Changed 4 years ago by exarkun

(In [30328]) Document _ConsumerMixin; remove unused import and debug log message

refs #4719

comment:15 Changed 4 years ago by exarkun

(In [30329]) remove debug log messages

refs #4719

comment:16 Changed 4 years ago by exarkun

(In [30330]) Hoist some code out of test_unregisterProducerAfterDisconnect and generally try to make the test more readable

refs #4719

comment:17 Changed 4 years ago by exarkun

(In [30331]) Some cleanups to test_disconnectWhileProducing

refs #4719

comment:18 Changed 4 years ago by exarkun

(In [30333]) Skip on Windows on gtk2/glib2

refs #4719

comment:19 Changed 4 years ago by exarkun

  • Keywords review added
  • Owner exarkun deleted
  • Priority changed from normal to high

Thanks. I think I hit all the points you mentioned. Here are the build results, Build results.

comment:20 Changed 4 years ago by glyph

  • Keywords review removed
  • Owner set to exarkun
  1. Ending a line with :: means "put the indented block in a <pre>" in epitext. You really just wanted one colon in _ConsumerMixin's docstring. Otherwise, I love it.
  2. File a ticket on that awful PyGTK/Windows bug, just to remind somebody who knows what's going on (you or me, I guess) to file a ticket on their end when we are feeling more energetic about it. Don't worry about making a coherent description, just point at the code that's broken.
  3. pyflakes on abstract.py: "15: Error 'log' imported but unused"
  4. The whitespace change in gtk2reactor looks pretty pointless; maybe just don't bother to commit it?
  5. The tests are still pretty chatty in the log. If this is intentional for helping future maintainers, then by all means keep it.
  6. serverFactoryFor is pretty handy. Maybe that should be a public API somewhere, someday. (Just thinking out loud; ignore this comment.)
  7. Love the change to iocpreactor.


Address to your satisfaction and land it.

comment:21 Changed 4 years ago by exarkun

(In [30346]) Remove unused import. Fix epytext markup to be prettier.

refs #4719

comment:22 Changed 4 years ago by exarkun

(In [30347]) Back out whitespace change

refs #4719

comment:23 Changed 4 years ago by exarkun

Yay, thanks. 1-4 addressed. I like the chatty tests, so leaving the log messages. Merging.

comment:24 Changed 4 years ago by exarkun

  • Resolution set to fixed
  • Status changed from new to closed

(In [30349]) Merge close_wait-forever-4719-2

Author: exarkun, demaledetti2
Reviewer: exarkun, glyph
Fixes: #4719

Fix the implementation of unregisterProducer for most transports so that
if loseConnection was called while the producer was registered the, then
the connection is then closed. Previously these connections would hang
around indefinitely.

comment:25 Changed 3 years ago by <automation>

  • Owner exarkun deleted
Note: See TracTickets for help on using tickets.