Opened 5 years ago

Closed 5 years ago

Last modified 5 years ago

#5233 defect closed fixed (fixed)

win32eventreactor doesn't notice TCP connection is lost in certain (hopefully rare) cases

Reported by: itamar Owned by: exarkun
Priority: low Milestone:
Component: core Keywords: windows
Cc: zooko@… Branch: branches/win32er-close-5233-2
branch-diff, diff-cov, branch-cov, buildbot
Author: exarkun

Description -- test_resumeProducingThrows -- demonstrates the win32eventreactor problem: it doesn't notice a lost TCP connection. The problem doesn't appear to have anything to do with the abortConnection call; the test still fails if that line is commented out.

Protocol A connects over loopback to Protocol B. Protocol A's resumeProducing gets called from its transport's doWrite; resumeProducing throws an exception. At this point we expect the following to happen:

  1. Exception thrown from resumeProducing is logged, and Protocol A has its connectionLost called.
  2. Protocol B, which is connected to A over TCP, should also get its connectionLost called.

While item 1 happens, item 2 does not. As far as I can tell, MsgWaitForMultipleObjects never notices that B has lost its connection.

Only workaround I found: making the socket close with SO_LINGER set to 1,0, i.e. force RST instead of FIN. In this case MWFMO does notice the connection was lost.

Attachments (1)

WSAEnumNetworkEvents.patch (22.1 KB) - added by zseil 5 years ago.
standalone WSAEnumNetworkEvents

Download all attachments as: .zip

Change History (29)

comment:1 Changed 5 years ago by exarkun

  • Owner set to exarkun

#5285 was a duplicate of this. There's a self-contained example attached there.

comment:2 Changed 5 years ago by exarkun

  • Author set to exarkun
  • Branch set to branches/win32er-close-5233

(In [32695]) Branching to 'win32er-close-5233'

comment:3 Changed 5 years ago by exarkun

This is probably fixed in the branch. However, it's not tested. The test will be annoying to write. Maybe I'll give it a try soon. I also observed that this issue is probably also the cause of glib2/gtk2reactor flakiness on Windows. glib2 I/O notification propagates the one-shot nature of FD_CLOSE notification through to us, and we don't do anything special to handle it.

comment:4 Changed 5 years ago by exarkun

Oops, we don't use glib2/gtk2 on Windows, we use PortableGtkReactor. So nevermind about those bits.

comment:5 Changed 5 years ago by exarkun

  • Branch changed from branches/win32er-close-5233 to branches/win32er-close-5233-2

(In [32909]) Branching to 'win32er-close-5233-2'

comment:6 Changed 5 years ago by exarkun

  • Keywords review added
  • Owner exarkun deleted

Okay. I think I added a decent test, and made the fix really work in all cases.

Build results

comment:7 Changed 5 years ago by itamar

  • Owner set to itamar

I'll try to look at this tomorrow. For now, two notes:

  1. The test that was disabled in #78 should be re-enabled (if it's not already, haven't looked in detail at branch yet).
  1. We should open new ticket, and add the potential explanation, for pygtk issue noted in branch.

comment:8 Changed 5 years ago by exarkun

One issue that the build results pointed out is that WSAEnumNetworkEvents is somewhat new - it was added in late 2010. And several of our build slaves have a version of pywin32 which is old enough to not have it (disabling win32eventreactor entirely).

comment:9 Changed 5 years ago by itamar

  • Keywords review removed
  • Owner changed from itamar to exarkun

As far as availability of WSAEnumNetworkEvents, we can require people to upgrade pywin32, try to recreate the functionality with ctypes, or just fall back to the buggy behavior if old version of pywin32 is installed (in which case we should emit warning when the module is imported).

Given that going the ctypes route requires a bunch of extra work and tests, I think my preferred option would be requiring an upgrade. Mysterious non-closing sockets are not cool, and it's not like Windows users want to stick to particular versions of libraries distributed with the OS as you would on Linux.

I see item 1 above has been addressed - excellent.

  1. Item 2 above - add the guess about reason for pygtk issues to appropriate ticket.
  1. Any reason you're not using sets?
  1. The way in which an event leaves the closingReading pair of data structures is very... implicit. It seems like it work though, so I guess it's OK, but it seems tricksy enough that I'd want a test that it's not actually leaking objects.

Otherwise, looks good -- great job debugging a mysterious issue.

Changed 5 years ago by zseil

standalone WSAEnumNetworkEvents

comment:10 Changed 5 years ago by zseil

I've attached standalone WSAEnumNetworkEvents module if you'd prefer that over requiring users to upgrade the pywin32 package. The code is basically the same as in pywin32, there are also some tests.

comment:11 Changed 5 years ago by exarkun

  • Keywords review added
  • Owner exarkun deleted
  1. r32969 makes pywin32 215 the explicitly required version.
  2. Filed #5333 for pygtk/windows/close
  3. Not really. I'm weakly attached to the symmetry allowed by using a dict for _closedAndReading since there is no WeakSet to use for _closedAndNotReading, only a WeakKeyDictionary.
  4. I added a test in r32972.

As far as having our own WSAEnumNetworkEvents extension module... I'm not extremely opposed, but it means figuring out a behavior on PyPy (maybe not immediately, I don't think pywin32 works there yet), maintaining some more C, etc. I think in general we shouldn't worry about bumping our Windows dependencies, though, particularly if it lets us skip maintaining more C code. In any case, though, thanks very much for making that option easier for us, zseil. :)

comment:12 Changed 5 years ago by exarkun

comment:13 Changed 5 years ago by zseil

  • Keywords review removed
  • Owner set to exarkun

Nice! Only one small issue, at, you should also catch ValueError which is raised when fd.fileno() returns an invalid handle. This will happen e.g. for SerialPorts.

comment:14 Changed 5 years ago by zseil

Ups, my last comment was wrong, pywintypes.error is raised if fd is not a socket but looks like a valid handle.

comment:15 follow-up: Changed 5 years ago by exarkun

Does the branch in its current form break serial port support on Windows? I'm guessing so, but I have no way to test the serial port code.

comment:16 in reply to: ↑ 15 Changed 5 years ago by zseil

Replying to exarkun:

Does the branch in its current form break serial port support on Windows? I'm guessing so, but I have no way to test the serial port code.

Yes, the reactor explodes if a SerialPort is added to it.

comment:17 Changed 5 years ago by exarkun

(In [32981]) Handle pywintypes.error from WSAEnumNetworkEvents, to keep serial ports working

refs #5233

comment:18 Changed 5 years ago by exarkun

  • Keywords review added
  • Owner exarkun deleted

Okay, handling pywintypes.error now.

comment:19 Changed 5 years ago by glyph

  • Owner set to glyph
  • Status changed from new to assigned

comment:20 Changed 5 years ago by glyph

  • Owner changed from glyph to exarkun
  • Status changed from assigned to new

I think I've reviewed this as well as I am able (since I don't have a physical serial port on Windows: is it possible to create one with VMWare somehow?)

This looks pretty good. Excellent diagnostic work, I must say, given how obscure this condition is.

The one problem is that I really don't like the idea of the surprise explosion when upgrading Twisted on an existing installation without upgrading pywin32 first. Let's have a warning before we have IOCP and win32er just stop working completely. It's certainly better to very, very occasionally lose a connectionLost than to be completely unable to use the reactor. Just because we don't have to be as concerned about win32 users upgrading dependencies doesn't mean we don't have to be concerned at all; there is still the issue of upgrading development environments, synchronizing build systems, etc.

That said, I mostly agree with exarkun's observations about not wanting to maintain that much more C. Thanks, zseil, for implementing that mammoth function wrapper extension module, but I suspect that it will make more work for us than it saves, in the long run :).

I think we should get handle an ImportError on attempting to import WSAEnumNetworkEvents, and not depend on the results from GetFileVersionInfo. It's always possible that they'll change their versioning scheme in the future and this number will be wrong. Although it may be useful to call GetFileVersionInfo in the except block to provide the build number of the currently installed version so the user has some idea of how far behind they are. We should change the error to a warning, and the warning should say where you download pywin32 from, so that this doesn't become a FAQ for frustrated Windows users :).

Then, since we already have to handle the case where WSAEnumNetworkEvents doesn't actually work (the serial port stuff) we can just have a stub WSAEnumNetworkEvents that raises TypeError (and warns you again, maybe). This will still cause the test suite to fail, and that's fine; the failures can totally tell you to upgrade :).

Despite the volumes of prose here, I suspect the actual diff for making this change will be tiny, so if you agree with this plan then please just go ahead and land it after upgrading the buildbots.

comment:21 Changed 5 years ago by glyph

  • Keywords review removed

comment:22 Changed 5 years ago by zseil

I just noticed another thing, if you'd call WSAEnumNetworkEvents only for fd's that are in self._reads, you could disconnect the fd when WSAEnumNetworkEvents raises an error and unskip the twisted.internet.test.test_fdset.ReactorFDSetTestsBuilder.test_negativeOneFileDescriptor test.

This would also be more backwards compatible, fd.fileno() was previously never called for fds added directly with reactor.addEvent().

comment:23 Changed 5 years ago by exarkun

  • Keywords review added
  • Owner exarkun deleted

Thanks for the review, and special thanks for the additional comments, zseil. It's actually quite important to not call fileno for non-socket fds, since that method is not part of the interface for those objects.

I added an explicit fileno() == -1 check and got rid of the explicit pywin32 version check.

Latest build results

comment:24 Changed 5 years ago by zseil

  • Owner set to zseil

comment:25 Changed 5 years ago by zseil

  • Keywords review removed
  • Owner changed from zseil to exarkun

I think this is ready now. Both iocp and win32er reactors can now be imported with older pywin32 versions, test_resumeProducingThrows expectedly fails in that configuration. While there is no new test checking that fileno() doesn't get called for descriptors added with addEvent, existing tests would fail if that would happen now that there is an explicit fileno() call (test_win32events.Listener doesn't have a fileno method). Buildbot failures either look unrelated (hardy32-py2.5-select) or should get fixed once the branch is merged to trunk.

Thanks for working on this, even when Windows is not your favorite platform. Please merge!

comment:26 Changed 5 years ago by exarkun

  • Resolution set to fixed
  • Status changed from new to closed

(In [33049]) Merge win32er-close-5233-2

Author: exarkun Reviewer: itamar, zseil, glyph Fixes: #5233

win32eventreactor is now more careful in its handling of socket close notifications. Where previously it could miss a close notification entirely if it arrived at the same time as the last chunk of bytes on a connection, it will now handle that case and properly disconnect the transport and protocol. However, this functionality requires

pywin32 215 or later; it will be disabled with a warning if an older version of pywin32 is installed.

comment:27 Changed 5 years ago by zooko

You may enjoy (?) the consternation that this bugfix causes me over on the Tahoe-LAFS trac:

comment:28 Changed 5 years ago by zooko

  • Cc zooko@… added
Note: See TracTickets for help on using tickets.