Opened 6 years ago

Closed 6 years ago

Last modified 6 years ago

#3218 defect closed fixed (fixed)

SSL disconnection sometimes hangs indefinitely with pyOpenSSL 0.7

Reported by: exarkun Owned by:
Priority: highest Milestone:
Component: core Keywords:
Cc: jknight, therve Branch: branches/ssl-disconnection-3218-4
(diff, github, buildbot, log)
Author: exarkun Launchpad Bug:

Description

pyOpenSSL 0.7 introduced a Connection.set_shutdown API which allows Twisted to properly detect... something that I don't entirely understand yet. However, in the case where set_shutdown returns 0, it calls startWriting and not startReading so a closed socket is never detected. This results in SSL connections sometimes hanging indefinitely at shutdown.

Change History (36)

comment:1 Changed 6 years ago by exarkun

  • Type changed from enhancement to defect

comment:2 Changed 6 years ago by exarkun

  • Status changed from new to assigned

comment:3 Changed 6 years ago by jknight

Don't closed sockets showed up on both readable and writable lists? In any case, startWriting -> startReading is the wrong change, because sending the shutdown notification requires writing. If the socket isn't writable (buffer full or whatever) when it wants to write that data, it'd need to get a writable notification..

comment:4 Changed 6 years ago by exarkun

  • Status changed from assigned to new

It seems as though this is a Linux bug. In certain cases, the socket does not show up in the writable list, only in the readable list. Testing demonstrated the bug in 2.4.x and a number of 2.6.x kernels. FreeBSD and OS X 10.5 didn't exhibit the behavior.

I'm going to try to get #3198 merged before working on this, since #3198 has a bunch of code that should make testing this easier.

comment:5 Changed 6 years ago by exarkun

It also seems that this isn't a new problem with pyOpenSSL 0.7, although that release may make the failure more likely to happen. I can apparently reproduce the failure to disconnect with Twisted trunk@HEAD and pyOpenSSL 0.6.

comment:6 Changed 6 years ago by exarkun

Or not. Scratch my last, I just confused myself about what version was being used.

comment:7 Changed 6 years ago by jknight

While you may not have noticed, it seems to me rather likely that it *is* possible even without the shutdown stuff.

The prerequisite is: a send call which results in EPIPE, which error is swallowed before it gets back to Twisted. It's not inconceivable that other APIs might swallow a send() error somewhere...

comment:8 Changed 6 years ago by exarkun

  • author set to exarkun
  • Branch set to branches/ssl-disconnection-3218-2

(In [23636]) Branching to 'ssl-disconnection-3218-2'

comment:9 Changed 6 years ago by exarkun

  • Keywords review added
  • Owner exarkun deleted

James is probably right that this may be possible in other circumstances. For now I'll just worry about fixing this one though.

comment:10 Changed 6 years ago by therve

  • Keywords review removed
  • Owner set to exarkun
  • the new server.pem file is unfortunate. Could you create a temporary file during the tests, from a string in the module? We already have 2 pem files for tests, I don't like that much.
  • the ReactorBuilder test shows 2 problems: there is no timeout, and there is not assertFailure. If I remove the fix from tcp, the test hangs forever, so that's not really great. Maybe for now a method that call reactor.run() and fires a timeout is enough.

Thanks!

comment:11 Changed 6 years ago by jknight

I'm a bit worried about the startReading call: what happens if it triggers because there's actually data, rather than the socket closing? From a cursory glance, it looks like that would cause the doRead() function to be called, and perhaps actually deliver data to the protocol.

comment:12 Changed 6 years ago by zooko

By the way, here is the ticket at allmydata.org, where the current workaround is to require pyOpenSSL != v0.7:

http://allmydata.org/trac/tahoe/ticket/402 # (bug in Twisted, triggered by pyOpenSSL-0.7)

comment:13 Changed 6 years ago by zooko

There is new information on http://allmydata.org/trac/tahoe/ticket/402#comment:11 about how this issue does not seem to arise on certain combinations of software on Ubuntu Hardy, but it still does on Debian Sid...

comment:14 Changed 6 years ago by exarkun

(In [23976]) Get rid of the server.pem, store the certificate data in the test module

refs #3218

comment:15 Changed 6 years ago by jknight

BTW, it's worth noting this is only an issue for the selectreactor. The reason it is a problem is easily explained by the translation from poll flags to select bits:

#define POLLIN_SET (POLLRDNORM | POLLRDBAND | POLLIN | POLLHUP | POLLERR)
#define POLLOUT_SET (POLLWRBAND | POLLWRNORM | POLLOUT | POLLERR)
#define POLLEX_SET (POLLPRI)

Notice especially that POLLHUP only appears on POLLIN_SET. POLLHUP is what gets set when the remote side closes the connection.

This brings up an interesting question, though: is Twisted's pollreactor broken?

Consider, a remote server that calls shutdown(SHUT_WR), while the Twisted client is only in poll-for-writable mode. It seems this will cause the connection to be torn down, because event & POLL_DISCONNECTED and not (event & POLLIN): will be true (since POLLIN wasn't asked for).

comment:16 Changed 6 years ago by exarkun

See also #764

comment:17 Changed 6 years ago by exarkun

AFAICT, POLLHUP isn't set when SHUT_WR happens:

from socket import SHUT_WR, socket
from select import POLLIN, POLLOUT, POLLERR, POLLNVAL, POLLHUP, select, poll

def show((fd, event)):
    print fd, event & POLLIN and 'IN' or '', \
              event & POLLOUT and 'OUT' or '', \
              event & POLLERR and 'ERR' or '', \
              event & POLLNVAL and 'NVAL' or '', \
              event & POLLHUP and 'HUP' or ''

port = socket()
port.listen(3)

client = socket()
client.setblocking(False)
client.connect_ex(port.getsockname())
client.setblocking(True)

server, addr = port.accept()
poller = poll()
poller.register(client, POLLIN|POLLOUT|POLLERR|POLLNVAL|POLLHUP)

server.shutdown(SHUT_WR)
print 'Poll after shutdown:'
map(show, poller.poll(0))

assert client.recv(1024) == ''
print 'Poll after recv:'
map(show, poller.poll(0))

On Linux 2.6.15(-52-k7), glibc 2.3.6, this produces the following output:

Poll after shutdown:
4 IN OUT
Poll after recv:
4 IN OUT

So I guess it's okay.

comment:18 Changed 6 years ago by exarkun

  • Branch changed from branches/ssl-disconnection-3218-2 to branches/ssl-disconnection-3218-3

(In [24219]) Branching to 'ssl-disconnection-3218-3'

comment:19 Changed 6 years ago by exarkun

(In [24221]) some code that's untested but should handle the case james pointed out

in case the other side shuts down tls and then starts speaking tcp again, don't
even bother trying to read it, just give up on the connection.

refs #3218

comment:20 Changed 6 years ago by exarkun

(In [24222]) add a bit more detail to the comment about why select on linux is of inferior quality

refs #3218

comment:21 Changed 6 years ago by exarkun

(In [24223]) by the way, I didn't write any tests for this code

refs #3218

comment:22 Changed 6 years ago by exarkun

  • Keywords review added
  • Owner exarkun deleted
  • Priority changed from high to highest

Okay. I'm not sure what else I can do with this. Maybe the new code in _TLSMixin.doRead shouldn't be there, but at least the comment is really long.

Build results:

http://buildbot.twistedmatrix.com/boxes-supported?num_builds=1&branch=/branches/ssl-disconnection-3218-3

comment:23 Changed 6 years ago by therve

  • Cc therve added
  • Keywords review removed
  • Owner set to exarkun

Some flakes

twisted/internet/test/test_tls.py:13: 'FilePath' imported but unused
twisted/internet/test/test_tls.py:20: undefined name 'importError'

More problematic, the test doesn't fail when I rever the changes in tcp. They used, so it must be linked with changes made after my last review. But afaict, it was only the change from a certificate file.

comment:24 Changed 6 years ago by exarkun

  • Keywords review added
  • Owner changed from exarkun to therve

twisted.internet.test.test_tls.SSLClientTestsMixin_SelectReactor.test_disconnectAfterWriteAfterStartTLS still fails for me if I revert all of the changes to twisted/internet/tcp.py. Unfortunately, I think it's possible there's a race condition in the test which makes it not fail all the time. In ShortProtocol.dataReceived, if both sides of the connection manage to do their self.transport.write("y") before either side notices that the other has closed the connection, then when OpenSSL writes the TLS close alert, it won't fail with ENOTCONN or EPIPE. This means the socket will still show up in the input set reported by select, and so the changes to tcp.py aren't necessary to notice the connection is closed. Unfortunately, I can't see a simple way to test this hypothesis or fix the problem if it exists.

comment:25 Changed 6 years ago by therve

  • Keywords review removed
  • Owner changed from therve to exarkun

I wonder if it may be solved by a recent linux/openssl version. It's not failing for me, even with -u.

But anyway, there's now a small conflict with trunk. Once that solved, along with the flakes I mentioned, please merge.

comment:26 Changed 6 years ago by exarkun

  • Resolution set to fixed
  • Status changed from new to closed

(In [24404]) Merge ssl-disconnection-3218-3

Author: exarkun
Reviewer: jknight, therve
Fixes: #3218

Work around a bad behavior in the Linux implementation of select(2)
which interacted with OpenSSL to cause SelectReactor to sometimes
be unable to detect that a TCP connection which had switched to TLS
had been disconnected. This change causes SelectReactor to check
such connections for readability (in addition to the existing check
for writeability), since Linux adds these sockets only to the read-
set.

This also adds the utility method ReactorBuilder.runReactor which
wraps reactor.run() with additional logic to enforce timeouts for
tests.

comment:27 Changed 6 years ago by jknight

A note on some of the comments you added: Twisted does /not/ actually wait for the peer to respond with a close alert, so the sentence "Only after the peer responds with a TLS close alert of their own will we disconnect the underlying TCP connection. " and "The TCP connection is still up and we're waiting for the peer to send us back a TLS close alert. " are wrong. They should say instead, that we're waiting until we've been able to send the TLS close alert to the peer.

Achieving that is what the set_shutdown mess is about. Because, of course, OpenSSL doesn't actually have a /proper/ API for doing it (or, heck, just do it by default).

comment:28 Changed 6 years ago by exarkun

  • Resolution fixed deleted
  • Status changed from closed to reopened

(In [24430]) Revert r24404 - win32/gtk2 test suite regression

Reopens #3218

The Gtk2 reactor on Windows fails the newly added TLS disconnection
tests.

comment:29 Changed 6 years ago by exarkun

  • Branch changed from branches/ssl-disconnection-3218-3 to branches/ssl-disconnection-3218-4

(In [24433]) Branching to 'ssl-disconnection-3218-4'

comment:30 Changed 6 years ago by exarkun

  • Keywords rev added
  • Owner changed from exarkun to radix
  • Status changed from reopened to new

I fixed up the comment jknight pointed out was incorrect and I made gtk2/glib2 skip the new test on windows.

comment:31 Changed 6 years ago by exarkun

  • Keywords review added; rev removed

comment:32 Changed 6 years ago by radix

  • Keywords review removed
  • Owner changed from radix to exarkun

The changes since the last merge attempt look good. If the branch passes on the supported windows slaves, +1

comment:33 Changed 6 years ago by zooko

For what it is worth, over at http://allmydata.org/trac/tahoe/ticket/402#comment:17 we've worked-around this by specifying --reactor=poll on Linux. Our builders are all green: http://allmydata.org/buildbot/waterfall?show_events=false

Thanks!

comment:34 Changed 6 years ago by exarkun

  • Resolution set to fixed
  • Status changed from new to closed

(In [24473]) Merge ssl-disconnection-3218-4

Author: exarkun
Reviewer: jknight, therve, radix
Fixes: #3218

Work around a bad behavior in the Linux implementation of select(2)
which interacted with OpenSSL to cause SelectReactor to sometimes be
unable to detect that a TCP connection which had switched to TLS had
been disconnected. This change causes SelectReactor to check such
connections for readability (in addition to the existing check for
writeability), since Linux adds these sockets only to the read set.

This also adds the utility method ReactorBuilder.runReactor which
wraps reactor.run() with additional logic to enforce timeouts for
tests.

This re-merge fixes failures on Windows of the new TLS tests under
the glib2 and gtk2 reactors by skipping testing that configuration.
See #3371.

comment:35 Changed 6 years ago by warner

FYI, I've confirmed that this fixes the Foolscap (0.3.0) unit test failures that I was seeing against Twisted-8.1.0 and pyopenssl-0.7 . The tests pass when run against twisted trunk r24575.

comment:36 Changed 4 years ago by <automation>

  • Owner exarkun deleted
Note: See TracTickets for help on using tickets.