#3218 defect closed fixed (fixed)
SSL disconnection sometimes hangs indefinitely with pyOpenSSL 0.7
Reported by: | Jean-Paul Calderone | Owned by: | |
---|---|---|---|
Priority: | highest | Milestone: | |
Component: | core | Keywords: | |
Cc: | jknight, therve | Branch: |
branches/ssl-disconnection-3218-4
branch-diff, diff-cov, branch-cov, buildbot |
Author: | exarkun |
Description
pyOpenSSL 0.7 introduced a Connection.set_shutdown
API which allows Twisted to properly detect... something that I don't entirely understand yet. However, in the case where set_shutdown
returns 0, it calls startWriting
and not startReading
so a closed socket is never detected. This results in SSL connections sometimes hanging indefinitely at shutdown.
Change History (36)
comment:1 Changed 10 years ago by
Type: | enhancement → defect |
---|
comment:2 Changed 10 years ago by
Status: | new → assigned |
---|
comment:3 Changed 10 years ago by
comment:4 Changed 10 years ago by
Status: | assigned → new |
---|
It seems as though this is a Linux bug. In certain cases, the socket does not show up in the writable list, only in the readable list. Testing demonstrated the bug in 2.4.x and a number of 2.6.x kernels. FreeBSD and OS X 10.5 didn't exhibit the behavior.
I'm going to try to get #3198 merged before working on this, since #3198 has a bunch of code that should make testing this easier.
comment:5 Changed 10 years ago by
It also seems that this isn't a new problem with pyOpenSSL 0.7, although that release may make the failure more likely to happen. I can apparently reproduce the failure to disconnect with Twisted trunk@HEAD and pyOpenSSL 0.6.
comment:6 Changed 10 years ago by
Or not. Scratch my last, I just confused myself about what version was being used.
comment:7 Changed 10 years ago by
While you may not have noticed, it seems to me rather likely that it *is* possible even without the shutdown stuff.
The prerequisite is: a send call which results in EPIPE, which error is swallowed before it gets back to Twisted. It's not inconceivable that other APIs might swallow a send() error somewhere...
comment:8 Changed 10 years ago by
author: | → exarkun |
---|---|
Branch: | → branches/ssl-disconnection-3218-2 |
(In [23636]) Branching to 'ssl-disconnection-3218-2'
comment:9 Changed 10 years ago by
Keywords: | review added |
---|---|
Owner: | Jean-Paul Calderone deleted |
James is probably right that this may be possible in other circumstances. For now I'll just worry about fixing this one though.
comment:10 Changed 10 years ago by
Keywords: | review removed |
---|---|
Owner: | set to Jean-Paul Calderone |
- the new server.pem file is unfortunate. Could you create a temporary file during the tests, from a string in the module? We already have 2 pem files for tests, I don't like that much.
- the
ReactorBuilder
test shows 2 problems: there is no timeout, and there is not assertFailure. If I remove the fix from tcp, the test hangs forever, so that's not really great. Maybe for now a method that call reactor.run() and fires a timeout is enough.
Thanks!
comment:11 Changed 10 years ago by
I'm a bit worried about the startReading call: what happens if it triggers because there's actually data, rather than the socket closing? From a cursory glance, it looks like that would cause the doRead() function to be called, and perhaps actually deliver data to the protocol.
comment:12 Changed 10 years ago by
By the way, here is the ticket at allmydata.org, where the current workaround is to require pyOpenSSL != v0.7:
http://allmydata.org/trac/tahoe/ticket/402 # (bug in Twisted, triggered by pyOpenSSL-0.7)
comment:13 Changed 10 years ago by
There is new information on http://allmydata.org/trac/tahoe/ticket/402#comment:11 about how this issue does not seem to arise on certain combinations of software on Ubuntu Hardy, but it still does on Debian Sid...
comment:14 Changed 10 years ago by
comment:15 Changed 10 years ago by
BTW, it's worth noting this is only an issue for the selectreactor. The reason it is a problem is easily explained by the translation from poll flags to select bits:
#define POLLIN_SET (POLLRDNORM | POLLRDBAND | POLLIN | POLLHUP | POLLERR) #define POLLOUT_SET (POLLWRBAND | POLLWRNORM | POLLOUT | POLLERR) #define POLLEX_SET (POLLPRI)
Notice especially that POLLHUP only appears on POLLIN_SET. POLLHUP is what gets set when the remote side closes the connection.
This brings up an interesting question, though: is Twisted's pollreactor broken?
Consider, a remote server that calls shutdown(SHUT_WR), while the Twisted client is only in poll-for-writable mode. It seems this will cause the connection to be torn down, because event & POLL_DISCONNECTED and not (event & POLLIN):
will be true (since POLLIN wasn't asked for).
comment:17 Changed 10 years ago by
AFAICT, POLLHUP isn't set when SHUT_WR happens:
from socket import SHUT_WR, socket from select import POLLIN, POLLOUT, POLLERR, POLLNVAL, POLLHUP, select, poll def show((fd, event)): print fd, event & POLLIN and 'IN' or '', \ event & POLLOUT and 'OUT' or '', \ event & POLLERR and 'ERR' or '', \ event & POLLNVAL and 'NVAL' or '', \ event & POLLHUP and 'HUP' or '' port = socket() port.listen(3) client = socket() client.setblocking(False) client.connect_ex(port.getsockname()) client.setblocking(True) server, addr = port.accept() poller = poll() poller.register(client, POLLIN|POLLOUT|POLLERR|POLLNVAL|POLLHUP) server.shutdown(SHUT_WR) print 'Poll after shutdown:' map(show, poller.poll(0)) assert client.recv(1024) == '' print 'Poll after recv:' map(show, poller.poll(0))
On Linux 2.6.15(-52-k7), glibc 2.3.6, this produces the following output:
Poll after shutdown: 4 IN OUT Poll after recv: 4 IN OUT
So I guess it's okay.
comment:18 Changed 10 years ago by
Branch: | branches/ssl-disconnection-3218-2 → branches/ssl-disconnection-3218-3 |
---|
(In [24219]) Branching to 'ssl-disconnection-3218-3'
comment:19 Changed 10 years ago by
comment:20 Changed 10 years ago by
comment:21 Changed 10 years ago by
comment:22 Changed 10 years ago by
Keywords: | review added |
---|---|
Owner: | Jean-Paul Calderone deleted |
Priority: | high → highest |
Okay. I'm not sure what else I can do with this. Maybe the new code in _TLSMixin.doRead
shouldn't be there, but at least the comment is really long.
Build results:
comment:23 Changed 10 years ago by
Cc: | therve added |
---|---|
Keywords: | review removed |
Owner: | set to Jean-Paul Calderone |
Some flakes
twisted/internet/test/test_tls.py:13: 'FilePath' imported but unused twisted/internet/test/test_tls.py:20: undefined name 'importError'
More problematic, the test doesn't fail when I rever the changes in tcp. They used, so it must be linked with changes made after my last review. But afaict, it was only the change from a certificate file.
comment:24 Changed 10 years ago by
Keywords: | review added |
---|---|
Owner: | changed from Jean-Paul Calderone to therve |
twisted.internet.test.test_tls.SSLClientTestsMixin_SelectReactor.test_disconnectAfterWriteAfterStartTLS
still fails for me if I revert all of the changes to twisted/internet/tcp.py
. Unfortunately, I think it's possible there's a race condition in the test which makes it not fail all the time. In ShortProtocol.dataReceived
, if both sides of the connection manage to do their self.transport.write("y")
before either side notices that the other has closed the connection, then when OpenSSL writes the TLS close alert, it won't fail with ENOTCONN
or EPIPE
. This means the socket will still show up in the input set reported by select
, and so the changes to tcp.py
aren't necessary to notice the connection is closed. Unfortunately, I can't see a simple way to test this hypothesis or fix the problem if it exists.
comment:25 Changed 10 years ago by
Keywords: | review removed |
---|---|
Owner: | changed from therve to Jean-Paul Calderone |
I wonder if it may be solved by a recent linux/openssl version. It's not failing for me, even with -u.
But anyway, there's now a small conflict with trunk. Once that solved, along with the flakes I mentioned, please merge.
comment:26 Changed 10 years ago by
Resolution: | → fixed |
---|---|
Status: | new → closed |
(In [24404]) Merge ssl-disconnection-3218-3
Author: exarkun Reviewer: jknight, therve Fixes: #3218
Work around a bad behavior in the Linux implementation of select(2)
which interacted with OpenSSL to cause SelectReactor
to sometimes
be unable to detect that a TCP connection which had switched to TLS
had been disconnected. This change causes SelectReactor
to check
such connections for readability (in addition to the existing check
for writeability), since Linux adds these sockets only to the read-
set.
This also adds the utility method ReactorBuilder.runReactor
which
wraps reactor.run()
with additional logic to enforce timeouts for
tests.
comment:27 Changed 10 years ago by
A note on some of the comments you added: Twisted does /not/ actually wait for the peer to respond with a close alert, so the sentence "Only after the peer responds with a TLS close alert of their own will we disconnect the underlying TCP connection. " and "The TCP connection is still up and we're waiting for the peer to send us back a TLS close alert. " are wrong. They should say instead, that we're waiting until we've been able to send the TLS close alert to the peer.
Achieving that is what the set_shutdown mess is about. Because, of course, OpenSSL doesn't actually have a /proper/ API for doing it (or, heck, just do it by default).
comment:28 Changed 10 years ago by
Resolution: | fixed |
---|---|
Status: | closed → reopened |
comment:29 Changed 10 years ago by
Branch: | branches/ssl-disconnection-3218-3 → branches/ssl-disconnection-3218-4 |
---|
(In [24433]) Branching to 'ssl-disconnection-3218-4'
comment:30 Changed 10 years ago by
Keywords: | rev added |
---|---|
Owner: | changed from Jean-Paul Calderone to radix |
Status: | reopened → new |
I fixed up the comment jknight pointed out was incorrect and I made gtk2/glib2 skip the new test on windows.
comment:31 Changed 10 years ago by
Keywords: | review added; rev removed |
---|
comment:32 Changed 10 years ago by
Keywords: | review removed |
---|---|
Owner: | changed from radix to Jean-Paul Calderone |
The changes since the last merge attempt look good. If the branch passes on the supported windows slaves, +1
comment:33 Changed 10 years ago by
For what it is worth, over at http://allmydata.org/trac/tahoe/ticket/402#comment:17 we've worked-around this by specifying --reactor=poll on Linux. Our builders are all green: http://allmydata.org/buildbot/waterfall?show_events=false
Thanks!
comment:34 Changed 10 years ago by
Resolution: | → fixed |
---|---|
Status: | new → closed |
(In [24473]) Merge ssl-disconnection-3218-4
Author: exarkun Reviewer: jknight, therve, radix Fixes: #3218
Work around a bad behavior in the Linux implementation of select(2) which interacted with OpenSSL to cause SelectReactor to sometimes be unable to detect that a TCP connection which had switched to TLS had been disconnected. This change causes SelectReactor to check such connections for readability (in addition to the existing check for writeability), since Linux adds these sockets only to the read set.
This also adds the utility method ReactorBuilder.runReactor which wraps reactor.run() with additional logic to enforce timeouts for tests.
This re-merge fixes failures on Windows of the new TLS tests under the glib2 and gtk2 reactors by skipping testing that configuration. See #3371.
comment:35 Changed 10 years ago by
FYI, I've confirmed that this fixes the Foolscap (0.3.0) unit test failures that I was seeing against Twisted-8.1.0 and pyopenssl-0.7 . The tests pass when run against twisted trunk r24575.
comment:36 Changed 7 years ago by
Owner: | Jean-Paul Calderone deleted |
---|
Don't closed sockets showed up on both readable and writable lists? In any case, startWriting -> startReading is the wrong change, because sending the shutdown notification requires writing. If the socket isn't writable (buffer full or whatever) when it wants to write that data, it'd need to get a writable notification..