Opened 11 years ago

Last modified 11 years ago

#4455 defect new

SSL Connection Shutdown Causes Reactor To Spin

Reported by: moxie Owned by:
Priority: normal Milestone:
Component: core Keywords: spinning, spin, ssl
Cc: itamarst Branch:
Author:

Description (last modified by Jean-Paul Calderone)

I'm using the twisted.web framework for a high performance HTTP proxy server that very closely resembles the HTTP proxy server example that comes with the twisted package, except that it accepts connections via SSL.

Under heavy load, I occasionally (once every ~3 days) run into a problem where the reactor appears to start spinning on an "empty" write. An strace of the process reveals these system calls:

epoll_wait(3, {{EPOLLOUT, {u32=81, u64=13194703840475086929}}}, 269, 8) = 1
write(81, "", 0)                        = 0
gettimeofday({1272980323, 567613}, NULL) = 0
gettimeofday({1272980323, 567696}, NULL) = 0
epoll_wait(3, {{EPOLLOUT, {u32=81, u64=13194703840475086929}}}, 269, 8) = 1
write(81, "", 0)                        = 0
gettimeofday({1272980323, 567988}, NULL) = 0
gettimeofday({1272980323, 568071}, NULL) = 0
epoll_wait(3, {{EPOLLOUT, {u32=81, u64=13194703840475086929}}}, 269, 7) = 1
write(81, "", 0)                        = 0
gettimeofday({1272980323, 568501}, NULL) = 0
gettimeofday({1272980323, 568613}, NULL) = 0
epoll_wait(3, {{EPOLLOUT, {u32=81, u64=13194703840475086929}}}, 269, 7) = 1
write(81, "", 0)                        = 0
gettimeofday({1272980323, 568978}, NULL) = 0
gettimeofday({1272980323, 569084}, NULL) = 0
epoll_wait(3, {{EPOLLOUT, {u32=81, u64=13194703840475086929}}}, 269, 6) = 1
write(81, "", 0)                        = 0

...it seems to be waiting for write availability on a FD, trying to write an empty buffer, seeing that it wrote zero bytes, and so waiting for availability again.

I've confirmed that when this occurs, whatever FD it is spinning on does correspond with a network socket connected to the proxy. I've seen this with both Twisted 8.2 and 9.0 running on Linux using both the epoll and select reactors.

Digging through the 10.0 release code, I found the _sendCloseAlert method in tcp.py: [source:/trunk/twisted/internet/tcp.py?rev=26296#L224]

I feel like this is almost certainly the culprit. Most suspiciously, there's an explicit empty write on pastebin-line 23:

os.write(self.socket.fileno(), '')

It looks to me like there's a code path where this write "succeeds," the SSL socket reports that the connection is still not shut down, and then the transport is resumed via startWriting and startReading. I assume that this then repeats for as long as that socket is up. This directly corresponds with the system calls I'm seeing (poll, empty write, poll, empty write...)

I still don't have an entirely comprehensive grasp of whats' going on here, but while these efforts to "cleanly" shut down the SSL connection by exchanging alerts are heroic, I feel like it's much more common (and probably good enough) to just send a best-effort alert and then immediately close the underlying socket.

I've seen this with twisted 8.2 and 9.0, pyopenssl 0.9, and openssl 0.9.8g

Change History (6)

comment:1 Changed 11 years ago by Jean-Paul Calderone

Description: modified (diff)

comment:2 Changed 11 years ago by David Reid

From discussion on IRC:

[09:48:01] < exarkun> It's probably not load related at all
[09:48:22] < exarkun> You probably just need a crapton of real world clients before you run into one that behaves in a way that triggers this case
[09:48:40] < exarkun> Because most people know how to close sockets
[09:50:07] < foom> I wonder if the self.startReading() thing is the problem.
[09:50:46] < exarkun> foom: so what if someone decided to stop reading from their SSL socket just after receiving the close alert twisted sends?
[09:50:51] < exarkun> stops reading and stops writing
[09:51:29] < exarkun> eh actually I don't really remember how all these things fit together, so I'm probably not going to say anything useful
[09:51:42] < exarkun> dreid: I might be able to look in january sometime.
[09:51:43] < foom> The code is *supposed* to only wait for a sent notification
[09:52:02] < foom> that is: wait for write, when it can write, send some bytes. done.
[09:52:16] < foom> that is what code in other servers does too
[09:52:36] < exarkun> foom: maybe I broke it when I actually implemented the set_shutdown stuff
[09:52:36] < foom> unfortunately, other servers use blocking mode, where openssl's HORRIBLE SUCKY API doesn't get in the way quite so much
[09:54:14] < foom> Well, that's the only case in which it will attempt to retry
[09:54:40] < foom> So the question for someone is: what can make SSL_shutdown fail, besides a failed write()

comment:3 Changed 11 years ago by Jean-Paul Calderone

I would expect #4854 to resolve this, by removing the need to interact with OpenSSL's I/O systems at all.

comment:4 Changed 11 years ago by <automation>

Owner: Glyph deleted

comment:5 Changed 11 years ago by Itamar Turner-Trauring

Cc: itamarst added

And, if it's not clear: in older versions of Twisted, switching to twisted.internet.tls would probably fix it too (that's what #4854 would do, just a bit more transparently.)

comment:6 Changed 11 years ago by Jean-Paul Calderone

(In [31537]) Merge protocol-ssl-4854-6

Author: exarkun Reviewer: itamar, glyph, thijs Fixes: #4854 Refs: #4974 Refs: #5014 Refs: #4455

Add an implementation of IReactorSSL and ITLSTransport which uses the memory BIO APIs present in pyOpenSSL 0.10 and newer. This implementation will be preferred by all reactors if the pyOpenSSL dependency is satisfied, otherwise the old implementation will still be used.

This appears to have slightly better performance than the old implementation and should avoid bugs like #4455.

Note: See TracTickets for help on using tickets.