Opened 11 years ago
Closed 11 years ago
#5118 release blocker: regression closed fixed (fixed)
Memory BIO TLS transport which receives a TLS close alert never sends a close alert
Reported by: | Itamar Turner-Trauring | Owned by: | Glyph |
---|---|---|---|
Priority: | normal | Milestone: | Twisted-11.1 |
Component: | core | Keywords: | |
Cc: | ivank, jesstess | Branch: |
branches/tls-close-5118-3
branch-diff, diff-cov, branch-cov, buildbot |
Author: | itamarst, exarkun, glyph |
Description
BIO TLS transports don't disconnect cleanly; the side that initiated the disconnect always gets unclean disconnect. There is a test that attempts to check that the close handshake works correctly for BIO TLS transports (test_loseConnectionAfterHandshake), but it fails to assert some things and therefore doesn't catch the problem.
There is a fix for this, with test changes, in [31954]. However, this doesn't cover an additional problem: once a TLS transport receives the close alert, only one 215 chunk is written out before the underlying transport is closed, and any other buffered data is dropped and never sent.
Change History (40)
comment:1 Changed 11 years ago by
Author: | → itamarst |
---|---|
Branch: | → branches/tls-close-5118 |
comment:2 Changed 11 years ago by
Author: | itamarst → itamarst, exarkun |
---|---|
Keywords: | review added |
Ready for review. I expanded the test to have a large write to cover the additional problem mentioned in the original description and, strangely, it passes. Perhaps OpenSSL only exposes the TLS close alert once all the data has been flushed from the bio? Or perhaps my expanded test isn't covering that case correctly.
http://buildbot.twistedmatrix.com/boxes-supported?branch=/branches/tls-close-5118 for test runs.
comment:3 Changed 11 years ago by
Milestone: | → Twisted-11.1 |
---|
comment:4 Changed 11 years ago by
Keywords: | review removed |
---|---|
Owner: | set to Itamar Turner-Trauring |
My brain hurts a little now. But it seems fine, please merge.
comment:5 Changed 11 years ago by
I rather dislike the style of test involving:
chunkOfBytes = "123456890" * 100000
particular if the author doesn't even claim to understand why it's there. I think the test should stick to what it was doing before, or there should be a better explanation of why it should write a huge pile of bytes.
comment:6 Changed 11 years ago by
Keywords: | review added |
---|---|
Owner: | Itamar Turner-Trauring deleted |
I expanded the comment inside cbHandshake - is that sufficient? Is it understandable now?
What's unclear to me is why this passes, not why writing a huge pile of bytes is a good idea.
comment:7 Changed 11 years ago by
I tested the branch against a Divmod branch which tries to port Q2Q from JUICE to AMP. Sadly, this does not fix an issue caused by the Unexpected EOF
error. It might not be the fault of twisted.protocols.tls
, it could be that AMP or Q2Q is doing something wrong, I'm not entirely sure yet. The code in question is at lp:~divmod-dev/divmod.org/amp-vertex-2580
, and the test is vertex.test.test_q2q.VirtualConnection.testConnectWithIntroduction
. It fails because the Unexpected EOF
handler in twisted.protocols.tls
turns the connectionLost
reason into ConnectionLost
and the test wants ConnectionDone
.
comment:8 Changed 11 years ago by
Keywords: | review removed |
---|---|
Owner: | set to Itamar Turner-Trauring |
I got the vertex test to pass... Basically remove self.transport.loseConnection from TLSMemoryBIOProtocol.loseConnection, and make TLSMemoryBIOProtocol return immediately if self.disconnecting is True.
I am investigating - seems to break SSL tests, and it's not quite clear how it fixes things :)
comment:10 Changed 11 years ago by
(In [32020]) OK, I got it to point where vertex passes, only one ssl test fails (see below), and some HTTPS and POP3S tests fail due to dirty reactor. I have no test reproducing whatever vertex is triggering.
OTOH I improved things so that loseConnection() can be called twice without wacky side-effects, and write() after loseConnection() just drops the bytes (the latter at least should have a test).
The failing SSL test is testImmediateDisconnect: quite possibly we need handshake-done callback, it seems like calling shutdown() before the handshake is done doesn't actually work.
refs #5118
comment:11 Changed 11 years ago by
comment:12 Changed 11 years ago by
Keywords: | review added |
---|---|
Owner: | changed from Itamar Turner-Trauring to Jean-Paul Calderone |
OK, ready for review again. Changes:
- Implemented equivalent test to the failing vertex test, and fixed the issue. The test ended up in
twisted.test.test_ssl
, since my attempt to reproduce it intwisted.protocols.test.test_tls
failed. - Code refactoring and cleanup.
- Writes after
loseConnection
are dropped on the floor, as is the case with TCP connections. - The user-level protocol has
connectionLost
called after the underlying transport has disconnected, instead of when the TLS shutdown has finished. - Nagle is disabled on TCP transports when TLS shutdown is initiated; this speeds up shutdown, which otherwise would be delayed, since shutdown involves sending a small write.
- Some tests started failing due to TLS shutdown delaying disconnection a little. The
pop3client
test was fixed cleanly, but thewebclient
tests were fixed horribly. Perhaps a ticket for a longer-term solution would be in order, but I'm not sure what the fix should be. Most obvious solution is to delay firing Deferred returned from downloadPage until connection is lost, rather than until HTTP response is received; this is a non-trivial change so probably shouldn't be in this branch.
comment:13 Changed 11 years ago by
Cc: | ivank added |
---|
I don't have a full review, and this is assigned to exarkun, so I've left the review keyword alone.
I started up 5 instances of trial -u twisted.web.test.test_webclient
on my 4 core Ubuntu VM, and within a few seconds, this test failed:
=============================================================================== [ERROR] Traceback (most recent call last): Failure: twisted.trial.util.DirtyReactorAggregateError: Reactor was unclean. Selectables: <TLSMemoryBIOProtocol #0 on 35386> <TLSMemoryBIOProtocol #1 on 35386> twisted.web.test.test_webclient.WebClientSSLTestCase.test_downloadTimeout
I don't think raising the number 0.001
in slightDelay
would be a good idea, though. I see two options, which itamar mentioned in #twisted:
- Teach trial about TLS connections that are closing and don't count the reactor as dirty.
- Make
_newtls
do unclean closes like the old TLS code.
comment:14 Changed 11 years ago by
Keywords: | review removed |
---|---|
Owner: | changed from Jean-Paul Calderone to Itamar Turner-Trauring |
The current plan:
- make HTTPClient SSL tests spin the reactor a bit more, so they pass more consistently (apparently they are fragile with old TLS implementation as well).
- File ticket for using half-close to speed up TLS shutdown. This also "fixes" the downloadPage tests, though still leaving them fragile, so maybe we'll remove point 1 once we do that.
The motivation: we're trying to get rid of downloadPage and friends anyway, so putting lots of work into them is a waste of time, and this problem is not really user-facing.
comment:15 Changed 11 years ago by
Keywords: | review added |
---|---|
Owner: | changed from Itamar Turner-Trauring to Jean-Paul Calderone |
Done and done (#5150). Ready for review.
Actually: I notice the comment about Error case in TLSMemoryBIOProtocol._write is no longer correct, I'm pretty sure the error *will* reach user protocol's connectionLost, unlike old code. Please note in review which behavior is preferred and I'll fix the comment or the code. Personally I think the error should be passed to user; this is useful debugging info which otherwise will be swallowed.
comment:16 Changed 11 years ago by
Keywords: | review removed |
---|---|
Owner: | changed from Jean-Paul Calderone to Itamar Turner-Trauring |
- in
test_pop3client.py
POP3TLSTestCase
andPOP3.testStartTLS
are missing docstrings, andtestStartTLS
should be renamed in accordance with the coding standard.- There's some trailing whitespace on the new
return connLostDeferred
line - The test could probably be simplified by doing an
addCleanup(port.stopListening)
early on, instead of adding it to the Deferred chain near the end.
- The test coverage of
twisted/protocols/tls.py
is worse in the branch than it was in trunk; 5 new uncovered lines (total of 10) and 3 new uncovered branches (total of 4). The output ofcoverage.py
can enumerate the cases more easily than I can. - The
if self.disconnecting:
check inwrite
is going to immediately cause problems, isn't it? It will break some producer/consumer behavior and we'll have to fix it in #5063. - There's a pyflakes warning in
twisted/protocols/tls.py
I think I complained already about some of the (new/changed in the branch) tests, it probably doesn't help to repeat those. Maybe I can try to write some better ones at some point though.
comment:17 Changed 11 years ago by
Re 3, I think that breaking producers slighty more is fine if it improves things for the non-producer case. Producers are messed up already and this will be fixed ASAP in the producer branch (just need to make sure it has test for write after loseConnection).
Re 2... am I responsible for fixing branches I didn't write? Cause I'm not sure I understand the code well enough to do all of them. My plan is to remove a few branches that are probably unnecessary, and add some more tests to test_tcp to improve coverage, but I'm not sure I'll be able to do everything. We'll see how far I get.
As far as your complaints about the tests... is it that some of them were in test_ssl? The vertex-like test is in test_ssl and will continue to be there unless someone else can reproduce it some other way; my attempt to do so in test_tls failed.
comment:18 Changed 11 years ago by
Re 2... am I responsible for fixing branches I didn't write?
If your changes caused the tests to no longer exercise them. :)
comment:19 Changed 11 years ago by
comment:20 Changed 11 years ago by
comment:21 Changed 11 years ago by
1 and 4 were fixed, 3 will be dealt with in #5063 branch. I have no idea how to test the remaining two edge cases (Unexpected EOF, and the error case in _write); they were tested before, but possibly due to unclean shutdown, or worse error handling elsewhere. Suggestions?
comment:22 Changed 11 years ago by
Keywords: | review added |
---|---|
Owner: | Itamar Turner-Trauring deleted |
OK, I have added test coverage for remaining two branches. Note that for full coverage you need to run twisted.protocols.test.test_tls, twisted.internet.test.test_tls and twisted.test.test_ssl. Some branches, in particular _tlsShutdown in _write(), only cause failures when removed on IOCP reactors, but they are necessary.
All tests pass: http://buildbot.twistedmatrix.com/boxes-supported?branch=/branches/tls-close-5118
comment:23 Changed 11 years ago by
Replying to itamar:
There is a fix for this, with test changes, in [31954]. However, this doesn't cover an additional problem: once a TLS transport receives the close alert, only one 215 chunk is written out before the underlying transport is closed, and any other buffered data is dropped and never sent.
This problem actually sounds like an almost-correct interpretation of the spec:
The other party (which has received a close_notify) MUST respond with a close_notify alert of its own and close down the connection immediately, discarding any pending writes. It is not required for the initiator of the close to wait for the responding close_notify alert before closing the read side of the connection.
(emphasis mine).
comment:24 Changed 11 years ago by
Based on the original quote it sounds like the "additional problem" noted in the ticket description is not an actual problem, and can be passed over. In any case, this ticket doesn't cover this "additional problem"; it is rather than expanded, much improved version of the fix originally done in a branch, in [31954]. That branch will be merged forward once this ticket is fixed to use the much better fix implemented here.
comment:25 Changed 11 years ago by
Keywords: | review removed |
---|---|
Owner: | set to Itamar Turner-Trauring |
Looking pretty good. I had a hard time understanding what was going on in this ticket until I independently rediscovered one of the bugs that it fixes and spent a good deal of time tracking it down. I think it's ready to land.
- Fix the merge conflicts; they're all pretty trivial (just imports).
- The test in
twisted.test.test_ssl
seems misplaced. It explicitly doesn't run unless it's using thetwisted.protocols.tls
/ memory BIO implementation, and I can't see anything specific toconnectSSL
/listenSSL
about the test; if it were intest_tls
and didn't go through those APIs, it could avoid using real sockets completely, and thereby be faster and more reliable. If this is actually covering code intwisted.internet._newtls
, its docstring should be a little clearer, because it sounds like it's just covering twisted.internet.ssl. - The '.bugfix' file ought to mention TLS close alerts specifically; "clean shutdowns" is a little vague.
- The fix for nondeterminism in
WebClientSSLTestCase.tearDown
is crappy. It doesn't need to be that bad. It does involve writing a bit of wrapping code, though, and this is probably deterministic enough for now. Feel free to just file a ticket for fixing that, though, I would be happy to take that over. - Also possibly for a separate documentation ticket, unless the way to
communicate this is immediately clear to you: we should have some language
somewhere around the transport guarantees provided by
twisted.protocols.tls
andreactor.(listen|connect)SSL
, and the required steps to avoid truncation attacks. For that matter, is there already a thing that says something about whenconnectionMade
is called with respect to negotiation? I can see in the implementation that it's called before negotiation is initiated, but the only thing I can find resembling documentation is a review comment on #4905; and it's a question, not a statement.
Thanks for doing this. Another tricky ticket down on the road to 11.1!
comment:26 Changed 11 years ago by
Thanks!
As far as 2 goes: first, the test is not specific to the new code. Rather, the old code is broken, and I don't feel like fixing it since it's going away, thus the skip. Second, the reason it's in test_ssl is that I was unable to reproduce the issue using the TLSMemoryBIOProtocol APIs directly. Doesn't mean it's impossible, but I tried and failed.
I'd be happy to open a new ticket for moving it, would that be sufficient? Having it in right place seems nice, but not necessary.
Re 5, what's a truncation attack?
comment:27 Changed 11 years ago by
Branch: | branches/tls-close-5118 → branches/tls-close-5118-2 |
---|
(In [32254]) Branching to 'tls-close-5118-2'
comment:29 Changed 11 years ago by
Resolution: | → fixed |
---|---|
Status: | new → closed |
comment:30 Changed 11 years ago by
comment:31 Changed 11 years ago by
Resolution: | fixed |
---|---|
Status: | closed → reopened |
comment:32 Changed 11 years ago by
I cleared out the de-nagling feature and that seems to have fixed all the observed issues.
comment:33 Changed 11 years ago by
Author: | itamarst, exarkun → itamarst, exarkun, glyph |
---|---|
Keywords: | review added |
Owner: | Itamar Turner-Trauring deleted |
Status: | reopened → new |
comment:34 Changed 11 years ago by
Build Results, for the reviewer's convenience.
My own personal testing of Calendar Server indicates that the problems which caused the rollback are gone. Since the feature being removed may still be marginally useful to someone, I've filed #5196 for re-adding it.
comment:35 Changed 11 years ago by
Cc: | jesstess added |
---|---|
Keywords: | review removed |
Owner: | set to Glyph |
glyph, thanks for tackling this regression.
After forcing a few more builds, it looks like the downloadPage tests will consistently fail on Lucid on this branch:
http://buildbot.twistedmatrix.com/builders/lucid64-py2.6-poll http://buildbot.twistedmatrix.com/builders/lucid64-py2.6-epoll
comment:36 Changed 11 years ago by
#3796 fixes the downloadPage/ssl issues the real way. If that gets merged then this branch can drop all of the test_webclient.py
changes, I think.
comment:37 Changed 11 years ago by
The downloadPage tests fail because nagle is still enabled, so it can take many milliseconds more (I saw 30ms) to do the TLS close alert and then the TCP disconnect. (And now, back to my vacation - thanks for all your followup work!)
comment:38 Changed 11 years ago by
Branch: | branches/tls-close-5118-2 → branches/tls-close-5118-3 |
---|
(In [32330]) Branching to 'tls-close-5118-3'
comment:40 Changed 11 years ago by
Resolution: | → fixed |
---|---|
Status: | assigned → closed |
(In [31970]) Branching to 'tls-close-5118'