Opened 11 years ago

Closed 11 years ago

#5118 release blocker: regression closed fixed (fixed)

Memory BIO TLS transport which receives a TLS close alert never sends a close alert

Reported by: Itamar Turner-Trauring Owned by: Glyph
Priority: normal Milestone: Twisted-11.1
Component: core Keywords:
Cc: ivank, jesstess Branch: branches/tls-close-5118-3
branch-diff, diff-cov, branch-cov, buildbot
Author: itamarst, exarkun, glyph

Description

BIO TLS transports don't disconnect cleanly; the side that initiated the disconnect always gets unclean disconnect. There is a test that attempts to check that the close handshake works correctly for BIO TLS transports (test_loseConnectionAfterHandshake), but it fails to assert some things and therefore doesn't catch the problem.

There is a fix for this, with test changes, in [31954]. However, this doesn't cover an additional problem: once a TLS transport receives the close alert, only one 215 chunk is written out before the underlying transport is closed, and any other buffered data is dropped and never sent.

Change History (40)

comment:1 Changed 11 years ago by itamarst

Author: itamarst
Branch: branches/tls-close-5118

(In [31970]) Branching to 'tls-close-5118'

comment:2 Changed 11 years ago by Itamar Turner-Trauring

Author: itamarstitamarst, exarkun
Keywords: review added

Ready for review. I expanded the test to have a large write to cover the additional problem mentioned in the original description and, strangely, it passes. Perhaps OpenSSL only exposes the TLS close alert once all the data has been flushed from the bio? Or perhaps my expanded test isn't covering that case correctly.

http://buildbot.twistedmatrix.com/boxes-supported?branch=/branches/tls-close-5118 for test runs.

comment:3 Changed 11 years ago by Itamar Turner-Trauring

Milestone: Twisted-11.1

comment:4 Changed 11 years ago by therve

Keywords: review removed
Owner: set to Itamar Turner-Trauring

My brain hurts a little now. But it seems fine, please merge.

comment:5 Changed 11 years ago by Jean-Paul Calderone

I rather dislike the style of test involving:

chunkOfBytes = "123456890" * 100000

particular if the author doesn't even claim to understand why it's there. I think the test should stick to what it was doing before, or there should be a better explanation of why it should write a huge pile of bytes.

comment:6 Changed 11 years ago by Itamar Turner-Trauring

Keywords: review added
Owner: Itamar Turner-Trauring deleted

I expanded the comment inside cbHandshake - is that sufficient? Is it understandable now?

What's unclear to me is why this passes, not why writing a huge pile of bytes is a good idea.

comment:7 Changed 11 years ago by Jean-Paul Calderone

I tested the branch against a Divmod branch which tries to port Q2Q from JUICE to AMP. Sadly, this does not fix an issue caused by the Unexpected EOF error. It might not be the fault of twisted.protocols.tls, it could be that AMP or Q2Q is doing something wrong, I'm not entirely sure yet. The code in question is at lp:~divmod-dev/divmod.org/amp-vertex-2580, and the test is vertex.test.test_q2q.VirtualConnection.testConnectWithIntroduction. It fails because the Unexpected EOF handler in twisted.protocols.tls turns the connectionLost reason into ConnectionLost and the test wants ConnectionDone.

comment:8 Changed 11 years ago by Itamar Turner-Trauring

Keywords: review removed
Owner: set to Itamar Turner-Trauring

I got the vertex test to pass... Basically remove self.transport.loseConnection from TLSMemoryBIOProtocol.loseConnection, and make TLSMemoryBIOProtocol return immediately if self.disconnecting is True.

I am investigating - seems to break SSL tests, and it's not quite clear how it fixes things :)

comment:9 Changed 11 years ago by Itamar Turner-Trauring

Hm. Is it possible AMP actually wants half-close?

comment:10 Changed 11 years ago by itamarst

(In [32020]) OK, I got it to point where vertex passes, only one ssl test fails (see below), and some HTTPS and POP3S tests fail due to dirty reactor. I have no test reproducing whatever vertex is triggering.

OTOH I improved things so that loseConnection() can be called twice without wacky side-effects, and write() after loseConnection() just drops the bytes (the latter at least should have a test).

The failing SSL test is testImmediateDisconnect: quite possibly we need handshake-done callback, it seems like calling shutdown() before the handshake is done doesn't actually work.

refs #5118

comment:11 Changed 11 years ago by itamarst

(In [32027]) Docstring for _write.

Remaining work: look at tests that fail with dirty reactor, write test for fact that write()/writeSequence() drop additional bytes after loseConnection.

refs #5118

comment:12 Changed 11 years ago by Itamar Turner-Trauring

Keywords: review added
Owner: changed from Itamar Turner-Trauring to Jean-Paul Calderone

OK, ready for review again. Changes:

  1. Implemented equivalent test to the failing vertex test, and fixed the issue. The test ended up in twisted.test.test_ssl, since my attempt to reproduce it in twisted.protocols.test.test_tls failed.
  2. Code refactoring and cleanup.
  3. Writes after loseConnection are dropped on the floor, as is the case with TCP connections.
  4. The user-level protocol has connectionLost called after the underlying transport has disconnected, instead of when the TLS shutdown has finished.
  5. Nagle is disabled on TCP transports when TLS shutdown is initiated; this speeds up shutdown, which otherwise would be delayed, since shutdown involves sending a small write.
  6. Some tests started failing due to TLS shutdown delaying disconnection a little. The pop3client test was fixed cleanly, but the webclient tests were fixed horribly. Perhaps a ticket for a longer-term solution would be in order, but I'm not sure what the fix should be. Most obvious solution is to delay firing Deferred returned from downloadPage until connection is lost, rather than until HTTP response is received; this is a non-trivial change so probably shouldn't be in this branch.

comment:13 Changed 11 years ago by ivank

Cc: ivank added

I don't have a full review, and this is assigned to exarkun, so I've left the review keyword alone.

I started up 5 instances of trial -u twisted.web.test.test_webclient on my 4 core Ubuntu VM, and within a few seconds, this test failed:

===============================================================================
[ERROR]
Traceback (most recent call last):
Failure: twisted.trial.util.DirtyReactorAggregateError: Reactor was unclean.
Selectables:
<TLSMemoryBIOProtocol #0 on 35386>
<TLSMemoryBIOProtocol #1 on 35386>

twisted.web.test.test_webclient.WebClientSSLTestCase.test_downloadTimeout

I don't think raising the number 0.001 in slightDelay would be a good idea, though. I see two options, which itamar mentioned in #twisted:

  1. Teach trial about TLS connections that are closing and don't count the reactor as dirty.
  1. Make _newtls do unclean closes like the old TLS code.

comment:14 Changed 11 years ago by Itamar Turner-Trauring

Keywords: review removed
Owner: changed from Jean-Paul Calderone to Itamar Turner-Trauring

The current plan:

  1. make HTTPClient SSL tests spin the reactor a bit more, so they pass more consistently (apparently they are fragile with old TLS implementation as well).
  2. File ticket for using half-close to speed up TLS shutdown. This also "fixes" the downloadPage tests, though still leaving them fragile, so maybe we'll remove point 1 once we do that.

The motivation: we're trying to get rid of downloadPage and friends anyway, so putting lots of work into them is a waste of time, and this problem is not really user-facing.

comment:15 Changed 11 years ago by Itamar Turner-Trauring

Keywords: review added
Owner: changed from Itamar Turner-Trauring to Jean-Paul Calderone

Done and done (#5150). Ready for review.

Actually: I notice the comment about Error case in TLSMemoryBIOProtocol._write is no longer correct, I'm pretty sure the error *will* reach user protocol's connectionLost, unlike old code. Please note in review which behavior is preferred and I'll fix the comment or the code. Personally I think the error should be passed to user; this is useful debugging info which otherwise will be swallowed.

comment:16 Changed 11 years ago by Jean-Paul Calderone

Keywords: review removed
Owner: changed from Jean-Paul Calderone to Itamar Turner-Trauring
  1. in test_pop3client.py
    1. POP3TLSTestCase and POP3.testStartTLS are missing docstrings, and testStartTLS should be renamed in accordance with the coding standard.
    2. There's some trailing whitespace on the new return connLostDeferred line
    3. The test could probably be simplified by doing an addCleanup(port.stopListening) early on, instead of adding it to the Deferred chain near the end.
  2. The test coverage of twisted/protocols/tls.py is worse in the branch than it was in trunk; 5 new uncovered lines (total of 10) and 3 new uncovered branches (total of 4). The output of coverage.py can enumerate the cases more easily than I can.
  3. The if self.disconnecting: check in write is going to immediately cause problems, isn't it? It will break some producer/consumer behavior and we'll have to fix it in #5063.
  4. There's a pyflakes warning in twisted/protocols/tls.py

I think I complained already about some of the (new/changed in the branch) tests, it probably doesn't help to repeat those. Maybe I can try to write some better ones at some point though.

comment:17 Changed 11 years ago by Itamar Turner-Trauring

Re 3, I think that breaking producers slighty more is fine if it improves things for the non-producer case. Producers are messed up already and this will be fixed ASAP in the producer branch (just need to make sure it has test for write after loseConnection).

Re 2... am I responsible for fixing branches I didn't write? Cause I'm not sure I understand the code well enough to do all of them. My plan is to remove a few branches that are probably unnecessary, and add some more tests to test_tcp to improve coverage, but I'm not sure I'll be able to do everything. We'll see how far I get.

As far as your complaints about the tests... is it that some of them were in test_ssl? The vertex-like test is in test_ssl and will continue to be there unless someone else can reproduce it some other way; my attempt to do so in test_tls failed.

comment:18 Changed 11 years ago by Jean-Paul Calderone

Re 2... am I responsible for fixing branches I didn't write?

If your changes caused the tests to no longer exercise them. :)

comment:19 Changed 11 years ago by itamarst

(In [32133]) AFAICT Unexpected EOF only happens when using BIO that talks to e.g. socket (in old SSL code it's only checked on SysCallError), and so it doesn't make sense for memory BIO. I therefore removed the check.

Refs #5118

comment:20 Changed 11 years ago by itamarst

(In [32137]) Get rid of edge case (write blocked on read, and we get valid bytes to BIO *after* TLS connection has closed) which seems implausible, and should, anyway, be caught by the BIO complaining about post-shutdown writes.

Refs #5118

comment:21 Changed 11 years ago by Itamar Turner-Trauring

1 and 4 were fixed, 3 will be dealt with in #5063 branch. I have no idea how to test the remaining two edge cases (Unexpected EOF, and the error case in _write); they were tested before, but possibly due to unclean shutdown, or worse error handling elsewhere. Suggestions?

comment:22 Changed 11 years ago by Itamar Turner-Trauring

Keywords: review added
Owner: Itamar Turner-Trauring deleted

OK, I have added test coverage for remaining two branches. Note that for full coverage you need to run twisted.protocols.test.test_tls, twisted.internet.test.test_tls and twisted.test.test_ssl. Some branches, in particular _tlsShutdown in _write(), only cause failures when removed on IOCP reactors, but they are necessary.

All tests pass: http://buildbot.twistedmatrix.com/boxes-supported?branch=/branches/tls-close-5118

comment:23 in reply to:  description Changed 11 years ago by Glyph

Replying to itamar:

There is a fix for this, with test changes, in [31954]. However, this doesn't cover an additional problem: once a TLS transport receives the close alert, only one 215 chunk is written out before the underlying transport is closed, and any other buffered data is dropped and never sent.

This problem actually sounds like an almost-correct interpretation of the spec:

The other party (which has received a close_notify) MUST respond with a close_notify alert of its own and close down the connection immediately, discarding any pending writes. It is not required for the initiator of the close to wait for the responding close_notify alert before closing the read side of the connection.

(emphasis mine).

comment:24 Changed 11 years ago by Itamar Turner-Trauring

Based on the original quote it sounds like the "additional problem" noted in the ticket description is not an actual problem, and can be passed over. In any case, this ticket doesn't cover this "additional problem"; it is rather than expanded, much improved version of the fix originally done in a branch, in [31954]. That branch will be merged forward once this ticket is fixed to use the much better fix implemented here.

comment:25 Changed 11 years ago by Glyph

Keywords: review removed
Owner: set to Itamar Turner-Trauring

Looking pretty good. I had a hard time understanding what was going on in this ticket until I independently rediscovered one of the bugs that it fixes and spent a good deal of time tracking it down. I think it's ready to land.

  1. Fix the merge conflicts; they're all pretty trivial (just imports).
  2. The test in twisted.test.test_ssl seems misplaced. It explicitly doesn't run unless it's using the twisted.protocols.tls / memory BIO implementation, and I can't see anything specific to connectSSL/listenSSL about the test; if it were in test_tls and didn't go through those APIs, it could avoid using real sockets completely, and thereby be faster and more reliable. If this is actually covering code in twisted.internet._newtls, its docstring should be a little clearer, because it sounds like it's just covering twisted.internet.ssl.
  3. The '.bugfix' file ought to mention TLS close alerts specifically; "clean shutdowns" is a little vague.
  4. The fix for nondeterminism in WebClientSSLTestCase.tearDown is crappy. It doesn't need to be that bad. It does involve writing a bit of wrapping code, though, and this is probably deterministic enough for now. Feel free to just file a ticket for fixing that, though, I would be happy to take that over.
  5. Also possibly for a separate documentation ticket, unless the way to communicate this is immediately clear to you: we should have some language somewhere around the transport guarantees provided by twisted.protocols.tls and reactor.(listen|connect)SSL, and the required steps to avoid truncation attacks. For that matter, is there already a thing that says something about when connectionMade is called with respect to negotiation? I can see in the implementation that it's called before negotiation is initiated, but the only thing I can find resembling documentation is a review comment on #4905; and it's a question, not a statement.

Thanks for doing this. Another tricky ticket down on the road to 11.1!

comment:26 Changed 11 years ago by Itamar Turner-Trauring

Thanks!

As far as 2 goes: first, the test is not specific to the new code. Rather, the old code is broken, and I don't feel like fixing it since it's going away, thus the skip. Second, the reason it's in test_ssl is that I was unable to reproduce the issue using the TLSMemoryBIOProtocol APIs directly. Doesn't mean it's impossible, but I tried and failed.

I'd be happy to open a new ticket for moving it, would that be sufficient? Having it in right place seems nice, but not necessary.

Re 5, what's a truncation attack?

comment:27 Changed 11 years ago by itamarst

Branch: branches/tls-close-5118branches/tls-close-5118-2

(In [32254]) Branching to 'tls-close-5118-2'

comment:28 Changed 11 years ago by itamarst

(In [32256]) Address review comment 3. Refs #5118

comment:29 Changed 11 years ago by itamarst

Resolution: fixed
Status: newclosed

(In [32257]) Merge tls-close-5118-2.

Author: itamar, exarkun Reviewer: exarkun, glyph Fixes: #5118

TLS close alerts are now exchanged correctly, leading to clean TLS disconnects. Also, lots of code cleanup.

comment:30 Changed 11 years ago by Itamar Turner-Trauring

I created tickets #5182, #5183, #5184, #5185 to address comments from the review as well as one I found on my own.

comment:31 Changed 11 years ago by Jean-Paul Calderone

Resolution: fixed
Status: closedreopened

(In [32274]) Revert r32257 - this leads to problems in some real SSL-using software

Reopens: #5118

See #5187 and #5188 for details about the problems introduced.

comment:32 Changed 11 years ago by Glyph

I cleared out the de-nagling feature and that seems to have fixed all the observed issues.

comment:33 Changed 11 years ago by Glyph

Author: itamarst, exarkunitamarst, exarkun, glyph
Keywords: review added
Owner: Itamar Turner-Trauring deleted
Status: reopenednew

comment:34 Changed 11 years ago by Glyph

Build Results, for the reviewer's convenience.

My own personal testing of Calendar Server indicates that the problems which caused the rollback are gone. Since the feature being removed may still be marginally useful to someone, I've filed #5196 for re-adding it.

comment:35 Changed 11 years ago by jesstess

Cc: jesstess added
Keywords: review removed
Owner: set to Glyph

glyph, thanks for tackling this regression.

After forcing a few more builds, it looks like the downloadPage tests will consistently fail on Lucid on this branch:

http://buildbot.twistedmatrix.com/builders/lucid64-py2.6-poll http://buildbot.twistedmatrix.com/builders/lucid64-py2.6-epoll

comment:36 Changed 11 years ago by Jean-Paul Calderone

#3796 fixes the downloadPage/ssl issues the real way. If that gets merged then this branch can drop all of the test_webclient.py changes, I think.

comment:37 Changed 11 years ago by Itamar Turner-Trauring

The downloadPage tests fail because nagle is still enabled, so it can take many milliseconds more (I saw 30ms) to do the TLS close alert and then the TCP disconnect. (And now, back to my vacation - thanks for all your followup work!)

comment:38 Changed 11 years ago by Jean-Paul Calderone

Branch: branches/tls-close-5118-2branches/tls-close-5118-3

(In [32330]) Branching to 'tls-close-5118-3'

comment:39 Changed 11 years ago by Glyph

Status: newassigned

Landing. Thanks for the review, jess.

comment:40 Changed 11 years ago by Glyph

Resolution: fixed
Status: assignedclosed

(In [32332]) Merge tls-close-5118-3.

Author: itamar, exarkun, glyph

Reviewer: therve, exarkun, glyph, jesstess

Fixes: #5118

twisted.protocols.tls, and SSL/TLS support in general, now do clean TLS close alerts when disconnecting.

Note: See TracTickets for help on using tickets.