Opened 11 years ago

Closed 3 years ago

Last modified 2 years ago

#78 defect closed fixed (fixed)

abortConnection() method for transports

Reported by: itamarst Owned by: itamar
Priority: high Milestone:
Component: core Keywords:
Cc: glyph, itamarst, jknight, naked, exarkun, ivank, spiv, PenguinOfDoom, me@…, ishimoto@…, accounts@…, itamar@… Branch: branches/abortConnection-78-10
(diff, github, buildbot, log)
Author: itamarst, jknight, exarkun, ivank, glyph Launchpad Bug:

Description


Attachments (5)

abortConnection.diff (1.8 KB) - added by naked 10 years ago.
abortConnection2.diff (4.1 KB) - added by jknight 10 years ago.
abortConnection-78-20090422.diff (7.5 KB) - added by ivank 5 years ago.
patch based on abortConnection-78 r17458 but fixed for today's trunk, as well as some code style changes
abortConnection-78-20091225.diff (10.8 KB) - added by ivank 5 years ago.
see comment below
builditbetter.py (1.0 KB) - added by glyph 3 years ago.
fd leak test

Download all attachments as: .zip

Change History (93)

comment:1 Changed 11 years ago by itamarst

Similar to loseConnection(), but doesn't flush the write
buffers or wait for producer to finish. Useful for when you
know connection should die ASAP, because the other side is
totally screwed up anyway.

comment:2 Changed 11 years ago by glyph

I hope you don't object.  BUT I DON'T CARE IF YOU DO, HAHA

comment:3 Changed 11 years ago by glyph

This should probably just be an optional argument to
loseConnection, i.e.

    loseConnection(terminateTimeout=600)

A lot of code uses loseConnection() and expects the socket
to be closed eventually, but it is pretty easy to construct
an attack which will cause Twisted to never ever close a
socket this way.  By adding an optional argument to this
interface, we can enhance the semantics of the existing
method in a mostly-backwards-compatible way.

comment:4 Changed 11 years ago by itamarst

Sounds good. Maybe with a default timeout? Which should
probably be high enough that existing code won't be affected
by slow connections (and for producers this is especially an
issue, though I'm not sure if we have any code that depends
on loseConnection's producer interaction).

comment:5 Changed 11 years ago by glyph

I bet that our existing code deals with
loseConnection+Producers differently in different reactors.
 That should be on our list of things to clean up when we
revamp the Producer API.

comment:6 Changed 11 years ago by itamarst

Actually it's in tcp.py (or actually maybe abstract.py), so
the reactor shouldn't matter.

comment:7 Changed 10 years ago by itamarst

Hm. Wouldn't loseConnection time out potentially break existing code? Like,
"send really huge file and then loseConnection". Of course, if the default
timeout was large enough it'd mostly be OK...

Changed 10 years ago by naked

comment:8 Changed 10 years ago by naked

I wrote a patch to this effect, although the method is called abortConnection().

My needs were to explicitly close a connection without waiting for write
availability on the socket/whatever first. Also this means that with SSL, we
don't send or wait for the reply to close notify, which is important.

There is a problem though, that in the normal case, connectionLost() is invoked
from the reactor after doWrite or doRead has returned CONNECTION_LOST or
CONNECTION_DONE. In this case, connectionLost() is called directly as a result
of calling abortConnection(). I do not know enough about twisted internals to
say if this is a serious problem or not, and how it should best be fixed.

comment:9 Changed 10 years ago by glyph

We do have code that depends on that behavior.  Quite a bit, actually.  I think I prefer 
abortConnection(), since terminateTimeout is a convenience and will have all the issues normally 
associated with callLater.

comment:10 Changed 10 years ago by jknight

I think there's two issues here. One is a API to abort the connection. The other is the timeout 
issue.

For the write timeout issue, I don't think an absolute timeout on loseConnection is the right 
thing. Instead, it should have behavior like: "If there is data in the write buffer, and I have not sent any 
bytes for the last minute, the connection is as good as dead." I believe this behavior can and should 
default to on without compatibility problems.

Changed 10 years ago by jknight

comment:11 Changed 10 years ago by jknight

Here's a different patch. transport.abortConnection() loses the connection and tcp's _closeSocket is 
updated to do an abrupt TCP disconnect unless the reason is ConnectionDone. Seems like it works but 
mostly untested.

comment:12 Changed 8 years ago by foom

(In [17458]) An abortConnection() function for transport, and some tests.

Refs #78

comment:13 Changed 8 years ago by jknight

  • Owner changed from itamarst to jknight
  • Summary changed from terminateConnection() method for transports to abortConnection() method for transports

comment:14 Changed 7 years ago by kvogt

This is actually a pretty evil security hole. I logged into a server to see why it was out of memory, and found 1434 open connections with no activity. A malicious user was opening hundreds of connections to a twisted server but not reading from them. Since loseConnection() won't close the sockets unless the write buffer is empty, I have no way to fight this attack at the application level.

We have had a twisted IRC server and an entire twisted live video cluster (93 nodes) get destroyed by this exact exploit once a certain hacking group figured it out. In the mean time I'm patching in abortConnection2.diff, but I thought this real world example might be useful information for people trying to fix this.

comment:15 Changed 7 years ago by itamarst

  • Priority changed from normal to high
  • Type changed from enhancement to defect

I'm torn between happiness at live video clustering and sadness about the exploit. Upping priority, changing to defect.

comment:16 Changed 7 years ago by glyph

I'm a little concerned about the form of the solution here. It means that every protocol that is not specifically designed to be resistant to this particular exploit (in other words, all the protocols in Twisted) will continue to be a problem. Perhaps a better way to deal with the exploit would be to have a default timeout on loseConnection if no data is being consumed by the client? Or an easy way to set up administrative limits on the number of simultaneous connections from a single IP?

I haven't thought too hard about the solution yet, but I definitely don't think we could consider this form of DoS dealt with if you have to hand-patch some code for every port that you want to open.

comment:17 Changed 7 years ago by exarkun

Mechanism separate from policy.

comment:18 Changed 7 years ago by itamarst

  • Cc exarkun added

Obviously we should have abortConnection that is separate from loseConnection. The question is, should loseConnection additionally have some sort of a default timeout case where it calls abortConnection. loseConnection essentially is a policy, since it can be implemented in terms of producer API + abortConnection.

comment:19 Changed 7 years ago by glyph

I was not trying to make a point about the validity of this ticket.

I was responding to the comment by kvogt that "This is actually a pretty evil security hole". This ticket isn't going to fix any security hole (or, more accurately, DoS). It may be a necessary prerequisite, but we should probably have a separate issue to deal with the larger problem.

comment:20 Changed 5 years ago by exarkun

  • Branch set to /branches/abortConnection-78

Changed 5 years ago by ivank

patch based on abortConnection-78 r17458 but fixed for today's trunk, as well as some code style changes

comment:21 Changed 5 years ago by ivank

  • Keywords security added

comment:22 Changed 5 years ago by ivank

Many tests fail with abortConnection-78 when using iocpreactor. One issue is that it doesn't pass in the 'orderly' argument. I think there are other issues too.

comment:23 Changed 5 years ago by ivank

The patch causes almost all SSL-related tests to fail on Windows, even with the select reactor. If I discover why, I will post here.

comment:24 Changed 5 years ago by ivank

By SSL-related, I meant anything that uses SSL. Not SSL tests in particular.

comment:25 Changed 5 years ago by ivank

Why the tests fail on Windows:

On Windows, twisted.internet.tcp.Connection.connectionLost is incorrectly called with reason ConnectionLost even for "normal" disconnects.

[Failure instance: Traceback (failure with no frames): <class 'twisted.internet.error.ConnectionLost'>: Connection to the other side was lost in a non-clean fashion.

Sometimes you see ConnectionDone, but most of the time it is the incorrect ConnectionLost. It "should" be called with reason ConnectionDone (the correct behavior, which you see on Linux)

This incorrect reason causes a disorderly _closeSocket on almost every SSL connection, which causes data loss and test failures.

Hopefully someone who knows how PyOpenSSL and OpenSSL work will comment :-)

comment:26 Changed 5 years ago by itamar

Could you open a ticket for the windows connectionlost/connectiondone issue you discovered? With a failing test or minimal example if at all possible. Thanks!

comment:27 Changed 5 years ago by exarkun

Shouldn't we just fix this patch? The ticket is still open, after all...

comment:28 Changed 5 years ago by itamar

Oh, I hadn't realized the bug was in the patch. Good plan!

comment:29 Changed 5 years ago by ivank

  • Cc ivank added

comment:30 Changed 5 years ago by truekonrads

  • Cc truekonrads added

Lo and behold!
I have the connectionLost example nailed down:

from twisted.internet import win32eventreactor 
win32eventreactor.install()

from twisted.internet import reactor
from twisted.protocols.policies import TrafficLoggingFactory
from twisted.web.client import HTTPClientFactory,_parse
from twisted.python.util import println
from twisted.internet.defer import DeferredList
from OpenSSL import SSL

class NoVerifyClientContextFactory:
    """A context factory for SSL clients."""

    isClient = 1
    method = SSL.SSLv3_METHOD

    def getContext(self):
        def x(*args):
            return True
        ctx=SSL.Context(self.method)
        #print dir(ctx)
        ctx.set_verify(SSL.VERIFY_NONE,x)
        return ctx


url="https://online.lkb.lv/"
httpfactory=HTTPClientFactory(url)
dl=[]
for i in xrange(0,10):
    print i
    factory = TrafficLoggingFactory(httpfactory,"C:\\foo%i" %i)
    dl.append(httpfactory.deferred)
    pscheme,host, port,ppath = _parse(url)
    reactor.connectSSL(host, port, factory,NoVerifyClientContextFactory())
reactor.run()
DeferredList(dl).addCallback(lambda _:reactor.stop())                                     

Requirmentes are win32eventreactor and SSL connections!

comment:31 Changed 5 years ago by ivank

I've been looking at this again. I noticed that abortConnection in abstract.py does a very wrong thing -- it calls connectionLost synchronously, leading to some really broken behavior where a server calling abortConnection would have connectionLost called twice.

The 20091225 patch makes abortConnection empty the buffers and go through the typical loseConnection codepath, but with changes to return the correct error reason in _postLoseConnection. Hopefully someone knows if these changes are on the right track.

Like the previous patches, this patch patch breaks the IOCP reactor, and still has the OpenSSL problem mentioned above (so it breaks many tests that use SSL on Windows). It also needs a test for a server aborting a client, not just a client aborting. There's also likely an untested codepath in _sendCloseAlert, but it is beyond my abilities to describe.

Changed 5 years ago by ivank

see comment below

comment:32 Changed 5 years ago by ivank

Not below, above.

comment:33 Changed 5 years ago by spiv

  • Cc spiv added

comment:34 Changed 5 years ago by ivank

Here is why I suspected there is an untested codepath in _sendCloseAlert: before I changed this in _closeWriteConnection, all of the tests still passed:

-	        if result is main.CONNECTION_DONE: 
+	        if shouldClose:

comment:35 Changed 5 years ago by PenguinOfDoom

  • Cc PenguinOfDoom added

comment:36 Changed 5 years ago by truekonrads

  • Cc truekonrads removed

comment:37 Changed 5 years ago by glyph

  • Author set to glyph
  • Branch changed from /branches/abortConnection-78 to branches/abortConnection-78-2

(In [28708]) Branching to 'abortConnection-78-2'

comment:38 Changed 5 years ago by glyph

(In [28709]) Pull the changes forward and make the changes _slight_ less awful re #78

comment:39 Changed 4 years ago by exarkun

  • Author changed from glyph to exarkun, glyph
  • Branch changed from branches/abortConnection-78-2 to branches/abortConnection-78-3

(In [29980]) Branching to 'abortConnection-78-3'

comment:40 Changed 4 years ago by smira

  • Cc me@… added

comment:41 Changed 4 years ago by atsuoi

  • Cc ishimoto@… added

comment:42 Changed 4 years ago by kontinuity

  • Cc accounts@… added

comment:43 Changed 4 years ago by <automation>

  • Owner jknight deleted

comment:44 Changed 4 years ago by ivank

Given the SSL-related issues here, this will be a lot easier after #4854 is done.

comment:45 Changed 3 years ago by itamar

  • Cc itamar@… added

What's the motivation for sending a RST instead of a FIN when doing abortConnection()? I'm preparing to rebranch this, and then maybe try to get everything working, so I'm trying to understand the code and its choices.

comment:46 follow-up: Changed 3 years ago by naked

I think sending RST is somewhat a separate issue.

The main issue is freeing resources for the userland process. Just calling close() on the socket achieves this well enough (as long as no pending writes are waited and no SSL close is waited). This looks like an orderly shutdown to the remote end (unless there's SSL) in most cases. The exception to this is if the other end sends a data packet after we have called close() - in this case, the kernel will send a RST packet to the connection. (This could be avoided by never calling close(), but by calling shutdown() to close the sending end of the connection only.)

However, if the connections are indeed malicious, it might be that the wish is to free up resources for the kernel as well - atleast as much as is possible. Normally, the kernel keeps sending already written data after close(), and keeps handling the socket shutdown FIN exchanges. If a socket is closed with SO_LINGER on and it's timeout set to zero, the kernel will discard its write buffers and send a RST to the other end. (I'm not sure if this could end up being an orderly shutdown if buffers are empty.) This way the kernel can free up most of the memory immediately, but it will still keep TIME_WAIT sockets around for quite a while.

But, I must say, I believe Twisted is going to be the only project anywhere that is going to bother with tricking around with SO_LINGER. And the whole thing is actually even somewhat non-portable.

The feature has been missing for 8 years - and for 8 years I've had to add the abortConnection() API to every single server I've written, since every one of them have had a need for it. Perhaps it is time to finally slay the beast?

comment:47 Changed 3 years ago by itamar

I don't think the RST issue is a problem, I was just curious. And finally fixing this is my goal, yes.

More problematic: calling connectionLost directly from abortConnection will have reentrancy issues, leading to potential connectionLost being called twice. This was the original patch. The new patch (apparently never applied to subversion for some reason) relies via loseConnection on the socket eventually becoming writeable... which is exactly what won't happen if you're dealing with a wedged other side that has stopped reading.

The solution is probably going to be a horrible combination of callLater(0, self.connectionLost), overriding doRead and doWrite to always return CONNECTION_LOST, and then making sure connectionLost can handle being called twice. Unless someone has a better idea?

(It's time to replace FileDescriptor with something that has an explicit state machine... but I'll only think about that once this is done, since this is more immediately important.)

comment:48 Changed 3 years ago by itamarst

  • Author changed from exarkun, glyph to itamarst, exarkun, glyph
  • Branch changed from branches/abortConnection-78-3 to branches/abortConnection-78-4

(In [31596]) Branching to 'abortConnection-78-4'

comment:49 Changed 3 years ago by itamar

  • Author changed from itamarst, exarkun, glyph to exarkun, glyph
  • Branch changed from branches/abortConnection-78-4 to branches/abortConnection-78-3

So, I wrote some tests, and read some code. What I learned:

  1. There *already* exists code that calls FileDescriptor.connectionLost directly in a code path that may be reentrant: loseConnection if you've previously done loseWriteConnection. It wouldn't be reentrant with doWrite though.
  1. I wrote a test that simulates the likely use case of "abortConnection if other sides stop reading what I'm writing". It works both with the "call FileDescriptor.connectionLost immediately" implementation, and, contrary to my expectations, with another implementation based on loseConnection.
  1. Writing tests for this is hard.

Item 2 has me very confused. Anyone care to take a look and see what exactly is wrong with my code and/or conceptual model? I would think the file descriptor would never be writeable if the other side stopped reading, thus doWrite would never be called, and yet it is.

comment:50 Changed 3 years ago by naked

Sorry I can't test this more fully, but some just a simple comment:

Do verify with wireshark that the connection is really in the state you expect it to be in. Basically, what it should be in is a state where the the TCP window advertised by the recipient is zero, and the sender is periodically testing whether the window is still zero. The timeout is doubled on each try, so it might take a long time for you to see the retransmits.

So, tcpdump or wireshark, and the response packets from server should end up looking like this:

10:40:07.413776 IP localhost.14141 > localhost.41522: Flags ., ack 1, win 0, options [nop,nop,TS val 735552 ecr 707404], length 0

The important parts are: ACK packet, window size 0.

When the connection is in a stable state like this, the normal loseConnection() should be called.

comment:51 Changed 3 years ago by naked

And obviously, the normal loseConnection() shouldn't be able to close the socket, while abortConnection() should! :-)

comment:52 Changed 3 years ago by itamarst

  • Author changed from exarkun, glyph to itamarst, exarkun, glyph
  • Branch changed from branches/abortConnection-78-3 to branches/abortConnection-78-5

(In [31784]) Branching to 'abortConnection-78-5'

comment:53 Changed 3 years ago by itamar

  • Keywords review added

Review time! Out of desperation, in part. My questions for reviewers, in addition to whatever you can find that's wrong:

  1. What tests are still missing? I still suspect the possibility of FileDescriptor.connectionLost being called twice, somehow.
  1. http://buildbot.twistedmatrix.com/boxes-supported?branch=/branches/abortConnection-78-5 has py-without-middles not skipping the new TLS tests I added. Any idea why?
  1. What is causing the failing tests in (some of) the Windows buildslaves? They pass on trunk.

comment:54 Changed 3 years ago by exarkun

  • Keywords review removed
  • Owner set to itamar
  1. For TLS, test_fullWriteBufferAfterByteExchange takes foreeeeeeeeeeeeeeever. I would really like to not add half a dozen new tests that take 45 seconds of 100% CPU each. The branch's twisted.internet.test.test_tls takes longer than the whole trunk test suite. I don't really understand why this is. It happens with _newtls but not _oldtls. Does that mean it's caused by the producer bugs in _newtls? But then why does it peg the CPU instead of idle for a while?
  2. The skip logic for ReactorBuilder tests is in buildReactor. AbortSSLConnectionTest uses SSL APIs before it calls buildReactor though.
  3. Lots of trailing whitespace to clean up
  4. You should merge forward again:
    1. pick up the Failure fix so 'trial --debug' works
    2. there are some new tests in trunk that fail when this branch is merged into it, twisted.internet.test.test_tcp.TCPPortTestsBuilder.test_serverGetHostOnIPv4. I expect this is straightforward to fix, just client.getpeername() earlier.
  5. Might the previous point suggest the cause of the failing tests on Windows? They mainly appear to be failing with unexpected ConnectionLost failures. It could be that whereas previously the connections were not shut down hard enough to for the client to notice (or to notice quickly), now they are and the tests don't account for this. Why it only happens on Windows, I don't know, I can only conjecture that it is similar in cause to all of the other issues in which Windows behaves differently from other platforms.
  6. classImplements in twisted/internet/tcp.py is imported but unused
  7. It would be nice to use endpoints instead of ClientCreator in all new code, I think.
  8. The if self.disconnected: check in readConnectionLost in twisted/internet/tcp.py doesn't appear to be executed by the test suite
  9. Any reason to directly compare reason.value.__class__ in connectionLost in twisted/internet/tcp.py and twisted/internet/iocpreactor/tcp.py instead of using Failure.check?
  10. in twisted/internet/_oldtls.py, the modified conditionals in stopReading and stopWriting are only half tested. The test suite doesn't exercise the don't-call-self._base-method case of either one. Maybe I'm not really worried about this, but I'd like to at least know which case the new if self.disconnected check is handling there. Add a comment or something?
  11. In runAbortTest, I think rather than deferLater, it might be better to just verify there are no extra event sources left in the reactor and then finish right away.

Each of the actual new unit tests probably need more consideration than I have given them. Broadly the idea seems correct, but I don't really understand them completely yet. Hopefully this is enough of a review for you to make some further progress though!

comment:55 Changed 3 years ago by itamarst

  • Branch changed from branches/abortConnection-78-5 to branches/abortConnection-78-6

(In [31911]) Branching to 'abortConnection-78-6'

comment:56 Changed 3 years ago by itamarst

(In [31916]) "Fix" slowness in new TLS code by doing 100 smaller transport.write()s instead of one large one.

Refs #78

comment:57 Changed 3 years ago by itamar

Further investigation of the TLS test slowness suggests the problem may not be in TLS producer code, though I did see hints it may be buggy, but perhaps in different way than I originally thought. Rather, it seems that writing a really large chunk of data to a (t.p.tls) TLS transport is far more expensive than using multiple writes in a row (see [31916]). This seems like a bug in the new TLS code.

comment:58 Changed 3 years ago by itamar

Remaining work include figuring out why the changes to readConnectionLost and old SSL code's start/stopReading are necessary, getting rid of ClientCreator and maybe some whitespace cleanup.

Work will continue once #5063 is fixed and we can merge forward; this will allow getting rid of some code to workaround #5063.

comment:59 Changed 3 years ago by itamarst

  • Branch changed from branches/abortConnection-78-6 to branches/abortConnection-78-7

(In [32373]) Branching to 'abortConnection-78-7'

comment:60 Changed 3 years ago by itamarst

  • Branch changed from branches/abortConnection-78-7 to branches/abortConnection-78-8

(In [32457]) Branching to 'abortConnection-78-8'

comment:61 Changed 3 years ago by itamar

  • Owner changed from itamar to PenguinOfDoom

Pavel, could you take a look at the IOCP code and see if it makes sense to you? Thanks!

comment:62 Changed 3 years ago by itamar

  • Keywords review added

Ready for review, I guess. Some notes:

  1. I'd like feedback from Pavel about IOCP version.
  2. http://buildbot.twistedmatrix.com/boxes-supported?branch=/branches/abortConnection-78-8 has some remaining mysterious errors which seem like they shouldn't be my fault but maybe they are (py-without-models and the fedora builder - the win32 interface/adapter ones are known issue, I think.) Thoughts?
  3. The change to _oldtls appears necessary for tests to pass, but makes me slightly wary given _oldtls sets disconnected=True even when it's not *really* disconnected...
  4. Use callLater(0, f) to make protocol's connectionLost call non-reentrant. Sorry, but there's no better API at the moment; we can fix it along with all the other internal uses of this idiom when we do have a better API.

comment:63 Changed 3 years ago by itamar

  • Owner PenguinOfDoom deleted

Anyone can review!

comment:64 Changed 3 years ago by exarkun

  • Keywords review removed
  • Owner set to itamar
  1. The buildbot failures look pretty unrelated to me, but the rhel/fedora failures on gtk may mean that these new tests push us over some limit (remember, gtk leaks file descriptors like crazy, probably at least 1 per reactor).
  2. The howto says that connectionLost will be invoked re-entrantly, but it looks like the implementation goes for non-re-entrant.
  3. in _SocketCloser._closeSocket, I'm not sure how accurate the last bit of the comment is. I think SO_LINGER affects shutdown the same way it affects close. It seems fine to skip the shutdown in the non-orderly case, but if it's not actually required then the comment shouldn't say it is.
  4. Oh no, Connection._aborting. This absolutely must to be the last ad hoc state attribute we add to transports. After this, anything new needs to be added to an explicit state machine. For now, please add documentation for the attribute.
  5. Oh no, hasattr in Connection.connectionLost. Sure would be nice to have an explicit state machine. :)
  6. the abortConnection in twisted/internet/tcp.py is almost identical to twisted/internet/iocpreactor/tcp.py. They probably should be identical, and it would be nice to avoid the duplication.
  7. In twisted/internet/test/test_tcp.py, minimizeTCPBuffers docstring is misformatted. Also, it'd be nice to know why this is a useful thing to do, either with more docstring or comments at the call sites.
  8. Also, it has an XXX in it.
  9. StreamingProducerClient.write seems to provoke pauseProducing after just a single write. Does it need to do 99 more? Could it early out after pauseProducing? And should it be upset if it gets through all the writes without being paused?
  10. Why 0.05 in StreamingProducerClient.pauseProducing?
  11. in twisted/internet/test/test_tls.py, ContextGeneratingMixin should be new-style.
  12. And it looks like AbortSSLConnectionTest.buildReactor could use self.patch to be a little simpler.
  13. The news fragment could talk about TCP and SSL I guess, since SSL seems to support this feature as well. Also that comma in the middle is gross. :) ... abortConnection() which, unlike loseConnection(), always ...
  14. Which test exercises the self.stopWriting() fix in twisted/internet/iocpreactor/abstract.py (line ~365)?
  15. I have some reservations about the ReadAbortServerProtocol-based tests. These are providing a pretty strong (but fuzzy) guarantee - no bytes written in the same reactor iteration as an abortConnection call will be sent. If we wanted to switch back to synchronous write attempts, these tests will become unhappy. I don't know what to do about this, maybe nothing.
  16. The docstrings for the multi-abortConnection-call test methods confused me into thinking that multiple calls to loseConnection would raise an exception.
  17. All the runAbortTest-based tests are great, as far as readability goes. It took a couple minutes to figure out what runAbortTest does, but then the individual test methods were a breeze.
  18. Some of their docstrings could be expanded though. dataReceived calls abortConnection(), and then throws exception. - yea, then what?

comment:65 Changed 3 years ago by itamar

  • Author changed from itamarst, exarkun, glyph to itamarst, exarkun, glyph, jknight, ivank
  • Keywords security removed

Re #1... not really sure what you're suggesting. None of the new tests open many connections...

Re #14, quite possibly there is no test covering it. But I've written enough convoluted tests. If you *really* feel this is a problem, I can revert that fix :)

All other comments have been addressed. I just have to figure out why merging forward breaks a test.

comment:66 Changed 3 years ago by itamarst

  • Author changed from itamarst, exarkun, glyph, jknight, ivank to itamarst, jknight, exarkun, ivank, glyph
  • Branch changed from branches/abortConnection-78-8 to branches/abortConnection-78-9

(In [32604]) Branching to 'abortConnection-78-9'

comment:67 Changed 3 years ago by itamar

  • Keywords review added

Noooooooooooooooooooooooooooooooooo f u cfreactor:

[ERROR]
Traceback (most recent call last):
Failure: exceptions.RuntimeError: BUG: We should be paused a this point.

twisted.internet.test.test_tls.AbortSSLConnectionTest_CFReactor.test_fullWriteBufferAfterByteExchange

Doesn't happen on the other OS X builder, strangely. Can I just ignore that? Please? Or mark it failing? I don't want to have to deal with another crappy broken reactor :(

There's still also random mystery errors elsewhere (item 1) -- suggestions?

http://buildbot.twistedmatrix.com/boxes-supported?branch=/branches/abortConnection-78-9

comment:68 Changed 3 years ago by itamar

  • Owner itamar deleted

comment:69 follow-up: Changed 3 years ago by exarkun

Re #1... not really sure what you're suggesting. None of the new tests open many connections...

You added new tests though. And each test probably leaks a file descriptor.

comment:70 in reply to: ↑ 69 Changed 3 years ago by glyph

Replying to exarkun:

Re #1... not really sure what you're suggesting. None of the new tests open many connections...

You added new tests though. And each test probably leaks a file descriptor.

For those who might see this and not be hip enough to have our entire ticket database memorized, exarkun's talking about: http://twistedmatrix.com/trac/ticket/4482

comment:71 Changed 3 years ago by glyph

Or... no, that's a different bug? Hrm. I can't find a bug report about FD leaks in gtkreactor. I thought I knew about this though. Anyone have a reference handy?

Changed 3 years ago by glyph

fd leak test

comment:72 Changed 3 years ago by glyph

I assert that GTK does not leak file descriptors when simply running the reactor, and I have attached a file which I believe proves it. Perhaps this isn't true of our rhel6 buildbot, so someone can run this file there and let me know if there's a problem. However, I've run this program on every Ubuntu release back to Hardy and none of them exhibit the described behavior.

comment:73 Changed 3 years ago by exarkun

Not exactly what I thought I was talking about, but it seems likely to be related anyway: <https://bugzilla.gnome.org/show_bug.cgi?id=579406>. I think you didn't see the problem on Ubuntu because eventually Ubuntu backported the fix, but maybe RHEL did not.

comment:74 Changed 3 years ago by itamar

  • Keywords review removed

OK, I think I know what to do now. Taking off from review until I can get buildbot in better shape.

comment:75 Changed 3 years ago by itamar

  • Keywords review added

http://buildbot.twistedmatrix.com/boxes-supported?branch=/branches/abortConnection-78-9

We're down to failures that are even more mysterious and not-my-faulty.

Only unaddressed review comment is 14; see comments above. I can, if you want, remove the fix and open new ticket (Jerub claims to have seen bugs caused by it, so possibly writing a test isn't impossible).

comment:76 Changed 3 years ago by exarkun

  • Keywords review removed
  • Owner set to itamar
  1. The twisted/test/test_tcp_internals.py will fail to set the new rlimit if the number of open files is close to (within 10) the open file limit. It would also be nice if this change were made in a separate ticket/branch.
  2. The twisted/test/test_ssl.py change seems fairly unrelated. It would be nice if it were made in a separate ticket/branch.
  3. The twisted/internet/iocpreactor/abstract.py still needs a test. Maybe if the ratio of test difficulty to "fix" difficulty is so high, the fix can be made before the test - but that still involves understanding the case it fixes and being confident that it really works. Either way, it would be nice if this were done in a separate ticket/branch.
  4. lambda foo: bar(foo) is an antipattern. In twisted/internet/test/test_tls.py, in AbortSSLConnectionTest.buildReactor, just use cooperator.cooperate as the patch value.
  5. In twisted/internet/test/test_tcp.py, in the minimizeTCPBuffers docstring, likely makes me uncomfortable. Are the tests that use this helper non-deterministic? And is the IOCP case going to be more likely to fail (or pass?) spuriously due to its lack of buffer size tweaks?
  6. In twisted/internet/test/test_tcp.py, the trick with Xs and Ys used by ReadAbortServerProtocol and AbortingClient could probably be explained.
  7. Typo in StreamingProducerClient.pauseProducing's docstring - Wwe.
    1. Also, the behavior the callLater(0.05) is working around seems like a bug in Twisted. The reactor should really be moving bytes on to the next level more aggressively (I have some numbers that show this is a performance win too). Separate ticket, of course. And then the comment above the 0.05 can link to that ticket, and we can get rid of the delayed call when that ticket is resolved.
  8. The SSL handling in runAbortTest is sort of inverted. It would be better to let a subclass override something to define any extra exceptions. This can be fixed later though (it's a nice, easy refactoring ticket for some newbie).
  9. in r32606, was it the addition of the loop, or the corrected handling of the paused attribute that fixed the tests with a "modern version of pyOpenSSL"?
  10. TLS test_fullWriteBufferAfterByteExchange is still pretty slow. It does 300 writes now instead of 100, and writes 90MB instead of the 30MB it wrote last time. I really want all of twisted.internet.test.test_tls to take under a second. In my ideal universe, it also wouldn't take 90MB of data to test basic transport features.

Thanks.

comment:77 Changed 3 years ago by itamar

Created ticket #5301 for item 1.

Created ticket #5302 for item 2.

Created ticket #5300 for item 3.

comment:78 Changed 3 years ago by itamarst

(In [32809]) Address review items 4-7. Refs #78

comment:79 Changed 3 years ago by itamar

I got rid of the minimize thing, it's not strictly necessary and the tests are still reasonable without it, and also dealt with 6 and 7.

Re 8, I will file a ticket after this branch is merged.

As far as 9, I think it was the bug fix, not the loop, so I lowered the number down to 100 (30MB write)

As far as 10 goes, the above means it's faster and writes less... but I think there may actually be a bottleneck in TLS. I don't think crypto is so slow that it should add add 2 CPU seconds, and it's *much* slower if I transfer same number of bytes with a smaller number of (larger) writes. Anyway, it's down to 3 seconds on my computer now. I don't want to lower it much more, because the idea is to fill up buffers.

Since I've addressed all review comments, I'll put this up for review once I can merge forward to include the fix for #5301.

comment:80 Changed 3 years ago by itamarst

  • Branch changed from branches/abortConnection-78-9 to branches/abortConnection-78-10

(In [32824]) Branching to 'abortConnection-78-10'

comment:81 Changed 3 years ago by thijs

Can you also add @since for the new public methods and classes?

comment:82 Changed 3 years ago by itamar

Good point, will do. And maybe rename the abort mixin to start with an _ if it doesn't already.

comment:83 Changed 3 years ago by itamar

  • Keywords review added
  • Owner itamar deleted

comment:84 Changed 3 years ago by exarkun

  • Owner set to exarkun

comment:85 Changed 3 years ago by exarkun

  • Keywords review removed
  • Owner changed from exarkun to itamar

Okay. I think this is good to merge. Thanks!

comment:86 Changed 3 years ago by itamarst

  • Resolution set to fixed
  • Status changed from new to closed

(In [32851]) Merge abortConnection-78-10.

Author: itamar, jknight, exarkun, ivank, glyph
Review: exarkun, thijs
Fixes: #78

Add abortConnection() method to TCP and TLS transports.

comment:87 in reply to: ↑ 46 Changed 2 years ago by koebenhavn

Just for refrence regarding SO_LINGER:

Portability note 1: Some implementations of the BSD socket API do not implement SO_LINGER at all. On such systems, applying SO_LINGER either fails with EINVAL or is (silently) ignored. Having SO_LINGER defined in the headers is no guarantee that SO_LINGER is actually implemented.

Portability note 2: Since the BSD documentation on SO_LINGER is sparse and inadequate, it is not surprising to find the various implementations interpreting the effect of SO_LINGER differently. For instance, the effect of SO_LINGER on non-blocking sockets is not mentioned at all in BSD documentation, and is consequently treated differently on different platforms. Taking case 3 for example: Some implementations behave as described above. With others, a non-blocking socket close() succeed immediately leaving the rest to a background process. Others ignore non-blocking'ness and behave as if the socket were blocking. Yet others behave as if SO_LINGER wasn't in effect [as if the case 1, the default, was in effect], or ignore linger->l_linger [case 3 is treated as case 2]. Given the lack of
adequate documentation, such differences are not (by themselves) indicative of an "incomplete" or "broken" implementation. They are simply different, not incorrect.
Portability note 3: Some implementations of the BSD socket API do not implement SO_LINGER completely. On such systems, the value of linger->l_linger is ignored (always treated as if it were zero).

/Event.

Replying to naked:

I think sending RST is somewhat a separate issue.

The main issue is freeing resources for the userland process. Just calling close() on the socket achieves this well enough (as long as no pending writes are waited and no SSL close is waited). This looks like an orderly shutdown to the remote end (unless there's SSL) in most cases. The exception to this is if the other end sends a data packet after we have called close() - in this case, the kernel will send a RST packet to the connection. (This could be avoided by never calling close(), but by calling shutdown() to close the sending end of the connection only.)

However, if the connections are indeed malicious, it might be that the wish is to free up resources for the kernel as well - atleast as much as is possible. Normally, the kernel keeps sending already written data after close(), and keeps handling the socket shutdown event FIN exchanges. If a socket is closed with SO_LINGER on and it's timeout set to zero, the kernel will discard its write buffers and send a RST to the other end. (I'm not sure if this could end up being an orderly shutdown if buffers are empty.) This way the kernel can free up most of the memory immediately, but it will still keep TIME_WAIT sockets around for quite a while.

But, I must say, I believe Twisted is going to be the only project anywhere that is going to bother with tricking around with SO_LINGER. And the whole thing is actually even somewhat non-portable.

The feature has been missing for 8 years - and for 8 years I've had to add the abortConnection() API to every single server I've written, since every one of them have had a need for it. Perhaps it is time to finally slay the beast?

comment:88 Changed 2 years ago by itamar

Thanks for the info. So it sounds like there's a problem case where certain BSDs (which?) will block on close() if SO_LINGER is set to 0, even for non-blocking sockets? Or am I misunderstanding?

Note: See TracTickets for help on using tickets.