Opened 6 years ago

Closed 6 years ago

Last modified 6 years ago

#8014 defect closed fixed (fixed)

HostnameEndpoint: AlreadyCalledError when cancelling connection

Reported by: warner Owned by: Glyph
Priority: normal Milestone:
Component: core Keywords:
Cc: ashfall, warner Branch: branches/hostname-endpoint-8014
branch-diff, diff-cov, branch-cov, buildbot
Author: glyph

Description

I've been updating Foolscap to use endpoints instead of raw connectTCP, using HostnameEndpoint in particular (so I can take advantage of IPv6 in the future). Foolscap tries to make multiple connections in parallel, using endpoint.connect(), and the first connection which passes negotiation wins. The rest are cancelled.

My tests are throwing timing-sensitive AlreadyCalledErrors in some of those cancelled connections. There's a window in the HostnameEndpoint cancellation logic during which a d.cancel() will errback the connect() Deferred twice.

The smallest test case I was able to build is here, but it depends upon the connection taking a long time to complete. The most reliable way is to use an unassigned address on the local subnet (so the only thing that will stop the connection is a timeout).

from twisted.internet import reactor, endpoints, protocol

ep = endpoints.HostnameEndpoint(reactor, "10.0.0.99", 8999)
f = protocol.Factory.forProtocol(protocol.Protocol)
d = ep.connect(f)
reactor.callLater(1.0, d.cancel)
def connected(p): print "ep connected"
def abandoned(f): print "ep failed"
d.addCallbacks(connected, abandoned)
d.addCallback(lambda _: reactor.stop())
reactor.run()

The sequence of events (reactor turns) inside HostnameEndpoint.connect(), for a hostname that resolves to exactly one IP address, looks roughly like:

  • 1. deferToThread spawns a thread to run getaddrinfo
  • 2. Some time later, that returns. The addresses are turned into sub-endpoints, and a LoopingCall is created to try the addresses at 300ms intervals. The LoopingCall is started, which immediately starts .connect() on the first sub-endpoint. The resulting Deferred is added to pending.
  • 3. 300ms later, the LoopingCall fires again. It attempts to pull another endpoint from endpoints, but since there was only one address, this raises StopIteration. The code sets the endpointsListExhausted flag, stops the LoopingCall, and calls checkDone(), which does nothing because we still have connections in pending.
  • (here is the critical window)
  • 4. the connection is established, the sub-endpoint's .connect()'s Deferred fires, HostnameEndpoint's usedEndpointRemoval() removes the Deferred from pending, then afterConnectionAttempt() would stop the LoopingCall (but it is already stopped), cancels everything in (the now-empty) pending list, then fires winner.callback(). The caller's callback chain is executed.

If the HostnameEndpoint's winner Deferred is cancelled during the critical window between turns 3 and 4, the winner Deferred is errback'ed (this is the first call), then the one sub-endpoint Deferred in pending is cancelled. This errbacks the sub-endpoint Deferred, which causes usedEndpointRemoval() to remove it from pending, then invokes connectFailed(), which invokes checkDone(), which sees:

  • endpointsListExhausted: [True], because we hit StopIteration earlier
  • not pending: True, because we removed the last (only) entry in usedEndpointRemoval
  • not successful: True, because no sub-Endpoint managed a successful connection

So checkDone() errbacks winner, which is the second call, causing the AlreadyCalledError.

If the caller uses d.cancel() too early (during address resolution, between turns 1 and 2), then we errback past the entire attemptConnection() call, and never hit this case. If the cancel happens between turns 2 and 3 (during the first 300ms after address resolution), we won't have hit the StopIteration yet, so endpointsListExhausted will be False, so checkDone won't errback anything (it thinks we still have more to go). And if it happens after turn 4, even if there are additional sub-endpoints to cancel, the fact that successful is True causes the errback to be skipped.

"First winner of multiple parallel connections, which all might succeed or fail, but should be cancellable quickly" is one of those annoyingly-difficult patterns that breeds bugs faster than an entomologist's lab. I've probably implemented it a dozen times in my career, and I'm not convinced any of them actually work :-). I see #6698 proposes to rewrite this as a state machine, which I think is a great idea, and might help save the sanity of whoever tries to fix this particular AlreadyCalledError problem.

If I were going to try and fix this with the current not-really-a-state-machine layout, I'd probably declare checkDone() to be the only function that's allowed to callback/errback the winner Deferred, and call it after any state transition (timer fires, sub-endpoint succeeds, sub-endpoint fails, cancel is called). It should remember whether it's fired the Deferred already (maybe via an active flag), and any events that occur after that are just ignored.

Cancelling the HostnameEndpoint.connect should set a flag that says "stop trying", stop the timer, cancel any pending connections, then run checkDone, which will probably *not* errback because there are still pending connections waiting to resolve. Only when the last pending connection has errback'ed (which should be on the next reactor turn), should the main Deferred be fired with the ConnectingCancelledError.

Also, I'd try to find a way to remove the 300ms delay between a single sub-endpoint failing to connect, and the HostnameEndpoint.connect Deferred errbacking. In the present code, we won't notice the StopIteration until the next LoopingCall tick, which feels like an unnecessary slowdown for callers (who may be trying multiple hostnames or ports or whatever themselves).

I wish I knew how to turn this into a reliable network-independent unit test. I wrote one version that makes two separate endpoint.connect() calls to a 127.0.0.1 listener, and cancels the second one when the first one finishes. That got me about a 50% success rate, as it sort of "factors out" the getaddressinfo() delay. I was also able to improve the success rate by reducing the 300ms LoopingCall interval. It might help to change _endpoints() to return a list (instead of an iterator) and then set endpointsListExhausted as soon as the list becomes empty (instead of on the next timer pass). Those would all increase the duration of the window.

Another approach might be to create a 127.0.0.1 listener, then remove its socket from the reactor (or equivalently, do a raw socket.listen(), without the reactor, so nobody ever calls accept() on it). A connection attempt to that ought to never be accepted but never get rejected either. Then the test just needs to wait until the 300ms timer has expired to get a reliable failure.

Attachments (1)

delayed-calls-8014.diff (4.2 KB) - added by warner 6 years ago.
patch to cancel timer and add tests

Download all attachments as: .zip

Change History (20)

comment:1 Changed 6 years ago by Glyph

Thank you for the remarkably thorough diagnosis :-).

comment:2 Changed 6 years ago by Glyph

Author: glyph
Branch: branches/hostname-endpoint-8014

(In [46777]) Branching to hostname-endpoint-8014.

comment:3 Changed 6 years ago by Glyph

Cc: warner added

I wish I knew how to turn this into a reliable network-independent unit test.

Thanks to your, again, staggeringly comprehensive explanation, this was actually pretty easy if you know where to look.

For future reference, r46778 is the commit that introduced the test; there are already network-independent tests for pretty much everything HostnameEndpoint does, and this was just testing one specific deterministic case that hadn't been handled by the previous tests. If you have any questions about what that test is doing, we should definitely discuss on the mailing list.

comment:4 Changed 6 years ago by Glyph

Keywords: review added

I opted to fix this with the present pseudo-state-machine because even with an explicit state machine, there are a lot of counters and interacting bits of state to keep consistent. (Finishing the state-machine rewrite at some point is still probably a good idea, I just didn't want to tackle it because it is more nuanced than it looks; not to mention the selection of a state-machine library.) I did, however, remove distracting and unnecessary verbiage from the implementation of connect() (all the callbacks had quite verbose docstrings, possibly to pacify an earlier version of twistedchecker) and try to re-order the callback definitions so they follow the order they're executed in, to make it easier to understand and be sure my fix made sense to me.

comment:5 Changed 6 years ago by Glyph

I am intentionally leaving the speedup for another ticket - #8202 - so as not to make the reviewer's job any harder than it has to be.

comment:6 Changed 6 years ago by Glyph

The buildbot has been poked and test results should be ready when a reviewer is around.

comment:7 Changed 6 years ago by Chris Wolfe

Hi Glyph,

Thanks for your patch :-)

A few comments:

  1. line 695 - Please change the spelling of Defered to Deferred.
  2. line 705 - The docstring for attemptConnection(endpoints) is somewhat confusing. It could just be that The trial attempts for each endpoints should be endpoint, but even then, it seems a bit difficult to understand.
  3. line 719 - Instead of using multiple or conditions, any seems clearer:

if pending or checkDone.completed or checkDone.endpointsLeft:

if any([pending, checkDone.completed, checkDone.endpointsLeft]):

  1. line 1701 - Would you mind adding a comment or modifying the docstring to detail the significance of why 0.31 is enough time for all attempts the test to have passed? The concern I have is that this looks a bit like a magic number. That being said, you do state in the docstring that this should fire after enough time has passed. This is by no means a blocker, I just find it a bit unclear.

Once you've addressed the comments, please resubmit for review.

Thank you for working on this!

herrwolfe

comment:8 Changed 6 years ago by Chris Wolfe

Keywords: review removed

comment:9 Changed 6 years ago by Chris Wolfe

Owner: set to Glyph

comment:10 Changed 6 years ago by warner

Foolscap tests pass with this branch. Thanks!

comment:11 Changed 6 years ago by Glyph

(In [46781]) address review feedback re: #8014

comment:12 Changed 6 years ago by Glyph

Keywords: review added
Owner: changed from Glyph to Chris Wolfe

Hi Glyph,

Thanks for your patch :-)

Thanks for your quick review!

A few comments:

  1. line 695 - Please change the spelling of Defered to Deferred.

You didn't actually say which file this was in, just the line ;). Fixed.

  1. line 705 - The docstring for attemptConnection(endpoints) is somewhat confusing. It could just be that The trial attempts for each endpoints should be endpoint, but even then, it seems a bit difficult to understand.

This docstring was more or less in place already, but I guess doing a little refactoring and reorganizing here touched all of these lines. In for a penny, in for a pound, I suppose - I've totally rewritten the docstring to be as clear as I can make it.

  1. line 719 - Instead of using multiple or conditions, any seems clearer:

if pending or checkDone.completed or checkDone.endpointsLeft:

if any([pending, checkDone.completed, checkDone.endpointsLeft]):

I think I disagree; or is something you learn pretty early on in Python, whereas any is a slightly obscure builtin. Also, not that efficiency is a serious concern in this case, but in terms of habits / style, any forces the evaluation of its entire argument (consructing a list, and then getting all 3 values) whereas or only evaluates enough stuff to get the truth value of the test.

  1. line 1701 - Would you mind adding a comment or modifying the docstring to detail the significance of why 0.31 is enough time for all attempts the test to have passed? The concern I have is that this looks a bit like a magic number. That being said, you do state in the docstring that this should fire after enough time has passed. This is by no means a blocker, I just find it a bit unclear.

Rather than documenting this with a docstring or a comment, instead I made the default value a symbolic name. I had originally considered the magical constant OK because the spec mandated the value, but in fact the spec just says this is what chrome and firefox do, and it is only actually specified to be "a short delay".

Once you've addressed the comments, please resubmit for review.

Comments all addressed; resubmitting.

comment:13 Changed 6 years ago by Chris Wolfe

Keywords: review removed
Owner: changed from Chris Wolfe to Glyph

Glyph - Thanks for the super speedy turnaround on the changes! Thank you also for explaining why you prefer using or in this case, instead of any.

All of the buildbots are green (I did have to rerun the windows iocp builder, but after having done so, the build succeeded). Please feel free to merge :-).

herrwolfe

comment:14 Changed 6 years ago by Glyph

Resolution: fixed
Status: newclosed

(In [46786]) Merge hostname-endpoint-8014: HostnameEndpoint fix

Author: glyph

Reviewer: herrwolfe

Fixes: #8014

twisted.internet.endpoints.HostnameEndpoint.connect no longer fails with an AlreadyCalledError when the Deferred it returns is cancelled after all outgoing connection attempts have been made but none have yet succeeded or failed.

comment:15 Changed 6 years ago by warner

Resolution: fixed
Status: closedreopened

I'm seeing one lingering problem: a DirtyReactorError in my HostnameEndpoint-using integration tests.

In my failure, the endpoint is using a single IPv4 address. I believe that endpoint.connect() starts, and then the returned Deferred is cancelled while the connection is still outstanding (my program opens multiple connections in parallel, to different hostnames, and cancels every one that loses the race). This leaves the 300ms timer in the LoopingCall still running, which of course trial notices when the test claims to be complete. I think the problem is in this clause (endpoints.py HostnameEndpoint / connect / iterateEndpoint / failed) around line 89:

                def failed(reason):
                    failures.append(reason)
                    checkDone()

I think that it ought to be:

                def failed(reason):
                    if iterateEndpoint.running:
                        iterateEndpoint.stop()
                    failures.append(reason)
                    checkDone()

I'm working on a test: changing test_endpointConnectingCancelled to end with:

        self.assertFalse(mreactor.getDelayedCalls())

seems to catch the error (and goes green after patching failed()). However other tests start failing with the patch (in fact my python then segfaults shortly after trial exists), so something weird is happening on my system that I haven't tracked down yet.

comment:16 Changed 6 years ago by warner

I've got a patch. I moved the iterateEndpoint.stop() down into cancelRemainingPending, where it can get run for all of:

  • run out of endpoints
  • success
  • non-timeout failure (cancellation)
  • timeout failure (so stop() is unnecessary)

I also added assertions to the appropriate (I hope) tests, that there are no lingering DelayedCalls left around.

Changed 6 years ago by warner

Attachment: delayed-calls-8014.diff added

patch to cancel timer and add tests

comment:17 Changed 6 years ago by warner

Keywords: review added

I don't know how Twisted's current branch-and -test tooling works (I gather it involves Subversion??), but this patch is in need of review and landing by someone with commit privs. My apologies for not having it in the most convenient format :).

comment:18 Changed 6 years ago by warner

Resolution: fixed
Status: reopenedclosed

Moved to #8223 .

comment:19 Changed 6 years ago by warner

Keywords: review removed
Note: See TracTickets for help on using tickets.