[Twisted-Python] Twisted HTTP client supporting failover for multiple A records?

Thu Jul 15 17:18:23 MDT 2010

Luke Marsden <luke-lists at hybrid-logic.co.uk> writes:

> We're actually using it to provide redundancy in this instance. In our
> application any request for any site can be made to any (live) server,
> so having dead servers in the pool of A records doesn't matter so long
> as real web browsers failover to some other A record within a second,
> which they do! http://crypto.stanford.edu/dns/dns-rebinding.pdf

Be aware that the time to failover to an alternate A record need not
be that fast depending on the sort of failure involved.  Failover can
only occur quickly as long as the outage (network unreachable, port no
longer active on the host, etc..) is such that the connection attempt
is explicitly rejected by the target host or a router along the way.

If it's a more complicated outage (e.g., a routing loop or total
machine failure) for which no explicit failure response will be
received by the client, you'll be subject to the client's connect
timeout (one per each failing address and attempt to that address it
tries).  These may vary by client and/or platform, but can easily be
30-60s - certainly long enough for the human involved to potentially
want to give up.  Also, since web browsers typically cache DNS
responses, if a bad address is early in the list, a timeout will be
encountered for each and every individual browser request generated.

I did a quick test with a stock FireFox 3.6 under OSX and with a bad
initial A record (non-existent host) it took about 75s to failover to
the next A record.  In my test case even that was unusable since the
host I was referencing had other references to itself needed to load
that home page, and each of those references themselves took another
75 seconds to time out.  So it took more than 2 minutes for me to see
the page I wanted, which I presume most people would give up on.

That's not to say using multiple A records isn't a helpful practice
for many sorts of outages (especially to permit controlled
maintenance).  Just don't expect it to necessarily be sufficient in
all failure modes depending on the behavior you want clients to
experience.

If this is strictly limited to a client you control, it's much less of
an issue, since you can drop the TCP connect timeout much lower than
what it defaults to, though you still probably can't match how fast it
can happen for rejected connections, since you'll want to leave enough
room for occasional latency or response time issues without
immediately failing over.  But you can do a lot better than the system
defaults.

-- David