[Twisted-Python] Handling PBConnectionLost errors

Fri Jul 20 11:52:51 EDT 2007

Hello,

Twisted PB sometimes loses its connection to the server. In this  
case, a PBConnectionLost exception is raised on the client. It would  
be nice to implement a fail-safe(er) way of calling remote methods  
that would retry when necessary until the remote method has been  
called successfully and the result has been returned. Note that this  
is only necessary when the remote method call should be invoked  
exactly once on the server (i.e. for POST-like calls that change  
server state). In the case of GET-like requests, a simpler retry  
mechanism will do.

The motivation for this is based on my experience of using Twisted in  
an application I am developing. The network communications are all  
happening on a LAN. The good end of the network is connected directly  
to a 100Mbps switch at the server. Failures occur more frequently at  
the other end (my end) of the network, which is connected through a  
10/100 hub that is connected to the main switch. I rigged up a quick  
test with a 1000-request sample size; failures ranged from 28/1000 on  
the good end of the network to 83/1000 on the bad end of the network.  
One request consists of a single remote method call through PB. A  
success was when I got the expected result, a failure was when I got  
a PBConnectionLost error.

The following is pseudo code that I came up with to mitigate the  
problem.

Simple request (GET - repeatedly call method until success or  
RETRY_LIMIT is reached)
    Client flow:
       for x in range(RETRY_LIMIT)
          invoke remote method without unique call identifier
          if result is not PBConnectionLost
             break
       if result is PBConnectionLost
          raise server not responding error
    Server flow:
       (nothing special, just plain PB)

Complex request (POST - server-side method is invoked exactly once)
    Client flow:
       use simple retry method to get a unique call identifier from  
server
          a timeout value is also sent along to tell the server how  
long to hold the results of this request
       for x in range(RETRY_LIMIT)
          invoke remote method with identifier
          if return value is not PBConnectionLost
             break
       if result is PBConnectionLost
          raise server not responding error
       using simple retry method tell server to discard unique call  
identifier
    Server flow:
       receive request for unique call identifier
          create and store identifier with UNCALLED token
          schedule identifier to be discarded with timeout value  
supplied by client
          return identifier to client
       receive remote method invocation with unique call identifier
          branch on value stored with unique call identifier
          if UNCALLED
             update identifier with CALLED token
             invoke method
             while result is deferred
                get defer result
             store COMPLETED token and unique with unique call  
identifier
             if there is another invocation WAITING
                this means the connection was lost
                signal the WAITING request with the result
             else
                return result to client
          if CALLED
             store WAITING token with unique identifier (must not  
overwrite other call tokens)
             defer until COMPLETED
          if COMPLETED
             return result to client
          if unique call identifier does not exist
             raise error
       receive request to discard unique call identifier
          if identifier exists
             discard identifier, tokens, and result
          return True

I realize that implementing this would not eliminate network errors.  
It would simply reduce the likelyhood of failed method calls due to  
dropped connections. If I have my math correct (I always struggle a  
bit with statistics), even a RETRY_LIMIT of 2 would reduce the  
probability of a lost connection to 0.6% at the worst (<0.1% on the  
good end of the network).

I have two questions:

1. Does something like this already exist?
2. Is this a totally stupid idea? (would it be better to improve our  
physical network than to try to band-aid the problem with something  
like this?)

~ Daniel