[Twisted-Python] Multicast XMLRPC

Chaz. eprparadocs at gmail.com
Fri Aug 25 16:33:52 EDT 2006


glyph at divmod.com wrote:
> On Fri, 25 Aug 2006 14:43:43 -0400, "Chaz." <eprparadocs at gmail.com> wrote:
>> Perhaps the simple way to say this is that I need to do group 
>> communications that support RPC semantics with minimal overhead.
> 
> I'm still not really clear on what the application is.
> 

The application is a massively scalable data storage system. I plan on 
releasing it into the open source community within the next month or so. 
I've been working on it for almost two years now. Twisted and Python 
have made easier work of it from my first implementation (C/C++).

>> You ask about the network topology; all I can say is that it supports 
>> the normal communication means: unicast, broadcast and maybe multicast.
> 
> Heh.  "Normal" communication means?  After writing a P2P layer and 
> working on a SIP implementation, I have come to understand that the only 
> "normal" communication available is an *outgoing*, *unencrypted* HTTP 
> request on port 80... ;-)
> 
> More seriously, if you're writing an application for distributing 
> compute nodes to home computers, multicast is a non-starter.  If it's an 
> intranet, then maybe it's feasible.  Or, if you're on Internet 2 for 
> some reason.  (Is anybody actually on internet 2 these days?)
> 
This is not a "home application" but an enterprise and/or SSP 
application. Most likely it sits behind a firewall and if remote offices 
need to access it, they will get access via VPN portals.

I guess this is sort of a topology! Doh. All I know is that I have 
multicast and with some effort broadcast support.

> At any rate, producing a functioning multiunicast prototype with, e.g. 
> PB, would be the easiest way to get started if you need to fall back to 
> that sort of topology anyway in the case where a multicast solution 
> doesn't work.  Then you can collect data and determine how much 
> bandwidth is going to be saved in a realistic scenario...
>

So I can use PB with multicast support? How would I deal with all the 
target machines getting responses back>

>> I am being intentionally vague since I don't want to have any specific 
>> network architecture.
> 
> If you want to support arbitrary network architecture, you _definitely_ 
> can't use multicast, at all.  Even determining if *unicast* datagrams 
> work on an arbitrary network is a hard problem.

As I said the entire system sits behind a firewall and remote sites will 
use VPN to get to the system. This gives me multicast and broadcast (so 
long as everything is on the same subnet, even remote sites).
> 
>> I don't want to use overlay networks, if at all possible. While they 
>> are nice, I would prefer something a little more direct (though that 
>> might not be possible). The reason? Direct operations are faster.
> 
> Sometimes.  If your topology involves an extremely well-connected 
> overlay hub peer and a bunch of intermittently or poorly connected edge 
> peers, direct operations can be significantly slower.  While I'm not a 
> big fan of IRC's network architecture, the math on what happens if every 
> client is responsible for all of their own messages on a channel of 1000 
> people is really eye-opening.
> 

I had thought of the hub-and-spoke model and I designed the system that 
way, originally. But I have to respond to instantaneous demands which 
caused me to change the design of the system. Each of the servers can 
run as both servers (providing a service to a client app) and an end 
point (providing storage features). So a hub-and-spoke architecture are 
really out of the picture for me (at least I can't see an easy way).

I could probably do a self-organizing overlay network on top of the 
machines taking advantage of how they are connected together (the real 
physical topology) but even that presents me with an issue: I want the 
system to sort of be self-configuring. As such I don't have a way to 
auto-detect connection speeds.

I had thought of using a clock-synchronization algorithm to figure out 
bandwidth throttling but I thought it better to leave that to another 
day (or days). I also don't want the user (or owner of this beast) to 
have to manually configure that stuff.


>> I don't particular care if it is PB, XML-RPC or SOAP as the 
>> marshalling mechanism. I mention them since they allow me to solve one 
>> problem at a time. I would like to build the solution a piece at a 
>> time to do some measurements and testing. Today the underlying 
>> transport and tomorrow the marshallings.
> 
> It still seems to me like this is backwards.
> 
> The application can be complete, end-to-end, if you start marshalling 
> data and sending it over a simplistic (but possibly too-expensive) 
> mechanism.  Then, you can replace the transport as necessary later.  
> Preserving the semantics of the marshalling between things as radically 
> different as XMLRPC and PB would be very hard; but as you've said, the 
> semantics of your transport must remain identical.

That is certainly one way. I tend to think all my hard problems are 
going to be transport issues and work up the stack. I have had a share 
of algorithm issues too; nothing is quite obvious when you have a 1000 
or 10,000 machines to deal with!

> 
>> Now let me address the issue of TCP. It is a pretty heavy protocol to 
>> use. It takes a lot of resources on the sender and target and can take 
>> some time to establish a connection. Opening a 1000 or more sockets 
>> consumes a lot of resources in the underlying OS and in the Twisted 
>> client!
> 
> I still don't know what you mean by "resources", and as compared to 
> what.  In my experience all the alternatives to TCP end up consuming an 
> equivalent amount of RAM and CPU time... although in some cases you 
> might save on bandwidth.
> 

By resources I mean memory and time. Granted on a 1GB system with 3 GB 
of virtual, memory isn't a big deal, most of the times. But I have seen 
memory leaks kill this sucker more times than I care to recall. Once I 
ran the application for a few days and saw all my swap being used! It 
was very subtle memory leak in one of the libraries (in fact one library 
leak consumed 584M in less than one hour!).

>> If I use TCP and stick to the serial, synchronized semantics of RPC, 
>> doing one call at a time, I have only a few ways to solve the problem. 
>> Do one call at a time, repeat N times, and that could take quite a while.
> 
> I'm not sure what you mean by "at a time".  The operations can be quite 
> effectively parallelized, both by TCP and by Twisted talking to the OS: 
> if you keep a list of all your open connections and do the naive thing, 
> i.e., for each heartbeat:
> 
>    for connection in connections:
>        connection.sendPing(timeout=30).addErrback(connection.uhOh)
> 
> the initial loop will not take very long even with a very large number 
> of connections, and Twisted will send out traffic as network conditions 
> permit.
> 
> Most importantly, you do not need to wait for any of the calls to 
> complete to issue more calls, regardless of whether they're unicast or 
> multicast.  This same API could be refactored internally to group 
> together peers in the same multicast group and coalesce their pings; but 
> you still need to do the same complexity order of work, because you have 
> to track each peer's response individually.

Yes, I completely forgot that I would see them all in parallel. I tend 
to overlook Twisted's state machine architecture when I think of 
solutions. I am getting better but not quite there yet...

> 
> Finally, if all you're concerned with is clients dying, you can remove 
> Python from the equation entirely and let the TCP stack do its thing: 
> set SO_KEEPALIVE on all your sockets [in Twisted-ese: 
> self.transport.setTcpKeepAlive(True)] and just wait for connectionLost 
> to be called when a ping fails.  No user-space work _at all_, and 
> probably pretty minimal bandwidth usage.
> 
>> I could do M spawnProcesses and have each do N/M RPC calls.
> 
> Yow.  That definitely doesn't make sense unless you have a massively SMP 
> box.
> 
>> Or I could use M threads and do it that way.
> 
> ... and that would basically _never_ make sense, under any conditions.  
> Python's GIL negates any SMP benefits, Twisted won't send network 
> messages from threads anyway, and it would be substantially more complex.

Yes...see my mea culpa above....it is hard to stop thinking in terms of 
threads and processes!

> 
>> Granted I have M sockets open at a time, it is possible for this to 
>> take quite a while to execute. Performance would be terrible (and yes 
>> I want an approach that has good to very good performance. After all 
>> who would want poor to terrible performance?)
> 
> "performance 'would be' terrible" sounds like premature optimization to 
> me.  At least, I have lots of experience with systems where this 
> performance was more than good enough.  Huge massively multiplayer games 
> use such systems and manage to deal with tens of thousands of concurrent 
> clients per game node with (relative) ease, over the public internet, 
> with good performance, and without breaking the bank on bandwidth.
> 

Do the games use TCP or UDP? I would have thought they save state about 
each of the players in the server and use UDP for message passing. I 
thought that was part of the reason most game developers where 
interested in STUN?


>> So I divided the problem down to two parts. One, can I reduce the 
>> amount of traffic on the invoking side of the RPC request? Second, is 
>> how to deal with the response. Obviously I have to deal with the issue 
>> of failure, since RPC semantics require EXACTLY-ONCE.
> 
> If you're concerned about bandwidth *as a resource of its own* then this 
> is perhaps a legitimate concern.  But if you're concerned about reducing 
> bandwidth as a means to increase the real-time performance of the system 
> I don't think that it's actually going to save you a lot.  You save some 
> bandwidth, but then you move a bunch of request/response tracking out of 
> hardware and into Python.  Unless your new algorithm is more efficient 
> by a large margin, and N is very big indeed (100,000 is not "big", 
> especially when you can partition it using techniques like overlay 
> networks).

Bandwidth is a very important issue in this system. No one would run 
this on their network if it could bring down their network (or congest 
it so badly ...the old packet-storm issue).

Minimizing bandwidth usage is only one way to deal with performance. A 
congested network will drop packets (requiring retransmission, etc), so 
I try to minimize the impact on the network.

> 
>> That leads me to the observation that on an uncongested ethernet (...)
> 
> "uncongested ethernet" implies something very concrete about your 
> network topology.  Certainly it implies that you have enough spare 
> bandwith that you don't need to be compressing every byte.  Want to 
> expound? :)
> 
Well a congested network is about 1/2 the bandwidth; so I can expect 
about 5Mb/sec on a 10M ethernet, etc. So the idea would be keep traffic 
to a minimum.

As I mentioned earlier, if I was to do normal heartbeat messages with N 
machines, I have N! messages moving around. So long as N is small - a 
few hundred machines (and I have built machines in the telecom world 
that have had 100 machines), the load on the network is reasonable. But 
once you have 1000 machines, you have a 1,000,000 messages flying 
around. Through the work I did I got the number down to 2,000! And over 
15 seconds, that isn't too bad.

>> Hope this helps cast the problem...I didn't mean to sound terse before 
>> I just figured everyone had already thought about the problem and knew 
>> the issues.
> 
> I still really don't know what the problem at hand is.  I gather it has 
> something to do with sending a lot of traffic to a lot of peers but that 
> is still a description of an implementation technique, not a problem.  
> Are you making toast?  Doing distributed testing?  Sequencing genomes?  
> Cracking encryption?  Writing some kind of monster distributed 
> enterprise calendar server?  (I'm still not sure what you meant by 
> "communicating groups", above.)  Is the "problem" in this case to 
> develop a generic infrastructure for some wider set of problems, like an 
> open-source implementation of a MapReduce daemon?  If so, what are the 
> initial problems it's expected to be applied to?  What does all this 
> data, other than hearbteats, that you're slinging around *represent*?
> 
> _______________________________________________
> Twisted-Python mailing list
> Twisted-Python at twistedmatrix.com
> http://twistedmatrix.com/cgi-bin/mailman/listinfo/twisted-python
> 





More information about the Twisted-Python mailing list