[Twisted-Python] Multicast XMLRPC

Sat Aug 26 10:14:48 EDT 2006

Phil Mayers wrote:
> Chaz. wrote:
> 
>>
>> I started out using Spread some time ago (more than 2 years ago). The 
>> implementation was limited to a hundred or so nodes (that is in the 
>> notes on the spread implementation). Secondly it isn't quite so 
>> lightweight as you think (I've measured the performance).
>>
>> It is a very nice system but when it gets to 1000s of machines very 
>> little work has been done on solving many of the problems. My research 
>> on it goes back almost a decade starting out with Horus.
> 
> I must admit to not having attempted to scale it that far, but I was 
> under the impression that only the more expensive delivery modes were 
> that costly. But by the sounds of it, you don't need me to tell you that.
> 

Originally I had started out thinking about work surrounding Horus and 
researched a lot of the group communication stuff. When I got to Spread 
I tried it thinking it would solve all my problems. I actually built a 
system using it only to be sadly disappointed.

First I hit the 100+ node limit. Then I got to the static configuration, 
which I spent time trying to overcome. Finally when I did some 
measurements I decided that 1000s of machines would require a 
"hub-and-spoke" like architecture that Glyph suggested. I decided it was 
much too complicated and backed away.

Since that time I let the pendulum swing to the other extreme - no 
predefined architecture (or aggregation of machines). I want to examine 
what happens when I have 1000s of machines without a topology; can I 
solve the problems. As I said I solved the decentralized membership list 
issue. Now I am on to the harder problem: can I get RPC-like semantics 
with reasonable performance over the 1000s of machines? I don't know.

>>
>> Actually I am part of the IRTF group on P2P, E2E and SAM. I know the 
>> approaches they are being tossed about. I have tried to implement some 
>> of them. I just am not of the opinion that smart people can't find 
>> solutions to tough problems.
> 
> 
> Ok, in which case my apologies. My reading of your posts had lead me to 
> believe, incorrectly, you may not be familiar with the various issues. 
> In that case, you can (should) disregard most of it.
> 

I think the problem is on my part. I asked what I thought was an obvious 
question without laying the groundwork as to what I knew or how.

>>
>> Is multicast or broadcast the right way? I don't know, but I do know 
>> that without trying we will never know. Having been part of the IETF 
> 
> It's clearly right for some things - I'm just not sure how much 
> bi-directional distribution would be helped by it, since you've got at 
> some point to get the replies back.
>

I think I feel comfortable with using multicast (or broadcast) for the 
invoking RPC call. What I don't have a clear feeling for is how to 
correctly handle the response - I know I can't send them all within some 
small delta without congesting the network. So I am looking at all sorts 
of techniques (like holding off the responses, randomly...but I don't 
know how that will impact retries, etc).

>> community for a lot of years (I was part of the group that worked on 
>> SNMP v1 and the WinSock standard), I know that when the "pedal meets 
>> the metal" sometimes you discover interesting things.
> 
> I didn't realise winsock went near the IETF. You learn something new 
> every day.
> 

Me too!

>>>
>>> Knuth and his comments on early optimisation apply here. Have you 
>>> tried it? You might be surprised.
>>>
>>
>> I am sorry to say I don't know the paper or research you are referring 
>> to. Can you point me to some references?
> 
> Sorry, it's a phrase from Donald Knuth's (excellent) three-volume 
> programming book, "The Art of Computer Programming". Highly recommended.
>

Ah, ok. Having read them so many years ago I forgot most of it. lol..

>>
>> Thanks for the information. This is what makes me think that I want 
>> something based on UDP and not TCP! And if I can do RMT (or some 
>> variant of it) I might be able to get better performance. But, as I 
>> said it is the nice thing about not having someone telling me I need 
>> to get a product out the door tomorrow! I have time to experiment and 
>> learn.
> 
> When I wrote my reply I hadn't seen your comment on the app being 
> distributed storage.
> 
>>> How many calls per second are you doing, and approximately what 
>>> volume of data will each call exchange?
>>>
>> This is information I can't provide since the system I have designing 
>> has no equivalent in the marketplace today (either commercial or open 
>> source). All I know is that the first version of the system I built - 
>> using C/C++ and a traditional architecture (a few dozens of machines) 
>> was able to handle 200 transactions/minute (using SOAP). While there 
>> were some "short messages" (less than an normal MTU), I had quite a 
>> few that topped out 50K bytes and some up to 100Mbytes.
> 
> Oh. In which case more or less everything I wrote is useless!
> 

Well I don't think so. Based on your multicast comment I wonder about 
broadcast...have you ever seen the same thing happen? When you say 
"short interruptions" are we talking more than seconds? Can you 
elaborate a bit?

>>>
>>> Successful transmission is really the easy bit for multicast. There 
>>> is IGMP snooping, IGMP querier misbehaviour, loss of forwarding on an 
>>> upstream IGP flap, flooding issues due to global MSDP issues, and so 
>>> forth.
>>>
>>
>> I agree about the successful transmission. You've lost me on the IGMP 
>> part. Can you elaborate as to your thoughts?
> 
> Well, my experience of large multicast IPv4 networks is that short 
> interruptions in multicast connectivity are not uncommon. There are a 
> number of reasons for this, which can be broadly broken down into 1st 
> hop and subsequent hop issues.
> 
> Basically, in a routed-multicast environment, I've seen the subnet IGMP 
> querier (normally the gateway) get pre-empted by badly configured or 
> plain broken OS stacks (e.g. someone running Linux with the IGMPv3 early 
> patches). I've also seen confusion for highly-available subnets (e.g. 
> VRRPed networks) where the IGMP querier and the multicast designated 
> forwarder are different. This can cause issues with the IGMP snooping on 
> the downstream layer2 switches when the DF is no longer on the path 
> which the layer2 snooping builds.
> 
> You also get issues with upstream changes in the unicast routing 
> topology affecting PIM.
> 
> Most of these are only issues with routed multicast. Subnet-local is a 
> lot simpler, though you do still need an IGMP querier and switches with 
> IGMP snooping.
> 
>>
>> I do know I need EXACTLY-ONCE semantics but how and where I implement 
>> them is the unknown. When you use TCP you assume the network provides 
>> the bulk of the solution. I have been thinking that if I use a less 
>> reliable network - one with low overhead - that I can provide the 
>> server part to do the EXACTLY-ONCE piece.
>>
>> As to why I need EXACTLY-ONCE, well if I have to store something I 
>> know I absolutely need to store it. I can't be in the position that I 
>> don't know it has been stored - it must be there.
>>
>> Thanks for the great remarks....I look forward to reading more.
> 
> This makes a lot more sense now I know it's storage related.
> 
> You're right, this is a tricky and uncommon problem.
> 
> Let me see if I've got this right:
> 
> You're building some kind of distributed storage service. Clients will 
> access the storage by a "normal" protocol to one of the nodes. Reads 
> from the store are relatively easy, but writes to the store will need to 
> be distributed to all or a subset of the nodes. Obviously you'll have a 
> mix of lots of small writes and some very large writes.
> 
> Hmm.
> 
> Are you envisioning that you might have >1 storage set on the nodes, and 
> using a different multicast group per storage set to build optimal 
> distribution?
> 
> You might be able to perform some tricks depending on whether this 
> service provides block- or filesystem-level semantics. If it's the 
> latter, you could import some techniques from the distributed version 
> control arena - broadly speaking, node+version number each file and 
> "broadcast" (in the application sense) just the file + 
> newnode+newversion to the other store nodes, and have them lock the 
> local copy and initiate a pull from the updated node.
> 
> For block-level storage, that's going to be a lot harder.
> 
Definitely! Right now I dealing on the filesystem level. Doing block 
level would be incredibly difficult. I am trying to solve the simpler 
problem first! lol.

> For the multicast, something like NORM, which as you probably know is 
> basically forward-error-corrected transmit channel with 
> receiver-triggered re-transmits, would probably work. An implementation 
> would likely be non-trivial, but a fascinating project.
> 

Right now I am trying to find a solution to an interesting problem: how 
to find a file without knowing exactly where it exists in the network. 
You have to do this to make the system scale nicely.

Basically each node holds information about the files (aka objects) it 
stores. I do this so that I don't have a central database any where 
(this allows the system to scale differently. With a central database I 
would have that set of servers scale differently than the storage nodes).

Now I can build a set of machines that are the distributed database 
machines - each storing something - and querying them for where the file 
lives; this would narrow the machines I have to directly talk to, but it 
feels wrong. This is sort of a variation of the hub-and-spoke that Glyph 
talked about. But having said that I am trying to determine if I can get 
away from that and just go to a very unstructured environment (without 
intermediate database nodes).

As I said I have time to experiment before I put the code in the open 
source community ...

Peace,
Chaz