[Twisted-Python] Multicast XMLRPC
eprparadocs at gmail.com
Sat Aug 26 19:25:40 EDT 2006
glyph at divmod.com wrote:
> On Sat, 26 Aug 2006 10:14:48 -0400, "Chaz." <eprparadocs at gmail.com> wrote:
>> Right now I am trying to find a solution to an interesting problem:
>> how to find a file without knowing exactly where it exists in the
>> network. You have to do this to make the system scale nicely.
>> Basically each node holds information about the files (aka objects) it
>> stores. I do this so that I don't have a central database any where
>> (this allows the system to scale differently. With a central database
>> I would have that set of servers scale differently than the storage
>> Now I can build a set of machines that are the distributed database
>> machines - each storing something - and querying them for where the
>> file lives; this would narrow the machines I have to directly talk to,
>> but it feels wrong. This is sort of a variation of the hub-and-spoke
>> that Glyph talked about. But having said that I am trying to determine
>> if I can get away from that and just go to a very unstructured
>> environment (without intermediate database nodes).
> This sounds an awful lot like a distributed hashtable. It does
> implicitly use an overlay network, but not a hub-and-spoke overlay network.
> I'm not intimately familiar with the algorithms involved, so rather than
> try to describe them, I'll just refer you to the relatively nice
> wikipedia page on the topic:
> There is also a project in Python (not Twisted though) which may serve
> as an example:
> Are these ideas useful? Have you looked at them before?
> Twisted-Python mailing list
> Twisted-Python at twistedmatrix.com
As I understand DHT the concept is to create a hash identifier,
partition it into "chunks", and use the chunks to locate the file. It is
an interesting idea and certainly one approach. I am keeping it in my
back pocket. There are many reasons I don't like this approach.
First, with a poorly segmented hash, you can have a few levels of
indirection before reaching the file. You can see this in a lot of p2p
file sharing system. I would like to see if I can overcome this
performance penalty (another problem is DHT works well in a very sparse
environment, so the hash keys have to be pretty big. That means more
The second issue is one unique to data storage systems: I need to have
multiple copies of the file around. So I had thought if I do a DHT I
will just keep copies all along the path. That should solve the problem
of access quickly and copies.
The third issue - and this one I had more difficulty grasping - is that
once an intermediate node disappears, its contents have to be passed on
to someone else. Also the link from the prior node to this one (the one
going away) has to be adjusted. What is the problem? It is quite
possible that the node would have millions of files on it, hence copying
it is impossible. That means I have to keep exact copies at multiple
sites, at the same time (definitely smaller than the entire space of all
But the real problem is that in a network of 1000s of machines it is
quite possible the the two I am using to store indices on can disappear
at the same time (granted small, but still a problem). So I opted to
look at another approach, the one that I started talking about - using
broadcast or multicast with some sort of RPC-like mechanism and light
weight protocol applied over a lot of machines.
This approach hasn't been well researched, almost being excluded out of
hand. I decided it was at least worth investigating. It solves some
problems like scalability and easy management. The downside is that I
have to worry about building a lightweight protocol and handle RPC like
AT LEAST ONCE semantics instead of EXACTLY ONCE.
Glyph, thanks for the references. I will definitely look up 'thecircle'
stuff. That one I didn't know about!
More information about the Twisted-Python