[Twisted-Python] Send many large files with PB
Justin Mazzola Paluska
jmp at MIT.EDU
Tue May 16 19:24:24 MDT 2006
Hi,
I'm using PB in a distributed application that has suddenly grown the
requirement to copy directories of files between the servers.
>From lurking on the mailing list archives, it seems that the best way
to move large amounts of data between Twisted servers is to use a
twisted.spread.util.Pager sub-class to pipe the data. Between that
information and the "How to use twisted pb pager" [1] document, I'm
probably good to go on how to transfer large amounts of data.
However, I wanted to step back and ask what's the best way to actually
package the files that I'm going to send. To make things concrete,
suppose I need to send data from SRC to DEST and that SRC has a PB
RemoteReference to DEST. Also, most files will be huge (gigabytes)
and nested in directories.
- Should I send the files from SRC to DEST one-by-one? That is, make
a new PB request for a new Pager reference for each file, stream the
file using a twisted.spread.util.FilePager instance, then repeat
with the next file, and so on. This has the advantage that I think
I can do it fairly easily, but has the disadvantage of requiring
many PB calls (with the associated bookkeeping in my application).
- Or, is it better to use something like tarfile module to create a
stream of bytes that I stream to the other side and decode? There's
something appealing to using tarfile--it's like the oft-seen "tar
-xf - | ssh user at host 'tar -cf -'" way of transferring files. Plus,
the tarfile module takes care of making directories and the like for
me.
This method has the advantage of a single PB call, but the
disadvantage that I can't quite figure out how to use tarfile with
Twisted. The tarfile module requires an file-like object to stream
to or stream from. I don't think the naive approach of just adding
__write__ method to a Pager or __read__ method to a
CallbackPageCollector will work without taking up all of the memory
in my system or blocking in some way.
- Finally, should I be doing something completely different?
Normally, outside of my application, I'd just use rsync, scp, or
some such. However, the users of this application don't know how to
use these tools. I can't spawn these programs without getting into
authentication issues between the machines. Doing this within
Twisted seems like a good idea because the machines are already
authenticated to each other through PB, but I could be wrong.
I apologize if this is rambling. I've been thinking about this for
a while and am now a bit bleary-eyed.
--Justin
[1] http://aspn.activestate.com/ASPN/Cookbook/Python/Recipe/457670
More information about the Twisted-Python
mailing list