[Twisted-Python] Send many large files with PB

Tue May 16 19:24:24 MDT 2006

Hi,

I'm using PB in a distributed application that has suddenly grown the
requirement to copy directories of files between the servers.

>From lurking on the mailing list archives, it seems that the best way
to move large amounts of data between Twisted servers is to use a
twisted.spread.util.Pager sub-class to pipe the data.  Between that
information and the "How to use twisted pb pager" [1] document, I'm
probably good to go on how to transfer large amounts of data.

However, I wanted to step back and ask what's the best way to actually
package the files that I'm going to send.  To make things concrete,
suppose I need to send data from SRC to DEST and that SRC has a PB
RemoteReference to DEST.  Also, most files will be huge (gigabytes)
and nested in directories.

- Should I send the files from SRC to DEST one-by-one?  That is, make
  a new PB request for a new Pager reference for each file, stream the
  file using a twisted.spread.util.FilePager instance, then repeat
  with the next file, and so on.  This has the advantage that I think
  I can do it fairly easily, but has the disadvantage of requiring
  many PB calls (with the associated bookkeeping in my application).

- Or, is it better to use something like tarfile module to create a
  stream of bytes that I stream to the other side and decode?  There's
  something appealing to using tarfile--it's like the oft-seen "tar
  -xf - | ssh user at host 'tar -cf -'" way of transferring files.  Plus,
  the tarfile module takes care of making directories and the like for
  me.

  This method has the advantage of a single PB call, but the
  disadvantage that I can't quite figure out how to use tarfile with
  Twisted.  The tarfile module requires an file-like object to stream
  to or stream from.  I don't think the naive approach of just adding
  __write__ method to a Pager or __read__ method to a
  CallbackPageCollector will work without taking up all of the memory
  in my system or blocking in some way.

- Finally, should I be doing something completely different?
  Normally, outside of my application, I'd just use rsync, scp, or
  some such.  However, the users of this application don't know how to
  use these tools.  I can't spawn these programs without getting into
  authentication issues between the machines.  Doing this within
  Twisted seems like a good idea because the machines are already
  authenticated to each other through PB, but I could be wrong.

I apologize if this is rambling.  I've been thinking about this for
a while and am now a bit bleary-eyed.
  --Justin

[1] http://aspn.activestate.com/ASPN/Cookbook/Python/Recipe/457670