[Twisted-Python] Uploading multiple files using ftpclient in Twisted

Sat Jul 10 18:56:20 EDT 2010

Jaepyoung Kim <jaepyoung.kim at gmail.com> writes:

> The current script is uploading using ftplib and it takes time about 1 hour.
> I want to change this script to use twisted asynchronous function.
> I thought if I use asynchronous function in twisted like following,
> then file uploading will be executed in parallel.
> But this was executed sequentially. Uploading second file starts afer
> completing first file upload.
> Could you check what was wrong in my source code? Or Am I wrong in
> understanding asynchronous function?

I'm pretty sure you'll need separate connections to an FTP server to
achieve parallel transfers, regardless of how you write the client.
At least as long as you stick with regular get/put commands.  So while
using a twisted approach can enable you to manage those parallel
streams pretty easily, you'll need distinct connections for each
transfer and manage which file transfer is using which connection in
your code.

Essentially a store or fetch FTP operation initiates a transfer over
the dedicated data channel, so that channel is in use until the
transfer completes or is aborted.  The data on the data channel is not
encapsulated nor multiplexed in any way so you can only have a single
transfer using the data channel at once.  Passive transfers do create
new data channels, but the FTP protocol specifically says a server
needs to stop accepting connections and shut down any open connections
on old passive ports once a new passive request is received, so you're
still limited to one at a time.

Thus, your callbacks for each store operation, will only file when the
store has completed, and you'll only be able to initiate the next
store request at that point since its only then that the channel to
the server is free to transfer another file.

I believe some servers have implemented custom extensions to implement
parallel operations at a finer grained level than a file, but I don't
think they're commonly implemented in ftp libraries (nor in servers
commonly in use).

What I'd suggest, in terms of your code, is to instantiate a pool of
FTPClients to the same server, initiate transfers on them in parallel
and then as one completes, use it to pick up the next file.  You'll
need to handle the distribution of files amongst the pool of clients
yourself.

Is there any particular reason you expect this to yield an improvement
in overall time?  Unless you're transferring very large numbers of
files that are very small compared to the bandwidth*latency of your
network connection to the server (which doesn't sound like the case
here), the overhead of the protocol itself will be quite small, and
your bottleneck is either going to be the network throughput, or the
slower of the disk I/O on either end.

Neither of those bottlenecks will likely be improved by doing multiple
transfers in parallel, and in fact your total time can worsen if the
prior bottleneck was the disk I/O since you'll have competing
operations for the disks as opposed to simple sequential access.  Or
you may find that you get very marginal benefit with the expense of
much more complicated to maintain code.

You might grab an existing ftp client that supports parallel transfers
and use it to run some tests before trying to re-implement things
yourself.  There should be several options, but for example, I believe
FileZilla supports it under Windows, or lftp under Linux.

-- David