[Twisted-Python] Memory usage in large file transfers

Mon Dec 1 07:11:51 EST 2003

On Mon, Dec 01, 2003 at 11:25:13AM +0200, Nikolaos Krontiris wrote:
>    Hi there.
>    I am writing a file transfer program using twisted as my framework.
>    I have been having some problems as far as memory usage is concerned
>    (i.e. both client and server just eat through available memory without
>    ever releasing it back to the kernel while transfering data). I am aware
>    that in theory, the client and server will consume at least as much memory
>    as the file to be transferred, but this memory should also be made
>    available to the O/S after the operation has completed.
>    I also use a garbage collector, which makes things just marginally better
>    and the only TWISTED operations I use are a few transport write and
>    callLater commands.

You don't say how large "large" is, but you probably should be using
producer/consumer APIs rather than just plain transport.write(data).  See
twisted.protocols.basic.FileSender for an example.  If I'm understanding
your problem correctly, you should see a significant improvement.  This
technique doesn't require holding the entire file in memory to transfer it.

I'm not sure what you mean about using a garbage collector -- Python
automatically cleans up objects with zero reference counts, and periodically
finds and collects unreachable object cycles.

>    The only culprits responsible for this I can imagine to be a difference
>    between the hardcoded buffer sizes in TWISTED and the amount of data I
>    send (I send 64Kb of data per request for faster delivery in LANs) and/or
>    possibly that this memory lost is in many small chunks of data -- in this
>    case no O/S can free this data, since there are always limits only above
>    which the kernel will deem an amount of memory worth the trouble to be
>    released (I think glibc has around a 2MB limit)...

Memory fragmentation can prevent the OS reclaiming memory, but generally
you'd expect memory growth to slow as it asymptotically reaches a high
enough limit to accomodate all memory allocations for your load, even with
fragmentation.

I believe Python 2.3's pymalloc allocates memory for different types in
different "arenas", which are seperately mmapped, so fragmentation in e.g.
the string arena (strings being that type this is read from files, split up,
sent over the network, etc) hopefully won't impact other memory allocation.
So 2.3 vs. 2.2 (or earlier) you should see... different memory use
characteristics.  Hopefully better, but you never know :)

Also, transport.write and callLater in 64kB chunks is unlikely to be the
fastest or memory-efficient technique.  Producers/consumers should be best,
but I'd suspect that even a single transport.write of the entire content
would probably be better.  Actual benchmarks to support this claim would be
very welcome!

>    As professional network programmers, do you believe my diagnosis is
>    correct? Have you encountered such problems in the past? Are there
>    workarounds for this?

I really can't say.  You've given no specific data at all... How large are
the files?  How much memory does your server appear to lose per request?
How much memory does the server take overall (both initially and after
running for a while)?  How many concurrent requests are you dealing with?
What platform, version of Python, and version of Twisted?  Anything else you
think is relevant?  :)

If you could answer some of these sorts of questions, we could maybe tell
you if what you're seeing is expected behaviour, or unusual, and maybe
suggest specific remedies.

-Andrew.