[Twisted-Python] Memory usage in large file transfers

Tue Dec 2 08:46:07 EST 2003

Hi again.

----- Original Message ----- 
From: "Andrew Bennetts" <andrew-twisted at puzzling.org>
To: <twisted-python at twistedmatrix.com>
Sent: Monday, December 01, 2003 2:11 PM
Subject: Re: [Twisted-Python] Memory usage in large file transfers

> On Mon, Dec 01, 2003 at 11:25:13AM +0200, Nikolaos Krontiris wrote:
> >    Hi there.
> >    I am writing a file transfer program using twisted as my framework.
> >    I have been having some problems as far as memory usage is concerned
> >    (i.e. both client and server just eat through available memory
without
> >    ever releasing it back to the kernel while transfering data). I am
aware
> >    that in theory, the client and server will consume at least as much
memory
> >    as the file to be transferred, but this memory should also be made
> >    available to the O/S after the operation has completed.
> >    I also use a garbage collector, which makes things just marginally
better
> >    and the only TWISTED operations I use are a few transport write and
> >    callLater commands.
>
> You don't say how large "large" is, but you probably should be using
> producer/consumer APIs rather than just plain transport.write(data).  See
> twisted.protocols.basic.FileSender for an example.  If I'm understanding
> your problem correctly, you should see a significant improvement.  This
> technique doesn't require holding the entire file in memory to transfer
it.
>
> I'm not sure what you mean about using a garbage collector -- Python
> automatically cleans up objects with zero reference counts, and
periodically
> finds and collects unreachable object cycles.
>
nk: The size of files I am referring to can be anything from 20MB up to
500MB, but right now I'm taking it easy with the client/ server model; I'm
using sending a single 43MB file, and as I'm debugging and improving
performance, I will increase this filesize...
nk: I had originally thought about using basic.FileSender but a) It has been
commented as unstable by the twisted development team and b)I need to send a
client ID each time I send a single buffer (security... what can you
say...). To make sure that I'm not holding the entire file's contents in
memory, I read (at most) 64K of  the file each time, and send this data
away. After it has been sent, this data buffer is flushed. I guess I can try
to change this to file.open, file.seek, file.read and file.close each time I
read the file so that the only contents of the file in system memory are the
only ones necessary...
nk: When talking about the garbage collector, I'm just referring to python's
gc.enable() and gc.collect() commands, nothing more... Unfortunately I don't
believe that the built-in periodical find-and-collect unreachable object
cycles is very useful in the case of the client, since it shuts down after
the file's EOF...
> >    The only culprits responsible for this I can imagine to be a
difference
> >    between the hardcoded buffer sizes in TWISTED and the amount of data
I
> >    send (I send 64Kb of data per request for faster delivery in LANs)
and/or
> >    possibly that this memory lost is in many small chunks of data -- in
this
> >    case no O/S can free this data, since there are always limits only
above
> >    which the kernel will deem an amount of memory worth the trouble to
be
> >    released (I think glibc has around a 2MB limit)...
>
> Memory fragmentation can prevent the OS reclaiming memory, but generally
> you'd expect memory growth to slow as it asymptotically reaches a high
> enough limit to accomodate all memory allocations for your load, even with
> fragmentation.
>
> I believe Python 2.3's pymalloc allocates memory for different types in
> different "arenas", which are seperately mmapped, so fragmentation in e.g.
> the string arena (strings being that type this is read from files, split
up,
> sent over the network, etc) hopefully won't impact other memory
allocation.
> So 2.3 vs. 2.2 (or earlier) you should see... different memory use
> characteristics.  Hopefully better, but you never know :)
>
> Also, transport.write and callLater in 64kB chunks is unlikely to be the
> fastest or memory-efficient technique.  Producers/consumers should be
best,
> but I'd suspect that even a single transport.write of the entire content
> would probably be better.  Actual benchmarks to support this claim would
be
> very welcome!
>
> >    As professional network programmers, do you believe my diagnosis is
> >    correct? Have you encountered such problems in the past? Are there
> >    workarounds for this?
>
> I really can't say.  You've given no specific data at all... How large are
> the files?  How much memory does your server appear to lose per request?
> How much memory does the server take overall (both initially and after
> running for a while)?  How many concurrent requests are you dealing with?
> What platform, version of Python, and version of Twisted?  Anything else
you
> think is relevant?  :)
>
nk: Right now, I'm testing the server/ client model with a 43 MB file. The
memory consumed on a WinMe system using Python 2.3.2 and Twisted 1.1.0 with
a 64K buffer is 58MB, while with a 4KB buffer is around the 80MB region. On
Linux using Python 2.3.2 and Twisted 1.1.0, the memory consumed with a 4K
buffer is always a bit more than 100MB. I can't use very large buffers on my
Linux system, because of the ID I have to send per buffer sent. It seems
that the linux default SOL_SOCKET, SO_RCVBUF sizes are relatively small, so
it confuses the client ID since the packets it receives will have different
sizes... Note that these results are for 1 server and 1 client. I have not
yet dared do 2 concurrent clients at once!
nk: The server consumes only 3MB of memory while idle. Unfortunately, I
cannot tell if the erratic memory consumption lies on the server or client
side (or both), since I only have 1 PC...
> If you could answer some of these sorts of questions, we could maybe tell
> you if what you're seeing is expected behaviour, or unusual, and maybe
> suggest specific remedies.
>
> -Andrew.
>
>
> _______________________________________________
> Twisted-Python mailing list
> Twisted-Python at twistedmatrix.com
> http://twistedmatrix.com/cgi-bin/mailman/listinfo/twisted-python
>