[Twisted-Python] debugging a memory leak

Wed Feb 24 16:57:16 EST 2010

In desperation of not finding the real memory leak on the production server,

I wrote a test server that I can push to arbitrary high RSS memory. I am far
from sure if this the same leak that I observe in production, but I would
like to understand what this one is. 
This is the server code:

Server.py:

import twisted.protocols.basic
from twisted.internet.protocol import Factory
from twisted.internet import reactor

class HiRate(twisted.protocols.basic.LineOnlyReceiver):
        MAX_LENGTH = 20000

        def lineReceived(self, line):
                if line == 'get':
                        out = 'a'*4000+'\r\r' 
                        self.transport.write(out)

factory = Factory()
factory.protocol = HiRate

reactor.listenTCP(8007, factory, backlog=50, interface='10.18.0.2')
reactor.run()   

This server has to be flooded by "get" requests from this client:

Client.py:

import socket, time

HOST='10.18.0.2'
PORT=8007
def client():
    """high rate client, needs a dedicated CPU to run"""
    s = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
    try:
        s.connect((HOST,PORT))
    except socket.error, e:
        print 'client error is %s' %e
        return
    n=0
    while 1:
        #print "iter %s" %n
        #time.sleep(.001)
        s.send('get\r\n')
        s.setblocking(0)
        try:
            r = s.recv(1024)
        except:
            r=0
        while r:
            #print r
            try:
                r=s.recv(1024)
            except socket.error, e:
                r=0
        s.setblocking(1)
        n+=1

client()

To reproduce the memory leak, I either need two machines with fast LAN
between them (since the client program takes 100% CPU), or possibly one
machine with a dual core CPU (I have not tried that). It is important that
client.py is given a separate CPU to run.
When the length of the response from the server is sufficient,  (out =
'a'*4000+'\r\r' ,  4000 is enough in my case), the RSS of the server process
starts to leak without a bound.
If you introduce a small delay in the client (#time.sleep(.001)), the leak
does not occur.

Looking at tcpdump on the server machine, I sometimes see many "get" packets
from the client in a row, that are not followed by response packets from the
server with payload 'aaaaa...'.  Only when the server is in this
"overwhelmed" state, the memory seems to grow unbounded.
I first thought it may be an issue of the unbounded send queue on the
server, but the examination of Send-Q with netstat shows that Send-Q
saturates to a certain ceiling value, while the RSS memory of the server
process continues to grow.  

Here are some commands I was using to watch the parameters of the server:
Watch send-Q and recv-Q:
root$ watch -n1 netstat -an 
RSS memory of the server:
root$ watch -n1 ps -orss -p`netstat -nlp | grep :8007 | awk '{print $7}' |
cut -d/ -f1`
Traffic to/from the server:
root$ tcpdump -A -s10024 -nn -i eth1 'port 8007' (in my case I use eth1 for
LAN to the client)

> -----Original Message-----
> From: twisted-python-bounces at twistedmatrix.com [mailto:twisted-python-
> bounces at twistedmatrix.com] On Behalf Of Werner Thie
> Sent: Monday, February 22, 2010 11:39 PM
> To: Twisted general discussion
> Subject: Re: [Twisted-Python] debugging a memory leak
> 
> Hi
> 
> Assuming that if memory not released to the OS can be reused by the
> interpreter because of a suballocation system used in the interpreter
> should eventually lead to a leveling out of the overall memory usage
> over time, that's what I observe with our processes (sitting at several
> 100 MB per process). We are using external C libraries which do lots of
> malloc/free and one of the bigger sources of pain is indeed to bring
> such a library to a point where its clean not only by freeing all memory
> allocated in every circumstance but also Python refcounting wise. I
> usually go thru all the motions to build up a complete debug chain for
> all modules involved in a project and write a test bed to proof clean
> and proper implementation.
> 
> So if your using C/C++ based modules in your project I would mark them
> as highly suspicious to be responsible for leaks until proven otherwise.
> 
> Not to bother you with numbers but I usually allocate about 30% of
> overall project time to bring a server into a production ready state,
> meaning uptimes of months/years, no fishy feelings, no performance
> oscillations, predictable caving and recuperating when overloaded, just
> all the things you have to tick to sign off a project as completed,
> meaning you don't have to do daily 'tire kicking' maintenance and
> periodic reboots.
> 
> Werner
> 
> Alec Matusis wrote:
> > Hi Maarten,
> >
> > Your link
> > http://effbot.org/pyfaq/why-doesnt-python-release-the-memory-when-i-
> delete-
> > a-large-object.htm
> > seems to suggest that even though the interpreter does not release
memory
> > back to the OS, it can be re-used by the interpreter.
> > If this was our problem, I'd expect the memory to be set by the highest
> > usage, as opposed to it constantly leaking: in my case, the load is
> > virtually constant, but the memory still leaks over time.
> >
> > The environment is Linux 2.6.24 x86-64, the extensions used are MySQLdb,
> > pyCrypto (latest stable releases for both).
> >
> >> -----Original Message-----
> >> From: twisted-python-bounces at twistedmatrix.com [mailto:twisted-
> python-
> >> bounces at twistedmatrix.com] On Behalf Of Maarten ter Huurne
> >> Sent: Monday, February 22, 2010 6:24 PM
> >> To: Twisted general discussion
> >> Subject: Re: [Twisted-Python] debugging a memory leak
> >>
> >> On Tuesday 23 February 2010, Alec Matusis wrote:
> >>
> >>> When I start the process, both python object sizes and their counts
rise
> >>> proportionally to the numbers of reconnected clients, and then they
> >>> stabilize after all clients have reconnected.
> >>> At that moment, the "external" RSS process size is about 260MB. The
> >>> "internal size" of all python objects reported by Heapy is about
150MB.
> >>> After two days, the internal sizes/counts stay the same, but the
> > external
> >>> size grows to 1500MB.
> >>>
> >>> Python object counts/total sizes are measured from the manhole.
> >>> Is this sufficient to conclude that this is a C memory leak in one of
> > the
> >>> external modules or in the Python interpreter itself?
> >> In general, there are other reasons why heap size and RSS size do not
> > match:
> >> 1. pages are empty but not returned to the OS
> >> 2. pages cannot be returned to the OS because they are not completely
> > empty
> >> It seems Python has different allocators for small and large objects:
> >> http://www.mail-archive.com/python-list@python.org/msg256116.html
> >> http://effbot.org/pyfaq/why-doesnt-python-release-the-memory-when-i-
> >> delete-
> >> a-large-object.htm
> >>
> >> Assuming Python uses malloc for all its allocations (does it?), it is
the
> >> malloc implementation that determines whether empty pages are returned
> to
> >> the OS. Under Linux with glibc (your system?), empty pages are
returned,
> > so
> >> there reason 1 does not apply.
> >>
> >> Depending on the allocation behaviour of Python, the pages may not be
> >> empty
> >> though, so reason 2 is a likely suspect.
> >>
> >> Python extensions written in C could also leak or fragment memory. Are
> you
> >> using any extensions that are not pure Python?
> >>
> >> Bye,
> >> 		Maarten
> >>
> >> _______________________________________________
> >> Twisted-Python mailing list
> >> Twisted-Python at twistedmatrix.com
> >> http://twistedmatrix.com/cgi-bin/mailman/listinfo/twisted-python
> >
> >
> > _______________________________________________
> > Twisted-Python mailing list
> > Twisted-Python at twistedmatrix.com
> > http://twistedmatrix.com/cgi-bin/mailman/listinfo/twisted-python
> 
> _______________________________________________
> Twisted-Python mailing list
> Twisted-Python at twistedmatrix.com
> http://twistedmatrix.com/cgi-bin/mailman/listinfo/twisted-python