[Twisted-Python] debugging a memory leak

Fri Feb 19 23:00:48 MST 2010

Thank you Werner!

I am playing with your object counter. But before I can interpret the
results, I have to ask a possibly stupid question.
Can someone explain to me this behavior of python 2.6 on x86 linux 2.6.24:

1) start interpreter
Check the size of the interpreter process:
$ps -orss -p24316
  RSS
 4212

Size is 4MB, fine.

2) Allocate a huge chunk of memory:
>>> x=range(1,10000000)
Check the size of the interpreter process:
$ps -orss -p24316
  RSS
 322460

The RSS memory grew by 322460KB =314MB, as I expected

3) Delete this object:
>>> del x
Check the size of the interpreter process:
$ps -orss -p24316
  RSS
 244332

So, the memory is far off its original value! It's up by nearly 240MB, and
it never gets released. Why is that?

4) 
>>> import gc
>>> gc.collect()
0
$ps -orss -p24316
  RSS
244404
Not much luck here!

> -----Original Message-----
> From: twisted-python-bounces at twistedmatrix.com [mailto:twisted-python-
> bounces at twistedmatrix.com] On Behalf Of Werner Thie
> Sent: Friday, February 19, 2010 4:10 PM
> To: Twisted general discussion
> Subject: Re: [Twisted-Python] debugging a memory leak
> 
> Hi Alec
> 
> ...and they promised you that with a gc'ed language there will never be
> a memory problem again, you just plain forget about it.
> 
> I was stuck in the same position as you and after lots of probing the
> following attempt helped a lot to correct what was later proofed to be
> overly optimistic coding by holding on to objects for
> performance/practical reasons in other objects. Producing non collect
> able cycles in twisted is probably as easy as to forget about memory
> when you have Alzheimer.
> 
> Proofing and working on the problem was only possible on the production
> machine under real load situations. I went ahead and created a manhole
> service on the production server, allowing me to peek at the python
> object space without disturbing it too much. What I used as a tool was
> the code you find later on included.
> 
> After cleaning all the self produced cycles out our servers processes
> stabilized at roughly 280 to 320 MB per process and are now running
> stable for months with more than 20k logins per day and a usual time of
> connect per user on the average of 25 minutes playing games delivered by
> nevow/athena LivePages.
> 
> As I said before, all cycles I found in our SW were introduced by
> patterns like
> 
> def beforeRender(self, ctx):
>      self.session = inevow.ISession(ctx)
> 
> The included code helps to identify the amount of objects being around.
> Although it's a primitive tool it shines the light where its needed and
> if you see certain object counts run away then you have at least
> identified the surrounding where the non collect able cycles are built.
> 
> Why didn't I use heapy/guppy and found out that way? I wasn't able to
> find the evidence for what I was suspecting with all the tools I tried
> (and boy I tried for WEEKS). Avid users of heapy will most probably
> disagree and tell me it would have been easy. But in a situation as this
> everything that works to move you out of that pothole you're in is the
> right thing to do.
> 
> HTH, Werner
> 
> exc = [
>    "function",
>    "type",
>    "list",
>    "dict",
>    "tuple",
>    "wrapper_descriptor",
>    "module",
>    "method_descriptor",
>    "member_descriptor",
>    "instancemethod",
>    "builtin_function_or_method",
>    "frame",
>    "classmethod",
>    "classmethod_descriptor",
>    "_Environ",
>    "MemoryError",
>    "_Printer",
>    "_Helper",
>    "getset_descriptor",
>    "weakreaf"
> ]
> 
> inc = [
>    'myFirstSuspect',
>    'mySecondSuspect'
> ]
> 
> prev = {}
> 
> def dumpObjects(delta=True, limit=0, include=inc, exclude=[]):
>    global prev
>    if include != [] and exclude != []:
>      print 'cannot use include and exclude at the same time'
>      return
>    print 'working with:'
>    print '   delta: ', delta
>    print '   limit: ', limit
>    print ' include: ', include
>    print ' exclude: ', exclude
>    objects = {}
>    gc.collect()
>    oo = gc.get_objects()
>    for o in oo:
>      if getattr(o, "__class__", None):
>        name = o.__class__.__name__
>        if ((exclude == [] and include == [])       or \
>            (exclude != [] and name not in exclude) or \
>            (include != [] and name in include)):
>          objects[name] = objects.get(name, 0) + 1
> ##    if more:
> ##      print o
>    pk = prev.keys()
>    pk.sort()
>    names = objects.keys()
>    names.sort()
>    for name in names:
>      if limit == 0 or objects[name] > limit:
>        if not prev.has_key(name):
>          prev[name] = objects[name]
>        dt = objects[name] - prev[name]
>        if delta or dt != 0:
>          print '%0.6d -- %0.6d -- ' % (dt, objects[name]),  name
>        prev[name] = objects[name]
> 
> def getObjects(oname):
>    """
>    gets an object list with all the named objects out of the sea of
>    gc'ed objects
>    """
>    olist = []
>    objects = {}
>    gc.collect()
>    oo = gc.get_objects()
>    for o in oo:
>      if getattr(o, "__class__", None):
>        name = o.__class__.__name__
>        if (name == oname):
>          olist.append(o)
>    return olist
> 
> dumpObject = dumpobj.dumpObj
> 
> Alec Matusis wrote:
> > I modified a tolerably leaking (about 40MB/day) Twisted server: when the
> new
> > code was pushed, the memory leak  became catastrophic (about
> 100MB/hr).
> > We could tolerate 40MB/day, but the new code needs to be debugged.
> > First, I reverted to the old version, that is leaking 40MB/day (The leak
> > rate is actually proportional to the number of new connections per
second,
> > (which correlates with the CPU utilization of the process): if CPU as
> > measured by $top jumps to >90%, the leak can accelerate to 50MB/hr)
> > I took two steps to debug the leak:
> >
> > 1) Using guppy/heapy via manhole:
> >
> >>>> hp = hpy()
> >>>> h = hp.heap()
> >>>> h
> > Partition of a set of 1157084 objects. Total size = 140911144 bytes.
> >  Index  Count   %     Size   % Cumulative  % Kind (class / dict of
class)
> >      0 785798  68 48463680  34  48463680  34 str
> >      1   7357   1 24660664  18  73124344  52 dict of service.TagSession
> >      2  11735   1 12298280   9  85422624  61 dict of
> > twisted.internet.base.DelayedCall
> >      3   7377   1  7731096   5  93153720  66 dict of
> > twisted.internet.tcp.Server
> >      4   7375   1  7729000   5 100882720  72 dict of
protocols.TagProtocol
> >      5  30925   3  7174600   5 108057320  77 __builtin__.set
> >      6   9193   1  6373336   5 114430656  81 dict (no owner)
> >      7  15557   1  3396904   2 117827560  84 list
> >      8  44833   4  3227976   2 121055536  86 types.BuiltinFunctionType
> >      9  38142   3  3051360   2 124106896  88 types.MethodType
> > <328 more rows. Type e.g. '_.more' to view.>
> >
> > Note that the total size of all objects is 140911144 bytes
> > The 1st, 3d and 4th items in this list show the actual number of
connected
> > clients. I wait for about 48 hrs, and then execute the same command, and
I
> > see approximately the same Total size, of 130MB.
> > So the total size that Heapy sees via the manhole is stable, fluctuating
> > around 140MB.
> >
> > The problem is that the total RSS size of the process visible by the OS
is
> > much larger, it is  871680KB = 851MB:
> > $ps -o pid,vsz,rss,sz,size -p 11236
> >   PID     VSZ           RSS       SZ           SZ
> > 11236 1303180 871680 325795 1174388
> >
> > It is this total RSS size that keeps leaking at 40MB per day.
> >
> > As far as I understand, this means that this is not the problem with
purely
> > Python code, since Heapy shows that the total size of all Python objects
is
> > more or less constant.
> > Is this a correct assumption?
> >
> > 2) So now I turn to valgrind. I am no expert in using valgrind, so what
I
> > did was based only on general logic/rough guesses.
> > Since I cannot run this under valgrind on a production machine due to
> > performance reasons, I recompile python on the staging machine:
> > ./configure --enable-shared --without-pymalloc
> > --prefix=/nail/encap/python-2.6.4-valgrind
> > I then follow the instructions in
> > http://svn.python.org/projects/python/trunk/Misc/README.valgrind
> > Then I run twistd process like this:
> >
> > valgrind  --tool=memcheck
> > --suppressions=/nail/sys/src/Python-2.6.4/Misc/valgrind-python.supp
> > --leak-check=full --log-file=/tmp/valgrind.log  /usr/local/bin/twistd
> > --no_save --reactor=epoll --pidfile=logs/tagserv.pid
> > --logfile=logs/tagserv.log --python=tagserv.py
> >
> > The memory for the process shown by the OS is now 5x normal, and the
> > performance is about 5x worse, since it's running inside valgrind's
> > synthetic CPU.
> > Because this is done on the staging box where I cannot accurately
> reproduce
> > the real load, the memory leaks from simulated load seen by $ps -o rss
are
> > pretty small, about 1 to 10MB.
> > Still, I am interested in finding out what they are.
> >
> > Now I encounter the problems with my understanding of how to use
> valgrind.
> > This may not be the appropriate list for this, but perhaps someone could
> > recognize the problem.
> > When I start the server, about 240 lines is written to valgrind log
file,
> > --log-file=/tmp/valgrind.log
> > When I shut it down, it adds another 100 lines.
> > No matter what I do in between, it always results in the log file with
> > exactly 343 lines.
> > I can have server runs with a leak of 1MB, or with 10MB, but in the end,
I
> > get pretty much the same log file. Moreover, when look for all lost
memory
> > reports:
> > $grep lost valgrind.log.1
> >  17,352 bytes in 31 blocks are possibly lost in loss record 49 of 62
> >  203,312 bytes in 478 blocks are possibly lost in loss record 57 of 62
> >     definitely lost: 0 bytes in 0 blocks.
> >       possibly lost: 220,664 bytes in 509 blocks.
> >  64 bytes in 2 blocks are definitely lost in loss record 12 of 63
> >  17,352 bytes in 31 blocks are possibly lost in loss record 50 of 63
> >  203,824 bytes in 479 blocks are possibly lost in loss record 58 of 63
> >     definitely lost: 64 bytes in 2 blocks.
> >       possibly lost: 221,176 bytes in 510 blocks.
> >  47 bytes in 1 blocks are definitely lost in loss record 8 of 63
> >  128 bytes in 4 blocks are definitely lost in loss record 16 of 63
> >  584 (104 direct, 480 indirect) bytes in 2 blocks are definitely lost in
> > loss record 26 of 63
> >  1,008 bytes in 6 blocks are definitely lost in loss record 31 of 63
> >  22,296 bytes in 41 blocks are possibly lost in loss record 50 of 63
> >  183,368 bytes in 381 blocks are possibly lost in loss record 59 of 63
> >     definitely lost: 1,287 bytes in 13 blocks.
> >     indirectly lost: 480 bytes in 20 blocks.
> >       possibly lost: 205,664 bytes in 422 blocks.
> >
> > If I add up all those numbers, I get less than 1MB. How do I track down
the
> > 10MB leak?
> >
> > Are there any alternative strategies in finding this leak?
> >
> >
> >
> >
> > _______________________________________________
> > Twisted-Python mailing list
> > Twisted-Python at twistedmatrix.com
> > http://twistedmatrix.com/cgi-bin/mailman/listinfo/twisted-python
> 
> 
> _______________________________________________
> Twisted-Python mailing list
> Twisted-Python at twistedmatrix.com
> http://twistedmatrix.com/cgi-bin/mailman/listinfo/twisted-python