[Twisted-Python] debugging a memory leak

Mon Feb 22 20:52:17 EST 2010

Werner

I am using your code, and it shows essentially the same thing as Heapy:
the counts of all common objects more or less agree.
The 'Total size' shown in Heapy

When I start the process, both python object sizes and their counts rise
proportionally to the numbers of reconnected clients, and then they
stabilize after all clients have reconnected.
At that moment, the "external" RSS process size is about 260MB. The
"internal size" of all python objects reported by Heapy is about 150MB.
After two days, the internal sizes/counts stay the same, but the external
size grows to 1500MB.

Python object counts/total sizes are measured from the manhole.
Is this sufficient to conclude that this is a C memory leak in one of the
external modules or in the Python interpreter itself?

> -----Original Message-----
> From: twisted-python-bounces at twistedmatrix.com [mailto:twisted-python-
> bounces at twistedmatrix.com] On Behalf Of Werner Thie
> Sent: Friday, February 19, 2010 4:10 PM
> To: Twisted general discussion
> Subject: Re: [Twisted-Python] debugging a memory leak
> 
> Hi Alec
> 
> ...and they promised you that with a gc'ed language there will never be
> a memory problem again, you just plain forget about it.
> 
> I was stuck in the same position as you and after lots of probing the
> following attempt helped a lot to correct what was later proofed to be
> overly optimistic coding by holding on to objects for
> performance/practical reasons in other objects. Producing non collect
> able cycles in twisted is probably as easy as to forget about memory
> when you have Alzheimer.
> 
> Proofing and working on the problem was only possible on the production
> machine under real load situations. I went ahead and created a manhole
> service on the production server, allowing me to peek at the python
> object space without disturbing it too much. What I used as a tool was
> the code you find later on included.
> 
> After cleaning all the self produced cycles out our servers processes
> stabilized at roughly 280 to 320 MB per process and are now running
> stable for months with more than 20k logins per day and a usual time of
> connect per user on the average of 25 minutes playing games delivered by
> nevow/athena LivePages.
> 
> As I said before, all cycles I found in our SW were introduced by
> patterns like
> 
> def beforeRender(self, ctx):
>      self.session = inevow.ISession(ctx)
> 
> The included code helps to identify the amount of objects being around.
> Although it's a primitive tool it shines the light where its needed and
> if you see certain object counts run away then you have at least
> identified the surrounding where the non collect able cycles are built.
> 
> Why didn't I use heapy/guppy and found out that way? I wasn't able to
> find the evidence for what I was suspecting with all the tools I tried
> (and boy I tried for WEEKS). Avid users of heapy will most probably
> disagree and tell me it would have been easy. But in a situation as this
> everything that works to move you out of that pothole you're in is the
> right thing to do.
> 
> HTH, Werner
> 
> exc = [
>    "function",
>    "type",
>    "list",
>    "dict",
>    "tuple",
>    "wrapper_descriptor",
>    "module",
>    "method_descriptor",
>    "member_descriptor",
>    "instancemethod",
>    "builtin_function_or_method",
>    "frame",
>    "classmethod",
>    "classmethod_descriptor",
>    "_Environ",
>    "MemoryError",
>    "_Printer",
>    "_Helper",
>    "getset_descriptor",
>    "weakreaf"
> ]
> 
> inc = [
>    'myFirstSuspect',
>    'mySecondSuspect'
> ]
> 
> prev = {}
> 
> def dumpObjects(delta=True, limit=0, include=inc, exclude=[]):
>    global prev
>    if include != [] and exclude != []:
>      print 'cannot use include and exclude at the same time'
>      return
>    print 'working with:'
>    print '   delta: ', delta
>    print '   limit: ', limit
>    print ' include: ', include
>    print ' exclude: ', exclude
>    objects = {}
>    gc.collect()
>    oo = gc.get_objects()
>    for o in oo:
>      if getattr(o, "__class__", None):
>        name = o.__class__.__name__
>        if ((exclude == [] and include == [])       or \
>            (exclude != [] and name not in exclude) or \
>            (include != [] and name in include)):
>          objects[name] = objects.get(name, 0) + 1
> ##    if more:
> ##      print o
>    pk = prev.keys()
>    pk.sort()
>    names = objects.keys()
>    names.sort()
>    for name in names:
>      if limit == 0 or objects[name] > limit:
>        if not prev.has_key(name):
>          prev[name] = objects[name]
>        dt = objects[name] - prev[name]
>        if delta or dt != 0:
>          print '%0.6d -- %0.6d -- ' % (dt, objects[name]),  name
>        prev[name] = objects[name]
> 
> def getObjects(oname):
>    """
>    gets an object list with all the named objects out of the sea of
>    gc'ed objects
>    """
>    olist = []
>    objects = {}
>    gc.collect()
>    oo = gc.get_objects()
>    for o in oo:
>      if getattr(o, "__class__", None):
>        name = o.__class__.__name__
>        if (name == oname):
>          olist.append(o)
>    return olist
> 
> dumpObject = dumpobj.dumpObj
> 
> Alec Matusis wrote:
> > I modified a tolerably leaking (about 40MB/day) Twisted server: when the
> new
> > code was pushed, the memory leak  became catastrophic (about
> 100MB/hr).
> > We could tolerate 40MB/day, but the new code needs to be debugged.
> > First, I reverted to the old version, that is leaking 40MB/day (The leak
> > rate is actually proportional to the number of new connections per
second,
> > (which correlates with the CPU utilization of the process): if CPU as
> > measured by $top jumps to >90%, the leak can accelerate to 50MB/hr)
> > I took two steps to debug the leak:
> >
> > 1) Using guppy/heapy via manhole:
> >
> >>>> hp = hpy()
> >>>> h = hp.heap()
> >>>> h
> > Partition of a set of 1157084 objects. Total size = 140911144 bytes.
> >  Index  Count   %     Size   % Cumulative  % Kind (class / dict of
class)
> >      0 785798  68 48463680  34  48463680  34 str
> >      1   7357   1 24660664  18  73124344  52 dict of service.TagSession
> >      2  11735   1 12298280   9  85422624  61 dict of
> > twisted.internet.base.DelayedCall
> >      3   7377   1  7731096   5  93153720  66 dict of
> > twisted.internet.tcp.Server
> >      4   7375   1  7729000   5 100882720  72 dict of
protocols.TagProtocol
> >      5  30925   3  7174600   5 108057320  77 __builtin__.set
> >      6   9193   1  6373336   5 114430656  81 dict (no owner)
> >      7  15557   1  3396904   2 117827560  84 list
> >      8  44833   4  3227976   2 121055536  86 types.BuiltinFunctionType
> >      9  38142   3  3051360   2 124106896  88 types.MethodType
> > <328 more rows. Type e.g. '_.more' to view.>
> >
> > Note that the total size of all objects is 140911144 bytes
> > The 1st, 3d and 4th items in this list show the actual number of
connected
> > clients. I wait for about 48 hrs, and then execute the same command, and
I
> > see approximately the same Total size, of 130MB.
> > So the total size that Heapy sees via the manhole is stable, fluctuating
> > around 140MB.
> >
> > The problem is that the total RSS size of the process visible by the OS
is
> > much larger, it is  871680KB = 851MB:
> > $ps -o pid,vsz,rss,sz,size -p 11236
> >   PID     VSZ           RSS       SZ           SZ
> > 11236 1303180 871680 325795 1174388
> >
> > It is this total RSS size that keeps leaking at 40MB per day.
> >
> > As far as I understand, this means that this is not the problem with
purely
> > Python code, since Heapy shows that the total size of all Python objects
is
> > more or less constant.
> > Is this a correct assumption?
> >
> > 2) So now I turn to valgrind. I am no expert in using valgrind, so what
I
> > did was based only on general logic/rough guesses.
> > Since I cannot run this under valgrind on a production machine due to
> > performance reasons, I recompile python on the staging machine:
> > ./configure --enable-shared --without-pymalloc
> > --prefix=/nail/encap/python-2.6.4-valgrind
> > I then follow the instructions in
> > http://svn.python.org/projects/python/trunk/Misc/README.valgrind
> > Then I run twistd process like this:
> >
> > valgrind  --tool=memcheck
> > --suppressions=/nail/sys/src/Python-2.6.4/Misc/valgrind-python.supp
> > --leak-check=full --log-file=/tmp/valgrind.log  /usr/local/bin/twistd
> > --no_save --reactor=epoll --pidfile=logs/tagserv.pid
> > --logfile=logs/tagserv.log --python=tagserv.py
> >
> > The memory for the process shown by the OS is now 5x normal, and the
> > performance is about 5x worse, since it's running inside valgrind's
> > synthetic CPU.
> > Because this is done on the staging box where I cannot accurately
> reproduce
> > the real load, the memory leaks from simulated load seen by $ps -o rss
are
> > pretty small, about 1 to 10MB.
> > Still, I am interested in finding out what they are.
> >
> > Now I encounter the problems with my understanding of how to use
> valgrind.
> > This may not be the appropriate list for this, but perhaps someone could
> > recognize the problem.
> > When I start the server, about 240 lines is written to valgrind log
file,
> > --log-file=/tmp/valgrind.log
> > When I shut it down, it adds another 100 lines.
> > No matter what I do in between, it always results in the log file with
> > exactly 343 lines.
> > I can have server runs with a leak of 1MB, or with 10MB, but in the end,
I
> > get pretty much the same log file. Moreover, when look for all lost
memory
> > reports:
> > $grep lost valgrind.log.1
> >  17,352 bytes in 31 blocks are possibly lost in loss record 49 of 62
> >  203,312 bytes in 478 blocks are possibly lost in loss record 57 of 62
> >     definitely lost: 0 bytes in 0 blocks.
> >       possibly lost: 220,664 bytes in 509 blocks.
> >  64 bytes in 2 blocks are definitely lost in loss record 12 of 63
> >  17,352 bytes in 31 blocks are possibly lost in loss record 50 of 63
> >  203,824 bytes in 479 blocks are possibly lost in loss record 58 of 63
> >     definitely lost: 64 bytes in 2 blocks.
> >       possibly lost: 221,176 bytes in 510 blocks.
> >  47 bytes in 1 blocks are definitely lost in loss record 8 of 63
> >  128 bytes in 4 blocks are definitely lost in loss record 16 of 63
> >  584 (104 direct, 480 indirect) bytes in 2 blocks are definitely lost in
> > loss record 26 of 63
> >  1,008 bytes in 6 blocks are definitely lost in loss record 31 of 63
> >  22,296 bytes in 41 blocks are possibly lost in loss record 50 of 63
> >  183,368 bytes in 381 blocks are possibly lost in loss record 59 of 63
> >     definitely lost: 1,287 bytes in 13 blocks.
> >     indirectly lost: 480 bytes in 20 blocks.
> >       possibly lost: 205,664 bytes in 422 blocks.
> >
> > If I add up all those numbers, I get less than 1MB. How do I track down
the
> > 10MB leak?
> >
> > Are there any alternative strategies in finding this leak?
> >
> >
> >
> >
> > _______________________________________________
> > Twisted-Python mailing list
> > Twisted-Python at twistedmatrix.com
> > http://twistedmatrix.com/cgi-bin/mailman/listinfo/twisted-python
> 
> 
> _______________________________________________
> Twisted-Python mailing list
> Twisted-Python at twistedmatrix.com
> http://twistedmatrix.com/cgi-bin/mailman/listinfo/twisted-python