[Twisted-Python] debugging a memory leak
Alec Matusis
matusis at yahoo.com
Fri Feb 19 23:00:48 MST 2010
Thank you Werner!
I am playing with your object counter. But before I can interpret the
results, I have to ask a possibly stupid question.
Can someone explain to me this behavior of python 2.6 on x86 linux 2.6.24:
1) start interpreter
Check the size of the interpreter process:
$ps -orss -p24316
RSS
4212
Size is 4MB, fine.
2) Allocate a huge chunk of memory:
>>> x=range(1,10000000)
Check the size of the interpreter process:
$ps -orss -p24316
RSS
322460
The RSS memory grew by 322460KB =314MB, as I expected
3) Delete this object:
>>> del x
Check the size of the interpreter process:
$ps -orss -p24316
RSS
244332
So, the memory is far off its original value! It's up by nearly 240MB, and
it never gets released. Why is that?
4)
>>> import gc
>>> gc.collect()
0
$ps -orss -p24316
RSS
244404
Not much luck here!
> -----Original Message-----
> From: twisted-python-bounces at twistedmatrix.com [mailto:twisted-python-
> bounces at twistedmatrix.com] On Behalf Of Werner Thie
> Sent: Friday, February 19, 2010 4:10 PM
> To: Twisted general discussion
> Subject: Re: [Twisted-Python] debugging a memory leak
>
> Hi Alec
>
> ...and they promised you that with a gc'ed language there will never be
> a memory problem again, you just plain forget about it.
>
> I was stuck in the same position as you and after lots of probing the
> following attempt helped a lot to correct what was later proofed to be
> overly optimistic coding by holding on to objects for
> performance/practical reasons in other objects. Producing non collect
> able cycles in twisted is probably as easy as to forget about memory
> when you have Alzheimer.
>
> Proofing and working on the problem was only possible on the production
> machine under real load situations. I went ahead and created a manhole
> service on the production server, allowing me to peek at the python
> object space without disturbing it too much. What I used as a tool was
> the code you find later on included.
>
> After cleaning all the self produced cycles out our servers processes
> stabilized at roughly 280 to 320 MB per process and are now running
> stable for months with more than 20k logins per day and a usual time of
> connect per user on the average of 25 minutes playing games delivered by
> nevow/athena LivePages.
>
> As I said before, all cycles I found in our SW were introduced by
> patterns like
>
> def beforeRender(self, ctx):
> self.session = inevow.ISession(ctx)
>
> The included code helps to identify the amount of objects being around.
> Although it's a primitive tool it shines the light where its needed and
> if you see certain object counts run away then you have at least
> identified the surrounding where the non collect able cycles are built.
>
> Why didn't I use heapy/guppy and found out that way? I wasn't able to
> find the evidence for what I was suspecting with all the tools I tried
> (and boy I tried for WEEKS). Avid users of heapy will most probably
> disagree and tell me it would have been easy. But in a situation as this
> everything that works to move you out of that pothole you're in is the
> right thing to do.
>
> HTH, Werner
>
> exc = [
> "function",
> "type",
> "list",
> "dict",
> "tuple",
> "wrapper_descriptor",
> "module",
> "method_descriptor",
> "member_descriptor",
> "instancemethod",
> "builtin_function_or_method",
> "frame",
> "classmethod",
> "classmethod_descriptor",
> "_Environ",
> "MemoryError",
> "_Printer",
> "_Helper",
> "getset_descriptor",
> "weakreaf"
> ]
>
> inc = [
> 'myFirstSuspect',
> 'mySecondSuspect'
> ]
>
> prev = {}
>
> def dumpObjects(delta=True, limit=0, include=inc, exclude=[]):
> global prev
> if include != [] and exclude != []:
> print 'cannot use include and exclude at the same time'
> return
> print 'working with:'
> print ' delta: ', delta
> print ' limit: ', limit
> print ' include: ', include
> print ' exclude: ', exclude
> objects = {}
> gc.collect()
> oo = gc.get_objects()
> for o in oo:
> if getattr(o, "__class__", None):
> name = o.__class__.__name__
> if ((exclude == [] and include == []) or \
> (exclude != [] and name not in exclude) or \
> (include != [] and name in include)):
> objects[name] = objects.get(name, 0) + 1
> ## if more:
> ## print o
> pk = prev.keys()
> pk.sort()
> names = objects.keys()
> names.sort()
> for name in names:
> if limit == 0 or objects[name] > limit:
> if not prev.has_key(name):
> prev[name] = objects[name]
> dt = objects[name] - prev[name]
> if delta or dt != 0:
> print '%0.6d -- %0.6d -- ' % (dt, objects[name]), name
> prev[name] = objects[name]
>
> def getObjects(oname):
> """
> gets an object list with all the named objects out of the sea of
> gc'ed objects
> """
> olist = []
> objects = {}
> gc.collect()
> oo = gc.get_objects()
> for o in oo:
> if getattr(o, "__class__", None):
> name = o.__class__.__name__
> if (name == oname):
> olist.append(o)
> return olist
>
> dumpObject = dumpobj.dumpObj
>
> Alec Matusis wrote:
> > I modified a tolerably leaking (about 40MB/day) Twisted server: when the
> new
> > code was pushed, the memory leak became catastrophic (about
> 100MB/hr).
> > We could tolerate 40MB/day, but the new code needs to be debugged.
> > First, I reverted to the old version, that is leaking 40MB/day (The leak
> > rate is actually proportional to the number of new connections per
second,
> > (which correlates with the CPU utilization of the process): if CPU as
> > measured by $top jumps to >90%, the leak can accelerate to 50MB/hr)
> > I took two steps to debug the leak:
> >
> > 1) Using guppy/heapy via manhole:
> >
> >>>> hp = hpy()
> >>>> h = hp.heap()
> >>>> h
> > Partition of a set of 1157084 objects. Total size = 140911144 bytes.
> > Index Count % Size % Cumulative % Kind (class / dict of
class)
> > 0 785798 68 48463680 34 48463680 34 str
> > 1 7357 1 24660664 18 73124344 52 dict of service.TagSession
> > 2 11735 1 12298280 9 85422624 61 dict of
> > twisted.internet.base.DelayedCall
> > 3 7377 1 7731096 5 93153720 66 dict of
> > twisted.internet.tcp.Server
> > 4 7375 1 7729000 5 100882720 72 dict of
protocols.TagProtocol
> > 5 30925 3 7174600 5 108057320 77 __builtin__.set
> > 6 9193 1 6373336 5 114430656 81 dict (no owner)
> > 7 15557 1 3396904 2 117827560 84 list
> > 8 44833 4 3227976 2 121055536 86 types.BuiltinFunctionType
> > 9 38142 3 3051360 2 124106896 88 types.MethodType
> > <328 more rows. Type e.g. '_.more' to view.>
> >
> > Note that the total size of all objects is 140911144 bytes
> > The 1st, 3d and 4th items in this list show the actual number of
connected
> > clients. I wait for about 48 hrs, and then execute the same command, and
I
> > see approximately the same Total size, of 130MB.
> > So the total size that Heapy sees via the manhole is stable, fluctuating
> > around 140MB.
> >
> > The problem is that the total RSS size of the process visible by the OS
is
> > much larger, it is 871680KB = 851MB:
> > $ps -o pid,vsz,rss,sz,size -p 11236
> > PID VSZ RSS SZ SZ
> > 11236 1303180 871680 325795 1174388
> >
> > It is this total RSS size that keeps leaking at 40MB per day.
> >
> > As far as I understand, this means that this is not the problem with
purely
> > Python code, since Heapy shows that the total size of all Python objects
is
> > more or less constant.
> > Is this a correct assumption?
> >
> > 2) So now I turn to valgrind. I am no expert in using valgrind, so what
I
> > did was based only on general logic/rough guesses.
> > Since I cannot run this under valgrind on a production machine due to
> > performance reasons, I recompile python on the staging machine:
> > ./configure --enable-shared --without-pymalloc
> > --prefix=/nail/encap/python-2.6.4-valgrind
> > I then follow the instructions in
> > http://svn.python.org/projects/python/trunk/Misc/README.valgrind
> > Then I run twistd process like this:
> >
> > valgrind --tool=memcheck
> > --suppressions=/nail/sys/src/Python-2.6.4/Misc/valgrind-python.supp
> > --leak-check=full --log-file=/tmp/valgrind.log /usr/local/bin/twistd
> > --no_save --reactor=epoll --pidfile=logs/tagserv.pid
> > --logfile=logs/tagserv.log --python=tagserv.py
> >
> > The memory for the process shown by the OS is now 5x normal, and the
> > performance is about 5x worse, since it's running inside valgrind's
> > synthetic CPU.
> > Because this is done on the staging box where I cannot accurately
> reproduce
> > the real load, the memory leaks from simulated load seen by $ps -o rss
are
> > pretty small, about 1 to 10MB.
> > Still, I am interested in finding out what they are.
> >
> > Now I encounter the problems with my understanding of how to use
> valgrind.
> > This may not be the appropriate list for this, but perhaps someone could
> > recognize the problem.
> > When I start the server, about 240 lines is written to valgrind log
file,
> > --log-file=/tmp/valgrind.log
> > When I shut it down, it adds another 100 lines.
> > No matter what I do in between, it always results in the log file with
> > exactly 343 lines.
> > I can have server runs with a leak of 1MB, or with 10MB, but in the end,
I
> > get pretty much the same log file. Moreover, when look for all lost
memory
> > reports:
> > $grep lost valgrind.log.1
> > 17,352 bytes in 31 blocks are possibly lost in loss record 49 of 62
> > 203,312 bytes in 478 blocks are possibly lost in loss record 57 of 62
> > definitely lost: 0 bytes in 0 blocks.
> > possibly lost: 220,664 bytes in 509 blocks.
> > 64 bytes in 2 blocks are definitely lost in loss record 12 of 63
> > 17,352 bytes in 31 blocks are possibly lost in loss record 50 of 63
> > 203,824 bytes in 479 blocks are possibly lost in loss record 58 of 63
> > definitely lost: 64 bytes in 2 blocks.
> > possibly lost: 221,176 bytes in 510 blocks.
> > 47 bytes in 1 blocks are definitely lost in loss record 8 of 63
> > 128 bytes in 4 blocks are definitely lost in loss record 16 of 63
> > 584 (104 direct, 480 indirect) bytes in 2 blocks are definitely lost in
> > loss record 26 of 63
> > 1,008 bytes in 6 blocks are definitely lost in loss record 31 of 63
> > 22,296 bytes in 41 blocks are possibly lost in loss record 50 of 63
> > 183,368 bytes in 381 blocks are possibly lost in loss record 59 of 63
> > definitely lost: 1,287 bytes in 13 blocks.
> > indirectly lost: 480 bytes in 20 blocks.
> > possibly lost: 205,664 bytes in 422 blocks.
> >
> > If I add up all those numbers, I get less than 1MB. How do I track down
the
> > 10MB leak?
> >
> > Are there any alternative strategies in finding this leak?
> >
> >
> >
> >
> > _______________________________________________
> > Twisted-Python mailing list
> > Twisted-Python at twistedmatrix.com
> > http://twistedmatrix.com/cgi-bin/mailman/listinfo/twisted-python
>
>
> _______________________________________________
> Twisted-Python mailing list
> Twisted-Python at twistedmatrix.com
> http://twistedmatrix.com/cgi-bin/mailman/listinfo/twisted-python
More information about the Twisted-Python
mailing list