Opened 16 years ago

Last modified 16 years ago

#1079 defect closed fixed (fixed)

HTTPClientFactory leaks memory

Reported by: jojo Owned by:
Priority: high Milestone:
Component: web Keywords:
Cc: Jean-Paul Calderone, jknight, titty, jeffsilver, jojo Branch:
Author:

Description


Attachments (5)

t.py (723 bytes) - added by jojo 16 years ago.
patched-log.txt (665.3 KB) - added by titty 16 years ago.
urls.txt.gz (2.1 MB) - added by titty 16 years ago.
dumbserver.py (663 bytes) - added by jeffsilver 16 years ago.
twisted-1079.tgz (1.0 MB) - added by titty 16 years ago.

Change History (28)

Changed 16 years ago by jojo

Attachment: t.py added

comment:1 Changed 16 years ago by jojo

The attached test program (a minimal-version of a web page crawler) uses more
and more memory when running for a longer time.

To use the test, feed a text file containing many (http-)URLs (one per line) to
the program through stdin like so: "python t.py <urls.txt". [If someone needs
some URLs to test it, feel free to mail me to johannes.beigel(AT)brainbot.com.]

When looking at the running program with top one can see the memory-growth of
about 1MB each two seconds or so - this may depend on the speed of the network
connection of course.

We can reproduce this behaviour on different machines running Linux (kernel
2.6.*) with Python 2.3 and 2.4 (compiled with and without --with-pydebug) and
with the latest Twisted from SVN. It seems like the program is not leaking on
FreeBSD 4.11 (Python 2.4) though.

The outcommented part (keyword argument "maxLength") in the call to getPage() is
used by us in a patched version of Twisted that only fetches up to maxLength
bytes. This doesn't make any difference in the memory leakage, so we think the
leak can't result from big websites that are being crawled.

We did some tests with Python's garbage collection module, but could not see any
irregularities up to now, especially gc.garbage is empty.

This is a very urgent problem for us and we keep testing/debugging it, but any
help would be very appreciated!!

Thanks in advance!
jojo.

comment:2 Changed 16 years ago by jojo

It seems like the problem has nothing to do with DNS resolving: When using
HTTPClientFactory and reactor.connectTCP() directly with pre-resolved
IP-Adresses, the memory leak remains. (reactor.resolved() is never called in
this case.)

What's really strange: When we crawl 400000 different pages from one single host
in our intranet, the memory-usage does not grow!

Any ideas? Where to look or what to try?

comment:3 Changed 16 years ago by Jean-Paul Calderone

On my machine, memory usage quickly goes to 98MB (within 5000 requests) and
stays there for the duration of the run (60000 requests).

The number of objects in the interpreter varies between 20k and 43k.

Python 2.4.1, Ubuntu version
Twisted 2.0.1
Linux 2.6.10-5-686-smp #1, Ubuntu version

comment:4 Changed 16 years ago by Jean-Paul Calderone

Additionally, if I pre-resolve all hostnames, memory usage only goes to 18MB,
and stays there.

comment:5 Changed 16 years ago by titty

Can you try with latest svn version? With Twisted 2.0.1 I can't reproduce the
problem either.
With the following patch, everything works fine for me. But, actually I don't
understand why. In handleResponseEnd sometimes handleResponse isn't called with
the unpatched version, but that doesn't seem to be the problem as I have changed
handleResponseEnd to
def handleResponseEnd(self):
        self.__buffer = None
        self.handleResponse('')
and the program still did leak.

Also, there is no need to crawl that much pages:

#req     unpatched:     patched
1000     26M            18M
2000     45M            21M
3000     65M            22M





Index: web/http.py
===================================================================
--- web/http.py (Revision 13960)
+++ web/http.py (Arbeitskopie)
@@ -356,7 +356,7 @@
     """
     length = None
     firstLine = 1
-    __buffer = None
+    __b = ''
 
     def sendCommand(self, command, path):
         self.transport.write('%s %s HTTP/1.0\r\n' % (command, path))
@@ -387,7 +387,7 @@
             if key.lower() == 'content-length':
                 self.length = int(val)
         else:
-            self.__buffer = []
+            self.__b = ''
             self.handleEndHeaders()
             self.setRawMode()
 
@@ -395,13 +395,12 @@
         self.handleResponseEnd()
 
     def handleResponseEnd(self):
-        if self.__buffer != None:
-            b = ''.join(self.__buffer)
-            self.__buffer = None
-            self.handleResponse(b)
+        b = self.__b
+        self.__b = ''
+        self.handleResponse(b)
     
     def handleResponsePart(self, data):
-        self.__buffer.append(data)
+        self.__b += data
 
     def connectionMade(self):
         pass

comment:6 Changed 16 years ago by titty

ahh...that table got screwed.
the unpatched version did use 26/45/65 MB after 1000/2000/3000 requests,
the patched version did use 18/21/22 MB.

comment:7 Changed 16 years ago by jknight

I believe that this is in main unreproducible, but the reproducible part of the 
issue was fixed by the changes to make threadpool not always spawn its maximum 
number of threads.

Closing as resolved.

comment:8 Changed 16 years ago by titty

Did you actually try to reproduce it?
If so, on which systems?
We've invested a lot of time into this bug, and it's rather sad to see
someone closing it, because he "believes" it's unreproducible...

comment:9 Changed 16 years ago by jknight

Yes, and so did exarkun, below. I tried on twisted svn head, python 2.3.5, OSX 
10.4 and twisted svn head, debian linux sarge, python 2.3.5. On linux the 
process stays at 10MB, on OSX the process stays at 30MB.

Additionally, the patch below could not possibly fix a memory leak. It may make 
the process use less memory under some condition, but not stop a leak. The 
memory will get recovered after the request finishes, either way.

So, yes, I do believe it is unreproducible.

comment:10 Changed 16 years ago by titty

exarkun seems to have pretty much the same system that I used for my tests (i.e.
Ubuntu, 2.6 kernel, python 2.4.1).
When I ran my tests with twisted 2.0.1 I could not reproduce the problem either,
so I tried to find which changes were applied on the trunk and caused leakage.
It turned out to be the reverse of my patch which seems to cause leakage for me.
(yes, I know that my patch should not have any significant effect on memory usage).

Maybe I was just too drunk last time I tested it, so I'll give it another try in
the next days.If I can reproduce it, I'll attach the urls, otherwise I'll stop
drinking and bothering you :)


Thanks anyway..

Changed 16 years ago by titty

Attachment: patched-log.txt added

comment:11 Changed 16 years ago by titty

Drunk again, but I now have evidence.

Changed 16 years ago by titty

Attachment: urls.txt.gz added

comment:12 Changed 16 years ago by titty

those are the urls.

comment:13 Changed 16 years ago by titty

patched-log.txt is the wrong name, it's the output of a shell session.
I first crawled 5500 urls using a patched version, then with an unpatched version.
The patched version uses 30MB resident memory, the unpatched uses 91MB afterwards.
The test program has been slightly patched, it uses a cookbook recipe from
http://aspn.activestate.com/ASPN/Cookbook/Python/Recipe/286222 to get memory usage.

I'll try the same thing on a debian sarge chroot...

And no, I didn't fake those numbers...

comment:14 Changed 16 years ago by jknight

The fact that it uses 91MB after one version and 30MB after another doesn't 
necessarily mean there was a memory leak. It may just be that python has allocated 
more peak memory and that it will stop growing and stay at 91MB forever after. 
However, I will try out your concrete example myself and hopefully be able to 
provide more insight afterwards.

comment:15 Changed 16 years ago by titty

Peak memory usage is as high as 319 MB, and - just a wild guess -  I think this
isn't really a peak, since I stopped the program after having crawled 23000
pages. When I do the same thing with the patched version, it uses around 65 MB.

Changed 16 years ago by jeffsilver

Attachment: dumbserver.py added

comment:16 Changed 16 years ago by jeffsilver

I've recently found this too. The attached file demonstrates that objects do not
get deleted because of mutual references. Comment out the line marked 'Fix' and
run this little server. Connect to it (telnet is good) and send a rubbish
request. The DumbHandler's __del__ method is NOT called.
Uncomment the 'Fix' line and run it again. This time, the __del__ method IS
called, showing that all references have gone.
self.requests is a list of requests, each of which holds a reference to the
DumbHandler instance.
I think the 'Fix' line should be added to HTTPChannel.connectionLost

comment:17 Changed 16 years ago by Jean-Paul Calderone

Instrumenting that demonstration with __del__ alters the actual behavior.  It
makes a collectable cycle into uncollectable cyclic garbage.  Python will
collect the HTTPChannel and its requests whether `del self.requests' is run or
not, as long as HTTPChannel doesn't have a __del__.  It will collect them faster
if the cyclic is broken manually though, so maybe this is a good change to make
anyway.

Should requests be able to get notification of this event?

comment:18 Changed 16 years ago by jeffsilver

I see what you mean about using __del__. Thanks.
My suggestion to put 'del self.requests' into HTTPChannel.connectionLost has
problems. HTTPChannel.connectionLost is called from HTTPChannel.requestDone, and
HTTPChannel.requestDone is called from Request._cleanup. I think _cleanup is
called both on request completion and on connection loss. In the former case, we
mustn't mess about with 'requests' or we break the request queueing mechanism.
(My 'fix' broke nevow.)
I tried protecting this by putting 'self.requests = []' into
HTTPChannel.connectionLost (instead of 'del'), and protecting the call to
requestDone in Request._cleanup with 'if self.channel.requests:'
This seems to work, but looks very hacky to me, so I don't have a lot of
confidence in it.

Changed 16 years ago by titty

Attachment: twisted-1079.tgz added

comment:19 Changed 16 years ago by titty

I've come to the conclusion that this is just some 'odd' memory fragmentation issue.
I'm attaching a c module, which simulates _socketmodule's recv behaviour
(when called from twisted, recv preallocates 64k of memory and later shrinks
that space later).
After compiling one can run the attached sim.py, which simulates one run
of t.py crawling around 15000 pages).
Virtual memory size is around 215 MB after it has finished, with the attached
patch (stringio-patch.diff inside archive) it is around 14M.

comment:20 Changed 16 years ago by titty

Note, the number below are for python 2.4.2 running on ubuntu 5.10.
On FreeBSD 4.11 (also with 2.4.2) I get around 50MB for the unpatched version
and around 11 MB for the patched one.

comment:21 Changed 16 years ago by jknight

Okay, finally an explanation that makes sense. When resizing a string, python uses 
the system's realloc(), so I'm guessing what happens here is that the system's 
realloc() isn't actually releasing any memory when reducing the size of the block. 

In any case, the patch seems reasonable, applied.

comment:22 Changed 16 years ago by titty

Thanks for the fast reply.

comment:23 Changed 11 years ago by <automation>

Owner: jknight deleted
Note: See TracTickets for help on using tickets.