[Twisted-web] reduce deferred stack in nevow

Andrea Arcangeli andrea at cpushare.com
Thu Jan 20 09:44:55 MST 2005

On Thu, Jan 20, 2005 at 12:38:06PM +0000, Valentino Volonghi wrote:
> There are some issues to consider:
> 1) The optimizations branch will surely help a lot. I got a 2x speedup
> after using it (the branch caches every flattener lookup and does a
> great job in context lookups) unfortunately this is not yet merged and
> it's a bit old, so you may have to merge trunk with it first.
> 2) After that branch you may have pages that render in 70ms with the
> most complex that could take between 400-600ms.
> Keep in mind that taking 70ms to render a page is not so slow for a single page. 

This is great news, is this going to be merged into trunk soon? It would
hide at least the troubles I get with components. I make quite heavy use
of fragments, and those rend.Fragments only change once every 10 sec (I
already cache the sql queries for the fragments).

But I suspect something can be improved into the rendering itself too. The
slowdown I get seems a bit excessive. And the caching doesn't help the
heavily dynamic part.

> Another thing that could possibly work is rewriting the flatteners in
> Pyrex or even your page module in Pyrex, nothing stops you from that.

Indeed, but I'd leave this as last resort ;).

> But first you should try to cache rendered fragments so that if data
> doesn't change you will be able to directly serve the pre-rendered
> html. 

Definitely agreed. I'm already caching some SQL queries in my module, so
I guarantee they're not going to change too often.

> And before everything you can use the load balancer, and once the
> session management will be improved you will also be able to share the
> session on many different servers.

Yep but in the short term I won't have finances to use more than a
single system, and taking 800msec to render a page seems something is
wrong, even taking python slowdown into the equation. It'd probably work
since my initial bandwidth will be so low, but currently my homepage
cannot deliver more than 25k/sec due the cpu limit.

here a bench of a static file (the css):

Document Path:          /css
Document Length:        5690 bytes

Concurrency Level:      1
Time taken for tests:   0.721704 seconds
Complete requests:      100
Failed requests:        0
Write errors:           0
Total transferred:      588300 bytes
HTML transferred:       569000 bytes
Requests per second:    138.56 [#/sec] (mean)
Time per request:       7.217 [ms] (mean)
Time per request:       7.217 [ms] (mean, across all concurrent
Transfer rate:          795.34 [Kbytes/sec] received

Connection Times (ms)
              min  mean[+/-sd] median   max
Connect:        0    0   0.1      0       1
Processing:     6    6   2.1      6      26
Waiting:        1    6   2.0      6      25
Total:          6    6   2.2      6      27

Percentage of the requests served within a certain time (ms)
  50%      6
  66%      6
  75%      7
  80%      7
  90%      7
  95%      7
  98%      8
  99%     27
 100%     27 (longest request)

and here the homepage, about the same size:

Document Path:          /
Document Length:        5505 bytes

Concurrency Level:      1
Time taken for tests:   21.268182 seconds
Complete requests:      100
Failed requests:        0
Write errors:           0
Total transferred:      569330 bytes
HTML transferred:       550500 bytes
Requests per second:    4.70 [#/sec] (mean)
Time per request:       212.682 [ms] (mean)
Time per request:       212.682 [ms] (mean, across all concurrent
Transfer rate:          26.10 [Kbytes/sec] received

Connection Times (ms)
              min  mean[+/-sd] median   max
Connect:        0    0   0.0      0       0
Processing:   196  212  67.0    201     871
Waiting:      196  207  23.1    201     423
Total:        196  212  67.0    201     871

Percentage of the requests served within a certain time (ms)
  50%    201
  66%    209
  75%    211
  80%    212
  90%    218
  95%    224
  98%    227
  99%    871
 100%    871 (longest request)

It's 5 request per seconds vs 138 requests per second. So the difference
between static and dynamic data is extreme. I'm betting the rendering
can be heavily optimized by rewriting it with efficiency in mind.

My profiling shows the interface layer (getInterfaces paths) to be one
of the biggest offenders (second only to the flattener, which is in turn
second to the loader).

The getInterfaces thing seem to loop and recurse too much. Plus it's
all obsolete stuff that raises a warning too. So can all this
inefficient stuff in compy be dropped and rewritten using
zope.interfaces? (I'm fine if you leave it to run w/o twisted, but
within twisted it seems bad to have two implementation of the same
thing, especially if this one is depreacted and so slow) Is
zone.interfaces so inefficient too, or has it a chance to work in O(1)?
These at first glance looks complexity problems with bad algorithms.
Infact thinking about it the whole __implements__ API seems unfixable,
it shouldn't be a tuple but an hash. Otherwise to find if a certain
feature is implemented, one has to browse the whole tuple in O(N) (which
is more or less probably what getInterfaces does). then there's this
stuff recursing and messing it up, which probably make it even worse than a
linear search in practice, and it's probably getting called more than
once. This is the most called function in my profiling at least (hundred
thousands times with just an hundred queries).

So I'm not really happy about this interface mania that is apparently
hurting performance so much. If the interfaces cannot be implemented
better, it's much better to use a dirty pointer that adds a field to a
class but that doesn't run in O(N) (or O(N**2)), even if it clobbers the
namespace and it's not as clean, but it's usable in production.

I mean, I don't want to rewrite the thing in C, just to to be able to
run a O(N) loop faster when a dirty pointer would have fixed it in
python. That would be a mistake. So I'd like to get some explanation if
I'm missing something, and getting this fixed if I'm correct on this
theory (so far it's mostly a theory since I don't undersand the whole
internals of the interface stuff yet, but stuff like the below makes me
wonder that something is wrong there).

def tupleTreeToList(t, l=None):
    """Convert an instance, or tree of tuples, into list."""
    if l is None: l = []
    if isinstance(t, types.TupleType):
        for o in t:
            tupleTreeToList(o, l)
    return Flat(l)

Note that even if we have to break the API to make it faster it's not
too bad, all that __interfaces__ slowdown is very easy to grep. So a
conversion would be quick. Perhaps this is what zope.interfaces already
does? I've noticed the API has changed slightly to interfaces()
insteadof __interfaces__ = .... I seriously hope the API didn't change
gratuitously without providing runtime benefits.

Comments welcome, thanks!

More information about the Twisted-web mailing list