Ticket #2477 (new enhancement)

Opened 3 years ago

Last modified 3 weeks ago

xml serialize optimizations

Reported by: jack Owned by: exarkun
Priority: normal Milestone:
Component: words Keywords: xish domish xml
Cc: jack, twonds, ralphm, exarkun, thijs Branch: branches/faster-domish-serialization-2477
Author: exarkun Launchpad Bug:

Description

twisted.words.xish's xml serialization is quite slow. It is routinely showing up as the top cycle eater in profiles. The following two patches attempt to address this.

The first one gets rid of all string formatting operations in favor of more calls to write(). This results in about a 17% speedup.

The second patch caches the results of escapeToXml and achieves about another 25% performance boost at the cost of memory. This patch is usable but could probably be done better.

The performance is still not great, but I'm unsure how further gains can be made. I tried amny things includes cStringIO and a non-recursive implementation (both made the code slower, not faster). I think the real solution is to make domish a light wrapper around some fast underlying C library.

Attachments

domish.diff Download (4.9 KB) - added by jack 3 years ago.
replace string formatting with more calls to write()
domish2.diff Download (1.1 KB) - added by jack 3 years ago.
cache results of escapeToXml
test.log.bz2 Download (127.0 KB) - added by jack 3 years ago.
test data from our xmpp component
timer.py Download (1.6 KB) - added by jack 3 years ago.
benchmark script
cserialize.diff Download (1.7 KB) - added by thijs 7 weeks ago.
diff from #3633

Change History

Changed 3 years ago by jack

replace string formatting with more calls to write()

Changed 3 years ago by jack

cache results of escapeToXml

Changed 3 years ago by exarkun

Can you attach the benchmarks you used?

Changed 3 years ago by jack

Sure. I'm attaching a small data file with xml extracted from our components logs (ie, real world data for our application) and the timing code I used for benchmarking performance on this data.

Changed 3 years ago by jack

test data from our xmpp component

Changed 3 years ago by jack

benchmark script

Changed 3 years ago by ralphm

The problem with using a c based XML library like elementtree, is that it seems hard to control the serialization. XMPP has special requirements in this regard in terms of how namespaces are used and that we serialize stanzas, not complete documents. I'd like to be proven wrong, though.

Changed 3 years ago by ralphm

  • owner changed from exarkun to ralphm

Changed 2 years ago by ralphm

  • owner changed from ralphm to jack

Jack, I'm not sure how the log file relates to the timer script here. Could you expand on that? It would be nice to have a script that we can use to test performance before and after applying the patches, that we can put alongside the Words documentation, similar to the ones in doc/core/benchmarks.

Changed 2 years ago by twonds

It is just data to test with the profile script. You can just use that data or generate some other xml to run the script and profile before and after the patches right? We could just switch this to use timeit instead of profile. I could do this or get jack to do it if that is what you want.

Changed 14 months ago by jack

  • owner changed from jack to ralphm

Changed 14 months ago by glyph

  • keywords review added
  • owner ralphm deleted

So, according to  this blog post, Jack has already been using this for "2 years", and has started distributing a fork of Twisted that includes this patch. In that same post he complained that it hasn't been merged yet, so maybe somebody should review it.

Changed 14 months ago by exarkun

  • branch set to branches/faster-domish-serialization-2477
  • branch_author set to exarkun

(In [26066]) Branching to 'faster-domish-serialization-2477'

Changed 14 months ago by exarkun

(In [26075]) Fix a bug in the cache optimization which results in incorrect XML escaping

refs #2477

Changed 14 months ago by exarkun

  • cc exarkun added
  • owner set to ralphm

I applied both patches to a branch. I also grabbed some of the test data and worked it into a modified version of the benchmark which can easily be run to produce some performance data.

The caching part of the patch had two problems which I have addressed (with different success, I think). First, it caused escapeToXml to return incorrect results sometimes because the cache did not take into account the value of isattrib. The code now makes sure to respect this value. Second, the cache was allowed to grow to a million items, then completely discarded. A million items is too large for an upper bound for a private cache which users have no control over. The behavior of discarding all entries when the cache became full also seems strange; I suspect this case was never encountered, or was at least encountered only very rarely, but I don't know. It seems as though most keys in the cache would have to be very low value, only being computed once or twice in the lifetime of the cache. However, I don't really know what a typical xmpp stream's values look like (possibly there is no such thing as a typical stream). So, I have replaced the caching scheme with a different one. This one imposes a (comparatively) low limit and uses a very simple two-generation scheme to attempt to preserve keys with frequent hits while discarding those which are not often requested. As to whether it really succeeds at this goal, I do not know. It likely depends on the actual data used with it, which again I am not very knowledgeable of.

Changed 14 months ago by jack

That sounds like the right improvement. I believe we used fill and discard because it is very fast. Most other methods we could think of incurred enough overhead to make the performance gains less interesting.

Does your new benchmarking find similar results to what I reported with the original cache method? What are they like with the new cache method?

Changed 14 months ago by ralphm

  • owner changed from ralphm to exarkun
  • keywords review removed

I've run the benchmark a bunch of times on trunk, branch with only Jack's patches (r26063), branch up till exarkun's caching experiments (r26073), and the branch as it is now (r26082). On my machine, I consistently get respectively 2.2⋅10³, 3.4⋅10³, 3.7⋅10³ and 4.1⋅10³ elements/s. Thanks for the improvements!

Some review comments:

  • domish_serialization.py import sys, but doesn't use it.
  • serialize should have a proper docstring.
  • A bunch of docstrings don't have their opening on a line of their own, unfortunately, and this should at least be fixed for escapeToXml.
  • escapeToXml's docstring should mention the caching.
  • The cache limit should probably have a name.
  • The change to addContent made me discover an issue with SerializedXml. I filed #3623 for that, but maybe it could be solved in this branch.

Changed 14 months ago by exarkun

Jack, is there any chance you can try this code out and compare its performance to trunk and to your original patches?

Changed 14 months ago by jack

I can confirm Ralph's findings. Trunk gives ~2100 elements/sec. My patches give ~3100 elements/sec. And the final branch code gives ~3700 elements/sec. Great job.

I've now written a C serializer for domish Elements in and attached it to a new bug. This code gives ~7500 elements/sec. See #3633.

Changed 14 months ago by exarkun

Jack, have you tried using psyco with this code? In my tests, it produces roughly a 3x speedup in this branch (distinctly less in trunk, I suppose there are fewer opportunities for it there). How does this compare to the C serialize you've written?

Changed 12 months ago by exarkun

Ping?

Changed 10 months ago by jack

We used to run psyco but ran into some weird problems and turned it off some time back. I'm a bit wary of turning that on in our production systems unless I can dig up the reasons we turned it off, or we can do some load testing with it enabled on the staging server.

If you are saying it's 3x faster, that would make it approximately (and hypothetically since I haven't run it) 11k elements/s, which is a fair bit faster than my C version. However, I think the C version can be sped up considerably.

I see no reason not to commit this work since there is still outstanding work to do on the C version, and there may be good reasons to have both implementations (ie, C may be faster for CPython, but unworkable on Jython or IronPython?).

Changed 10 months ago by jack

Addressing exarkun's comments from #3633:

What's interesting is that XML serialization is fast. In the light of the 2x improvement of the approach of using a C library instead of a Python library, one thought I have is that the Python optimizations are irrelevant and should be discarded. They obscure the code and don't bring its performance even close to the performance this optimization gives. That said, I'd also really like to explore using an existing C library to do this serialization rather than including a completely custom C extension.

I disagree that the Python optimizations are irrelevant now for the reasons stated above. I agree that the python optimization is a tradeoff on code readability, but I think as long as the rest of the domish stuff is nice and readable, it's not a big deal.

The problem with using an existing C library to do serialization is that it will need the XML represented in its internal format to start with. Translating from domish's format to say the libxml2 format is probably just as expensive an operation as serializing to a string. This story would be radically different if domish itself were just a wrapper around libxml2 or something similar. This has been discussed before (see #3201 for example).

Changed 7 weeks ago by thijs

  • cc thijs added

From #3633:

For most XMPP applications, XML serialization is the bottleneck. While #2477 improves the Python serializer quite a bit, this can be made faster by dropping down to C.

I've written a Python c module called cserialize which serializes domish.Element objects. This module is compatible with domish and passes most tests. It gives slightly more than a factor of 2 performance increase over the code in #2477.

The failing tests have to do with namespace issues, and after investigating the code, I believe the tests are wrong and need to be rewritten. I haven't rewritten them as I wanted to discuss it first.

You can find the cserialize code here:  http://github.com/metajack/cserialize

Changed 7 weeks ago by thijs

diff from #3633

Changed 7 weeks ago by thijs

  • owner exarkun deleted
  • keywords review added

Changed 6 weeks ago by exarkun

I think we should ignore the cserializer stuff. If someone wants to review what's in the branch referenced here, I guess that's fine. I'm not too excited about this, since it adds a lot of code (much of which is not idiomatic), but there's a benchmark and it shows this is a performance improve so great I guess.

It would be better if we could let someone else deal with this kind of performance issue, of course.

Changed 3 weeks ago by jkakar

  • owner set to exarkun
  • keywords review removed

[1]

+ for p, u in localPrefixes.iteritems():

It would be nice if 'p' and 'u' were words that describe the data each variable contains. There are a couple of other places where one letter variables could be expanded to words to improve readability.

[2]

+ tuple_ = tuple + type_ = type

Are you using locals here because they can be accessed faster than global variables? When I remove these and use tuple and type directly I see a slight speedup, which was a bit surprising...

[3]

+def escapeToXml(text, isattrib=0):

Out of curiosity, I tried changing the global _g1 and _g2 caches to:

def escapeToXml(text, isattrib=0, _g1={}, _g2={}):

I ran the benchmark a few times... the results weren't a clear win, but this approach seems to be very slightly faster (like 0.1%). It's probably not worth the ugliness/confusion of abusing the function signature.

[4]

+_g1 = {} +_g2 = {}

These would be easier to understand if they had names that reflected their purpose. I recommend _escapeXmlCache1 and _escapeXmlCache2.

[5]

key = (text, isattrib)

I see a slight speedup when the order of these values are inverted.

[6]

+ if isattrib == 1:

Is there a reason you're comparing to 1 here instead of just using 'if isattrib'?

The branch looks very nice!

Note: See TracTickets for help on using tickets.