[Twisted-web] twisted.web.template output encoding

Mon Dec 5 15:19:24 EST 2011

Sorry it took me so long to get to this.  Hopefully it's still relevant ;).

On Nov 26, 2011, at 11:52 AM, exarkun at twistedmatrix.com wrote:

> Apart from various issues relating to the lack of patterns in twisted.web.template,

I had some trepidation about marking <http://twistedmatrix.com/trac/ticket/5040> as "closed" :).  What kind of issues came up with patterns?  Anything you feel needs fixing?

> the main difficulty is in handling non-ascii contents in the traceback.  Apart from any unicode that may show up in the source code being rendered (or, perhaps, eventually, the values of variables to be rendered - though for now I do not plan to implement this) the no-break space characters which are necessary to get traceback lines indented properly mean that there is always some non-ascii to include in the output.

Looking at the actual output now, these &nbsp; characters strike me as an accident of how browsers collapse different types of whitespace.  They could be replaced with a <span style="width: 4em;" /> to avoid this problem for now, which is probably more expressive.

> twisted.web.template encodes its output using UTF-8, and this is not customizable.  Thus, using twisted.web.template, formatFailure's result will be a str containing UTF-8 encoded text.  Previously the result was a str containing only ASCII encoded text, with no-break space represented as `&nbsp;´.  Consequently, callers of `formatFailure´ will probably mishandle the result - the caller in `twisted.web.server´ does, at least, including the bytes in a page with a content type of "text/html".
> 
> The solutions that come to mind are all about removing this incompatible change and making it so `formatFailure´ can continue to return a str with ASCII-encoded text.
> 
> One solution is to add support for named entities or numeric character references to twisted.web.template.  Very likely this is a good idea regardless (Nevow supported these).

I think that this is probably a necessary feature regardless, eventually.  Did you end up filing a ticket for it?

> Another solution is to use a different encoding in `twisted.web.template´ - ASCII, with xmlcharrefreplace as the error handler.  This is tempting since it avoids an obtrusive non-ASCII support API (the way Nevow supports these is via `nevow.entities´, which must be used rather than normal Python unicode objects).

I like this idea, because it's so hard to get wrong even if you have other problems (missing charset, buggy proxies, overly aggressive encoding detection, etc).  We can still say it's UTF-8 but it will work anywhere ASCII will work :).

> Perhaps another question is whether the encoding used by `twisted.web.template´ should be a parameter.  A related question raised might be whether `twisted.web.template´ should encoded to bytes at all, or delegate the responsibility for that to code closer to a socket.

Personal experience looking at profiles of applications which serialize a lot of XML suggests to me that encoding and decoding text in Python is a huge chunk of CPU work and memory footprint; keeping the encoding in t.w.t provides an opportunity for a potentially important optimization which might not be possible if it were done closer to the socket.

For example, if we're generating a long table that generates 10MB of HTML, if this is encoded incrementally (even foregoing any smarter optimizations, like caching the encoded form of strings) then there's a small working set of encoded data which can be collected as the template renders, and by the time the final string is emitted by cStringIO.getvalue() or what have you, you're using 20-ish megabytes of heap to store your UTF-8 bytes (10 in the StringIO and 10 in the str).  If you build this as a unicode string instead, you'll end up using 50MB; 40MB for your unicode string, 10MB for the decoded bytes.  Part of this is just an implementation issue, but even if Python gets a smarter unicode representation, you still need more space, because you need to store the encoded and decoded representations concurrently.

It might be a while until I get around to implementing something smart in this area, but I'd prefer we have an interface that makes such optimizations possible without breaking compatibility.

> As a work-around in `formatFailure´ I can decode the output of the flattener using UTF-8 and then re-encode it to avoid non-ASCII, but it seems like this should be solved in `twisted.web.template´ rather than over and over again in application code.

If this does end up happening in formatFailure or anywhere else, please (whoever does it) make sure to file a ticket to fix it; this should never be more than a temporary workaround.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://twistedmatrix.com/pipermail/twisted-web/attachments/20111205/8359c857/attachment.htm