[Twisted-web] twisted.web.template output encoding

Tue Jan 3 10:54:29 EST 2012

On 5 Dec 2011, 08:19 pm, glyph at twistedmatrix.com wrote:
>Sorry it took me so long to get to this.  Hopefully it's still relevant 
>;).

Heh.  Heh heh heh.  Heh.
>On Nov 26, 2011, at 11:52 AM, exarkun at twistedmatrix.com wrote:
>>Apart from various issues relating to the lack of patterns in 
>>twisted.web.template,
>
>I had some trepidation about marking 
><http://twistedmatrix.com/trac/ticket/5040> as "closed" :).  What kind 
>of issues came up with patterns?  Anything you feel needs fixing?

The approach facilitated by #5040 seems to result in much more 
boilerplate than the approach facilitated by Nevow's patterns.  The code 
for #4896 has many, many Elements.  An implementation using Nevow 
probably would have had far fewer, perhaps only one.

Which of these is better, I don't know.  I certainly got bored very 
early on in the #4896 work, though.
>>the main difficulty is in handling non-ascii contents in the 
>>traceback.  Apart from any unicode that may show up in the source code 
>>being rendered (or, perhaps, eventually, the values of variables to be 
>>rendered - though for now I do not plan to implement this) the no- 
>>break space characters which are necessary to get traceback lines 
>>indented properly mean that there is always some non-ascii to include 
>>in the output.
>
>Looking at the actual output now, these &nbsp; characters strike me as 
>an accident of how browsers collapse different types of whitespace. 
>They could be replaced with a <span style="width: 4em;" /> to avoid 
>this problem for now, which is probably more expressive.

If I understood Jonathan's reply properly, it sounds like the &nbsp; 
hack is the best we've got.
>>twisted.web.template encodes its output using UTF-8, and this is not 
>>customizable.  Thus, using twisted.web.template, formatFailure's 
>>result will be a str containing UTF-8 encoded text.  Previously the 
>>result was a str containing only ASCII encoded text, with no-break 
>>space represented as `&nbsp;´.  Consequently, callers of 
>>`formatFailure´ will probably mishandle the result - the caller in 
>>`twisted.web.server´ does, at least, including the bytes in a page 
>>with a content type of "text/html".
>>
>>The solutions that come to mind are all about removing this 
>>incompatible change and making it so `formatFailure´ can continue to 
>>return a str with ASCII-encoded text.
>>
>>One solution is to add support for named entities or numeric character 
>>references to twisted.web.template.  Very likely this is a good idea 
>>regardless (Nevow supported these).
>
>I think that this is probably a necessary feature regardless, 
>eventually.  Did you end up filing a ticket for it?

Yep, this has been filed and is up for review (for weeks now ;): #5408.
>>Another solution is to use a different encoding in 
>>`twisted.web.template´ - ASCII, with xmlcharrefreplace as the error 
>>handler.  This is tempting since it avoids an obtrusive non-ASCII 
>>support API (the way Nevow supports these is via `nevow.entities´, 
>>which must be used rather than normal Python unicode objects).
>
>I like this idea, because it's so hard to get wrong even if you have 
>other problems (missing charset, buggy proxies, overly aggressive 
>encoding detection, etc).  We can still say it's UTF-8 but it will work 
>anywhere ASCII will work :).
>>Perhaps another question is whether the encoding used by 
>>`twisted.web.template´ should be a parameter.  A related question 
>>raised might be whether `twisted.web.template´ should encoded to bytes 
>>at all, or delegate the responsibility for that to code closer to a 
>>socket.
>
>Personal experience looking at profiles of applications which serialize 
>a lot of XML suggests to me that encoding and decoding text in Python 
>is a huge chunk of CPU work and memory footprint; keeping the encoding 
>in t.w.t provides an opportunity for a potentially important 
>optimization which might not be possible if it were done closer to the 
>socket.
>
>For example, if we're generating a long table that generates 10MB of 
>HTML, if this is encoded incrementally (even foregoing any smarter 
>optimizations, like caching the encoded form of strings) then there's a 
>small working set of encoded data which can be collected as the 
>template renders, and by the time the final string is emitted by 
>cStringIO.getvalue() or what have you, you're using 20-ish megabytes of 
>heap to store your UTF-8 bytes (10 in the StringIO and 10 in the str). 
>If you build this as a unicode string instead, you'll end up using 
>50MB; 40MB for your unicode string, 10MB for the decoded bytes.  Part 
>of this is just an implementation issue, but even if Python gets a 
>smarter unicode representation, you still need more space, because you 
>need to store the encoded and decoded representations concurrently.

This all seems to suppose the non-existence of the 
twisted.web.template.flatten
style interface.  Doesn't that give you what's needed to do your 
incremental encoding outside of the flattener?
>
>
>It might be a while until I get around to implementing something smart 
>in this area, but I'd prefer we have an interface that makes such 
>optimizations possible without breaking compatibility.
>>As a work-around in `formatFailure´ I can decode the output of the 
>>flattener using UTF-8 and then re-encode it to avoid non-ASCII, but it 
>>seems like this should be solved in `twisted.web.template´ rather than 
>>over and over again in application code.
>
>If this does end up happening in formatFailure or anywhere else, please 
>(whoever does it) make sure to file a ticket to fix it; this should 
>never be more than a temporary workaround.

Okay.  #4896 is still up for review, and the branch implementing it does 
use the decode/encode hack.  I'll file a ticket for fixing that if I 
ever get to merge the branch (someone review it please).

Jean-Paul