[Twisted-web] twisted.web.template output encoding
exarkun at twistedmatrix.com
exarkun at twistedmatrix.com
Tue Jan 3 10:54:29 EST 2012
On 5 Dec 2011, 08:19 pm, glyph at twistedmatrix.com wrote:
>Sorry it took me so long to get to this. Hopefully it's still relevant
>;).
Heh. Heh heh heh. Heh.
>On Nov 26, 2011, at 11:52 AM, exarkun at twistedmatrix.com wrote:
>>Apart from various issues relating to the lack of patterns in
>>twisted.web.template,
>
>I had some trepidation about marking
><http://twistedmatrix.com/trac/ticket/5040> as "closed" :). What kind
>of issues came up with patterns? Anything you feel needs fixing?
The approach facilitated by #5040 seems to result in much more
boilerplate than the approach facilitated by Nevow's patterns. The code
for #4896 has many, many Elements. An implementation using Nevow
probably would have had far fewer, perhaps only one.
Which of these is better, I don't know. I certainly got bored very
early on in the #4896 work, though.
>>the main difficulty is in handling non-ascii contents in the
>>traceback. Apart from any unicode that may show up in the source code
>>being rendered (or, perhaps, eventually, the values of variables to be
>>rendered - though for now I do not plan to implement this) the no-
>>break space characters which are necessary to get traceback lines
>>indented properly mean that there is always some non-ascii to include
>>in the output.
>
>Looking at the actual output now, these characters strike me as
>an accident of how browsers collapse different types of whitespace.
>They could be replaced with a <span style="width: 4em;" /> to avoid
>this problem for now, which is probably more expressive.
If I understood Jonathan's reply properly, it sounds like the
hack is the best we've got.
>>twisted.web.template encodes its output using UTF-8, and this is not
>>customizable. Thus, using twisted.web.template, formatFailure's
>>result will be a str containing UTF-8 encoded text. Previously the
>>result was a str containing only ASCII encoded text, with no-break
>>space represented as ` ´. Consequently, callers of
>>`formatFailure´ will probably mishandle the result - the caller in
>>`twisted.web.server´ does, at least, including the bytes in a page
>>with a content type of "text/html".
>>
>>The solutions that come to mind are all about removing this
>>incompatible change and making it so `formatFailure´ can continue to
>>return a str with ASCII-encoded text.
>>
>>One solution is to add support for named entities or numeric character
>>references to twisted.web.template. Very likely this is a good idea
>>regardless (Nevow supported these).
>
>I think that this is probably a necessary feature regardless,
>eventually. Did you end up filing a ticket for it?
Yep, this has been filed and is up for review (for weeks now ;): #5408.
>>Another solution is to use a different encoding in
>>`twisted.web.template´ - ASCII, with xmlcharrefreplace as the error
>>handler. This is tempting since it avoids an obtrusive non-ASCII
>>support API (the way Nevow supports these is via `nevow.entities´,
>>which must be used rather than normal Python unicode objects).
>
>I like this idea, because it's so hard to get wrong even if you have
>other problems (missing charset, buggy proxies, overly aggressive
>encoding detection, etc). We can still say it's UTF-8 but it will work
>anywhere ASCII will work :).
>>Perhaps another question is whether the encoding used by
>>`twisted.web.template´ should be a parameter. A related question
>>raised might be whether `twisted.web.template´ should encoded to bytes
>>at all, or delegate the responsibility for that to code closer to a
>>socket.
>
>Personal experience looking at profiles of applications which serialize
>a lot of XML suggests to me that encoding and decoding text in Python
>is a huge chunk of CPU work and memory footprint; keeping the encoding
>in t.w.t provides an opportunity for a potentially important
>optimization which might not be possible if it were done closer to the
>socket.
>
>For example, if we're generating a long table that generates 10MB of
>HTML, if this is encoded incrementally (even foregoing any smarter
>optimizations, like caching the encoded form of strings) then there's a
>small working set of encoded data which can be collected as the
>template renders, and by the time the final string is emitted by
>cStringIO.getvalue() or what have you, you're using 20-ish megabytes of
>heap to store your UTF-8 bytes (10 in the StringIO and 10 in the str).
>If you build this as a unicode string instead, you'll end up using
>50MB; 40MB for your unicode string, 10MB for the decoded bytes. Part
>of this is just an implementation issue, but even if Python gets a
>smarter unicode representation, you still need more space, because you
>need to store the encoded and decoded representations concurrently.
This all seems to suppose the non-existence of the
twisted.web.template.flatten
style interface. Doesn't that give you what's needed to do your
incremental encoding outside of the flattener?
>
>
>It might be a while until I get around to implementing something smart
>in this area, but I'd prefer we have an interface that makes such
>optimizations possible without breaking compatibility.
>>As a work-around in `formatFailure´ I can decode the output of the
>>flattener using UTF-8 and then re-encode it to avoid non-ASCII, but it
>>seems like this should be solved in `twisted.web.template´ rather than
>>over and over again in application code.
>
>If this does end up happening in formatFailure or anywhere else, please
>(whoever does it) make sure to file a ticket to fix it; this should
>never be more than a temporary workaround.
Okay. #4896 is still up for review, and the branch implementing it does
use the decode/encode hack. I'll file a ticket for fixing that if I
ever get to merge the branch (someone review it please).
Jean-Paul
More information about the Twisted-web
mailing list