[Twisted-web] twisted.web.template output encoding

Glyph glyph at twistedmatrix.com
Wed Jan 4 20:52:24 EST 2012

On Jan 3, 2012, at 10:54 AM, exarkun at twistedmatrix.com wrote:

> On 5 Dec 2011, 08:19 pm, glyph at twistedmatrix.com wrote:
>> Sorry it took me so long to get to this.  Hopefully it's still relevant 
>> ;).
> Heh.  Heh heh heh.  Heh.

So it goes ;-).

>> On Nov 26, 2011, at 11:52 AM, exarkun at twistedmatrix.com wrote:
>>> Apart from various issues relating to the lack of patterns in 
>>> twisted.web.template,
>> I had some trepidation about marking 
>> <http://twistedmatrix.com/trac/ticket/5040> as "closed" :).  What kind 
>> of issues came up with patterns?  Anything you feel needs fixing?
> The approach facilitated by #5040 seems to result in much more 
> boilerplate than the approach facilitated by Nevow's patterns.  The code 
> for #4896 has many, many Elements.  An implementation using Nevow 
> probably would have had far fewer, perhaps only one.
> Which of these is better, I don't know.  I certainly got bored very 
> early on in the #4896 work, though.

Well, if the approach on #5040 is way more verbose, what does it have in its favor?  Simplicity?  I must imagine that we can get both somehow.

>>> the main difficulty is in handling non-ascii contents in the 
>>> traceback.  Apart from any unicode that may show up in the source code 
>>> being rendered (or, perhaps, eventually, the values of variables to be 
>>> rendered - though for now I do not plan to implement this) the no- 
>>> break space characters which are necessary to get traceback lines 
>>> indented properly mean that there is always some non-ascii to include 
>>> in the output.
>> Looking at the actual output now, these &nbsp; characters strike me as 
>> an accident of how browsers collapse different types of whitespace. 
>> They could be replaced with a <span style="width: 4em;" /> to avoid 
>> this problem for now, which is probably more expressive.
> If I understood Jonathan's reply properly, it sounds like the &nbsp; 
> hack is the best we've got.

I don't _want_ to read Jonathan's reply thoroughly enough to understand it, so I'll have to take your word for it.

>>> twisted.web.template encodes its output using UTF-8, and this is not 
>>> customizable.  Thus, using twisted.web.template, formatFailure's 
>>> result will be a str containing UTF-8 encoded text.  Previously the 
>>> result was a str containing only ASCII encoded text, with no-break 
>>> space represented as `&nbsp;´.  Consequently, callers of 
>>> `formatFailure´ will probably mishandle the result - the caller in 
>>> `twisted.web.server´ does, at least, including the bytes in a page 
>>> with a content type of "text/html".
>>> The solutions that come to mind are all about removing this 
>>> incompatible change and making it so `formatFailure´ can continue to 
>>> return a str with ASCII-encoded text.
>>> One solution is to add support for named entities or numeric character 
>>> references to twisted.web.template.  Very likely this is a good idea 
>>> regardless (Nevow supported these).
>> I think that this is probably a necessary feature regardless, 
>> eventually.  Did you end up filing a ticket for it?
> Yep, this has been filed and is up for review (for weeks now ;): #5408.

Great, okay.

>>> Another solution is to use a different encoding in 
>>> `twisted.web.template´ - ASCII, with xmlcharrefreplace as the error 
>>> handler.  This is tempting since it avoids an obtrusive non-ASCII 
>>> support API (the way Nevow supports these is via `nevow.entities´, 
>>> which must be used rather than normal Python unicode objects).
>> I like this idea, because it's so hard to get wrong even if you have 
>> other problems (missing charset, buggy proxies, overly aggressive 
>> encoding detection, etc).  We can still say it's UTF-8 but it will work 
>> anywhere ASCII will work :).
>>> Perhaps another question is whether the encoding used by 
>>> `twisted.web.template´ should be a parameter.  A related question 
>>> raised might be whether `twisted.web.template´ should encoded to bytes 
>>> at all, or delegate the responsibility for that to code closer to a 
>>> socket.
>> Personal experience looking at profiles of applications which serialize 
>> a lot of XML suggests to me that encoding and decoding text in Python 
>> is a huge chunk of CPU work and memory footprint; keeping the encoding 
>> in t.w.t provides an opportunity for a potentially important 
>> optimization which might not be possible if it were done closer to the 
>> socket.
>> For example, if we're generating a long table that generates 10MB of 
>> HTML, if this is encoded incrementally (even foregoing any smarter 
>> optimizations, like caching the encoded form of strings) then there's a 
>> small working set of encoded data which can be collected as the 
>> template renders, and by the time the final string is emitted by 
>> cStringIO.getvalue() or what have you, you're using 20-ish megabytes of 
>> heap to store your UTF-8 bytes (10 in the StringIO and 10 in the str). 
>> If you build this as a unicode string instead, you'll end up using 
>> 50MB; 40MB for your unicode string, 10MB for the decoded bytes.  Part 
>> of this is just an implementation issue, but even if Python gets a 
>> smarter unicode representation, you still need more space, because you 
>> need to store the encoded and decoded representations concurrently.
> This all seems to suppose the non-existence of the 
> twisted.web.template.flatten
> style interface.  Doesn't that give you what's needed to do your 
> incremental encoding outside of the flattener?

Hmmmmmm.  Okay, generating a couple of short encoded strings does leave one with a much shorter working set.  There should definitely be a lot more convenience functions in this area to just do the right thing in the various contexts one might want to flatten something (for which there are already a few tickets, such as <http://tm.tl/5395>).  As I recall you've spoken against the flatten() style interface because it makes error-handling somewhat more challenging, but if #5395 were fixed it could take care of those complexities internally.

>> It might be a while until I get around to implementing something smart 
>> in this area, but I'd prefer we have an interface that makes such 
>> optimizations possible without breaking compatibility.
>>> As a work-around in `formatFailure´ I can decode the output of the 
>>> flattener using UTF-8 and then re-encode it to avoid non-ASCII, but it 
>>> seems like this should be solved in `twisted.web.template´ rather than 
>>> over and over again in application code.
>> If this does end up happening in formatFailure or anywhere else, please 
>> (whoever does it) make sure to file a ticket to fix it; this should 
>> never be more than a temporary workaround.
> Okay.  #4896 is still up for review, and the branch implementing it does 
> use the decode/encode hack.  I'll file a ticket for fixing that if I 
> ever get to merge the branch (someone review it please).

Why not just file the ticket now?  As you said before: "Heh.  Heh heh heh.  Heh."  It might be a while before sufficient review bandwidth becomes available.  (If history is any indicator, things will stall out between now and February, and March will be crazily active.)


More information about the Twisted-web mailing list