[Twisted-web] twisted.web.template output encoding

Glyph glyph at twistedmatrix.com
Wed Jan 4 20:52:24 EST 2012


On Jan 3, 2012, at 10:54 AM, exarkun at twistedmatrix.com wrote:

> On 5 Dec 2011, 08:19 pm, glyph at twistedmatrix.com wrote:
>> Sorry it took me so long to get to this.  Hopefully it's still relevant 
>> ;).
> 
> Heh.  Heh heh heh.  Heh.

So it goes ;-).

>> On Nov 26, 2011, at 11:52 AM, exarkun at twistedmatrix.com wrote:
>>> Apart from various issues relating to the lack of patterns in 
>>> twisted.web.template,
>> 
>> I had some trepidation about marking 
>> <http://twistedmatrix.com/trac/ticket/5040> as "closed" :).  What kind 
>> of issues came up with patterns?  Anything you feel needs fixing?
> 
> The approach facilitated by #5040 seems to result in much more 
> boilerplate than the approach facilitated by Nevow's patterns.  The code 
> for #4896 has many, many Elements.  An implementation using Nevow 
> probably would have had far fewer, perhaps only one.
> 
> Which of these is better, I don't know.  I certainly got bored very 
> early on in the #4896 work, though.

Well, if the approach on #5040 is way more verbose, what does it have in its favor?  Simplicity?  I must imagine that we can get both somehow.

>>> the main difficulty is in handling non-ascii contents in the 
>>> traceback.  Apart from any unicode that may show up in the source code 
>>> being rendered (or, perhaps, eventually, the values of variables to be 
>>> rendered - though for now I do not plan to implement this) the no- 
>>> break space characters which are necessary to get traceback lines 
>>> indented properly mean that there is always some non-ascii to include 
>>> in the output.
>> 
>> Looking at the actual output now, these &nbsp; characters strike me as 
>> an accident of how browsers collapse different types of whitespace. 
>> They could be replaced with a <span style="width: 4em;" /> to avoid 
>> this problem for now, which is probably more expressive.
> 
> If I understood Jonathan's reply properly, it sounds like the &nbsp; 
> hack is the best we've got.

I don't _want_ to read Jonathan's reply thoroughly enough to understand it, so I'll have to take your word for it.

>>> twisted.web.template encodes its output using UTF-8, and this is not 
>>> customizable.  Thus, using twisted.web.template, formatFailure's 
>>> result will be a str containing UTF-8 encoded text.  Previously the 
>>> result was a str containing only ASCII encoded text, with no-break 
>>> space represented as `&nbsp;´.  Consequently, callers of 
>>> `formatFailure´ will probably mishandle the result - the caller in 
>>> `twisted.web.server´ does, at least, including the bytes in a page 
>>> with a content type of "text/html".
>>> 
>>> The solutions that come to mind are all about removing this 
>>> incompatible change and making it so `formatFailure´ can continue to 
>>> return a str with ASCII-encoded text.
>>> 
>>> One solution is to add support for named entities or numeric character 
>>> references to twisted.web.template.  Very likely this is a good idea 
>>> regardless (Nevow supported these).
>> 
>> I think that this is probably a necessary feature regardless, 
>> eventually.  Did you end up filing a ticket for it?
> 
> Yep, this has been filed and is up for review (for weeks now ;): #5408.

Great, okay.

>>> Another solution is to use a different encoding in 
>>> `twisted.web.template´ - ASCII, with xmlcharrefreplace as the error 
>>> handler.  This is tempting since it avoids an obtrusive non-ASCII 
>>> support API (the way Nevow supports these is via `nevow.entities´, 
>>> which must be used rather than normal Python unicode objects).
>> 
>> I like this idea, because it's so hard to get wrong even if you have 
>> other problems (missing charset, buggy proxies, overly aggressive 
>> encoding detection, etc).  We can still say it's UTF-8 but it will work 
>> anywhere ASCII will work :).
>>> Perhaps another question is whether the encoding used by 
>>> `twisted.web.template´ should be a parameter.  A related question 
>>> raised might be whether `twisted.web.template´ should encoded to bytes 
>>> at all, or delegate the responsibility for that to code closer to a 
>>> socket.
>> 
>> Personal experience looking at profiles of applications which serialize 
>> a lot of XML suggests to me that encoding and decoding text in Python 
>> is a huge chunk of CPU work and memory footprint; keeping the encoding 
>> in t.w.t provides an opportunity for a potentially important 
>> optimization which might not be possible if it were done closer to the 
>> socket.
>> 
>> For example, if we're generating a long table that generates 10MB of 
>> HTML, if this is encoded incrementally (even foregoing any smarter 
>> optimizations, like caching the encoded form of strings) then there's a 
>> small working set of encoded data which can be collected as the 
>> template renders, and by the time the final string is emitted by 
>> cStringIO.getvalue() or what have you, you're using 20-ish megabytes of 
>> heap to store your UTF-8 bytes (10 in the StringIO and 10 in the str). 
>> If you build this as a unicode string instead, you'll end up using 
>> 50MB; 40MB for your unicode string, 10MB for the decoded bytes.  Part 
>> of this is just an implementation issue, but even if Python gets a 
>> smarter unicode representation, you still need more space, because you 
>> need to store the encoded and decoded representations concurrently.
> 
> This all seems to suppose the non-existence of the 
> twisted.web.template.flatten
> style interface.  Doesn't that give you what's needed to do your 
> incremental encoding outside of the flattener?

Hmmmmmm.  Okay, generating a couple of short encoded strings does leave one with a much shorter working set.  There should definitely be a lot more convenience functions in this area to just do the right thing in the various contexts one might want to flatten something (for which there are already a few tickets, such as <http://tm.tl/5395>).  As I recall you've spoken against the flatten() style interface because it makes error-handling somewhat more challenging, but if #5395 were fixed it could take care of those complexities internally.

>> It might be a while until I get around to implementing something smart 
>> in this area, but I'd prefer we have an interface that makes such 
>> optimizations possible without breaking compatibility.
>>> As a work-around in `formatFailure´ I can decode the output of the 
>>> flattener using UTF-8 and then re-encode it to avoid non-ASCII, but it 
>>> seems like this should be solved in `twisted.web.template´ rather than 
>>> over and over again in application code.
>> 
>> If this does end up happening in formatFailure or anywhere else, please 
>> (whoever does it) make sure to file a ticket to fix it; this should 
>> never be more than a temporary workaround.
> 
> Okay.  #4896 is still up for review, and the branch implementing it does 
> use the decode/encode hack.  I'll file a ticket for fixing that if I 
> ever get to merge the branch (someone review it please).

Why not just file the ticket now?  As you said before: "Heh.  Heh heh heh.  Heh."  It might be a while before sufficient review bandwidth becomes available.  (If history is any indicator, things will stall out between now and February, and March will be crazily active.)

-glyph




More information about the Twisted-web mailing list