[Twisted-web] twisted.web.template output encoding
Glyph
glyph at twistedmatrix.com
Wed Jan 4 20:52:24 EST 2012
On Jan 3, 2012, at 10:54 AM, exarkun at twistedmatrix.com wrote:
> On 5 Dec 2011, 08:19 pm, glyph at twistedmatrix.com wrote:
>> Sorry it took me so long to get to this. Hopefully it's still relevant
>> ;).
>
> Heh. Heh heh heh. Heh.
So it goes ;-).
>> On Nov 26, 2011, at 11:52 AM, exarkun at twistedmatrix.com wrote:
>>> Apart from various issues relating to the lack of patterns in
>>> twisted.web.template,
>>
>> I had some trepidation about marking
>> <http://twistedmatrix.com/trac/ticket/5040> as "closed" :). What kind
>> of issues came up with patterns? Anything you feel needs fixing?
>
> The approach facilitated by #5040 seems to result in much more
> boilerplate than the approach facilitated by Nevow's patterns. The code
> for #4896 has many, many Elements. An implementation using Nevow
> probably would have had far fewer, perhaps only one.
>
> Which of these is better, I don't know. I certainly got bored very
> early on in the #4896 work, though.
Well, if the approach on #5040 is way more verbose, what does it have in its favor? Simplicity? I must imagine that we can get both somehow.
>>> the main difficulty is in handling non-ascii contents in the
>>> traceback. Apart from any unicode that may show up in the source code
>>> being rendered (or, perhaps, eventually, the values of variables to be
>>> rendered - though for now I do not plan to implement this) the no-
>>> break space characters which are necessary to get traceback lines
>>> indented properly mean that there is always some non-ascii to include
>>> in the output.
>>
>> Looking at the actual output now, these characters strike me as
>> an accident of how browsers collapse different types of whitespace.
>> They could be replaced with a <span style="width: 4em;" /> to avoid
>> this problem for now, which is probably more expressive.
>
> If I understood Jonathan's reply properly, it sounds like the
> hack is the best we've got.
I don't _want_ to read Jonathan's reply thoroughly enough to understand it, so I'll have to take your word for it.
>>> twisted.web.template encodes its output using UTF-8, and this is not
>>> customizable. Thus, using twisted.web.template, formatFailure's
>>> result will be a str containing UTF-8 encoded text. Previously the
>>> result was a str containing only ASCII encoded text, with no-break
>>> space represented as ` ´. Consequently, callers of
>>> `formatFailure´ will probably mishandle the result - the caller in
>>> `twisted.web.server´ does, at least, including the bytes in a page
>>> with a content type of "text/html".
>>>
>>> The solutions that come to mind are all about removing this
>>> incompatible change and making it so `formatFailure´ can continue to
>>> return a str with ASCII-encoded text.
>>>
>>> One solution is to add support for named entities or numeric character
>>> references to twisted.web.template. Very likely this is a good idea
>>> regardless (Nevow supported these).
>>
>> I think that this is probably a necessary feature regardless,
>> eventually. Did you end up filing a ticket for it?
>
> Yep, this has been filed and is up for review (for weeks now ;): #5408.
Great, okay.
>>> Another solution is to use a different encoding in
>>> `twisted.web.template´ - ASCII, with xmlcharrefreplace as the error
>>> handler. This is tempting since it avoids an obtrusive non-ASCII
>>> support API (the way Nevow supports these is via `nevow.entities´,
>>> which must be used rather than normal Python unicode objects).
>>
>> I like this idea, because it's so hard to get wrong even if you have
>> other problems (missing charset, buggy proxies, overly aggressive
>> encoding detection, etc). We can still say it's UTF-8 but it will work
>> anywhere ASCII will work :).
>>> Perhaps another question is whether the encoding used by
>>> `twisted.web.template´ should be a parameter. A related question
>>> raised might be whether `twisted.web.template´ should encoded to bytes
>>> at all, or delegate the responsibility for that to code closer to a
>>> socket.
>>
>> Personal experience looking at profiles of applications which serialize
>> a lot of XML suggests to me that encoding and decoding text in Python
>> is a huge chunk of CPU work and memory footprint; keeping the encoding
>> in t.w.t provides an opportunity for a potentially important
>> optimization which might not be possible if it were done closer to the
>> socket.
>>
>> For example, if we're generating a long table that generates 10MB of
>> HTML, if this is encoded incrementally (even foregoing any smarter
>> optimizations, like caching the encoded form of strings) then there's a
>> small working set of encoded data which can be collected as the
>> template renders, and by the time the final string is emitted by
>> cStringIO.getvalue() or what have you, you're using 20-ish megabytes of
>> heap to store your UTF-8 bytes (10 in the StringIO and 10 in the str).
>> If you build this as a unicode string instead, you'll end up using
>> 50MB; 40MB for your unicode string, 10MB for the decoded bytes. Part
>> of this is just an implementation issue, but even if Python gets a
>> smarter unicode representation, you still need more space, because you
>> need to store the encoded and decoded representations concurrently.
>
> This all seems to suppose the non-existence of the
> twisted.web.template.flatten
> style interface. Doesn't that give you what's needed to do your
> incremental encoding outside of the flattener?
Hmmmmmm. Okay, generating a couple of short encoded strings does leave one with a much shorter working set. There should definitely be a lot more convenience functions in this area to just do the right thing in the various contexts one might want to flatten something (for which there are already a few tickets, such as <http://tm.tl/5395>). As I recall you've spoken against the flatten() style interface because it makes error-handling somewhat more challenging, but if #5395 were fixed it could take care of those complexities internally.
>> It might be a while until I get around to implementing something smart
>> in this area, but I'd prefer we have an interface that makes such
>> optimizations possible without breaking compatibility.
>>> As a work-around in `formatFailure´ I can decode the output of the
>>> flattener using UTF-8 and then re-encode it to avoid non-ASCII, but it
>>> seems like this should be solved in `twisted.web.template´ rather than
>>> over and over again in application code.
>>
>> If this does end up happening in formatFailure or anywhere else, please
>> (whoever does it) make sure to file a ticket to fix it; this should
>> never be more than a temporary workaround.
>
> Okay. #4896 is still up for review, and the branch implementing it does
> use the decode/encode hack. I'll file a ticket for fixing that if I
> ever get to merge the branch (someone review it please).
Why not just file the ticket now? As you said before: "Heh. Heh heh heh. Heh." It might be a while before sufficient review bandwidth becomes available. (If history is any indicator, things will stall out between now and February, and March will be crazily active.)
-glyph
More information about the Twisted-web
mailing list