[Twisted-web] nevow 0.3 + encoding
James Y Knight
foom at fuhm.net
Mon Sep 27 21:05:19 MDT 2004
On Sep 28, 2004, at 12:47 AM, Bartek Bargiel wrote:
> My first comment after having installed Nevow 0.3: my polish cp1250
> encoding chars are not displayed:
> - htmlfile replaces them all with question marks (it works OK when
> UTF-8 encoding)
> - xmlfile works fine when reading the text from file but again it
> fails to display national charset when it gets it returned from Python
> Maybe it's me doing something wrong somewhere, I'm getting more&more
> confused with that encoding stuff :)
Firstly, Nevow is and always will be designed to use only unicode
internally. Doing anything else at this point in time is complete
This has a few consequences:
1) you should always use unicode strings in your python code if they
have any non-core-ASCII characters in them.
like e.g. u"새카만 커피 oh no~ 새하얀 우유 oh yes~"
Additionally, you have to make sure your source code file encoding is
set properly <http://www.python.org/peps/pep-0263.html> or else use
unicode escapes instead of the actual characters,
e.g. u"\uc0c8\uce74\ub9cc \ucee4\ud53c oh no~ \uc0c8\ud558\uc580
\uc6b0\uc720 oh yes~"
2) xmlfile and htmlfile must decode from the file's encoding to
unicode. However, htmlfile is completely broken in this regard: it does
not decode the file encoding at all. If the file happens to be in UTF-8
already, it will "work", but only because it returns byte strings,
which are not encoded upon output.
This really ought to be fixed; people have lots of pre-existing files
in strange encodings, and utf-8 editor support isn't quite all there
yet, either. htmlfile should do META content-type tag sniffing (like a
browser would), and also allow the developer to specify a default
encoding in the htmlfile constructor.
Fortunately, xmlfile does work right: use a standard <?xml
version="1.0" encoding="cp1250"?> declaration at the top of the file
and it'll do the right thing.
3) When writing the response to the client, nevow must encode from
unicode into the proper response encoding. Currently there is no way
to specify any response encoding besides UTF-8. I do not believe this
needs to be (or even should be) fixed: any browser that cannot handle
UTF-8 encoding is utterly worthless, and I don't think there are any
browsers that worthless still in use. At least I hope there aren't.
More information about the Twisted-web