[Twisted-web] Matt's "forms" - validation error when field name
is not ASCII
Jean-Paul Calderone
exarkun at divmod.com
Sat Dec 10 12:23:36 MST 2005
On Sat, 10 Dec 2005 19:48:02 +0100, Valentino Volonghi aka Dialtone <dialtone at divmod.com> wrote:
>On Sat, Dec 10, 2005 at 06:43:28PM +0100, Paul Reznicek wrote:
>> Yes, this is a way too, but than you must take care in each
>> ".addField(" line, my patch save those sorrows ...
>
>Actually that is the only way, not one way. All the other ways, if they work,
>they do by chance and are very fragile.
>
>For example your 'solution' is simply broken by using a different encoding
>that cannot be decoded with the default one used by unicode().
Here are some concrete examples:
exarkun at boson:~$ python
Python 2.4.2 (#2, Sep 30 2005, 21:19:01)
[GCC 4.0.2 20050808 (prerelease) (Ubuntu 4.0.1-4ubuntu8)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> unicode('é')
Traceback (most recent call last):
File "<stdin>", line 1, in ?
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 0:
ordinal not in range(128)
>>>
Oops, I typed an accented e, which ended up in the code as UTF-8, since
that's how my terminal is configured. But Python has no way of knowing
this, so it has to use the system specified default encoding to decode
the bytes my terminal sent to it. Since that's ASCII (the only sane
system encoding, unfortunately), not UTF-8, the decode explodes.
Well fine. How about I change the system encoding to UTF-8, then things
will work, right?
>>> [magic censored - the system encoding is utf-8 now, trust me]
>>> unicode('é')
u'\xe9'
Hey cool, that worked. Let's try another character.
>>> unicode('弱')
Traceback (most recent call last):
File "<stdin>", line 1, in ?
UnicodeDecodeError: 'ascii' codec can't decode byte 0xae in position 0:
ordinal not in range(128)
>>>
Ahh crud. This time I typed in some Big-5 encoded bytes to represent a
Han character. The system encoding is still UTF-8 though, so again it
is broken.
Now, you might think it is unreasonable to be typing things in UTF-8 to
start with and then randomly switch over to Big-5 (actually, doing
anything with Big-5 is crazy, but I digress). But just imagine you are
trying to use two different libraries in your program: one developed in
Sweden and one developed in China. Nevow has to work with both of
these *at the same time*. So it cannot assume an encoding anywhere,
even the system default encoding (you might be trying to run the program
in the United States!), so the encodings have to be provided at the
location of the actual bytes themselves: the way you do this in Python
is by using unicode strings (u'...'), not byte strings ('...') and
declaring the encoding of the source file the literals appear in.
Hope this helps clear things up,
Jean-Paul
More information about the Twisted-web
mailing list