[Twisted-web] Matt's "forms" - validation error when field name is not ASCII

Sat Dec 10 12:23:36 MST 2005

On Sat, 10 Dec 2005 19:48:02 +0100, Valentino Volonghi aka Dialtone <dialtone at divmod.com> wrote:
>On Sat, Dec 10, 2005 at 06:43:28PM +0100, Paul Reznicek wrote:
>> Yes, this is a way too, but than you must take care in each
>> ".addField(" line, my patch save those sorrows ...
>
>Actually that is the only way, not one way. All the other ways, if they work,
>they do by chance and are very fragile.
>
>For example your 'solution' is simply broken by using a different encoding
>that cannot be decoded with the default one used by unicode().

Here are some concrete examples:

exarkun at boson:~$ python
Python 2.4.2 (#2, Sep 30 2005, 21:19:01)
[GCC 4.0.2 20050808 (prerelease) (Ubuntu 4.0.1-4ubuntu8)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> unicode('é')
Traceback (most recent call last):
  File "<stdin>", line 1, in ?
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 0: 
ordinal not in range(128)
>>>

Oops, I typed an accented e, which ended up in the code as UTF-8, since 
that's how my terminal is configured.  But Python has no way of knowing 
this, so it has to use the system specified default encoding to decode 
the bytes my terminal sent to it.  Since that's ASCII (the only sane 
system encoding, unfortunately), not UTF-8, the decode explodes.

Well fine.  How about I change the system encoding to UTF-8, then things 
will work, right?

>>> [magic censored - the system encoding is utf-8 now, trust me]
>>> unicode('é')
u'\xe9'

Hey cool, that worked.  Let's try another character.

>>> unicode('弱')
Traceback (most recent call last):
  File "<stdin>", line 1, in ?
UnicodeDecodeError: 'ascii' codec can't decode byte 0xae in position 0: 
ordinal not in range(128)
>>>

Ahh crud.  This time I typed in some Big-5 encoded bytes to represent a 
Han character.  The system encoding is still UTF-8 though, so again it 
is broken.

Now, you might think it is unreasonable to be typing things in UTF-8 to 
start with and then randomly switch over to Big-5 (actually, doing 
anything with Big-5 is crazy, but I digress).  But just imagine you are 
trying to use two different libraries in your program: one developed in 
Sweden and one developed in China.  Nevow has to work with both of 
these *at the same time*.  So it cannot assume an encoding anywhere, 
even the system default encoding (you might be trying to run the program 
in the United States!), so the encodings have to be provided at the 
location of the actual bytes themselves: the way you do this in Python 
is by using unicode strings (u'...'), not byte strings ('...') and 
declaring the encoding of the source file the literals appear in.

Hope this helps clear things up,

Jean-Paul