[Twisted-Python] Unicode

Tue Oct 4 04:37:57 EDT 2005

On Mon, 03 Oct 2005 18:19:44 -0600, Ken Kinder <ken at kenkinder.com> wrote:

>The purpose of Python's unicode type is transparent exchange of string
>objects, whether those string objects are of type str or type unicode.
>Pretending that isn't so and raising a TypeError is not helpful. I would
>urge you to AT LEAST provide a detailed explanation in that error,
>explaining the philosophical disagreement you have with Python's
>unicode-string conversion behavior and have a flag you can set to
>disable that check.

>From http://docs.python.org/api/stringObjects.html:

    "Only string objects are supported; no Unicode objects should be passed."

So there is a precedent for this in the very APIs you are citing :).

You seem to have misunderstood the intent of Python's unicode support.  Python allows byte strings to be treated in the same way as character strings in the areas where such a transposition is useful and semantically valid; in some cases it (uncharacteristically) guesses based on the default encoding.  I say "uncharacteristically" because Python refuses the temptation to guess when presented with, say, an array object containing bytes, integers, or a list of smaller strings.  Automatic conversion is not the norm in Python.

I see others have already relayed you to the FAQ.  Please read the articles attached to it.

As long as I'm writing a list post about this though, let me include another example which may explain why this is an absolutely horrible idea.  There are basically 2 modes that .write() could use to accept a unicode object; one where it would cause random exceptions at runtime based on input, or one where it would generate corrupt data on the network.

Let's say I've got a very simple protocol that writes 2 bytes indicating the length of a string, then a string, like so:

 def writeChunk(self, x):
  self.transport.write(struct.pack("!H", len(x)))
  self.transport.write(x)

If 'x' were a unicode object in this case, we could do one of 2 things:

 A - Write it to the transport as UTF-8/UTF-16 (an encoding that can accept any unicode data)
 B - Write it to the transport using ascii/charmap (the default encoding, or an encoding that will only produce single-byte characters.

Given option A, this code will appear to work until it is passed a unicode string with a code point > '\u00ff'.  At that point, the 'length' prefix will be incorrect; since len() works in terms of code points and not bytes, a phrase like u'Shoot me with a \u2022' will be truncated by the receiving end, possibly into a string which can't even be decoded:

>>> len(u'Shoot me with a \u2022')
17
>>> len(u'Shoot me with a \u2022'.encode('utf8'))
19
>>> len(u'Shoot me with a \u2022'.encode('utf16'))
36
>>> u'Shoot me with a \u2022'.encode('utf16')[:17].decode('utf16')
Traceback (most recent call last):
  File "<stdin>", line 1, in ?
  File "/usr/lib/python2.4/encodings/utf_16.py", line 16, in decode
    return codecs.utf_16_decode(input, errors, True)
UnicodeDecodeError: 'utf16' codec can't decode byte 0x65 in position 16: truncated data

Using option B, we won't produce any invalid data on the network, but we will have to raise exceptions when presented with any *actual* unicode data (as opposed to just ASCII stuck into a unicode-type object):

>>> u'Shoot me with a \u2022'.encode()
Traceback (most recent call last):
  File "<stdin>", line 1, in ?
UnicodeEncodeError: 'ascii' codec can't encode character u'\u2022' in position 16: ordinal not in range(128)

In either case, simple tests where your code is passed english ASCII "unicode" strings will pass, but any actual exercise of unicode for the purpose it was designed (i.e. creating a clear distinction between transport encoding and character set) will fail horribly and possibly inexplicably.

I hope that now you can see why "a flag you can set to disable that check" could not possibly help anyone, and the code will remain as it is.