[Twisted-Python] Python 3: bytes vs. str in twisted.python.filepath

Zooko Wilcox-OHearn zooko at leastauthority.com
Wed Sep 11 11:48:08 MDT 2013


Hello, Harry!

I just noticed this thread.

I opened a ticket for this a while back:

https://twistedmatrix.com/trac/ticket/5203# FilePath.children() should
return FilePath objects with unicodes in them instead of strs

There is some discussion on that ticket.

For what it is worth, I agree with Itamar that porting to Python3
shouldn't be combined with changing the functionality or API, but I
also agree with Harry (at least what Harry originally said) that
FilePath objects should not carry around a "path" that is just bytes
and doesn't specify what encoding those bytes are in.

I know this is a subtle topic, in the sense that I can see the
argument on the other side, too, and I don't think either approach can
satisfy all users, but I still think it is a better idea to require
unicode-only, and so I'd like to try to explain why a little bit,
below, in addition to the discussion that is recorded on #5203.

Here's my basic argument: a sequence of bytes without an accompanying
encoding is an *insufficiently typed* thing. That is, there is no way
to use it safely without first restoring a type, and that being the
*correct* type. The traditional way to handle pathnames in Linux has
been to let them be under-typed, and then restore the type
heuristically. This traditionally worked most of the time, because the
most common thing you would do with a sequence of bytes like that is
plug it back into the same filesystem from which it came. However, I
make two claims:

1. In the modern world, it is very common to send it over the network
instead of to plug it back into the same filesystem from which it
came, and

2. there's not very much need for this "forget what type it was, guess
the type later, and guess correctly" hack! We can instead *require*
the user to supply a type with the bytestring originally, and then
remember the type that the user supplied. This breaks only a few use
cases that are probably very rare, and in fact might be unfixable
anyway, but it prevents failures which are very common, which is what
happens when you guess the wrong type during the restore. This is what
we've done in Tahoe-LAFS, and we've had few or no complaints from
users about it. Certainly if there were any, it was in the early days,
of Tahoe-LAFS, around 5 years ago, when ill-typed Linux filesystems
hadn't quite finished dying out (i.e. the bytes on there are actually
encoded in iso8859, but sys.getfilesystemencoding() returns 'utf-8').

We wrote unit tests and did careful code-review when we converted
Tahoe-LAFS from bytes to unicode-only a few years ago, and so I'd be
happy to share the knowledge I gleaned from that experience.

Regards,

Zooko



More information about the Twisted-Python mailing list