[Twisted-Python] Python 3: bytes vs. str in twisted.python.filepath

Sun Jul 14 19:57:27 MDT 2013

First off, hi Harry!  I am super glad that someone has taken an interest in this.  Please let me know if I can be helpful in your effort to fix this.  FilePath totally has the right sort of shape to handle all these problems very gracefully, but its current implementation is (as you have noticed!) a disaster, regardless of python 2/3 issues, it doesn't handle text/bytes correctly on python 2.

Also, sorry for being a bit late to the party, been on vacation for a week :-).

On Jul 14, 2013, at 7:18 AM, Harry Bock <bock.harryw at gmail.com> wrote:

> On Sun, Jul 14, 2013 at 8:16 AM, Itamar Turner-Trauring <itamar at itamarst.org> wrote:
> On 07/13/2013 10:00 PM, Harry Bock wrote:
> Hi all,
> 
> My name is Harry Bock.  I'm interested in helping out porting Twisted to Python 3, and I've popped in IRC a few times to introduce myself and ask a few questions. A few developers agreed that working on trial dependencies would be a big help.
> 
> In doing some porting work on trial, I stumbled upon a previous porting effort (possibly by Itamar?) for twisted.python.filepath and related modules.  It seemed like the porting effort included forcing all pathname inputs to be byte strings instead of native strings.
> 
> You imply that this was a change, somehow, but it wasn't. The API was *always* bytes and it continues to be bytes on Python 3.
> 
> Ah, I understand now.  Since the native string type was used in Python 2, it follows that in Python 3 the API should be bytes.

It doesn't really make sense to talk about "native strings" unless you're talking about Python code objects; __doc__ and func_name are "native strings"; the inputs to FilePath are bytes, pure and simple.  This is mostly just because FilePath was designed way back when I only really knew about the way path names worked on Linux.

Among several of the design errors in Python 3's allegedly superior unicode support was to call the text type "str", when this was a confusing name in the first place, and is now ambiguous, confusing, and arguably wrong all at once; at the cost of one additional letter, it could have been "text", which is both a whole word and a more accurate description of what it does.  I generally use "text" rather than "string" to describe the text type anyway, because it's a lot less ambiguous and requires less backtracking ("oh I was talking about python 2 there, let me rephrase").

> It's a common Python 3 porting mistake to change everything from bytes to unicode just because. E.g. Python standard library does this in many places for no good reason, resulting in bugs that are still being fixed (http://bugs.python.org/issue12411) or APIs that are less useful (zipfile docs explicitly state that there is no standard encoding in zip files, but Python 3 zipfile module only supports one specific encoding because they switched to Unicode and didn't bother reading the module's own docs). Our goal in porting was backwards compatibility with Python 2 code, so porters don't have to change everything, and correctness. And, in this particular case, to get something working in the minimal amount of time - *adding* Unicode support is useful and should be done.
> 
> 
> After some investigation, I believe this is the wrong approach, but I wanted to start a discussion here first.  Some thoughts:
> 
> (a) As of Python 3.3, use of the ANSI API in Windows is deprecated[1], so many functions in os and os.path raise DeprecationWarning when given byte strings as input.  Although win32 is not an initial target of the porting effort, we'll have to support it eventually and the API should be supported before then.
> 
> (b) Misunderstandings at the application level about the underlying filesystem's path encoding is not the problem of the Twisted API.  Correct me if I'm wrong, but that's the responsibility of the system administrator or individual user (at least on UNIX) to set the LANG environment variable, or for the application to call setlocale(3) to explicitly override it.
> Given operating systems that don't really know about encodings on the filesystem level, forcing everything to be unicode doesn't make sense. I'm pretty sure you can end up with files in multiple different Unicode encodings on same filesystem on Linux, for example.
> 
> This is very true and I didn't consider it in my initial investigation.  While I think it would be uncommon to have files in multiple encodings on the same filesystem, it certainly would not be rare - to Tristan's point, copying names from filesystem to filesystem could easily result in multiple encodings.  The operating system may not need to understand the encodings, but applications do to display them correctly,  Which leads to your last point...

This is not really true.  This is how Linux and BSD handle file names; it is not how OS X handle file names.  (Nor is it how Windows works, as you've mentioned above.)

On OS X, file names are normalized (I forget the normalization at the moment, but you can look it up) UTF-8.  They _must_ be normalized UTF-8; it doesn't matter what $LANG is.  If you try to deal with filenames that are invalid UTF-8 byte sequences, the OS will URL-encode portions of the filename for you and _force_ its name (as returned by listdir() at least) to be a valid UTF-8 sequence.  If you give it something non-normalized, it will normalize it for you.

> Thus, my vote is that on Python 2.x, Twisted should accept either the native str or unicode types for path names, and on Python 3.x, only accept the str type to prevent deprecation issues with system calls.  I have a patch set that will make this happen including unittest modifications; if there's a consensus I'm happy to open a ticket and submit the patches.
> 
> The ideal situation would be to support bytes and Unicode on Python 2 *and* Python 3, for maximum compatibility. Even if deprecated on Windows, filesystem operations on Python 3 still do accept bytes (and they're not deprecated elsewhere). Given existing code that already takes bytes, switching to only doing Unicode on Python 3 would not be backwards compatible, so we can't really do that without a bunch of deprecation warnings and a few releases. Instead we should just do what Python does: if you start with bytes path you always get back bytes, if you start with Unicode path you always get back Unicode.
> 
> Yes, you're right, that's probably the best solution.  It would not be terribly hard to do so - then application developers can choose whether to defer to the local user's interpretation of the setting, or explicitly use byte paths.  Thanks so much for your input!

The design should not be as naive as "support bytes" or "support unicode", or even "support both".  In order to deal with some of these nastier edge-cases, you need a method that can give you a name to display to a user that's "human readable", a weird-Python-broken-surrogates-trick unicode object, and some bytes.  Then there's possibly some extra methods that could be added which are only sometimes available, like "driveLetter()" or somesuch.  (Maybe we could do better and have some kind of general mount-point object, but I digress.)

In other words, we need to give the developer an expressive enough API to clearly indicate their intent, and then have clear enough API documentation for them to figure out what their intent is :).

At the implementation level, these potential methods are both platform-specific and subtly distinctive.  For example, the "human readable name" implementation of a broken FilePath should include replacement characters rather than broken-surrogate hacks.  Replacement characters have a defined method for displaying them; since broken surrogates are just invalid garbage, some software might elect not to display the string at all, or throw an error.  It might also be sensible (as a future enhancement, this is not something we should try to do as a basic part of proper unicode support) to do some encoding-guessing and mojibake detection when trying to compute the human-readable name, since this name is just for display and it makes sense to work as hard as possible to display something sensible, since it does NOT need to be able to be fed back in to FilePath.  But of course on OS X, the thing to do would just be to convert to the percent-escaped version, since that's what the platform presents.  And on Windows, it might be sensible for the thing that gives you bytes to give you a faithful UTF-8 version of the filename rather than some platform-dependent ANSI junk, since as far as I can tell there's no need to ever get a byte sequence you could pass back to some other ANSI API.  If it were, that could be an explicitly separate API.

Finally, the fact that FilePath exposes the internal representation of the path (as ".path") is sort of a design error, and we should eventually deprecate that attribute, since there are multiple use-cases you might want that string for and we should return the appropriate version depending on which one you want.  I wouldn't worry about getting that attribute to do anything useful beyond a very rudimentary level of compatibility; in fact it would be great if the internal storage of the path were always unicode on Windows and always bytes on UNIX-ish platforms, and ".path" were just a proxy that always gave you bytes.  (Although possibly the internal representation should just be unicode too on OS X, I keep finding myself on the fence about that.)

> Is this something I can open a ticket for?

Hopefully the existing ticket is sufficient, but, open as many as you need :).  There might be a bunch of methods that need modification here, and at least e.g. the ZipPath work could be done separately.

-glyph

-------------- next part --------------
An HTML attachment was scrubbed...
URL: </pipermail/twisted-python/attachments/20130714/3f842716/attachment-0002.html>