[Twisted-Python] Python3: should paths be bytes or str?

Glyph glyph at twistedmatrix.com
Tue Sep 9 00:01:50 MDT 2014


On Sep 7, 2014, at 7:14 PM, exarkun at twistedmatrix.com wrote:

> On 01:26 am, wolfgang.kde at rohdewald.de wrote:
>> The porting guide says
>> 
>> No byte paths in sys.path.
> 
> What porting guide is that?
>> 
>> doc for FilePath says
>>   On both Python 2 and Python 3, paths can only be bytes.
>> 
>> 
>> I stumbled upon this while trying to find out how much work it might be
>> to make bin/trial run with python3
>> 
>> admin/run-python3-tests already passes for all twisted.spread related
>> tests but I still need to clean up a lot.
>> 
>> after adding an assert to FilePath.__init__, python3 bin/trial ... gives
>> 
>> File "/home/wr/ssdsrc/Twisted/twisted/scripts/trial.py", line 601, in run
>>   config.parseOptions()
>> File "/home/wr/ssdsrc/Twisted/twisted/python/usage.py", line 277, in parseOptions
>>   self.postOptions()
>> File "/home/wr/ssdsrc/Twisted/twisted/scripts/trial.py", line 472, in postOptions
>>   _BasicOptions.postOptions(self)
>> File "/home/wr/ssdsrc/Twisted/twisted/scripts/trial.py", line 382, in postOptions
>>   self['reporter'] = self._loadReporterByName(self['reporter'])
>> File "/home/wr/ssdsrc/Twisted/twisted/scripts/trial.py", line 369, in _loadReporterByName
>>   for p in plugin.getPlugins(itrial.IReporter):
>> File "/home/wr/ssdsrc/Twisted/twisted/plugin.py", line 209, in getPlugins
>>   allDropins = getCache(package)
>> File "/home/wr/ssdsrc/Twisted/twisted/plugin.py", line 134, in getCache
>>   mod = getModule(module.__name__)
>> File "/home/wr/ssdsrc/Twisted/twisted/python/modules.py", line 781, in getModule
>>   return theSystemPath[moduleName]
>> File "/home/wr/ssdsrc/Twisted/twisted/python/modules.py", line 702, in __getitem__
>>   self._findEntryPathString(moduleObject)),
>> File "/home/wr/ssdsrc/Twisted/twisted/python/modules.py", line 627, in _findEntryPathString
>>   if _isPackagePath(FilePath(topPackageObj.__file__)):
>> File "/home/wr/ssdsrc/Twisted/twisted/python/filepath.py", line 664, in __init__
>>   assert isinstance(path, bytes), 'path must be bytes: %r' % (path,)
>> AssertionError: path must be bytes: '/home/wr/ssdsrc/Twisted/twisted/__init__.py'
> 
> If paths are being represented using unicode somewhere and you want to use them with FilePath then you have to encode them (or you have to add unicode path support to FilePath and let FilePath encode them).
> 
> Unfortunately it's not entirely obvious how to make FilePath support unicode paths since not all platforms Twisted supports represent filesystem paths using unicode.
> 
> The choice python-dev made to bridge this gap was the creation of the "surrogateescape" error handler for the UTC-8 codec.  This lets you pretend that any time you need to convert between bytes and unicode the correct codec is UTF-8 (with this special error handler).
> 
> It's not clear this was a good choice (since the result is unicode strings that may contain garbage which will confuse other software) but it's also not clear it's possible for Twisted to try to make any other choice (at some point Twisted has to interoperate with the path-related APIs in Python itself - `sys.path`, for example).
> 
> Not sure if that helps you at all.  Maybe it outlines the problem a little more clearly, at least.

The problem with making FilePath support unicode is that we want to provide an interface that applications can rely upon, specified in terms of specific types (bytes or text) so that when you get an IFilePath you know what you can do with it.

As it is currently implemented, FilePath exposes its internal representation fairly directly, most notably as the ‘.path’ attribute, but also in the return-type of methods like "basename" and "segmentsFrom".

FilePath doesn't exactly "support" unicode, in that it's specifically documented not to, but it's sort of hard to tell, since you can instantiate one with a unicode string in both python 2 and python 3, and get (apparently) correct results out of it for some methods.  However, methods that need a string constant as part of their implementation, like siblingExtensionSearch and globChildren, will break unceremoniously when presented with unicode.

Another decision that python-dev made to bridge the gap was to randomly allow different string types be passed to platform APIs, like this:

>>> import os
>>> os.listdir(u".")
['a', 'b', 'c']
>>> os.listdir(b".")
[b'a', b'b', b'c']
>>> os.path.basename(b".")
b'.'
>>> os.path.basename(".")
'.'

This implies a parallel structure might be possible for FilePath: if you pass its constructor bytes, you get a BytesFilePath; if you pass it text, you get a TextFilePath.  You can't mix the two, and once you've chosen a path you can't choose a different one.

IFilePath could then document that all of its existing methods have the return type of "whatever got passed to __init__" (which is what the current implementation does about 2/3 of the time anyway on py3, and about 9/10 of the time on py2; we would just be making it work intentionally, all the way).

But, it would then be possible to give BytesFilePath a "asText()" method and vice versa "asBytes()" - since it's the filesystem, metadata about encodings exists outside your program and you would not need to guess at encodings, you'd simply indicate what return value you'd like from methods like .basename() et. al.

The more I think about this, the more I like it - it's a bit of annoying and subtle implementation work, but I think it would supply the behavior that most people want, remain compatible with most of the existing unspecified behavior, and it would address clean text/bytes separation without having a giant deprecation cycle and inventing a new interface.  It's also the sort of implementation work which, after some discussion and consideration, we could be reasonably sure is *correct* rather than guessing at things.

Thoughts?

-glyph


-------------- next part --------------
An HTML attachment was scrubbed...
URL: </pipermail/twisted-python/attachments/20140908/318bcbd6/attachment-0002.html>


More information about the Twisted-Python mailing list