Opened 3 years ago

Last modified 3 years ago

#5203 enhancement new

FilePath.children() should return FilePath objects with unicodes in them instead of strs

Reported by: zooko Owned by:
Priority: normal Milestone:
Component: core Keywords:
Cc: zooko@… Branch:
Author: Launchpad Bug:

Description

This is an epic tale of a ticket. I'm sorry for that, but I didn't have time to make it shorter. During the middle chapter (where I'm arguing that Tahoe-LAFS ought to adopt filepath) you may think I've wandered off topic, but rest assured that we return to the topic of this ticket in the finale.

Summary: 1. Tahoe-LAFS would be improved by the use of filepath instead of its legacy path manipulation code. 2. FilePath.children is insufficient for Tahoe-LAFS's needs (because it doesn't decode bytes on Linux into unicodes), 3. Maye we could contribute code to help with that.

(Inspired by #2366.)

Twisted folks: for context, Zancas and I are experimenting with using filepath in Tahoe-LAFS, and David-Sarah questions whether filepath supports the handling of non-ASCII pathnames like Tahoe-LAFS already does. If filepath doesn't, switching to it might introduce a regression.

Zancas: A great way to tell whether it does would be to find a unit test that makes sure that it does. If you find one, and it currently passes, then great!--You know that it works. If you find one and it fails, then great!--You know that it doesn't. (This is why it can be useful to have a unit test that is known to go red and is marked as a "TODO" test.) If you've done a thorough scan and you're pretty sure that there is no test that would go red if this code had bugs with regard to unicode handling, then you know that this functionality isn't automatically tested, and we treat it as though it is likely to have bugs. Having searched myself in order to write this ticket (see below), I've learned that the answer is that filepath has, not bugs exactly, but limited functionality--FilePath.children returns a set of FilePath instances that have str-type paths instead of unicode-type paths.

All the filepath tests are here:

http://twistedmatrix.com/trac/browser/trunk/twisted/test/test_paths.py

(I found that by grepping the twisted/test directory for "filepath".)

A quick scan through there confirms that, as exarkun and glyph mentioned on #4736, there aren't any tests for handling of non-ASCII characters.

Now, jknight mentioned on #4736 that there was a branch attached to #2366 that had some relevant changes! I don't see any branches attached to #2366. I guess I don't know how to find them.

The Tahoe-LAFS project has done extensive work, especially thanks to David-Sarah, and benefiting from some good advice I got a couple of years ago from Glyph and JP, to make sure that non-ASCII characters are supported in every way that makes sense for Tahoe-LAFS. We did this in a test-driven manner and have thorough unit tests. Perhaps they would be useful to filepath maintainers, and certainly they can be used to evaluate whether a Tahoe-LAFS-with-filepath had any regressions compared to a Tahoe-LAFS-with-os.path/__builtin__.open/shutil/etc.

The unicode-specific tests are in test_encodingutil.py. They are testing encodingutil.py.

There are many tests of other Tahoe-LAFS functionality which make sure it handles non-ASCII characters correctly, e.g.:

I'm looking at a sample of code in which we might replace Tahoe-LAFS's own filesystem manipulation with filepath. Here is an excerpt from our recent patch:

-        tmpfile = self.statefname + ".tmp"
-        f = open(tmpfile, "wb")
-        pickle.dump(self.state, f)
-        f.close()
-        fileutil.move_into_place(tmpfile, self.statefname)
+        self.statefp.setContent(pickle.dumps(self.state))

Our code uses fileutil.move_into_place:

def move_into_place(source, dest):
    """Atomically replace a file, or as near to it as the platform allows.
    The dest file may or may not exist."""
    if "win32" in sys.platform.lower():
        remove_if_possible(dest)
    os.rename(source, dest)

Which uses remove_if_possible:

def remove_if_possible(f):
    try:
        remove(f)
    except:
        pass

(Ugh! A bare except:.)

Which uses a horrible function that I am rather ashamed of which has, I think, been in continuous use since the days of Mojo Nation:

def remove(f, tries=4, basedelay=0.1):
    """ Here is a superkludge to workaround the fact that occasionally on
    Windows some other process (e.g. an anti-virus scanner, a local search
    engine, etc.) is looking at your file when you want to delete or move it,
    and hence you can't.  The horrible workaround is to sit and spin, trying
    to delete it, for a short time and then give up.

    With the default values of tries and basedelay this can block for less
    than a second.

    @param tries: number of tries -- each time after the first we wait twice
    as long as the previous wait
    @param basedelay: how long to wait before the second try
    """
    try:
        os.chmod(f, stat.S_IWRITE | stat.S_IEXEC | stat.S_IREAD)
    except:
        pass
    for i in range(tries-1):
        try:
            return os.remove(f)
        except EnvironmentError, le:
            # XXX Tighten this to check if this is a permission denied error
            # (possibly due to another Windows process having the file open
            # and execute the superkludge only in this case.
            if not os.path.exists(f):
                return
            log.msg("XXX KLUDGE Attempting to remove file %s; got %s;" + \
                        "sleeping %s seconds" % (f, le, basedelay,))
            time.sleep(basedelay)
            basedelay *= 2
    return os.remove(f) # The last try.

Okay, so this is a great example of the benefits of using filepath instead of our own utility functions. The patch above eliminates code, some of which is bad code, in favor of invoking FilePath.setContents. But to the topic of this ticket: is this likely to introduce a regression in handling of non-ASCII characters?

In this particular example it is not likely to introduce a regression, because this particular code doesn't have mechanisms for handling non-ASCII characters any more than filepath does, and the current Tahoe-LAFS tests don't test whether this functionality handles non-ASCII characters.

In fact, for this example the filepath code and the current code result in almost the exact same sequence of calls to the exact same Python standard library functions, so if one has bugs in case of non-ASCII filenames, the other probably does too.

So what's an example of Tahoe-LAFS code which has been specifically crafted to handle non-ASCII characters and which has tests of that? I think the only such piece of code which filepath would replace is listdir_unicode. This is thoroughly tested by unit tests in test_encodingutil.py (as well as indirectly tested by other functional tests of Tahoe-LAFS).

Where do we use it? Here is a great example: the "tahoe backup" command lists children of a directory in order to back them all up: tahoe_backup.py. In addition to the unit tests of listdir_unicode itself, "tahoe backup" has tests of the "tahoe backup" functionality and tests of the command-line interface to it. The former includes tests of handling non-ASCII chars (look for the string "unicode") and the latter currently doesn't.

Okay, so this suggests something that Zancas and I can do: try the unit tests for listdir_unicode on FilePath.children, and assuming that they fail, then make sure not to replace any uses of listdir_unicode with FilePath.children. This may also prevent us from using filepath to manipulate the resulting children in the (client-side) backup code, but in the (server-side) storage code that Zancas and I are working on, this isn't a problem.

(In addition, of course, to keeping our eyes out for other potential regressions that I've overlooked in this analysis.)

This also suggests something that someone could contribute to filepath: the tests and the implementation which would cause FilePath.children to return an iterable of unicode objects. Are the filepath hackers interested in that?

Note that the current listdir_unicode raises exception if it gets a child name which can't be decoded in the nominal filesystem encoding. (This can't happen on Windows or Mac OS X.) I currently think that this behavior is strictly superior to the current behavior of FilePath.children, and so this should not be considered a regression from filepath's perspective, but I could be persuaded otherwise. I can imagine an API which handles both decodable and non-decodable child names, but I suspect that for most users raising exception on a non-decodable child name would be preferable. For extensive exploration of that issue in the context of Tahoe-LAFS, please see Tahoe-LAFS #371 (what to do with filenames that are illegal on some systems), which is also directly relevant to #2366.

Change History (2)

comment:1 Changed 3 years ago by zooko

We talked about this on IRC for a bit. Glyph mentioned that he didn't see a reason FilePath couldn't have a canDecodeCleanly method. With that (and with a method named textIdentifier that would return unicode objects for the FilePaths that could decode cleanly using the declared filesystem encoding), we rewrite Tahoe-LAFS's listdir_unicode_fallback() from:

def listdir_unicode_fallback(path):
    """
    This function emulates a fallback Unicode API similar to one available
    under Windows or MacOS X.

    If badly encoded filenames are encountered, an exception is raised.
    """
    precondition(isinstance(path, unicode), path)

    try:
        byte_path = path.encode(filesystem_encoding)
    except (UnicodeEncodeError, UnicodeDecodeError):
        raise FilenameEncodingError(path)

    try:
        return [unicode(fn, filesystem_encoding) for fn in os.listdir(byte_path)]
    except UnicodeDecodeError:
        raise FilenameEncodingError(fn)

to

def listdir_unicode_fallback(fp):
    """
    This function emulates a fallback Unicode API similar to one available
    under Windows or MacOS X.

    If badly encoded filenames are encountered, an exception is raised.
    """
    precondition(isinstance(fp, FilePath), fp)

    for cfp in fp.children():
        if not cfp.canDecodeCleanly():
           raise cfp.giveMeTheDecodeErrorThen()
        yield cfp.textIdentifier()              

Then we would still have to maintain this function:

def listdir_unicode(fp):
    """
    Wrapper around listdir() which provides safe access to the convenient
    Unicode API even under platforms that don't provide one natively.
    """
    precondition(isinstance(fp, FilePath), fp)

    # On Windows and MacOS X, the Unicode API is used
    # On other platforms (ie. Unix systems), the byte-level API is used

    if is_unicode_platform:
        return os.listdir(fp.textIdentifier())
    else:
        return listdir_unicode_fallback(fp)

But there are a few ways that this could be improved for our use case by changing or extending the FilePath API. First, the implementation of of listdir_unicode_fallback would be nicer if FilePath offered a .textIdentifierStrict which gave a unicode resulting from a clean decode using the declared filesystem encoding, alongside a .textIdentifierLossy which gave its best attempt to guess, replace, mojibake, or \uFFFE (�). Then listdir_unicode_fallback could be written as:

def listdir_unicode_fallback(fp):
    """
    This function emulates a fallback Unicode API similar to one available
    under Windows or MacOS X.

    If badly encoded filenames are encountered, an exception is raised.
    """
    precondition(isinstance(fp, FilePath), fp)

    for cfp in fp.children():
        yield cfp.textIdentifierStrict()              

Next, it would be great if FilePath would adopt this code under its API, so that I wouldn't have to maintain any of this code and could just use FilePath.children, or FilePath.childrenStrictUnicode, or StrictlyUnicodeFilePath.children.

comment:2 Changed 3 years ago by zooko

  • Cc zooko@… added
Note: See TracTickets for help on using tickets.