Opened 4 months ago

Last modified 7 weeks ago

#9379 enhancement reopened

Distributed trial's AMP commands should use Unicode on Python 3

Reported by: mark williams Owned by:
Priority: normal Milestone:
Component: trial Keywords: review
Cc: Branch:
Author:

Description

One part of #8943's port of trial._dist to Python 3 has the Worker encode and decode native strings like a TestCase's name to and from UTF-8.

Unfortunately this is a source of regressions like #9378. The encoding only occurs because the AMP commands are implemented in terms of String.

They should instead use Unicode on Python 3 so that the worker code can work in terms of native strings.

Change History (8)

comment:2 Changed 4 months ago by mark williams

Keywords: review added

comment:3 Changed 4 months ago by Jean-Paul Calderone

Thanks for digging deeper into this issue.

I think varying the protocol based on the Python version in use (or anything else) is questionable. Such practice challenges the definition of the word "protocol".

One could imagine a future version of disttrial (if anyone ever cares enough to maintain and develop it further) that supports running tests with different versions of Python in a single run... What would the "protocol" be in that case?

Instead, it seems clear that Python identifiers are text even though they're represented as ASCII-only bytes on Python 2. So the protocol should be in terms of unicode and the Python 2 implementation should convert between bytes and unicode as necessary.

comment:4 in reply to:  3 Changed 4 months ago by mark williams

Replying to Jean-Paul Calderone:

I think varying the protocol based on the Python version in use (or anything else) is questionable. Such practice challenges the definition of the word "protocol".

One could imagine a future version of disttrial (if anyone ever cares enough to maintain and develop it further) that supports running tests with different versions of Python in a single run... What would the "protocol" be in that case?

That's a fair point. I understood the protocol to be an implementation detail specific to managing local processes whose lifetime was bound to a test suite's. Its purpose doesn't have to be that limited, and introducing ambiguity on the wire will make it so.

Instead, it seems clear that Python identifiers are text even though they're represented as ASCII-only bytes on Python 2.

The hard parts are exception messages, log output managercommands.TestWrite, and the directory path (sent to the worker in workercommands.Start).

On Python 2 exception messages can be an arbitrary byte sequence, which practically means they can be in any encoding with the expectation that the final presentation system, such as a terminal, interprets them correctly.

Considering only exception messages first, I see the following options:

  1. Make the protocol represent exception data as text, and on Python 2, base 64 encode and decode exception messages
  2. Make the protocol represent exception data as text and have disttrial require that exception messages on Python 2 be encoded as UTF-8.
  3. Split the commands in managercommands into binary and text versions, and call the binary versions on Python 2 and the text versions on Python 3. The manager process can then respond to either.

I like option 3 the most because it doesn't impose restrictions on test code, which seems infeasible, because it uses the protocol to unambiguously communicate data instead of layering a second encoding on top, and finally because it gives the manager flexibility to present data from either version of Python without assume the version it itself is running under.

I think the same approach would work for log messages for the same reasons.

On Python 3, file system paths can be encoded in a mess of different ways depending on platform and version (e.g., 3.6 made the default file system encoding for Windows UTF-8 with the surrogate escape handler). AMP's Unicode type is insufficient because it uses UTF-8 without surrogate escapes. Note that the existing code works encodes and decodes paths above the AMP layer in terms of UTF-8 *without* surrogateescape, which I believe is a bug.

For the same reasons listed above, I think managercommands.Start should be split into at least two new commands: one that sends paths in terms of bytes on Python 2 and one that sends paths as binary data encoded and decoded with os.fsencode and os.fsdecode on Python 3. I have no idea what the correct way to handle text paths on Python 2 is; perhaps these could just be UTF-8 encoded via a third Start command.

Are these concerns well-founded and, if so, are new protocol commands a good way to address them?

So the protocol should be in terms of unicode and the Python 2 implementation should convert between bytes and unicode as necessary.

comment:5 Changed 3 months ago by Adi Roiban

Resolution: wontfix
Status: newclosed

comment:6 Changed 7 weeks ago by Glyph

Right now trial -j crashes without an error message on Python 3 when tests fail, due in large part to the fact that the types of e.g. tracebacks are expected to be native strings by the implementation but in fact are bytes.

This seems to be the most straightforward way to fix this issue, so I disagree with the code reviews here: the wire protocol is already somewhat ambiguous, and is exclusively used to communicate between pythons of the same version on the same system. The wire protocol is also private, so if we want to expand its scope in future versions we can remove NativeString in future versions.

Furthermore, since AMP is a layered protocol, it is perfectly fine for different clients to interpret the same protocol definition differently. We can specify that abstractly, these identifiers should be text, but the range of that text is fundamentally wider on py3 (where it encompasses all of Unicode) than on py2 (where you are restricted to ASCII: although -*- coding -*- will let you put unicode in your strings, identifiers are still ASCII-only per the grammar).

I'll file a more detailed bug later when I have time to diagnose the precise circumstances under which it crashes.

comment:7 in reply to:  6 Changed 7 weeks ago by mark williams

Replying to Glyph:

Right now trial -j crashes without an error message on Python 3 when tests fail, due in large part to the fact that the types of e.g. tracebacks are expected to be native strings by the implementation but in fact are bytes.

#9436

This seems to be the most straightforward way to fix this issue, so I disagree with the code reviews here: the wire protocol is already somewhat ambiguous, and is exclusively used to communicate between pythons of the same version on the same system. The wire protocol is also private, so if we want to expand its scope in future versions we can remove NativeString in future versions.

It occurs to me now that we could also address this by adding a version identifier to the disttrial, which we should do anyway if we want it to do more than it does right now.

Furthermore, since AMP is a layered protocol, it is perfectly fine for different clients to interpret the same protocol definition differently. We can specify that abstractly, these identifiers should be text, but the range of that text is fundamentally wider on py3 (where it encompasses all of Unicode) than on py2 (where you are restricted to ASCII: although -*- coding -*- will let you put unicode in your strings, identifiers are still ASCII-only per the grammar).

I think this means: we could have a layer on top of the existing wire protocol that decodes the "native string" according to the rules of the source interpreter?

I'll file a more detailed bug later when I have time to diagnose the precise circumstances under which it crashes.

That would be nice but I think we have sufficient evidence that the existing code enables a whole class of bugs.

comment:8 Changed 7 weeks ago by mark williams

Resolution: wontfix
Status: closedreopened
Note: See TracTickets for help on using tickets.