[Twisted-Python] Converting str to bytes in Py3 port
glyph at twistedmatrix.com
Tue Sep 8 01:44:17 MDT 2015
> On Sep 2, 2015, at 3:38 AM, Adi Roiban <adi at roiban.ro> wrote:
> While reviewing the latest patch related to porting Twisted to py3 I
> saw that many values which were supposed to hold just text were ported
> as bytes.
This is fairly ambiguous, since it depends quite closely on what exact values they were.
> The argument for this conversion was that Twisted is a low level
> framework and that other high level framworks like Treq or Klein
> should implement the required code so that end user can just use text.
> For example the HTTP response messages are not bytes, even if the RFC
> specified that they should only contain text with a single encoding.
By "response messages" do you mean
response headers (which are ASCII-punned bytes, or text with a restricted character set, depending on your philosophical disposition)
response bodies (which are unambiguously bytes, unless you interpret them according to their content-type, in which case they are arbitrary objects, one type of which may be text)
response status codes (which are sort of like headers, except not, except maybe they're integers)
response status text (which are arguably text, but do not feature an encoding, and are therefore in the same grey area as headers)
> I have little experience with twisted.web, so maybe there are many
> users of twisted.web which use binary data for response messages or
> maybe there is a good use case for putting random bytes in the HTTP
> response message.
The main use-case is the Python zen, "refuse the temptation to guess". Treq and Klein can provide good default ways to interpret things, but some applications will need to get underneath those defaults and treat things differently.
> If for twisted.web there are Treq or Klein to implement the user
> friendly interfaces, I don't know what can be used for twisted.conch
Clearly we need to write some new code.
> In the ticket for porting twisted.conch.ssh.key to py3  the name of
> the ssh key algorithms like 'ssh-rsa' or 'ssh-dsa', encryption
> algorithm names like AES-128-CBC and ssh key components like p, q, y,
> x, n, e are now all bytes.
> Do you think that this is ok?
Yes, and here's why: those values are all enumerated constants. They come off the wire as bytes, in no particular encoding (these happen to be ASCII, but is there a guarantee that all future algorithm names will also be?), and then they have to be treated specially. A good, high-level API for this would use twisted.python.constants, and not bother application code with bytes or text. Given that what is implemented is all pretty low-level, bytes make sense.
That said, if there were a good case for ASCII being the declared encoding and having some authoritative sense that that's what we should use, then we should use it. Except that the 2.x types already use bytes, and so we'd have to either go with "native str" (which is a very problematic type, and should be avoided for everything except Python identifiers and docstrings, or things that need to be processed into them.)
> Why allow or encourage people to use random bytes for fields which
> should contain human readable text?
If you encounter an entity that can read the string "ssh-rsa" and truly comprehend it, chances are good you are not dealing with a human.
> For HTTP response line and response headers I think that all values should be text and encoded in ISO-8859-1.
> RFC 4819  only talks about using US-ASCII for all names used in the
> SSH public key subsystem.
> Why use bytes to represent these names?
You might be correct according to the specification (although it remains to be seen if you're right as far as implementations are concerned); however, why would it be useful to decode these values into bytes? Should we be processing them as text? In what context?
> RFC 4716 specifies that header tags must be US-ASCII while header
> value UTF-8 ... while all IANA names are US-ASCII.... and names in the
> private namespace (name at domain) should also be US-ASCII.
> As a reviewer I don't know that is the degin/architecture choose by
> Twisted and how to review such changes.
It is not necessarily possible to rationalize every decision that has been made thus far as being part of one grand plan. For one thing, many of them have been taken by different people. For another, we learn things as we go along, and so some of the decisions made thus far are now recognized as mistakes. So at this point I think it is best that you just state your preferred design and we discuss the pros and cons of that.
To the extent that there has been a conscious design strategy, it's something like this: every API needs one layer at which it needs to treat most of its data as bytes. (Sadly) few Twisted APIs have nice, discrete higher layers with objects that represent meaningful user actions rather than protocol trivia. So the existing strategy has been around making the lower levels consistently manipulate bytes everywhere, in the hope that we will promote these objects to more high-level types in a different layer later (hence the "well, in klein and treq..." answers).
> As a developer I prefer to have as much text as possible so that I can
> do text manipulation operation on these values and directly include
> them in logs or error messages.
OK, so the main utility of treating them as text is being able to concatenate them into diagnostic messages? Can this not be done with bytes equally well? This is the main thing to focus on, I think: concrete useful things you could do with text in these places where bytes are sub-optimal.
> I assume that all the people involved in writing the RFC had a good
> reason to require those fields to be text rather than any bytes.
> Thanks for your feedback!
Thank you for prompting this conversation, adi, we do need a better communicated strategy around how we handle text and encodings when protocols stipulate that the things they're dealing with are text.
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the Twisted-Python