Opened 4 years ago

Closed 4 years ago

#7994 enhancement closed fixed (fixed)

twisted.python.urlpath should work on bytes, not str

Reported by: hawkowl Owned by: hawkowl
Priority: normal Milestone: Python-3.x
Component: web Keywords:
Cc: Branch: branches/bytesurlpath-7994-2
branch-diff, diff-cov, branch-cov, buildbot
Author: hawkowl

Description (last modified by hawkowl)

Everywhere else in Twisted Web, paths are bytes, but twisted.python.urlpath thinks that it is working on strings. This causes issues on Python 3 (ie. str(request.URLPath()) explodes).

Change History (19)

comment:1 Changed 4 years ago by hawkowl

Description: modified (diff)

comment:2 Changed 4 years ago by hawkowl

Author: hawkowl
Branch: branches/bytesurlpath-7994

(In [45421]) Branching to bytesurlpath-7994.

comment:3 Changed 4 years ago by hawkowl

Keywords: review added

Builders are looking good. Please review.

This adds a fromBytes, which is used by fromRequest, which means it will work fine on Py3. Calling str() on it also returns a string.

comment:4 Changed 4 years ago by Adi Roiban

Do we need _BaseURLPathTests as a private class? I think that we can leave the _ out and just add a docstring informting that this is a mixin shared by bytes/str test for URLPath

test_mustBeBytes has no content.


Now some subjective comments as I don't like to work with text content as bytes :)

For me twisted.python.urlpath should be I high level wrapper for a HTTP URL and I should not do manual encoding/decoding.

It is ok for low level HTTP protocol handling to use bytes, as this is how data is transferred over the network.

I see that this patch convert all URLPath attributes from str to bytes. Do we need this?

Do we need to have the scheme as bytes? Is there anyone there using a non-ascii scheme?

Do we need to have netloc as bytes? I think that netloc is IDNA so we can always keep it as Unicode as it support a single encoding.

I see that the new standard defined UTF8 as the only encoding for non-ascii characters in the url - http://stackoverflow.com/questions/912811/what-is-the-proper-way-to-url-encode-unicode-characters ... since in previous versions of Twisted the urlpath was str, why not keep that type?


I am leaving this open for other to review.

Thanks!

comment:5 in reply to:  4 Changed 4 years ago by Glyph

Replying to adiroiban:

Do we need _BaseURLPathTests as a private class?

Indeed not, it's in a test module, by default private, no need for belt-and-suspenders underscores.

I think that we can leave the _ out and just add a docstring informting that this is a mixin shared by bytes/str test for URLPath

test_mustBeBytes has no content.


Now some subjective comments as I don't like to work with text content as bytes :)

Here we go with the "what even is an URL" discussion :).

For me twisted.python.urlpath should be I high level wrapper for a HTTP URL and I should not do manual encoding/decoding.

It is ok for low level HTTP protocol handling to use bytes, as this is how data is transferred over the network.

URLs are fraught objects. They are frequently constructed from random collections of flotsam bytes gathered from a variety of disparate sources. It's not as simple as "it's a string". The rules for parsing unicode URLs are in a totally different IRC: http://www.ietf.org/rfc/rfc3987.txt There are also quite a few other issues with that RFC: http://trac.tools.ietf.org/wg/iri/trac/report/6. It is by no means a simple problem to get from bytes to text.

I see that this patch convert all URLPath attributes from str to bytes. Do we need this?

These attributes are already strs most of the time on python 2; I think having them be one type precisely is a good idea.

That said, to preserve compatibility if anyone was actually making text_types out of these objects, we may need a FilePath-like mode-split here :-\.

Do we need to have the scheme as bytes? Is there anyone there using a non-ascii scheme?

It seems to me like if everything else is bytes, it would make sense for scheme to be as well.

Do we need to have netloc as bytes? I think that netloc is IDNA so we can always keep it as Unicode as it support a single encoding.

netloc is actually the best example of why you really, really need to preserve the original bytes as soon as you start doing anything iwth URLs. For one thing, there are different versions of IDNA. Some of them reject certain netlocs as invalid. Nevertheless, it is possible to parse those netlocs as URLs and even act upon them: while 'xn--c6h' is an invalid IDNA sequence, it's a perfectly valid hostname. I might not be able to register it as a TLD, but I could totally have a record in DNS for xn--c6h.mydomain.example.com, and I could have a URL to it. Really, IDNA is a transcoding, since it encodes characters as other characters, not straight from bytes to characters.

I see that the new standard defined UTF8 as the only encoding for non-ascii characters in the url - http://stackoverflow.com/questions/912811/what-is-the-proper-way-to-url-encode-unicode-characters ... since in previous versions of Twisted the urlpath was str, why not keep that type?

That new standard came after years of hundreds of millions of people tossing random garbage into URLs. So there are quite a few URLs whose content is ambiguous and need to be decoded differently. Not to mention that URL-encoding itself is a way to encode binary data in URLs, which only might map to valid UTF-8!

As an instance of the pattern I mentioned on the mailing list, my hope was always that we would get a higher-level "IRI" abstraction, wrapped around a URLPath; bytes on the inside, and all the encoding operations you want in all their highly ambiguous glory wrapped around the outside.

comment:6 Changed 4 years ago by Glyph

Keywords: review removed
Owner: set to hawkowl

Hi hawkowl,

Nice to see this helpful little corner of Twisted get some maintenance. I appreciate the additional documentation and clarity around what URLPath is. That said, some issues:

  1. there are pyflakes errors
  2. there are twistedchecker errors
  3. given that this was previously quite happy to do everything with str or possibly unicode before, it would be worth reviewing whether it would have been possible to get sensible behavior out of a unicode URLPath before, to ensure we're not doing a FilePath-style breakage where reasonable things previously would have been fine but are now broken.
  4. many of the modified tests have no docstring - if you're changing it, even a little, you need to update it to explain what you think it's testing now :).
  5. many of the modified implementation methods have no docstring; if you're changing them they should say what they do now.
  6. This may point to the need for a fromText; surely fromString ought to be deprecated eventually.

Please resubmit when you have dealt with these!

comment:7 Changed 4 years ago by hawkowl

Keywords: review added
Owner: hawkowl deleted

Okay, I'm a bit happier with this.

  1. Fixed.
  2. Fixed (the ones that make sense)
  3. Maybe, but only if you were constructing them manually. URLPath.fromRequest, what's used by Request.URLPath, was completely broken.
  4. Fixed.
  5. Fixed, mostly.
  6. I think that's where our mythical IRI implementation would come in -- which wraps a URLPath completely.

Please review.

comment:8 Changed 4 years ago by Adi Roiban

If the design / goal for URLPath is to do low level byte stuff and handle all possible encoding I think that handling only bytes is a good thing.... and we can wait for IRI stuff.

I think that the changes are a big improvement over the existing code, expecially the new docstrings.

I have never used URLPath before and I have no idea why would anybody wanted to work with such an low level object... so I find it hard to do a proper review.


Just a small comment.

I see that fromString encodes the data to UTF-8. Is that ok? Maybe we can have UTF-8 as default, but also allow an explicit encoding... or just update the docstring to inform that UTF-8 bytes are used.

If someone want a different encodings they can just encode the bytes in whatever format they want and then use fromBytes().

comment:9 Changed 4 years ago by Glyph

If the design / goal for URLPath is to do low level byte stuff and handle all possible encoding I think that handling only bytes is a good thing.... and we can wait for IRI stuff.

... except for the thing about potentially needing to support strs if strs were in use before ...

comment:10 in reply to:  8 Changed 4 years ago by Glyph

Replying to adiroiban:

Just a small comment.

I see that fromString encodes the data to UTF-8. Is that ok? Maybe we can have UTF-8 as default, but also allow an explicit encoding... or just update the docstring to inform that UTF-8 bytes are used.

The IRI RFC specifically stipulates UTF-8 as the one and only encoding allowed for the IRIs (rfc 3987 sections 3.2 and 6.4 and appendix A.2), so I think it's OK to hard-code this as an encoding.

comment:11 Changed 4 years ago by Adi Roiban

Keywords: review removed
Owner: set to hawkowl

Thanks for the latest comments.

I think that fromString should also accept Unicode.

I know that previous version was not documented, but I think that is sane to conside Unicode to also be a 'str' (text)

Python 2.7.6 (default, Jun 22 2015, 17:58:13) 
[GCC 4.8.2] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> isinstance(u'', str)
False
>>> 

I saw that URLPath is only used to implement a redirection page and is also a member of IRequest.

Can we deprecate it and replace it with a new IRIPath which is documented from the start?

Since we have no idea how people are using the URLPath and URLPath was not documented to specify how we expect people to use it, I think that deprecating it is a good idea.

The new IRIPath could have a similar interface, but it will be documented that it implements RFC 3987.

Thanks!

comment:12 in reply to:  11 Changed 4 years ago by Glyph

Replying to adiroiban:

Thanks for the latest comments.

I think that fromString should also accept Unicode.

Upon some further reflection, I think you might be right.

There are lots and lots of places you need to create a URLPath from a string. Let's just go ahead and let you create it with unicode, since that is if anything less ambiguous.

Can we deprecate it and replace it with a new IRIPath which is documented from the start?

We absolutely should deprecate it and replace it with something better designed. However, I want to make sure we have something good, and as compatible as possible, in place beforehand. Right now, today, if you want to implement something with URLs, I think you should use URLPath, since while it is problematic it is better than just manually using strings and stdlib url splitting/merging functions all over the place.

The new IRIPath could have a similar interface, but it will be documented that it implements RFC 3987.

Yup, absolutely. I hope to start work on that soon.

comment:13 Changed 4 years ago by Glyph

Oh hey right I don't need to start work on it I need to resume work on it: #5388

comment:14 Changed 4 years ago by hawkowl

Branch: branches/bytesurlpath-7994branches/bytesurlpath-7994-2

(In [45750]) Branching to bytesurlpath-7994-2.

comment:15 Changed 4 years ago by hawkowl

Keywords: review added

Builders spun, URLPath will now take Unicode and there's tests for it.

comment:16 Changed 4 years ago by Glyph

Thanks hawkie!

comment:17 Changed 4 years ago by Glyph

LGTM as far as it goes. One other thing that you might want to do a follow-up branch about though: the type that click takes is undocumented and somewhat ambiguous. I think it needs to be able to do either text or bytes. OTOH I'm working on that over in #5388 so maybe I'll just finish it there and the follow-up will be to hollow out URLPath and make it a wrapper over twisted.python.url.URL.

Just double-check those twistedchecker errors and make sure none of them are actually new (I don't think they are) and land.

comment:18 Changed 4 years ago by Glyph

Keywords: review removed

Land please!

comment:19 Changed 4 years ago by hawkowl

Resolution: fixed
Status: newclosed

(In [45763]) Merge bytesurlpath-7994-2: t.p.urlpath should operate on bytes, not str

Author: hawkowl Reviewers: adiroiban, glyph Fixes: #7994

Note: See TracTickets for help on using tickets.