Opened 4 years ago

Closed 4 months ago

#5388 enhancement closed fixed (fixed)

IRI implementation

Reported by: itamar Owned by: glyph
Priority: normal Milestone:
Component: core Keywords:
Cc: Branch: branches/iri-5388-4
(github, patch, buildbot, log)
Author: itamarst, glyph Launchpad Bug:

Description

Our current URLPath code doesn't support quoting correctly, nor does it support Unicode. We should add a new RFC 3987-compliant implementation, based on the code in lp:~divmod-dev/divmod.org/URL-IRI-2409.

Change History (29)

comment:1 Changed 4 years ago by itamarst

  • Author set to itamarst
  • Branch set to branches/iri-5388

(In [33134]) Branching to 'iri-5388'

comment:2 Changed 4 years ago by itamar

Remaining todos, at a minimum:

  1. Full test coverage
  2. Code style and docstrings.
  3. Make sure we're doing the right thing for domain names (is this really what Firefox would send e.g. in Referrer or Host header? Apache in redirects?)

comment:3 Changed 4 years ago by itamar

Apparently for domain names we should be doing IDNA, not the current iridecode. Query strings may be in random encodings maybe sometimes? Needs investigation.

comment:4 Changed 4 years ago by itamar

Based on some research, I think when parsing and unparsing need to support an encoding for path and an encoding for queries. By default they should be UTF-8, but they might be something else, and they might be different than each other.

We should also support the ability to ask for non-decoded passthrough, where the path and/or query are basically not decoded (e.g. "/foo%BB/bar" becomes [u"", u"foo%BB", u"bar"]). For an HTTP proxy, for example, it might be impossible for figure out what encoding was used; if URL is going to be used for twisted.web's dispatching mechanism, we must support some way just passing random crappy data through.

I assume non-decoding mode will still have values that splittable on & in query, and / in paths, and the issue is just that unicode encoding of %XX is unknown. Even if not, and sometimes just random crap is shoved in there, we'll still get *something* which can be reassembled into original URL, just not split up sufficiently, so I think that's a reasonable assumption for passthrough mode.

References: http://kb.mozillazine.org/Network.standard-url.encode-utf8 says that IE and Opera do paths in UTF-8, but queries in source page encoding. Firefox used to not have network.standard-url.encode-utf8 true by default, AFAICT, but in my browser it is the default. And a whole lots of Firefox bugs suggest that any global default breaks someone's site.

comment:5 Changed 4 years ago by exarkun

(e.g. "/foo%BB/bar" becomes [u"", u"foo%BB", u"bar"])

This is ambiguous, I think, and so doesn't work for the passthrough. "/foo%25BB/bar" would also become [u"", u"foo%BB", u"bar"], right?

Preserving the full structure of the input might mean parsing such paths into [u"", "foo%BB", u"bar"] (note the non-unicode middle element) or something equivalent (I might prefer [Segment(decoded=u""), Segment(undecoded="foo%BB"), Segment(decoded=u"bar")] to mixing str and unicode).

An entirely different approach could be to just make the original bytes available somewhere.

comment:6 Changed 4 years ago by itamar

  • Owner set to itamar

comment:7 Changed 4 years ago by itamar

This page looks like it summarizes exactly what we need to know: https://code.google.com/p/browsersec/wiki/Part1#Unicode_in_URLs

comment:8 Changed 4 years ago by itamar

The plan: path and query to be preserved as bytes, decoded on demand with default encoding UTF-8 which can be overridden. Query mutation will be dropped, and a new ticket opened for something like clone-with-mutation.

comment:9 Changed 4 years ago by itamar

Django seems to only do UTF-8 paths and queries, I think: https://docs.djangoproject.com/en/dev/ref/unicode/

comment:10 Changed 4 years ago by itamar

Some notes on API design --

Assumptions

  1. Path and query may have different unicode "encodings" (the quotes are because it's actually unicode encoding + URL % quoting).
  2. The encodings may not be known at time of creation of the IRI object. In fact, they may never be known.
  3. Certain APIs that might want to accept IRIs don't actually need the decoded version, e.g. an HTTP client just needs the bytes.
  4. Operations like "child path" or "add a value to the query string" should operate by creating new objects, not by mutation. This is already the case, I think. They can also only work in those situations when you can decode the path/query.

Use cases

  1. Give me the domain/port/protocol decoded. This is always possible since domains will always use IDNA.
  2. Give me the decoded path in a structured manner (e.g. a list, or something more sophisticated); by default use UTF-8, but allow specifying a different unicode encoding.
  3. Give me the decoded query string in a structured manner; by default use UTF-8, but allow specifying a different unicode encoding.
  4. Give me the original bytes of the IRI.
  5. Give me the original bytes of the host/port section.
  6. Give me the original bytes of the path.
  7. Give me the original bytes of the query string.

All of the above suggests that internally the IRI should just be bytes.

Other

Having separate objects for path and query (created on demand by the IRI object using some encoding) might simplify certain things, or maybe not.

I haven't thought about fragments.

comment:11 Changed 3 years ago by ralphm

I'd like to note that the current implementation will have an issue with Python 3 because of the way the IDNA decoding is done if the input string to parseIRI is a unicode string. The netloc returned from urllib.parse.urlsplit (the module was moved in Python 3), will still be a unicode string, which doesn't have a decode method in Python 3. I *think* we should first encode it it ascii. There might be more of such issues, but I haven't looked into more detail.

comment:12 Changed 3 years ago by exarkun

Fun fact, "Python 3" is yiddish for "the current implementation will have issues".

comment:13 Changed 3 years ago by itamar

I think we concluded that that we should always keeping the original bytes along across all transformations. This may allow the sane default of UTF-8 which everyone should always use anyway (but almost certainly don't in practice).

comment:14 Changed 4 months ago by glyph

  • Author changed from itamarst to itamarst, glyph
  • Branch changed from branches/iri-5388 to branches/iri-5388-2

(In [45748]) Branching to iri-5388-2.

comment:15 Changed 4 months ago by glyph

After trying to fix up this implementation a bit and bring it up to more modern standards, I think that there is too much work being conflated here.

  1. we're trying to integrate nevow.url
  2. we're trying to come up with a radically different, and much better, API for IRIs
  3. we're attempting to discover all of the nuanced encoding rules which might happen in a multidimensional matrix of encodings and IRIs and URIs and bytes
  4. we're trying to expose an API for all of those operations on strings and lists
  5. we're trying to implement basic IRI encoding and decoding.

I think we need to pare this down and I've been trying to remove the public APIs for working on strings and lists; we should be providing a single object and all the conversion functions should just be methods on it.

Luckily, spec-wise, URIs are all restricted to ASCII, and therefore all URIs are IRIs. There are rules for going back and forth, and we can just implement those conversions as methods.

comment:16 Changed 4 months ago by glyph

  • Keywords review added
  • Owner itamar deleted

OK, it should be good on Py3, and I'm not quite sure what I'm missing in the API just yet, so I think it's time for a review. Please give lots of feedback. Next step might also be to remove as much implementation logic as possible from twisted.web.client.URI and twisted.python.urlpath.URLPath and turn them into frontends for this object, but I didn't want to get ahead of myself.

comment:17 Changed 4 months ago by glyph

(builders are spinning)

comment:18 Changed 4 months ago by glyph

In the course of attempting to port Mimic to py3, it became clear that URLPath actually has divergent behavior on py2/py3 (like, the click method returns totally different URLs in some cases) so I think I'm going to need to fold in this functionality sooner rather than later. What's in the branch now still deserves a review but I will be adding some more commits.

comment:19 Changed 4 months ago by glyph

  • Branch changed from branches/iri-5388-2 to branches/iri-5388-3

(In [45900]) Branching to iri-5388-3.

comment:20 follow-up: Changed 4 months ago by jonathanj

  • Keywords review removed
  • Owner set to glyph

It's really exciting to see a sturdy and modern URL implementation in Twisted finally getting to the end of development! I think it would be a huge boon to the Python ecosystem to see this spun out into it's own library for use in other projects. Thank you!

On to the review:

  1. Query.get's return value's docstring example confusingly leaves out the part where the key is "x".
  1. Query.add seems like an inconvenient API to use, consider you have 3 query arguments you with to add to a URL: url.query.add('x', 1).query.add('y', 2).query.add('z', 3) seems a little unwieldy. Keyword arguments may be the answer here (although they do not preserve order) but then either they need to be the only way of invoking add (so that you don't have name collisions) or there would need to be another method.

2.1. I think this concern is true for Query.set too.

  1. URL.__eq__'s docstring says "that are structurally similar" but the code looks like they have to be more than similar, perhaps the docstring could indicate where they only need to be similar and where they need to be equal?
  1. url.child(x).child(y).child(z) (and similarly for FilePath) has always seemed unnecessarily wordy and while you can write reduce(URL.child, [x, y, z], url) I think it slightly obscures the intention, I'm not sure what your thoughts are on this but this is a nuisance I encounter on an almost daily basis writing with FilePath or URLPath.
  1. URL.asURI, URL.asIRI and URL.asText don't have docstrings that would help someone understand which one to call in their specific situation. This information is kind of present in the class docstring but I think having a slightly better explanation in the docstring would be valuable.
  1. TestURL.test_roundtrip mentions __str__ in the docstring but calls asText, I guess this is a refactoring artifact?
  1. While there is test_fragmentEquality it might be useful to include a test in test_similarEqual that compares http://localhost/ and http://localhost/# which should be equal if I'm understanding correctly.

7.1 And http://localhost/ and http://localhost?

Overall the implementation is pretty clean, I think deciding to accept one canonical input form and then having methods to produce what the user requires alleviates the internals from handling considerable encoding complexity.

comment:21 Changed 4 months ago by glyph

  • Branch changed from branches/iri-5388-3 to branches/iri-5388-4

(In [45929]) Branching to iri-5388-4.

comment:22 in reply to: ↑ 20 Changed 4 months ago by glyph

Replying to jonathanj:

It's really exciting to see a sturdy and modern URL implementation in Twisted finally getting to the end of development! I think it would be a huge boon to the Python ecosystem to see this spun out into it's own library for use in other projects. Thank you!

On to the review:

  1. Query.get's return value's docstring example confusingly leaves out the part where the key is "x".

Fixed.

  1. Query.add seems like an inconvenient API to use, consider you have 3 query arguments you with to add to a URL: url.query.add('x', 1).query.add('y', 2).query.add('z', 3) seems a little unwieldy. Keyword arguments may be the answer here (although they do not preserve order) but then either they need to be the only way of invoking add (so that you don't have name collisions) or there would need to be another method.

2.1. I think this concern is true for Query.set too.

I think you're right; it's pretty unwieldy. It was written to be nice for the maintainer, not the library user.

So, what am I going to want to do if I'm constructing a URL? I think something like this:

values = dict(x=1, y=2, z=[3, 4, 5])
return url.query.add(values)
# ...
# note I actually want to use 'name' and 'value' as params here...
return url.query.add(name=u"value")
# maybe I want to add multiple values to the same key
return url.query.add(name=[u"name one", u"name two"])
# if I care about order
return url.query.add([(u"b", u"a"), (u"a", u"b")])
# change a value
return url.query.set(u"x", u"y")
# change a value to multiple values
return url.query.set(u"x", [u"a", u"b", u"c"])
# clear it and add new stuff
return url.query.clear(name=u"value")
# new multiple stuff
return url.query.clear(name=[u"multiple", u"values"], other=u"more stuff")

so that's a consistent idiom for add, set, and clear, where:

  1. I want to be able to be flexible and accept integers, strings, or lists; it's unambiguous what I mean in each case, and I don't want to have to go through and manually convert everything into a query-string-format data structure; the whole point of this layer is for convenience, so it should be convenient and DWIM as long as it's unambiguous
  2. I want to be able to specify order, but if I don't care (which is most of the time) I want to be able to specify keyword arguments
  3. the simple 2-argument case is actually the least-likely.

I also realized that now that the URL namespace is cleaned up somewhat, it actually has no conflict with add, set, get, remove, and clear, and the separate Query object is somewhat superflous. I am contemplating just putting these methods directly on URL.

  1. URL.__eq__'s docstring says "that are structurally similar" but the code looks like they have to be more than similar, perhaps the docstring could indicate where they only need to be similar and where they need to be equal?

I was using "structural" in the sense of "structural typing"; it was a misapplication of a formal term and hopefully the simpler docstrings are more evocative of the actual rules.

  1. url.child(x).child(y).child(z) (and similarly for FilePath) has always seemed unnecessarily wordy and while you can write reduce(URL.child, [x, y, z], url) I think it slightly obscures the intention, I'm not sure what your thoughts are on this but this is a nuisance I encounter on an almost daily basis writing with FilePath or URLPath.

I think there's probably no harm in adding a url.child(x, y, z). This is nicely compatible, doesn't bother you in the simple case where you want to add one segment, and avoids this annoyance. The * case where you're adding an arbitrary number of segments strikes me as unlikely, but the .child(x).child(y).child(z) thing where there's a fixed set of 3 or 4 segments seems pretty common, with dates and /people/name/attribute type URLs, so varargs seems appropriate.

  1. URL.asURI, URL.asIRI and URL.asText don't have docstrings that would help someone understand which one to call in their specific situation. This information is kind of present in the class docstring but I think having a slightly better explanation in the docstring would be valuable.

Yes, this definitely needs some improving.

  1. TestURL.test_roundtrip mentions __str__ in the docstring but calls asText, I guess this is a refactoring artifact?

Yup.

  1. While there is test_fragmentEquality it might be useful to include a test in test_similarEqual that compares http://localhost/ and http://localhost/# which should be equal if I'm understanding correctly.

OK, point taken.

7.1 And http://localhost/ and http://localhost?

These actually are different and oh my gosh it is so confusing. They're different because for different schemes that might actually make a different, but they are equivalent within HTTP. Fragments, on the other hand, can be normalized scheme-agnostically. I'm not sure what I'm going to do with this...

Overall the implementation is pretty clean, I think deciding to accept one canonical input form and then having methods to produce what the user requires alleviates the internals from handling considerable encoding complexity.

Thanks. I think we might want to have a little more support around getting binary data out of "bad" URLs, but that is a nasty edge case and not what the API should be pushing towards, so I also like where it's ending up :).

comment:23 Changed 4 months ago by glyph

OK I think I've addressed everything at this point. This also fixes #8063, because URLPath has no implementation to call its own any more.

comment:24 Changed 4 months ago by glyph

  • Keywords review added
  • Owner glyph deleted

comment:25 Changed 4 months ago by hawkowl

  • Keywords review removed
  • Owner set to glyph

HI GLYPH HERE'S AN REVIEW

  • twisted.python.url is not in dist3.
  • URLPath.fromString() now takes bytes as well, but the docstring isn't updated.
  • There's two topfiles for some reason.
  • There's a few little uncovered branches in _url that look trivial to test.

Otherwise, this looks good. Fix these, then land. Let's put this to bed.

comment:26 Changed 4 months ago by glyph

(In [45970]) review point 1 re #5388

comment:27 Changed 4 months ago by glyph

There's two topfiles because it fixes two issues, so I'm leaving that as-is...

comment:28 Changed 4 months ago by glyph

The branches were not quite as "trivial" to test as they seemed, but thanks for pointing them out, there were a couple of gnarly bugs there. I ended up ripping out pretty much all the typechecking and replacing it even though mostly it was technically "covered" the few holes in coverage indicated that it wasn't really tested or useful.

comment:29 Changed 4 months ago by glyph

  • Resolution set to fixed
  • Status changed from new to closed

(In [45992]) Merge iri-5388-4: new, internationalizable URL

Author: glyph, itamar

Reviewer: jonathanj, hawkowl

Fixes: #5388 Fixes: #8063

twisted.python.url is a new abstraction for URLs, supporting RFC 3987 IRIs.

This change also replaces the guts of twisted.python.urlpath, reducing duplication since we cannot deprecate it immediately.

Note: See TracTickets for help on using tickets.