Opened 3 years ago

Last modified 16 months ago

#5388 enhancement new

IRI implementation

Reported by: itamar Owned by: itamar
Priority: normal Milestone:
Component: core Keywords:
Cc: Branch: branches/iri-5388
(diff, github, buildbot, log)
Author: itamarst Launchpad Bug:

Description

Our current URLPath code doesn't support quoting correctly, nor does it support Unicode. We should add a new RFC 3987-compliant implementation, based on the code in lp:~divmod-dev/divmod.org/URL-IRI-2409.

Change History (13)

comment:1 Changed 3 years ago by itamarst

  • Author set to itamarst
  • Branch set to branches/iri-5388

(In [33134]) Branching to 'iri-5388'

comment:2 Changed 3 years ago by itamar

Remaining todos, at a minimum:

  1. Full test coverage
  2. Code style and docstrings.
  3. Make sure we're doing the right thing for domain names (is this really what Firefox would send e.g. in Referrer or Host header? Apache in redirects?)

comment:3 Changed 3 years ago by itamar

Apparently for domain names we should be doing IDNA, not the current iridecode. Query strings may be in random encodings maybe sometimes? Needs investigation.

comment:4 Changed 3 years ago by itamar

Based on some research, I think when parsing and unparsing need to support an encoding for path and an encoding for queries. By default they should be UTF-8, but they might be something else, and they might be different than each other.

We should also support the ability to ask for non-decoded passthrough, where the path and/or query are basically not decoded (e.g. "/foo%BB/bar" becomes [u"", u"foo%BB", u"bar"]). For an HTTP proxy, for example, it might be impossible for figure out what encoding was used; if URL is going to be used for twisted.web's dispatching mechanism, we must support some way just passing random crappy data through.

I assume non-decoding mode will still have values that splittable on & in query, and / in paths, and the issue is just that unicode encoding of %XX is unknown. Even if not, and sometimes just random crap is shoved in there, we'll still get *something* which can be reassembled into original URL, just not split up sufficiently, so I think that's a reasonable assumption for passthrough mode.

References:
http://kb.mozillazine.org/Network.standard-url.encode-utf8 says that IE and Opera do paths in UTF-8, but queries in source page encoding. Firefox used to not have network.standard-url.encode-utf8 true by default, AFAICT, but in my browser it is the default. And a whole lots of Firefox bugs suggest that any global default breaks someone's site.

comment:5 Changed 3 years ago by exarkun

(e.g. "/foo%BB/bar" becomes [u"", u"foo%BB", u"bar"])

This is ambiguous, I think, and so doesn't work for the passthrough. "/foo%25BB/bar" would also become [u"", u"foo%BB", u"bar"], right?

Preserving the full structure of the input might mean parsing such paths into [u"", "foo%BB", u"bar"] (note the non-unicode middle element) or something equivalent (I might prefer [Segment(decoded=u""), Segment(undecoded="foo%BB"), Segment(decoded=u"bar")] to mixing str and unicode).

An entirely different approach could be to just make the original bytes available somewhere.

comment:6 Changed 3 years ago by itamar

  • Owner set to itamar

comment:7 Changed 3 years ago by itamar

This page looks like it summarizes exactly what we need to know:
https://code.google.com/p/browsersec/wiki/Part1#Unicode_in_URLs

comment:8 Changed 3 years ago by itamar

The plan: path and query to be preserved as bytes, decoded on demand with default encoding UTF-8 which can be overridden. Query mutation will be dropped, and a new ticket opened for something like clone-with-mutation.

comment:9 Changed 3 years ago by itamar

Django seems to only do UTF-8 paths and queries, I think: https://docs.djangoproject.com/en/dev/ref/unicode/

comment:10 Changed 3 years ago by itamar

Some notes on API design --

Assumptions

  1. Path and query may have different unicode "encodings" (the quotes are because it's actually unicode encoding + URL % quoting).
  2. The encodings may not be known at time of creation of the IRI object. In fact, they may never be known.
  3. Certain APIs that might want to accept IRIs don't actually need the decoded version, e.g. an HTTP client just needs the bytes.
  4. Operations like "child path" or "add a value to the query string" should operate by creating new objects, not by mutation. This is already the case, I think. They can also only work in those situations when you can decode the path/query.

Use cases

  1. Give me the domain/port/protocol decoded. This is always possible since domains will always use IDNA.
  2. Give me the decoded path in a structured manner (e.g. a list, or something more sophisticated); by default use UTF-8, but allow specifying a different unicode encoding.
  3. Give me the decoded query string in a structured manner; by default use UTF-8, but allow specifying a different unicode encoding.
  4. Give me the original bytes of the IRI.
  5. Give me the original bytes of the host/port section.
  6. Give me the original bytes of the path.
  7. Give me the original bytes of the query string.

All of the above suggests that internally the IRI should just be bytes.

Other

Having separate objects for path and query (created on demand by the IRI object using some encoding) might simplify certain things, or maybe not.

I haven't thought about fragments.

comment:11 Changed 19 months ago by ralphm

I'd like to note that the current implementation will have an issue with Python 3 because of the way the IDNA decoding is done if the input string to parseIRI is a unicode string. The netloc returned from urllib.parse.urlsplit (the module was moved in Python 3), will still be a unicode string, which doesn't have a decode method in Python 3. I *think* we should first encode it it ascii. There might be more of such issues, but I haven't looked into more detail.

comment:12 Changed 19 months ago by exarkun

Fun fact, "Python 3" is yiddish for "the current implementation will have issues".

comment:13 Changed 16 months ago by itamar

I think we concluded that that we should always keeping the original bytes along across all transformations. This may allow the sane default of UTF-8 which everyone should always use anyway (but almost certainly don't in practice).

Note: See TracTickets for help on using tickets.