Opened 4 years ago

Closed 3 years ago

#5053 enhancement closed fixed (fixed)

Add Gzip support to web client

Reported by: therve Owned by: therve
Priority: normal Milestone:
Component: web Keywords:
Cc: jknight, thijs Branch: branches/gzip-client-5053-3
(diff, github, buildbot, log)
Author: therve Launchpad Bug:

Description

Twisted web client should support Content-Encoding: gzip, decompressing the body transparently.

Change History (20)

comment:1 Changed 4 years ago by DefaultCC Plugin

  • Cc jknight added

comment:2 Changed 4 years ago by therve

  • Owner set to therve

comment:3 Changed 4 years ago by therve

  • Author set to therve
  • Branch set to branches/gzip-client-5053

(In [31684]) Branching to 'gzip-client-5053'

comment:4 Changed 4 years ago by therve

(In [31685]) Implement gzip support

Refs #5053

comment:5 Changed 4 years ago by therve

  • Keywords review added
  • Owner therve deleted

The branch adds gzip supports for the new client. You need to explicitly pass the Accept-Encoding header for it to do so. I'm wondering about passing it by default, but I'm not sure about backward compatibility.

comment:6 Changed 4 years ago by therve

  • Keywords review removed
  • Owner set to therve

Talked with exarkun about using a composition class instead:

So GzipAgent.request would add the Accept-Encoding header and send the call on to the wrapped agent.
It would add a callback to the returned Deferred to inspect the response.
If there's not a 'Content-Encoding: gzip' header in the response, it just returns the response on.
If there is one, then it wraps the Response so that it can intercept the Response.deliverBody call.
In its deliveryBody implementation, it wraps the protocol passed in with another protocol that does the decoding and passes the uncompressed bytes to the real protocol.
The real protocol can have the original body producer as its "transport" because there's only IProducer methods for it to use, and GzipAgent doesn't care about those.

comment:7 Changed 4 years ago by therve

(In [31700]) Implement the support in GzipAgent instead of the client parser

Refs #5053

comment:8 Changed 4 years ago by therve

  • Keywords review added
  • Owner therve deleted

Back to review, with a brand new implementation.

comment:9 Changed 4 years ago by jonathanj

  • Keywords review removed
  • Owner set to therve

Looking good.

  1. There was talk about introducing an IResponse interface and doing some stuff which would make _GzipResponse.__getattr__ exist less. On IRC you suggested that the IResponse work could be done first and then this branch modified, which sounds like a good idea to me. Could you file a ticket for that work?
  1. The docstring for _GzipProtocol looks incomplete. It also needs a newline between the description and the start of the fields.
  1. There is an extra newline between _GzipProtocol.__init__ and _GzipProtocol.dataReceived.
  1. There is an epydoc syntax error in the @see field of GzipAgent.request.
  1. In GzipAgent.request wouldn't it be slightly cleaner if you set headers to an empty object if it were None and then always perform the addRawHeader step?
  1. There is a missing newline after GzipAgentTests.test_acceptHeaders.
  1. This is kind of a big one: GzipAgent only handles gzipped data in the case where it is the only content encoding although, according to the RFC, it is possible to specify multiple ordered content encodings. You suggested being able to register encoders with Agent, something like Agent.addEncoderSupport(GzipEncoder()), which would then take care of constructing Accept-Encodings and dispatching the decoders upon receiving data. Presumably this should go into a separate ticket and its design discussed further, perhaps exarkun has some helpful advice.

comment:10 Changed 4 years ago by exarkun

An alternative to Agent.addEncoderSupport(GzipEncoder()) might be ContentDecoder(Agent(...), [Gzip(), ...]). This way Agent doesn't need a new method (which it would be nice to avoid, since it means every wrapper of Agent will need the same new method).

comment:11 Changed 3 years ago by therve

  • Branch changed from branches/gzip-client-5053 to branches/gzip-client-5053-2

(In [31821]) Branching to 'gzip-client-5053-2'

comment:12 Changed 3 years ago by therve

  • Keywords review added
  • Owner therve deleted

Alright, back to review. The implementation completely changed, and is now a generic content decoding wrapper, with initial gzip support. It's a little bit more complicated than I hope, but it does the job I think.

comment:13 Changed 3 years ago by jonathanj

  • Keywords review removed
  • Owner set to therve
  1. ContentDecoderAgentTests, ContentDecoderAgentTests.test_unsupportedEncoding, `ContentDecoderAgentWithGzipTests need docstrings.
  1. ContentDecoderAgentTests.test_acceptHeaders, ContentDecoderAgentTests.test_existingHeaders, ContentDecoderAgentTests.test_unknownEncoding assume something about the ordering of the encoding names, when in fact the ordering will depend on the hashing implementation. Point 9 might make this point moot.
  1. There is an epydoc syntax error in the @see field of ContentDecoderAgent.request.
  1. According to the docstrings for Response some of your parameters to Response.__init__ (Response('HTTP/1.1', 200, 'GET', headers, None)) are incorrect. I believe the first parameter should be something like ('HTTP', 1, 1) while the third parameter should be something like OK (as in "200 OK").
  1. ContentDecoderAgentTests and ContentDecoderAgentWithGzipTests (probably also a bunch of other unit tests) have the same setUp and _dummyConnect code. Maybe it could be nice to refactor this (although setUp should have another name) into a mixin.
  1. I can't help but notice that there are no tests for the cases where the server says the data is gzipped (or otherwise encoded) and sends back data that causes the decoder to break.
  1. I believe the standard is now to use assertEqual instead of assertEquals, can you fix this?
  1. I've noticed some code in the form below. Would it not be better to always call addRawHeader and avoid having the values duplicated?
    if headers is None:
        headers = Headers({'foo': 'bar'})
    else:
        headers = headers.copy()
        headers.addRawHeader('foo', 'bar')
    
  1. GzipDecoder.name is not documented. Although it feels like this should be documented by an interface.
  1. I don't see any way to specify the qvalue for the decoders capable of being decoded. If the order of the decoders is retained you could automatically generate qvalues based on their position in the ordering. At the moment this isn't really relevant since there is only one decoder (although it might be useful to people implementing their own decoders) so I'm happy if you want to file a ticket for this, although it doesn't seem too hard.
  1. According to the HTTP 1.1 protocol RFC some content encodings have synonyms (quoted below), I'm not sure how widely spread this usage is. I'm happy if you want to file a ticket for this (or ignore it if you can reason that it is not valuable.)
    For compatibility with previous implementations of HTTP,
    applications SHOULD consider "x-gzip" and "x-compress" to be
    equivalent to "gzip" and "compress" respectively.
    

Phew.

comment:14 Changed 3 years ago by therve

  • Keywords review added
  • Owner therve deleted

Thanks for the review!

1, 3, 4, 5, 6, 7, 8: Fixed:

2, 9: I've changed the implementation to take a list of (names, decoder).

8: Yeah I don't really care about qvalues, so I'll postpone if that's fine with you.

11: With the new structure, it's just a matter of passing ('x-gzip', GzipDecoder) in the list of supported decoders if you care about it.

comment:15 Changed 3 years ago by jonathanj

  • Keywords review removed
  • Owner set to therve

Looking good, some minor points:

  1. Decoder1 and Decoder2 in t.w.t.test_webclient should have docstrings briefly describing their purpose instead of pass.
  1. In the interests of being able to cleanly handle failed responses in the future (when you've got 10 Agent wrappers) I think that instead of just letting zlib.error be raised in deliverBody you should wrap it in ReponseFailed and pass the original error as one of the reasons.
  1. The docstring for ContentDecoderAgent.decoders says the value is to be an iterable, however you iterate it twice without copying it into something that can be iterated multiple times first.

I'm happy for you to merge this after addressing these points.

comment:16 Changed 3 years ago by mithrandi

"iterable" is not the same thing as "iterator"; iterating an iterable multiple times is fine, as iter(theIterable) will (must?) return a new iterator each time.

comment:17 Changed 3 years ago by mithrandi

Whoops, apparently "container" is the word I was looking for, not "iterable".

comment:18 Changed 3 years ago by therve

  • Branch changed from branches/gzip-client-5053-2 to branches/gzip-client-5053-3

(In [31974]) Branching to 'gzip-client-5053-3'

comment:19 Changed 3 years ago by thijs

  • Cc thijs added

The new methods and classes also need an @since marker.

comment:20 Changed 3 years ago by therve

  • Resolution set to fixed
  • Status changed from new to closed

(In [31995]) Merge gzip-client-5053-3

Author: therve
Reviewer: jonathanj
Fixes: #5053

Add Gzip support to twisted web client via the new ContentDecoderAgent class.

Note: See TracTickets for help on using tickets.