Opened 14 years ago

Closed 11 years ago

#2062 defect closed fixed (fixed)

Handle multiple-line headers in twisted.web.http.HTTPClient

Reported by: kkinder Owned by:
Priority: normal Milestone:
Component: web Keywords: httpclient
Cc: jesstess, Thijs Triemstra Branch: branches/multi-line-headers-2062
branch-diff, diff-cov, branch-cov, buildbot
Author: jesstess

Description

Many HTTP servers send multiple line headers, such as this:

Http-header: text
        (text-continued)

The header is continued with a newline and a tab.

Twisted's http.py does not handle this:

        if line:
            key, val = line.split(':', 1)
            val = val.lstrip()
            self.handleHeader(key, val)
            if key.lower() == 'content-length':
                self.length = int(val)

Line, in this case, is only part of the header. I've attached a patch which addresses this problem.

Attachments (2)

http.py.diff (2.0 KB) - added by kkinder 14 years ago.
Patch to have http.py handle multiple headers.
multiline-http-request-field.patch (4.2 KB) - added by jesstess 11 years ago.

Download all attachments as: .zip

Change History (25)

Changed 14 years ago by kkinder

Attachment: http.py.diff added

Patch to have http.py handle multiple headers.

comment:1 Changed 14 years ago by kkinder

BTW, I recall having sent this in before, but I couldn't find a ticket for it, and it hasn't been fixed.

comment:2 Changed 12 years ago by Jean-Paul Calderone

Summary: Handle multiple-line headers in web/http.pyHandle multiple-line headers in twisted.web.http.HTTPClient

comment:3 Changed 12 years ago by Jean-Paul Calderone

Keywords: httpclient added

Changed 11 years ago by jesstess

comment:4 Changed 11 years ago by jesstess

Cc: jesstess added
Keywords: review added
Owner: jknight deleted

Thanks kkinder. As it turns out, RFCs 1945 (HTTP 1.0) and 2616 (HTTP 1.1) state that HTTP message header fields can span multiple lines if each extra line is preceded by at least one space or horizontal tab.

kkinder's patch broke web/test/test_http test cases and seemed to be using a lot of temp variables, so multiline-http-request-field.patch simplifies things and handles the space case as well. It also includes a test case in web/test/test_http, a doc string for HTTPClient.lineReceived as it is affected by the patch, and documentation for the instance variables for HTTPClient since I was adding one anyway (_header, which holds a possibly incomplete HTTP header).

comment:5 Changed 11 years ago by z3p

Keywords: review removed

Some quick comments:

  1. Can you put this into a branch?
  2. lineReceived()s docstring says it parses the headers, but it also parses the initial status line.
  3. flushHeader()
    1. It should get broken out into an actual method.
    2. It should have a name that describes what it does; maybe extractHeader()?
    3. It shouldn't reference self._header directly; it should use the header argument passed in. Alternatively, it shouldn't take an argument.
    4. It should have a test.
  4. MultilineHeaders
    1. The name should have TestCase in it.
    2. test_MultilineHeaders should be camelCased.
    3. expectedHeaders and ourHandleHeader need documentation.

comment:6 Changed 11 years ago by jesstess

Owner: set to jesstess

comment:7 Changed 11 years ago by jesstess

Author: jesstess
Branch: branches/multi-line-headers-2062

(In [27825]) Branching to 'multi-line-headers-2062'

comment:8 Changed 11 years ago by jesstess

(In [27830]) Address review comments by z3p.

refs #2062

comment:9 Changed 11 years ago by jesstess

Keywords: review added
Owner: jesstess deleted

comment:10 Changed 11 years ago by TimAllen

Keywords: added; review removed
Owner: set to jesstess

When I check out this branch and run the tests, twisted.web.test.test_proxy.ProxyClientTestCase.test_statusWithMessage repeatably fails for me:

[ERROR]: twisted.web.test.test_proxy.ProxyClientTestCase.test_statusWithMessage

Traceback (most recent call last):
  File "/mnt/aquaria/nobackup/home/tim/tmp/Twisted-git/twisted/web/test/test_proxy.py", line 201, in test_statusWithMessage
    return self._testDataForward(
  File "/mnt/aquaria/nobackup/home/tim/tmp/Twisted-git/twisted/web/test/test_proxy.py", line 153, in _testDataForward
    client.dataReceived("\r\n" + body)
  File "/mnt/aquaria/nobackup/home/tim/tmp/Twisted-git/twisted/protocols/basic.py", line 251, in dataReceived
    why = self.lineReceived(line)
  File "/mnt/aquaria/nobackup/home/tim/tmp/Twisted-git/twisted/web/http.py", line 461, in lineReceived
    self.extractHeader(self._header)
  File "/mnt/aquaria/nobackup/home/tim/tmp/Twisted-git/twisted/web/http.py", line 424, in extractHeader
    key, val = header.split(':', 1)
exceptions.ValueError: need more than 1 value to unpack

Looks like HTTPClient.lineReceived() breaks if the response doesn't have any headers. Might be a useful thing to add to the HTTPClient unit-tests.

It kind of bugs me that HTTPClient returns the (arbitrary) whitespace used to mark continued header values as part of the header values. On the other hand, we aren't canonicalizing header names or decoding things to Unicode, so at least we're consistently presenting raw-ish data.

Apart from that one test-failure, looks good to me.

comment:11 Changed 11 years ago by TimAllen

Oh, I nearly forgot: you'll need to add a description of the change to twisted/web/topfiles/2062.bugfix or perhaps .news.

comment:12 Changed 11 years ago by jesstess

(In [27866]) Handle header-less requests and strip leading whitespace from the individual lines of multiline headers.

refs #2062

comment:13 Changed 11 years ago by jesstess

Keywords: review added; removed
Owner: jesstess deleted

Review points by TimAllen addressed:

  • check that headers exist before trying to process them added in lineReceived
  • test_noHeaders added to cover that case
  • stripping the leading whitespace from the individual lines of the multiline headers
  • 2062.feature added in twisted/web/topfiles

comment:14 Changed 11 years ago by Thijs Triemstra

Cc: Thijs Triemstra added
Keywords: review removed
Owner: set to jesstess

Can you get rid of oneliners like:

"""Convert chunk to string.

    @returns: tuple (result, remaining), may raise ValueError.
"""

as much as possible in surrounding code and apply PEP-8 standards as much as possible? People here tend to dislike fixing up a whole module like this for a single ticket but readability now should count as well..

Thanks for working on those old tickets, fresh Twisted == good.

comment:15 Changed 11 years ago by jesstess

(In [27882]) Bring docstring formatting in http.py up to the standard.

refs #2062

comment:16 Changed 11 years ago by jesstess

Keywords: review added
Owner: jesstess deleted

comment:17 Changed 11 years ago by Glyph

Keywords: review removed
Owner: set to jesstess

It's always good to fix bugs, but twisted.web.client.Agent is hopefully the wave of the future. We should be deprecating stuff like HTTPClient as soon as possible, or, where it's a reasonable amount of effort to do so compatibly, re-implementing it in terms of the _newclient stuff. This branch should probably land, but further changes to fix bugs in other twisted.web.client stuff should lean in the direction of adding deprecation warnings and pointing folks at the new APIs.

  1. The "Multiline headers are possible" comment in lineReceived's docstring is unnecessary. The docstring should not list every possible bug that might be covered: if we followed this example, it should also say "Headers which contain whitespace are accepted. Headers which have a ":" in them are accepted. Unknown status codes will be accepted.", etc etc. It would be more helpful to draw attention to the fact that request bodies will be parsed by something else (rawDataReceived).
  2. This is a bugfix, not a feature. Multi-line headers are a required part of the HTTP spec. Therefore, the file in topfiles should be .bugfix, not .feature.
  3. The test docstrings should be more descriptive. Phrases like "can successfully do X" or "will not fail when X" should be avoided if possible. For example, instead of "HTTPClient can successfully parse header-less requests", say, "An HTTP request with no headers will cause handleHeaders and handleEndHeaders to be called on HTTPClient subclasses".
  4. As I was writing that previous comment, I realized that the new tests don't make particularly good assertions. They effectively only assert that handleStatus was called, not that handleEndHeaders was called, or even that any headers were received. If ourHandleHeader is never called, then its assertions will never be made. Any test which asserts that a callback should be invoked really ought to assert that the callback was in fact called.
  5. The use of lineReceived rather than dataReceived is a little confusing; especially whe it comes to including the delimiter ('\r\n') in lineReceived's argument, something that should never happen in real life.

Thanks for keeping up the maintenance work!

comment:18 Changed 11 years ago by jesstess

(In [27952]) The description of "unfolding" multiline headers in RFC 822 suggests that the non-CRLF whitespace between lines should be preserved, so change lineReceived to have that behavior and adjust the test cases accordingly.

refs #2062

comment:19 Changed 11 years ago by jesstess

Keywords: review added
Owner: jesstess deleted
Type: enhancementdefect

Points 1-5 from glyph's review are now addressed in the branch. The comment about CRLFs being passed to lineReceived sent me back to the RFCs looking for clarification on how whitespace between the lines of multiline headers is supposed to be handle, as TimAllen had brought this up previously in the ticket. RFC 2616 (HTTP/1.1) references RFC 822 (STANDARD FOR THE FORMAT OF ARPA INTERNET TEXT MESSAGES) regarding the format for multiline headers. Section 3.1.1. of that RFC says:

"The process of moving from this folded multiple-line representation of a header field to its single line representation is called "unfolding. Unfolding is accomplished by regarding CRLF immediately followed by a LWSP-char as equivalent to the LWSP-char."

which implies to me that leading whitespace on the continuing lines of multiline headers should be preserved in what's passed to extractHeader and then handleHeader. That's what's done now in the branch.

comment:20 Changed 11 years ago by Moshe Zadka

Keywords: review removed
Owner: set to jesstess

In http.py:lineReceived:line 464: Stylistically I would like it better if it was:

if not line:
    ...stuff under else...
    return
...stuff under if...

That would allow dedenting the stuff if/elif/else stuff currently there.

Thanks for changing the 1/0 stuff to True/False in the .firstLine stuff.

Overall, patch well done. Please merge.

comment:21 Changed 11 years ago by jesstess

(In [28021]) Rework flow control in lineReceived.

refs #2062

comment:22 Changed 11 years ago by jesstess

Resolution: fixed
Status: newclosed

(In [28022]) Merge multi-line-headers-2062

Authors: kkinder, jesstess Reviewers: z3p, TimAllen, thijs, glyph, moshez Fixes: #2062

twisted.web.http.HTTPClient now supports multi-line headers, which are a required part of the HTTP specification.

comment:23 Changed 10 years ago by <automation>

Owner: jesstess deleted
Note: See TracTickets for help on using tickets.