Opened 7 years ago

Last modified 3 years ago

#2842 defect new

[PATCH] Handle bad formatted headers in lineRecieved() in twisted/web/http.py

Reported by: alexis Owned by:
Priority: normal Milestone:
Component: web Keywords: http, key, val, line, split, lineRecieved, httpclient
Cc: thijs, exarkun, jesstess, jasonjwwilliams Branch:
Author: Launchpad Bug:

Description

Some web servers send us bad formatted headers. In these cases, the instruction [ key, val = line.split(':', 1) ] just don't work because there is no ":" in line. So there is a little patch that makes some tests before splitting. I don't know if this is useful for twisted but there is a lot of servers that send headers like that...

Attachments (1)

http.py.diff (682 bytes) - added by alexis 7 years ago.

Download all attachments as: .zip

Change History (20)

Changed 7 years ago by alexis

comment:1 follow-up: Changed 7 years ago by exarkun

Which web servers do this? Arguably, they are not HTTP servers.

In any case, can you provide unit tests for this change? You may want to read <http://twistedmatrix.com/trac/wiki/TwistedDevelopment> and the pages it links to, particularly <http://twistedmatrix.com/trac/wiki/ReviewProcess> and <http://twistedmatrix.com/trac/browser/trunk/doc/development/policy/coding-standard.xhtml?format=raw>.

comment:2 in reply to: ↑ 1 Changed 7 years ago by alexis

I have no url in mind now. I'll give unit tests and example urls tomorrow.
Thank you for your advices.

comment:3 Changed 7 years ago by alexis

Here are some urls that do not send correct headers:

these 2 send HTML code like if it was headers:
line="<HTML>
<TITLE> Not Found </TITLE>
<BODY BGCOLOR="#FFFFFF">
<FONT FACE="Arial,Helvetica,Geneva" SIZE=-1>
<H2> Not Found </H2>
</BODY>
</HTML>"

I think there must be loads of servers like these...
The trunk version of twisted will throw an exception on the line 416: key, val = line.split(':', 1)
The patched version will just set key and val to ""

Here is my unit test, you can add it in twisted/web/test/test_http.py:

class HTTPClientLineRecievedTestCase(unittest.TestCase):
	def _testLineReceived(self, line, result_key, result_val):
		c=http.HTTPClient()
		c.lineReceived("HTTP/1.x 200 OK")
		c.factory=object()
		def myTest(key, val):
			self.assertEquals(key, result_key)
			self.assertEquals(val, result_val)
		c.handleHeader=myTest
		c.lineReceived(line)
	
	def testBadLineReceived(self):
		self._testLineReceived("This is not a correct header", "", "")
	
	def testGoodLineReceived(self):
		self._testLineReceived("Content-Type: text/html; charset=UTF-8", "Content-Type", "text/html; charset=UTF-8")

comment:4 Changed 7 years ago by alexis

  • Type changed from enhancement to defect

comment:5 Changed 6 years ago by thijs

  • Cc thijs added
  • Keywords review added
  • Owner jknight deleted

Adding review keyword for unit tests.

comment:6 Changed 6 years ago by exarkun

  • Cc exarkun added
  • Keywords review removed
  • Owner set to alexis

Of the HTTP servers listed above, I don't see a problem with any of their responses. One of the addresses no longer resolves and one of them isn't for an HTTP server, it's for an NNTP server. The rest respond with valid HTTP which can be parsed without the attached patch.

So it's still unclear to me what this change is fixing.

The proposed test also has some problems:

  1. lots of coding standard violations
  2. only performs assertions indirectly, HTTPClient could easily swallow the failures from those assertions
  3. there's no coverage of the slice parameters in the line[1:-1] part of the implementation, and I'm not even sure why that slice is there.

I'm somewhat skeptical that this change is useful. If there is some server software somewhere out there with a particular known bad behavior, it may make sense to work around that. We can't work around all broken HTTP servers though, and we can't even work around the brokenness of one unless we know what it is.

comment:7 Changed 6 years ago by alexis

The example domains that I've posted here 1 year ago may have changed their content.

One of these urls is still raising the same error in twisted: http://bubunet.net/ which sends an "\n" as a line for one of it's headers. The server's headers say that it's a Microsoft-IIS/6.0. But other IISs do not raise that error.

I know that it's impossible to work around all broken servers but in my patch the main idea was just to ignore the recieved line if it doesn't contain a ":" instead of raising a ValueError exception on the line "key, val = line.split(':', 1)".

A simplier solution would be to replace the test (line 412) "if line:" by "if line and ':' in line:"

comment:8 Changed 6 years ago by exarkun

  • Keywords httpclient added

Okay, rechecking server behavior (as part of testing the new HTTP client), I see that the listed servers all have one misbehavior in common. They freely mix \n and \r\n terminated lines in the header section. Changing the protocol to accept either lets it receive these broken responses.

comment:9 Changed 6 years ago by jknight

And it does appear that browsers' http response parsers don't mind the broken line termination.

comment:10 Changed 4 years ago by jesstess

  • Cc jesstess added

I'm not sure what is supposed to be fixed now. Is the goal supporting both \n and \r\n, supporting a lack of :, to do nothing and close the ticket, or something else? I'll work on it if someone tells me.

comment:11 Changed 4 years ago by exarkun

Just support \n and \r\n, I guess.

comment:12 Changed 4 years ago by jesstess

  • Owner changed from alexis to jesstess

comment:13 Changed 4 years ago by ivank

Related bug for the new HTTP client: #3833

comment:14 Changed 4 years ago by jknight

  • Cc jasonjwwilliams added

comment:15 Changed 3 years ago by binjured

  • Keywords review added
  • Owner jesstess deleted

Opening for review of patch mentioned in previous comment.

comment:16 Changed 3 years ago by jasonjwwilliams

Porting the problem description that the above patch addresses from #4814:

t.w.h.HTTPClient just drops content if the sending server uses LF to terminate headers and lines instead of CRLF. It's broken behavior to issue LF only, but notable sites like http://news.ycombinator.com do it. The problem is exacerbated when you're using ProxyClient and some sites come up blank. Most proxies handle this busted behavior, and all browsers appear to be tolerant of it.

The actual issue is in the underlying LineReceiver, but since it's mostly an HTTP issue it seems like HTTPClient is the place to deal with it.

The attached patch gives HTTPClient it's own dataReceived method that splits on LF if it can't find CRLFs in the response. Patch includes tests for this behavior.

comment:17 Changed 3 years ago by jonathanj

  • Keywords review removed

There are a number of issues with this patch. The patch on #886 (as referenced on #3833) seems like it could easily be ported to twisted.web.http, resulting in a less messy fix.

comment:18 Changed 3 years ago by jonathanj

To expand on my previous comment about the patch issues:

  • Half of LineReceiver (dataReceived) is copied into http.Request, with seemingly no explanation.
  • dataReceived is incorrectly indented, yet the tests pass.
  • DummyHTTPBrokenSeparatorHandler seems overly complicated (when compared with DummyHTTPHandler).
  • There is another test_bufferBrokenSeparator method with no code or docstring in it.

Compare these points with the patch in #886.

comment:19 Changed 3 years ago by <automation>

Note: See TracTickets for help on using tickets.