Ticket #3896 (new defect )

Opened 7 months ago

Last modified 2 months ago

Passing a unicode object to request.write corrupts the entire response

Reported by: radix Assigned to: jknight
Type: defect Priority: normal
Milestone: Component: web
Keywords: Cc: terry@jon.es, msabramo@yahoo.com
Branch: Author:
Launchpad Bug:

Description

I passed a unicode object to request.write, and my HTTP response came out like this:

'H\x00\x00\x00T\x00\x00\x00T\x00\x00\x00P\x00\x00\x00/\x00\x00\x001\x00\x00\x00.\x00\x00\x001\x00\x00\x00'

and so on. (That's the string "HTTP/1.1" encoded in, apparently, UCS-4).

idnar points out that this may be because of cStringIO usage:

<idnar> >>> StringIO(u'abc').getvalue()
<idnar> 'a\x00\x00\x00b\x00\x00\x00c\x00\x00\x00'

In my opinion, either an exception should be immediately raised when unicode is passed to request.write, or the data should be encoded with the response's encoding, if one has been set.

Attachments

web-sux.tac (461 bytes) - added by radix 7 months ago.
reproducer
unicodeinrequesthandler.py (416 bytes) - added by msabramo 2 months ago.
twisted.internet.ticket_3896.patch (0.6 kB) - added by msabramo 2 months ago.
Proposed patch (output from "diff -F def -u abstract.py.orig abstract.py")

Change History

  2009-07-01 15:23:26+00:00 changed by radix

  • attachment web-sux.tac added

reproducer

  2009-07-06 00:05:40+00:00 changed by terrycojones

  • cc set to terry@jon.es

Funnily enough, I ran into exactly this issue today. In my case I inadvertently put some unicode into the value passed to request.setHeader before calling request.write with a body. I don't think it's (only) unicode in the response payload/body that causes this problem, as the output shown above is produced by self.transport.writeSequence(l) in request.write as self.startedWriting is 0.

It would be good to get this fixed, as it's non-trivial to figure out what's going on. My client (with very slightly tweaked protocol and factory subclassing those in twisted.web.client) got a failure from its getPage with a twisted.web.error.Error / ConnectionDone? with message 'connection closed cleanly'. The server thought it was done and closed the connection. The client protocol and factory saw very little of the action though as the incoming data was a mess and so lots of methods that you'd expect to be called, like handleHeader, handleStatus, etc were not called, and yet failed was 0 in the HTTPPageGetter instance.

Sorry to be a little vague on the client details - they're not too important seeing as the issue will be fixed in the server. I'm just trying to highlight that the problem takes some tracking down (at least when using twisted.web.client tools to process the response).

  2009-12-19 03:24:20+00:00 changed by msabramo

I just ran into this issue as well and it took me a LONG time to figure out what was going on.

In my case, I switched my program from reading data from MySQL to sqlite3 (via adbapi.ConnectionPool?) and didn't realize that sqlite3 was returning Unicode strings. The symptom I saw was that Firefox would hang forever while trying to load the page.

After hours of looking at the issue, I figured out what was happening. As mentioned in the ticket, writing any Unicode causes the response to get written in some kind of multi-byte Unicode encoding:

$ curl -i http://localhost:8000/ > output.bin
$ hexdump -c output.bin | head -2 | cut -c5-
000   H  \0   T  \0   T  \0   P  \0   /  \0   1  \0   .  \0   1  \0
010      \0   2  \0   0  \0   0  \0      \0   O  \0   K  \0  \r  \0

This of course confuses the heck out of any browser, whether it's Firefox or curl.

Interestingly, this problem only seems to occur when the response uses HTTP/1.1 chunked encoding. If the client requests HTTP/1.0, then attempting to write Unicode causes Twisted to raise TypeError("Data must not be unicode"), which seems to be the expected behavior (according to the Twisted FAQ).

The difference between behavior for non-chunked vs. chunked seems to come down to the twisted.internet.abstract.FileDescriptor class.

For non-chunked data, the write method is called and that checks if isinstance(data, unicode). For chunked data, the writeSequence method is called, which does no such check. Later on, the response gets garbled in twisted.internet.tcp.Connection.writeSomeData, which does:

return self.socket.send(buffer(data, 0, self.SEND_LIMIT))

Now I know that I'm not supposed to write Unicode, but it's nice for the framework to let me know when I'm doing something wrong. It already does that if the response is unchunked, so I propose to have the same behavior for when the response is chunked.

  2009-12-19 03:25:44+00:00 changed by msabramo

  • attachment unicodeinrequesthandler.py added

  2009-12-19 12:19:07+00:00 changed by msabramo

  • attachment twisted.internet.ticket_3896.patch added

Proposed patch (output from "diff -F def -u abstract.py.orig abstract.py")

  2009-12-19 12:21:15+00:00 changed by msabramo

  • cc changed from terry@jon.es to terry@jon.es, msabramo@yahoo.com
Note: See TracTickets for help on using tickets.