Opened 3 years ago

Closed 3 years ago

#8129 enhancement closed fixed (fixed)

twisted.web.http_headers.Headers should work on bytes and Unicode

Reported by: hawkowl Owned by: hawkowl
Priority: normal Milestone: Python-3.x
Component: web Keywords:
Cc: Branch: branches/unicode-headers-8129-2
branch-diff, diff-cov, branch-cov, buildbot
Author: hawkowl


Since the encoding is fairly unambiguous (keys are always iso-8859-1-encodable), we should accept Unicode input. If people want to send binary garbage, they can use the bytes interface still.

Change History (11)

comment:1 Changed 3 years ago by hawkowl

Author: hawkowl
Branch: branches/unicode-headers-8129

(In [46351]) Branching to unicode-headers-8129.

comment:2 Changed 3 years ago by hawkowl

Keywords: review added
Milestone: Python-3.x

I'm happy with this, builders spun, please review.

Coverage report is here:

comment:3 Changed 3 years ago by Adi Roiban

Keywords: review removed
Owner: set to hawkowl

Many thanks for the changes.

Happy to see more Unicode handling in Twisted :)

Can you please rephrase this phrase... or break it in smaller sentences. As a non-native English speaker I find it hard to read.

If bytes that cannot be interpreted using either of these encodings are required, using only L{bytes} when interfacing with this class will not do any encoding or decoding, and pass the raw bytes through.

for _encodeValue doctring, I think that we should have something like this, as it can return the same C{value} is no encoding is required.

@return: C{value} or encoded C{value}

same for _decodeValues

For _encodeName can we always return the encoded value in lower case?

In addRawHeader why do we need to to encode the name and value... I expect that it will be encoded later in setRawHeaders


        if isinstance(name, unicode):
            return self._decodeName(canonicalName)

can we move this into _decodeName so that we don't have to check the type before calling _decodeName ?

For UnicodeHeadersTests can we have some non-ascii values ?

With ascii only values it is hard to check utf-8 or iso-8859-1 as it should look just like the case when no explicit encodign was used... beside the type

It looks like UnicodeHeadersTests ends with 4 emply lines and twisted/web/test/ with no empty line.

Some more linter errors

************* Module twisted.web.http_headers
W9202:162,4:Headers._decodeName: Missing epytext markup @param for argument "name"
W9202:173,4:Headers._encodeName: Missing epytext markup @param for argument "name"
W9202:186,4:Headers._encodeValue: Missing epytext markup @param for argument "value"
W9202:198,4:Headers._encodeValues: Missing epytext markup @param for argument "values"
W9202:213,4:Headers._decodeValues: Missing epytext markup @param for argument "values"

comment:4 Changed 3 years ago by hawkowl

Keywords: review added
Owner: hawkowl deleted

Hi Adi,

I fixed up the TwistedChecker errors, and removed the requirement for the if statement you mentioned there. There was also some non-ASCII tests in the first patch, but I've made them a bit better. I also fixed up addRawHeaders doing encoding when it didn't need to. Doing the lowercasing in encodeName is also a good idea.

Builders spun, please review.

comment:5 Changed 3 years ago by Adi Roiban

Keywords: review removed
Owner: set to hawkowl

What is the reason for creating links to bytes, unicode, bool ... types?

In __init__ do we need to convert before calling setRawHeaders. I was expecting that setRawHeaders will handle the conversion and we don't need the conversion before each call.

Maybe we can update the docstring for getRawHeaders as it no longer needs to raw string.

Do we need _decodeName ?

For UnicodeHeadersTests

for test_initializer I don't understand how the docstring matches to actual test code... it looks like the test is for getRawHeaders and check that if a bytes key is requested, it returns bytes values.. and when unicode key is requested, it returns unicode values.

For test_setRawHeaders can we have a key value in the formatat content-md5 to check capitalization ?

Also, some strings don't have the unicode marker. Is that ok?

Can you please consider writing testing using AAA as this should make them much easier to read.

h = Headers()

h.setRawHeaders("test", rawValue)

rawValue = [u"value1", u"value2"]
self.assertEqual(h.getRawHeaders("test"), rawValue)
rawEncodedValue = [b"value1", b"value2"]
self.assertEqual(h.getRawHeaders(b"test"), rawEncodedValue)

test_nameEncoding looks like 3 tests in 1. setRawHeaders with invalid encodig, hasHeader with invalid encoding and setRawHeaders with valid encoding.

Maybe test_valueEncoding can be renamed to test_setRawHeadersValueEncoding

test_rawHeadersTypeChecking can be renamed to test_setRawHeadersNonBytes

Instead of self.assertIdentical(None, something) can we use self.assestIsNone... we should be over python 2.6

For test_hasHeaderTrue can we use some headers names with capitalization and non-ascii values?

For test_removeHeader can we have a non-ascii header name?

test_otherComparison has 3 empty lines before it ... and test_headersComparison is one long blob... hard to read :(

For test_repr can we have non-ascii names and valus ...also shouln't the values be Unicode encoded ?

Shouldn't we have an empty line at the end of each file ?

Please check my comments and resubmit.


comment:6 Changed 3 years ago by hawkowl

Keywords: review added
Owner: hawkowl deleted

Hi Adi,

The reason for the links are because L{} are for Python FQDN identifiers -- C{} is sometimes used for it, but we should be using L{} for things that can be externally identified.

I've amended the rest. Please review.

comment:7 Changed 3 years ago by Adi Roiban

Keywords: review removed
Owner: set to hawkowl

I think that docstring for test_nameNotEncodable is out of sync. I still think that there are 2 tests there.

I think that test_repr is broken as values should be UTF-8 not latin-1. Same for test_subclassRepr

For repr can we have a test in which both key and values are raw bytes without any direct text encoding?

is it ok to have native strings for repr ? I think that on Python2 some implicit encoding is done for native strings

You have changed the Unicode tests to non-ascii key names but I think that they also need non-ascii value names... so that we can check that keys are serialized to latin-1 and values to utf-8

The values should not only be unicode but also be decoded with the right encoding.

for ascii values utf-8 and latin-1 will look exactly the same.

Can we have test_copy with non-ascii key and values to check that copy is ok and encoding is coreclty handled?

Please check my comment and resubmit.


comment:8 Changed 3 years ago by hawkowl

Branch: branches/unicode-headers-8129branches/unicode-headers-8129-2

(In [46439]) Branching to unicode-headers-8129-2.

comment:9 Changed 3 years ago by hawkowl

Keywords: review added
Owner: hawkowl deleted

Hi Adi,

  • Fixed. I've also added a raw bytes one.
  • It's fine to do, as the repr reprs the bytestrings that makes up the values, meaning it will show the (escaped) raw bytes, and not attempt to interpret them.
  • There's a test with this -- it tests giving Unicode (u2603, SNOWMAN), and then testing that it matches the known byte values.
  • Altered the copying tests to check encoding, and have non-ascii keys and values.

Builders spun, please review.

comment:10 Changed 3 years ago by Adi Roiban

Keywords: review removed
Owner: set to hawkowl

Great work. Many, many thanks!

Happy to see more Unicode handling in Twisted :)

Please merge! Thanks again!

comment:11 Changed 3 years ago by hawkowl

Resolution: fixed
Status: newclosed

(In [46526]) Merge unicode-headers-8129-2: Make twisted.web.http_headers.Headers support both bytestrings and Unicode

Author: hawkowl Reviewer: adiroiban Fixes: #8129

Note: See TracTickets for help on using tickets.