Opened 4 years ago

Last modified 11 months ago

#7678 defect assigned

stalls on serving some files over http with gzip compression

Reported by: Elmir Jagudin Owned by: Arthur Kantor
Priority: normal Milestone:
Component: web Keywords:
Cc: jknight, Adi Roiban Branch:
Author:

Description

The twisted http server will stall when serving files while making on-the-fly gzip compression.

For example serving a file with a single letter repeated 131072 times (StaticProducer.bufferSize * 2) will trigger this bug.

See the attached example.

Attachments (3)

gzipbug.py (825 bytes) - added by Elmir Jagudin 4 years ago.
An example of a twisted http server which triggers the gzip stall bug.
issue7678.patch (769 bytes) - added by David Baker 3 years ago.
Patch for issue 7678
issue7678_2.patch (555 bytes) - added by David Baker 3 years ago.
Patch, mk2

Download all attachments as: .zip

Change History (17)

comment:1 Changed 4 years ago by DefaultCC Plugin

Cc: jknight added

Changed 4 years ago by Elmir Jagudin

Attachment: gzipbug.py added

An example of a twisted http server which triggers the gzip stall bug.

comment:2 Changed 4 years ago by Elmir Jagudin

I have spend some time investigating this issue, and my as well share my findings.

The _GzipEncoder objects uses the zlib.Compress object to compress data. _GzipEncoder read the file in chunks of 65536 bytes and compresses the with the zlib.Compress.compress() method. The compress() method returns compressed data, which is written to the client socket.

The problem arises when compress() does not return any new compressed data. This happens for example while compressing highly redundant data. When there is no compressed data, twisted skips writing the the client socket. It seems that it then fails to initiate reading of the next chunk of the input data. Thus it will stop uploading the file to the server.

Changed 3 years ago by David Baker

Attachment: issue7678.patch added

Patch for issue 7678

comment:3 Changed 3 years ago by David Baker

Patch included.

I agree with Elmir's investigation, and as far as I can see this is a bigger problem than just the gzip encoder. When a FileDescriptor has written all the data in its buffer, it removes itself from the reactor. It then calls resumeProducing() once which normally will write some more data to the FileDescriptor, causing it to be added back in to the reactor. If no data is written as a result of the resumeProducing() call, the file descriptor gets stuck since it never gets added back in to the reactor. Encoders such as the gzip encoder are one such way this might happen.

I think the correct behaviour for a FileDescriptor here would be to repeatedly call resumeProducing() until either it has data or the producer is removed. I have attached a patch that does this. This fixes the problem in question although I'm no expert in the Twisted reactor event loop so I'd be grateful for feedback.

Last edited 3 years ago by David Baker (previous) (diff)

comment:4 Changed 3 years ago by Jean-Paul Calderone

Keywords: review added; gzip removed

comment:5 Changed 3 years ago by Jean-Paul Calderone

Keywords: review removed
Owner: set to David Baker

Thanks.

If no data is written as a result of the resumeProducing() call, the file descriptor gets stuck since it never gets added back in to the reactor. Encoders such as the gzip encoder are one such way this might happen.

This is the bug. The encoder is mis-behaving. The reactor's behavior is well-specified. If you are a "pull producer" and someone calls "resumeProducing" on you, it is now your responsibility (and yours alone) to write some data to your consumer. If you don't do it, it's not the reactor's job to prod you again.

The encoder support in twisted.web.server should reflect this. If it puts some data into an encoder and no data comes out, perhaps it should try putting some more in (and continue doing so until some data comes out, I suppose).

comment:6 Changed 3 years ago by David Baker

Is this possible though? write() on the request doesn't appear to return the amount of bytes actually written (it returns None). The producer will know if it wrote any bytes to the layer below it but I can't see any way for it know whether any of those bytes actually made it through the encoder to the FileDescriptor and therefore whether it needs to try writing more.

If this is the model then I think there needs to be some feedback from the encoder on how many bytes were written, or possibly a way for the producer to query the FileDescriptor (presumably via the Request) for the state of the pull producer. Personally I'd argue that a pull producer should continue to be asked for data for as long as it's registered, but I appreciate the design choice.

What do you think would be the best way to feed back necessary information to the producer, or is there already such a mechanism?

comment:7 Changed 3 years ago by Jean-Paul Calderone

write() on the request doesn't appear to return the amount of bytes actually written (it returns None).

Indeed. Because the only guarantee it makes is that it will buffer the data you hand it and probably eventually send it over some kind of transport, or at least try to.

The producer will know if it wrote any bytes to the layer below it

Yes, exactly correct. And if it is a pull producer, then it knows it has fulfilled its part of the IProducer/IConsumer contract and properly responded to the resumeProducing call that was made on it. Its job is done at this point until resumeProducing is called on it again.

That's what makes the patch you supplied a tempting solution. However, the solution doesn't generalize. There's no requirement that a producer write data *instantly* upon a resumeProducing call. Perhaps it's a producer that makes an HTTP request for the next chunk of data and can't write it until some more network operations have completed. A while loop that keeps asking it for more data won't help - the reactor will end up spinning forever asking it for data it can't provide (and since the reactor has stopped making progress at this point, the data really will never be available).

If this is the model then I think there needs to be some feedback from the encoder on how many bytes were written

I think this is the right track. Specifically, the implementation of Request.write (and, incidentally, at least one other place, just because Request is implemented such that it might need to write response body bytes from multiple places in its implementation, oh well - maybe this can be refactored to remove the duplication) gets some bytes handed to it - perhaps by a pull producer (let's just consider that case for now). It passes the bytes through the encoder to get some new, different bytes. Then it passes the new, different bytes on (ultimately to the transport).

I'll suggest that this is exactly the place where some extra logic is required. This code should notice if the new, different bytes turn out to be no bytes at all (b"") and (if a pull producer is registered), generate a new resumeProducing call to it. It only needs to do this once because, as stated at the top, after resumeProducing is called it is the producer's responsibility to perform another write call. If that future write also gets encoded to b"" the same logic will activate again to repeat the process - and there's the "loop" that you recognized was required (and included in your patch).

The key here is that when Request.write does something that changes the flow of bytes between the producer and the ultimate consumer (the transport) - ie, *completely removes* part of that flow - then it is responsible for maintaining the producer/consumer contract.

Does that make sense?

comment:8 in reply to:  7 Changed 3 years ago by David Baker

That's what makes the patch you supplied a tempting solution. However, the solution doesn't generalize. There's no requirement that a producer write data *instantly* upon a resumeProducing call. Perhaps it's a producer that makes an HTTP request for the next chunk of data and can't write it until some more network operations have completed. A while loop that keeps asking it for more data won't help - the reactor will end up spinning forever asking it for data it can't provide (and since the reactor has stopped making progress at this point, the data really will never be available).

Okay - from looking at the design it looked like this would be a push producer

Yes, it does make sense that the Request's encoder is breaking the contract here. I've made a new patch to web/server.py that asks calls resumeProducing if no data is produced, as you suggested. I haven't been able to find the other places where response body bytes are written (apart from finish() but that's okay because it call finish() on the encoder too). This patch fixes the problem in my test setup.

Changed 3 years ago by David Baker

Attachment: issue7678_2.patch added

Patch, mk2

comment:9 Changed 3 years ago by Jean-Paul Calderone

Keywords: review added

Thanks. You may want to take a look over at ReviewProcess to get familiar with the issue tracker work flow.

comment:10 Changed 3 years ago by David Baker

Owner: David Baker deleted

comment:11 Changed 3 years ago by Adi Roiban

Cc: Adi Roiban added
Keywords: review removed
Owner: set to David Baker

Many thanks for your contribution matrix_dave!

In order for the patch to be accepted it needs to following:

  1. A test which can be used on trunk to trigger and error and on your patch to see that the error no longer occur. Your gzipbug.py file is the first step.

In the end, the test needs to be an automated test so that we can check that it work on all supported operating systems and environments... also make sure that any future change/upgrade will not introduce this error again.

I can help with the steps required to making it an automated test. Please let me know if you need help.

  1. The changes should have a topfile newsfile describing the change. The content of that file will be automatically included in the release notes so that other users can discover your change/fix.

If you have any questions regarding the patch/review process please let me know.

Me and other Twisted developers are also available on IRC.

Thanks again!

comment:12 Changed 11 months ago by Arthur Kantor

Owner: changed from David Baker to Arthur Kantor
Status: newassigned

i'll take a crack at adding the test and the topfile

comment:13 Changed 11 months ago by Arthur Kantor

This is actually really hard to reproduce as trial test. When running as trial, there is some sort of a cooperator task which keeps pumping the File producer.

So the following runs just fine when added to twisted.web.test.test_web.GzipEncoderTests:

    @inlineCallbacks
    def test_veryCompressibleData(self):
        from twisted.web.client import Agent, readBody
        from twisted.web.http_headers import Headers
        from twisted.web.resource import Resource
        with open('veryCompressibleData.txt','wb') as f:
            f.write(b"A"*200000)
        r = File('veryCompressibleData.txt')
        wrapped = resource.EncodingResourceWrapper(
            r, [server.GzipEncoderFactory()])
        site = server.Site(Resource())
        site.resource.putChild(b"bar", wrapped)
        port=reactor.listenTCP(54879, site)        

        agent = Agent(reactor)

        try:
            resp = yield  agent.request(
                b'GET',
                b'http://localhost:54879/bar',
                Headers({'User-Agent': ['Twisted Web Client Example'],
                         "Accept-Encoding": [b"gzip"]}),
                None)
            body = yield readBody(resp)
        except Exception as e:
            print(e)    

        self.assertEqual(b"A"*200000,
                          zlib.decompress(body, 16 + zlib.MAX_WBITS))
        port.stopListening()

However the stall does happen if you take the server out into it's own script and run it as a separate process.

from twisted.web import server, resource
from twisted.web.static import  File
from twisted.internet import reactor
from twisted.web.resource import Resource

with open('veryCompressibleData.txt','wb') as f:
    f.write(b"A"*20000000)
r = File('veryCompressibleData.txt')
wrapped = resource.EncodingResourceWrapper(
    r, [server.GzipEncoderFactory()])
site = server.Site(Resource())
site.resource.putChild(b"bar", wrapped)
port=reactor.listenTCP(54879, site)        
reactor.run()

with the test changed to

    @inlineCallbacks
    def test_veryCompressibleData(self):
        from twisted.web.client import Agent, readBody
        from twisted.web.http_headers import Headers
        from twisted.web.resource import Resource
#         with open('veryCompressibleData.txt','wb') as f:
#             f.write(b"A"*200000)
#         r = File('veryCompressibleData.txt')
#         wrapped = resource.EncodingResourceWrapper(
#             r, [server.GzipEncoderFactory()])
#         site = server.Site(Resource())
#         site.resource.putChild(b"bar", wrapped)
#         port=reactor.listenTCP(54879, site)        
        #self.channel.site.resource.putChild(b"bar", wrapped)

        agent = Agent(reactor)

        try:
            resp = yield  agent.request(
                b'GET',
                b'http://localhost:54879/bar',
                Headers({'User-Agent': ['Twisted Web Client Example'],
                         "Accept-Encoding": [b"gzip"]}),
                None)
            body = yield readBody(resp)
        except Exception as e:
            print(e)    

        self.assertEqual(b"A"*200000,
                          zlib.decompress(body, 16 + zlib.MAX_WBITS))
#         port.stopListening()

Is spawning a process with just the server inside the test acceptable? Is there an elegant way of writing this test?

comment:14 Changed 11 months ago by Jean-Paul Calderone

Any test that does real network operations is going to be slower than a unit test that exercises the same application code. Most tests that do real network operations are going to be less reliably (ie, they will fail because of problems that have nothing to do with the application code under test) than a unit test that exercises the same application code.

This all goes twice as much for tests that do real network operations *and* launch subprocesses.

So, in a perfect world, the test for this bug would involve neither real network operations nor subprocesses.

However, having a minimal program that does real network operations and demonstrates the bug is a very useful tool for understanding *exactly* where the problem is. Once you understand that, you've taken a big step towards knowing how to write a unit test that doesn't need to perform real network I/O.

(Note that it may seem like a conclusion of this line of reasoning is that unit tests should always be preferred to integration tests. This is not actually my intent. The point here is just that for this particular bug it is almost certainly the case that a unit test will be sufficient and preferable.)

Note: See TracTickets for help on using tickets.