[Twisted-Python] Streaming HTTP

Glyph Lefkowitz glyph at twistedmatrix.com
Wed Nov 18 05:18:13 MST 2015


> On Nov 13, 2015, at 4:36 AM, Cory Benfield <cory at lukasa.co.uk> wrote:
> 
> Folks,
> 
> # Problem Statement
> 
> Thanks for your feedback on my HTTP/2 questions. I’ve started work implementing a spike of a HTTP/2 protocol for twisted.web. I’m aiming to have something that works in at least some cases by the end of the day.

Hooray!

> As part of my dive into twisted.web, I noticed something that surprised me: it seems to have no support for ‘streaming’ request bodies. By this I mean that the Request.requestReceived() method is not actually called until the complete request body has been received. This is a somewhat unexpected limitation for Twisted: why should I have to wait until the entire body has been uploaded to start doing things with it?

This is exactly what <https://twistedmatrix.com/trac/ticket/288> is about (as you note).

> This problem is thrown into sharp relief with HTTP/2, which essentially always chunks the body, even if a content-length is provided. This means that it is now very easy to receive data in delimited chunks, which an implementation may want to have semantic meaning. However, the request is unable to access this data in this way. It also makes it impossible to use a HTTP/2 request/response pair as a long-running communication channel, as we cannot safely call requestReceived until the response is terminated (which also terminates the HTTP/2 stream).
> 
> Adi pointed me at a related issue, #6928[0], which itself points at what appears to be an issue tracking exactly this request. That issue is issue #288[1], which is 12 years old(!). This has clearly been a pain point for quite some time.
> 
> Issue #6928 has glyph suggesting that we come to the mailing list to discuss this, but the last time it was raised no responses were received[2]. I believe that with HTTP/2 on the horizon, this issue is more acute than it was before, and needs solving if Twisted is going to continue to remain relevant for the web. It should also allow people to build more performant web applications, as they should be able to handle how the data queues up in their apps.
> 
> This does not immediately block my HTTP/2 work, so we can take some time and get this right.

I'm very glad to hear you say this, because:

(A) we should absolutely make progress on this ticket now that there is some impetus to do so, but,
(B) we should absolutely NOT let this new API block any purely protocol-level HTTP2 work that needs to proceed (I say "protocol level" as a contrast to "feature level" because we could totally implement a thing that speaks HTTP 2 against the current API, although it might be awkward to expose the advantageous parts of the protocol as API features until we do some cleanup)

> # Proposed Solution
> 
> To help us move forward, I’m providing a proposal for how I’d solve this problem. This is not necessarily going to be the final approach, but is instead a straw-man we can use to form the basis of a discussion about what the correct fix should be.
> 
> My proposal is to deprecate the current Request/Resource model. It currently functions and should continue to function, but as of this point we should consider it a bad way to do things, and we should push people to move to a fully asynchronous model.

We have considered it a bad way to do things for a long time.  There have been several attempts to rewrite it (Nevow's request model, web2) but none of them have really been the comprehensive re-design we need.

> We should then move to an API that is much more like the one used by Go: specifically, that by default all requests/responses are streamed. Request objects (and, logically, any other object that handles requests/responses, such as Resource) should be extended to have a chunkReceived method that can be overridden by users.

No.

Let me elaborate :-).

First of all, this is already the case (sort of).

twisted.web.server.Request inherits from twisted.web.http.Request; as you can see in http.Request, there is already a method called handleContentChunk, which is called by HTTPChannel. By overriding this method, you can already handle request data streaming in from off the wire.  This is one of the reasons that #288 is so interminable: using Twisted's public API, today, you can totally write an HTTP-based application that happily streams data from the wire.  The only problem is that this API does not propagate up to Resource objects, because Resource objects can expect the (still, as of this writing, undocumented) "content" attribute to have been filled out in getChild.  Mostly, they don't, actually!  But it's impossible to tell if they might in the general case.

You can (and many applications have) just broken the technical compatibility contract with Resource, and written a subclass of twisted.web.server.Site that has a custom requestFactory method that returns a 'streamed' resource.

So, if we're already doing this, why "no"?

Superclasses with overridable methods are a terrible mechanism for exposing extensibility.  These are used extensively throughout Twisted, the older the API the more inheritance it uses.  Newer code, you may notice, is generally written much more in a pattern of delegation to formal interfaces.  So we have tried to learn our lesson here.

> If a user chooses not to override that method, the default implementation would continue to do what is done now (save to a buffer). Once the request/response is complete (marked by receipt of a zero-length chunk, or a frame with END_STREAM set, or when the remaining content-length is 0),

For what it's worth, I hope that all of these will be exposed as the same event to applications, since the fact that these differ on the wire is entirely an implementation detail?

> request/responseComplete would be called. For users that did not override chunkReceived can now safely access the content buffer: other users can do whatever they see fit. We’d also update requestReceived to ensure that it’s called when all the *headers* are received, rather than waiting for the body.

Again, this is very similar to what already happens, at the layer of the HTTP protocol.  The question is, how do you indicate that you're delegating to a Resource object which may expect the .content attribute to already be populated during .getChild?

> A similar approach should be taken with sending data: we should assume that users want to chunk it if they do not provide a content-length. An extreme position to take (and I do) is that this should be sufficiently easy that most users actually *accidentally* end up chunking their data: that is, we do not provide special helpers to set content-length, instead just checking whether that’s a header users actually send, and if they don’t we chunk the data.

request.write() already basically does this, I think?  Here, at least, we have lots of opportunity to make the implementation do smarter things (better error checking regarding content-length, for example) without changing the interface at all.

> This logic would make it much easier to work with HTTP/2 *and* with WebSockets, requiring substantially less special-case code to handle the WebSocket upgrade (when the headers are complete, we can spot the upgrade easily).
> 
> What do people think of this approach?

So I think you're roughly on the right track but there are probably some Twisted-level gaps to fill in.

I've already gestured in the direction of Tubes (as have others) and it's something to think about.  But before we get to that, let's talk about a much more basic deficiency in the API: although there's an "IRequest", and an "IResource", there's no such thing as an "IResponse".  Instead, "IRequest" stands in for both the request and the response, because you write directly to a request (implicitly filling out its response as you do so).

Luckily we have an existing interface that might point the way to a better solution, both for requests and responses: specifically, the client IResponse: https://twistedmatrix.com/documents/15.4.0/api/twisted.web.iweb.IResponse.html.

This interface is actually pretty close to what we want for a server IResponse as well.  Perhaps even identical.  Its static data is all exposed as attributes which can be relatively simply inspected, and the way it delivers a streaming response is that it delivers its body to an IProtocol implementation (via .deliverBody(aProtocol)).  This is not quite as graceful as having a .bodyFount() method that returns an IFount from the tubes package; however, the tubes package is still not exactly mature software, so we may not want to block on depending on it.  Importantly though, this delivers all the events you need as a primitive for interfacing with such a high-level interface; it would definitely be better to add this sort of interface Real Soon Now, because then the tubes package could simply have a method, responseToFount (which it will need anyway to work with Agent) that calls deliverBody internally.

This works as a primitive because you have all the hooks you need for flow-control.  This protocol receives, to its 'makeConnection' method, an ITransport which can provide the IProducer https://twistedmatrix.com/documents/15.4.0/api/twisted.internet.interfaces.IProducer.html and IConsumer https://twistedmatrix.com/documents/15.4.0/api/twisted.internet.interfaces.IConsumer.html interfaces for flow-control.  It receives dataReceived to tell it a chunk has arrived and connectionLost to tell it the stream has terminated.

Unfortunately the client IRequest https://twistedmatrix.com/documents/15.4.0/api/twisted.web.iweb.IClientRequest.html isn't quite as useful (although its relative minimalism should be an inspiration to anyone designing a next-generation IRequest more than the current IRequest's sprawling kitchen-sink aesthetic).  However, IResponse.deliverBody could be applied to IGoodRequest as well.  If we have a very similar-to-IResponse shaped IRequest object, say with 'method', 'uri' and 'headers', and then a 'deliverBody' that delivers the request body in much the same way, we could get a gracefully structured streaming request with works with a lot of existing code within Twisted.

Then the question is: what to do with IResource?

Right now the flow of processing a request is, roughly:

-> wait for full request to arrive
  -> have HTTPChannel fill out IRequest object
-> look at request.site.resource for the root
 *-> call getChildWithDefault repeatedly, mutating "cursor" state on the IRequest as you move (specifically: "prepath" and "postpath" attributes)
  -> eventually reach the leaf Resource, or one with 'isLeaf' set on it, and delegate producing the response to that resource
*-> call resource.render(request)
-> examine the return value; if it's bytes, deliver them and close the connection; NOT_DONE_YET, just leave the connection open, 

Instead, I think a good flow would be:

-> receive method/headers from request
-> recurse down from request.site.resource, calling something like nevow's or web2's locateChild, but not modifying 'request' at each stage; instead, pass a "cursor" object - perhaps indeed just a twisted.python.url.URL - indicating where we are in the resource traversal hierarchy.  the reason the request.prepath and request.postpath attributes exist is mainly for Resource objects to be able to orient themselves within a resource tree and generate links.

also, it probably bears some explanation; the signature of the current "get the next resource" call is resource.getChildWithDefault(onePathSegment, request) -> resource. This is somewhat limiting as it requires you to consume only an individual path segment at a time, which can be highly awkward for implementing sites that have a URL structure that is, for example, /YYYY/MM/DD/HH/MM/index.html.  

Instead, locateChild took the entire remaining path, and returned a 2-tuple of a resource, and the _still_ remaining path.  So for the above, you could do:

def locateChild(self, request, path):
    y, m, d = path[:3]
    return ymdresource(y, m, d), path[3:]

that 2-tuple instructs the traversal machinery, "keep going".  One alternative that we toyed with for this was to make consuming the path destructive, since that made it a lot easier to tell what resource you were "looking at":

def locateChild(self, request, path):
    y, m, d = path.consume(3)
    return ymdresource(y, m, d)

Either of these approaches also let you implement 'isLeaf' attribute without special support from the framework; you simply return leaf(), () or path.consume(path.length)

-> finally, call .responseForRequest(request) -> IResponse on the final Resource and deliver the IResponse to the network.

The way compatibility could be achieved here is to write a wrapper that would implement .responseForRequest to first collect the entire body, then synthesize a gross old-style-IRequest-like object out of the combination of that body and the other information about the resource, then call .getChildWithDefault on it a few times, then call the old-style .render_GET, et. al.  The IResponse returned from this compatibility .responseForRequest would wrap up calls like request.write and turn them into write() calls.

This is long and increasingly rambly, so I should probably stop now, send it, and get your feedback.  Does the strategy I'm proposing make sense?  I'm sure I'm leaving a ton out so feel free to ask for clarification.  Hopefully I didn't leave too many incomplete sentences in the middle.

-glyph
-------------- next part --------------
An HTML attachment was scrubbed...
URL: </pipermail/twisted-python/attachments/20151118/b361c1b5/attachment-0002.html>


More information about the Twisted-Python mailing list