Opened 11 years ago

Last modified 5 months ago

#288 enhancement new

no way to access the data of an upload which is in-progress

Reported by: dalke Owned by:
Priority: normal Milestone:
Component: web Keywords:
Cc: rainhead, glyph, itamarst, dp, jknight, dalke, ivank, rikyu, exarkun, thijs, redshift, jamesh, peter.westlake@…, zooko@…, msteder, davidsarah, goffi@…, andre.cruz@…, teratorn@…, tom.sheffler@…, daf@…, nejucomo@…, radix Branch:
Author: Launchpad Bug:

Description


Attachments (2)

http_request_args_property.diff (2.6 KB) - added by rikyu 5 years ago.
a patch to http.Request that uses a property for args that parses the request body at first reference
scribbles.py (1.5 KB) - added by exarkun 5 years ago.
Some unexplained sketches of what an implementation might look like

Download all attachments as: .zip

Change History (56)

comment:1 Changed 11 years ago by dalke

I have large files I want to upload over HTTP.  I want to show some
sort of progress bar.  There's two approaches I've found.  One installs
something on the server, either a signed java app or ActiveX control.

The other is to have the onSubmit create some unique id and
add it to the submission.  Call that id 'ABC'.  The OnSubmit action
then creates a new window pointing to server_url/status?ABC
and completes the submit.  There are now *2* channels open
to the server.  One doing the upload and the other a simple
request for status.

Because of the collusion between the two server sides of
the communication, via the ABC id, the upload site can
tell the status side how much data has been uploaded,
which can be turned into an appropriate HTML display.

Pretty cool, eh?  However, there's no easy way in Twisted's
existing HTTP server code to know the current upload size
(except by doing some method redefinitions) and there's
no way for a Request to know any of this until the data is
fully uploaded.

In other words, this can't be done under Twisted without
some non-trivial rewrites.

comment:2 Changed 11 years ago by itamarst

Resource.render is only called once the request is finished
processing. At some point the file uploading will be fixed
so it is processed as it comes in, at which point this may
be feasable... very likely by totally breaking abstractions.

I actually wanted something similar for t.w.distrib
(forwarding http requests to other servers) so that you
could start forwarding immediately rather than when the
whole request has been uploaded already. So... here's how it
might work:

The looking up which resource will handle the request can be
done immediately after headers have finished being received.
At that point perhaps the resource should be checked to see
if it has .isPrerender and then pass unfinished request to
resource.prerender() which could do this sort of stuff.

This however doesn't work for cases where you have an
isLeaf=1 resource that continues doing getChild lookups in
its render (e.g. guard), unless I suppose its prerender also
does that. Hmmm...

comment:3 Changed 11 years ago by itamarst

We worked out how to do this at PyCon - basically adapt result of resource
lookup to IReceivesHTTPChunks or something, with a default implementation that
just feeds it to the IResource more or less like we do now. Proxies, distrib and
this use case could override it with a different adapter.

comment:4 Changed 6 years ago by rainhead

  • Cc rainhead added

+1 on this, for what it's worth. I'd very much like to be able to read POSTs progressively, ideally with a LineReceiver style interface.

comment:5 Changed 6 years ago by ivank

  • Cc ivank added

comment:6 Changed 5 years ago by glyph

  • Cc rikyu added

#4118 was a duplicate of this.

comment:7 Changed 5 years ago by rikyu

Replying to glyph@#4118:

This is a duplicate of #288. The problem with your approach here is making sure that none of the calls to getChild along the way rely on request.args. As you've implemented it, the parsing is deferred, but you still have to wait for all the data in the request to be received before processing it.

Well, my biggest concern is in regards to WSGI support. No other WSGI container pre-parses POST data, so I figure neither should Twisted's. My solution was meant to be a "enough rope to hang yourself by" approach, and would definitely work in the case of WSGI, since WSGIResource doesn't allow use of getChild/putChild anyways.

If there's a preferred way to do this that would only provide the fix for WSGI apps, I'd be glad to write the patch.

comment:8 Changed 5 years ago by exarkun

  • Cc exarkun added

Hm. To some degree, the "all the other WSGI containers do X" argument is convincing. However, I'm not sure it is here, at least not without a little more explanation.

Are there particular issues that arise from Twisted's behavior here? If so, those are likely to make a more convincing argument.

(Of course, I absolutely think that this ticket should be resolved and Twisted Web should provide incremental upload access, I just don't think the WSGI case is particularly more compelling than the non-WSGI case (at least not without more details).)

comment:9 Changed 5 years ago by jknight

Note that his patch didn't actually change the behavior of anything, it just moved some code around. That makes sense in its own right.

But, we really ought to stop supporting request.args from POST data by default. We also really shouldn't support POST at all except if the resource wants such support. And we really should never read the request content except when the resource requests it.

I think solving the above issues are some of the trickiest things to solve with the plan to merge twisted.web2's featureset back into twisted.web. And I sure don't know the answers.

comment:10 Changed 5 years ago by rikyu

I think there's a few considerations in addition to allowing interactive file uploads.

Conceptual: Pre-parsing of POST data is a fine feature for Twisted.Web: A Web Application Development Framework, but it is less appropriate as part of Twisted Web: HTTP Protocol Support.

Efficiency: If you do need to parse the request body differently, you will always be iterating through the data for a second time.

Security: It also seems to me that the current implementation makes it possible to perform a DOS on a TwistedWeb server just by sending particularly large requests. Of course, this is a problem by default for most HTTP server implementations, but if an application could override the default handling of POST body data, it could enforce a maximum body size, and kill the request before reading the rest of it.

comment:11 Changed 5 years ago by exarkun

(replying to jknight)

There has been a bit of discussion. The conclusion goes something like this.

Introduce a new interface between requests and resources. The new interface will provide the request body as a producer.

For for the particulars... I dunno. Perhaps this means introducing a new version of IResource, or adding a new method toIResource` that is optional (or optional for a while). Anyone want to make any specific suggestions?

comment:12 follow-up: Changed 5 years ago by jknight

Adding a new IResource sounds like twisted.web2.IResource/Resource.

Do you really want to go there? It means that everything in the entire hierarchy above the resource you care about *also* has to be a new resource, so that they don't cause the request content to be read in early.

An alternative might be a t.w.Server flag that says whether or not to pre-parse the request content before resource.render(). It can start at True, and have reading the content or request.args (with POST-args) be deprecated from outside of render(). And eventually can be changed to False.

comment:13 Changed 5 years ago by rikyu

So, just FYI, right now I'm working on a patch that uses a property for args. It doesn't parse the request body until the first time you refer to args. I'm trying to get to the bottom of a couple of failing tests, but if I can get it working, it would be a fine solution to my problem.

Can anyone think of any particular reason this wouldn't work, or would be a bad idea?

Changed 5 years ago by rikyu

a patch to http.Request that uses a property for args that parses the request body at first reference

comment:14 in reply to: ↑ 12 ; follow-up: Changed 5 years ago by exarkun

Replying to jknight:

Adding a new IResource sounds like twisted.web2.IResource/Resource.

Do you really want to go there? It means that everything in the entire hierarchy above the resource you care about *also* has to be a new resource, so that they don't cause the request content to be read in early.

IResource needs to be improved. I think everyone has always agreed on this point. getChild alone is a sufficient argument for this, let alone anything to do with streaming uploads. So I think we're going to have to deal with this eventually. I think the way to be successful with such a change is to ensure that the new pieces continue to work with the old and that minimal effort is required to take advantage of new features. And I still don't have a specific proposal to achieve that yet, no. :)

An alternative might be a t.w.Server flag that says whether or not to pre-parse the request content before resource.render(). It can start at True, and have reading the content or request.args (with POST-args) be deprecated from outside of render(). And eventually can be changed to False.

Pre-parsing is only half the issue though. We want to get to the point where we can do resource traversal before the request body is even read. I don't see how we get there without completely breaking access to request.args (with POST-args), if we do anything to change the existing APIs.

comment:15 in reply to: ↑ 14 Changed 5 years ago by rikyu

Replying to exarkun:

Pre-parsing is only half the issue though. We want to get to the point where we can do resource traversal before the request body is even read.

What's the reasoning behind this idea? It seems reasonable to me that if you use POST data to determine the proper resource to return, then you have to deal with the consequences of already having read the request body.

As far as I can see, the only scenario that would be affected by this implementation detail is if you are accepting authentication data via POST along with a file upload that you wish to stream. The only pattern I can think of that would fit this scenario is a REST-based webservice that accepts a file upload with credentials that also wishes to provide interactive upload progress.

As it stands, I can't see how it would ever be possible to use request body data for resource traversal without potentially reading the entire request body, since AFAIK you have no guarantee of the order of fields in a POST request.

I suggest that on-demand parsing of POST data is a separate issue from being able to traverse the resource tree without reading the request body.

comment:16 Changed 5 years ago by exarkun

I suggest that on-demand parsing of POST data is a separate issue from being able to traverse the resource tree without reading the request body.

Okay. It's too bad this ticket has no description. I have always interpreted it as being the ticket for "being able to traverse the resource tree without reading the request body". That's the feature I've been talking about so far. AFAICT, it's not the feature you implemented in your patch, but I don't think your patch implements the feature you want either. ;) In #4118 you talked about how it is impossible to "implement interactive file uploads, or to prohibit large attachment sizes." You have to give control to application code before the request body is read to do these things. Just delaying parsing isn't enough. Does that make sense? Am I misunderstanding something?

comment:17 follow-up: Changed 5 years ago by rikyu

I think we've covered a lot of different issues ;-)

The one I care about is being able to implement streaming HTTP uploads in a WSGI app.

The patch I attached to #4118 just moved some parsing stuff to a separate method, so that my WSGIRequest subclass could simply redefine that method to be a no-op, instead of having to reimplement requestReceived() altogether. It's true that this didn't do much of anything for anyone not using a custom Request subclass.

The patch I attached to this ticket took a better approach. It changes request.args to be a property, tied to a memoized getter function that parses the input the first time it is referenced. This fixes my problem as well, but also makes it possible to implement streaming HTTP uploads in a standard Twisted Web app, as long as you do your own POST parsing and don't refer to request.args.

comment:18 in reply to: ↑ 17 ; follow-up: Changed 5 years ago by exarkun

Replying to rikyu:

I think we've covered a lot of different issues ;-)

The one I care about is being able to implement streaming HTTP uploads in a WSGI app.

The patch I attached to #4118 just moved some parsing stuff to a separate method, so that my WSGIRequest subclass could simply redefine that method to be a no-op, instead of having to reimplement requestReceived() altogether. It's true that this didn't do much of anything for anyone not using a custom Request subclass.

The patch I attached to this ticket took a better approach. It changes request.args to be a property, tied to a memoized getter function that parses the input the first time it is referenced. This fixes my problem as well, but also makes it possible to implement streaming HTTP uploads in a standard Twisted Web app, as long as you do your own POST parsing and don't refer to request.args.

I don't think it makes it possible to implement streaming HTTP uploads in a standard Twisted Web app: requestReceived still isn't even called until the entire request body has been received.

comment:19 in reply to: ↑ 18 Changed 5 years ago by rikyu

Replying to exarkun:

I don't think it makes it possible to implement streaming HTTP uploads in a standard Twisted Web app: requestReceived still isn't even called until the entire request body has been received.

Ah yes. You are entirely correct, not sure how I missed that. I've got a couple ideas to fix this, but I need to do a great deal more research.

comment:20 Changed 5 years ago by thijs

  • Cc thijs added

comment:21 Changed 5 years ago by redshift

  • Cc redshift added

Changed 5 years ago by exarkun

Some unexplained sketches of what an implementation might look like

comment:22 Changed 5 years ago by exarkun

At the Flumotion sprint, Esteve and I mainly talked about what the new request dispatch code might look like, in order to offer this new feature while also retaining backwards compatibility. The attached scribbles.py is some scribbling I did while we were talking. It's certainly not complete, and likely not very coherent. What it does show is where we might hook the new implementation in to the existing code (Request.allHeadersReceived) and how it might support a new kind of streaming resource while also providing the current behavior to old-style resources via a couple of adapters.

comment:23 Changed 5 years ago by zooko

  • Summary changed from no way to access upload progress to no way to access the data of an upload which is in-progress

I would really like this feature and here's why:

Tahoe-LAFS is a distributed storage system. It is somewhat like Amazon S3. The API by which programmers upload and download files is a RESTful web API, so to upload a file to the distributed storage grid you send a PUT or a POST request containing the entire contents of your file in the request body.

Currently, if you upload a large file from your web browser to the Tahoe-LAFS process (which is serving as an HTTP server using twisted.web and typically runs on localhost), then the entire files gets spooled to the Tahoe-LAFS process, which stores it in a temporary file on disk (very likely the same disk which your web browser is reading the file from), and after the whole thing has been written out to this temp file then the Tahoe-LAFS source code accesses it as req.fields["file"].file and then reads the whole thing back from disk to upload it (after encrypting and erasure-coding it) to the remote servers.

http://tahoe-lafs.org/trac/tahoe-lafs/browser/src/allmydata/web/unlinked.py?rev=4170#L35

It would be a nice improvement if we could change the Tahoe-LAFS code so that it could access the content of the request body incrementally as it arrives and process it. The processing in this case involves encrypting and erasure-coding and then uploading it to a remote server and then forgetting about it so that it doesn't use up RAM nor temporary disk space for the rest of the upload process.

Here is the ticket in the Tahoe-LAFS trac about this issue:

http://tahoe-lafs.org/trac/tahoe-lafs/ticket/320 (add streaming (on-line) upload to HTTP interface)

comment:24 Changed 5 years ago by jamesh

  • Cc jamesh added

Having something like the http_request_args_property.diff patch applied would be quite useful.

The current request.args handling makes twisted.web unsuitable for servers that will receive large post bodies. It seems weird that the request object goes to the trouble of using a temporary file to buffer large request bodies, only to turn around and load that body into memory if it is a form post.

While it would be nice to have a system where the resource traversal could occur before reading the request body, I'd settle for not going out of memory when people post large files.

comment:25 Changed 5 years ago by exarkun

Having something like the http_request_args_property.diff patch applied would be quite useful.

Perhaps you can help out with that by adding unit tests for the change, and updating the patch to comply with the coding standard? When that's done, add the "review" keyword and someone will take a look.

Also, I'd suggest attaching the updated patch to a new ticket, not this one, since it's not really a solution to the issue this ticket describes.

comment:26 Changed 4 years ago by jamesh

I've created a new ticket with an updated patch as ticket #4519.

comment:27 Changed 4 years ago by jupi

So why is the approach that cgi.FieldStorage has towards multipart/form-data not possible with twisted.web and additionally maintaining backwards compatibility?

The only problem is, that because of twisted's async nature we won't have a file-like object we could pass on to FieldStorage, so an async FieldStorage had to be implemented.
One had to re-implement handleContentChunk and requestReceived of Request to

a) start parsing of body already when first data is received, in a way like FieldStorage handles parsing (which means, watch out for content-type, if it's multipart/form-data, get boundary and then read appropriate content-length and -disposition headers after each boundary). FieldStorage writes any binary data of one boundary to a file it retrieves via it's make_file method (which may be overridden in subclasses to supply custom files), any plain-text content (e.g. from text input fields) is stored in a dictionary, very much like twisted does it these days in its body parsing after the whole body has been received.

For each uploaded file, FieldStorage provides a key named after the file input field's name with the properties file (which is a file-like object/the result of make_file) and value (for backwards compatibility, does nothing else than reading the temporary file's contents into memory and passing it on to the requesting method). So your own application has the choice to either let FieldStorage read the file's content into memory so that you can use it easily, or choose to do something else with the received data by e.g. reading only parts from the written file

b) Change HTTPChannel, so that request method and uri are passed on to the Request as soon as they are known, they should be set directly after being decoded by HTTPChannel's lineReceived so that Request is able to do the on-the-fly body processing. A property could be placed for _command and _path in a subclass to achieve this behaviour by utilizing the setter method to set the value for _command to self.requests[-1]

The way of FieldStorage provides great flexibility, e.g. in one of my projects I used cherrypy (which allows you to do body processing even in a streaming manner in your own subroutines, which is, given that cherrypy is not async but sync, much easier than in twisted) and parsed the body on a certain request uri not with cherrypy's own methods (which are, btw, more elegant than twisted.web's approach, because they save a single file of a multipart-body to a single temp file on disk instead of loading it to memory) but by utilizing a subclass of FieldStorage whose make_field created a NamedTemporaryFile instead of one that is immediately unlinked after having been opened so that my application is able to simply rename the temporary file, after some sanity checks have been performed, to the final destination.

This behaviour is very important for this application because it's a firmware uploader for an embedded device featuring only 128MB of RAM and 512MB NAND storage, so there is not enough space for keeping a firmware update of 150MB in RAM or to have several copies of the same file on disk just because you can't rename the temporary file (because it has been unlinked). With the solution I implemented the user can upload the firmware update, the application verifies sanity and then links it to another than the temporary filename to allow update procedures to take place, no more space than the initial 150MB firmware size and some KBytes of RAM are wasted for this.

comment:28 Changed 4 years ago by pmw

  • Cc peter.westlake@… added

comment:29 Changed 4 years ago by zooko

  • Cc zooko@… added

comment:30 Changed 4 years ago by <automation>

  • Owner jknight deleted

comment:31 Changed 4 years ago by ivank

A summary (?) from 2010-11-23:

<exarkun> Here's how to make it work in twisted.web

<exarkun> introduce a new kind of resource that can handle streaming request bodies

<exarkun> In HTTPChannel, when the request headers are done, start resource traversal in the usual way

<exarkun> as soon as you get to a resource that isn't the new kind of resource,
give up and let the normal request processing code run (ie, wait for the full request body, then do resource traversal)

<exarkun> but if you get to the end of the url without hitting an old resource, dispatch the request

<exarkun> And finally, extend the request with a method for getting the request body as a producer, or supply the request with a protocol that's used to interpret the body, or something else like that

<exarkun> As chunks from the request body arrive, you either let the old code run (write to request.content) or dispatch to the new thing (call some protocol.dataReceived, or some consumer.write)

comment:32 Changed 4 years ago by jknight

That's what web2 did -- native web2 resources didn't get content until they requested it (which the default impl of resource.render_POST did for you), while the compat.OldNevowResourceAdapter and compat.OldResourceAdapter triggered reading of the content before calling the old resource getChild or render methods.

I never implemented the ability to nest a new resource inside an old resource; that'd take some more adaptation fun to stream the already-read data through the new resource's dataReceived/whatever.

Don't forget that this means the new resource kind can't access POSTed variables from request.args in its getChild.

Also note that if a resource is sent data but fails to read all of it, that means you can't even reliably return an error page -- you have to either read and discard all of the data, or else abort the connection (after a delay) because the client might be a blocking client and will never read the response until it finishes writing the request (web2 got that wrong, #1873).

comment:33 Changed 3 years ago by msteder

  • Cc msteder added

comment:34 Changed 3 years ago by davidsarah

  • Cc davidsarah added

comment:35 Changed 3 years ago by Goffi

  • Cc goffi@… added

comment:36 Changed 3 years ago by EDevil

Will this new functionality be implemented in the near future?

comment:37 Changed 3 years ago by EDevil

  • Cc andre.cruz@… added

comment:38 Changed 3 years ago by itamar

I plan on getting to this once I finish #5379, a necessary pre-requisite.

comment:39 Changed 3 years ago by teratorn

  • Cc teratorn@… added

comment:40 Changed 3 years ago by tomsheffler

  • Cc tom.sheffler@… added

comment:41 Changed 3 years ago by daf

  • Cc daf@… added

Hi! I've been experimenting with approaches to addressing this.

Straw man API proposal: IResource grows a new method — let's call it render_early() that receives a request that has headers but no body yet. If the resource doesn't support early processing for the request's method (or for some other reason), it raises UnsupportedEarlyProcessing, in which case we fall back to receiving the entire body and passing the request to IResource.render(). twisted.web.server.Resource implements render_early() by looking for a method corresponding to the request — let's call them receive_{GET,HEAD,POST,...}.

(If we're concerned about the new IResource.render_early conflicting with user code, we could introduce an IEarlyResource/EarlyResource that would be implemented/subclassed by users wishing to implement/use the new traversal.)

With regards to the resource not reading the entire request, I think a sensible default is for the remaining body content to be ignored. If you're worried about wasted bandwidth, you can send a response and call request.channel.transport.loseConnection().

Also, as soon as this is fixed, I think it makes sense to extend the Twisted WSGI implementation to use it.

comment:42 follow-up: Changed 3 years ago by daf

The bit I left out: render_early() can call registerConsumer() on the request to get the body as it arrives.

One question: what happens if no consumer is registered by the time render_early() returns? Should it be an error? Should we silently fall back to traditional procesing? Should the body be ignored? I'm leaning towards the latter option.

comment:43 Changed 3 years ago by itamar

If we choose to implement this without a completely new resource traversal model, we would need to do the following:

  1. New resource marker interface, which supports marking resources that don't use the body during traversal.
  2. New request interface, with a method for streaming the body as it arrives to an application-supplied object (probably a protocol)
  3. Updated request implementation that starts traversal before the body arrives, and either stops when it hits a resource that doesn't implement the new resource interface or when it gets to the end of traversal.
  4. An updated request implementation that implements the new request interface to make the body available to whoever asks for it, and prevents this new method from being called more than once.

comment:44 Changed 3 years ago by itamar

Oh, and, the protocol which would get body delivered to it would have a transport that implements IProducer, so it can pause/resume data delivery (proxying the HTTPChannel, basically).


comment:45 Changed 3 years ago by tomsheffler

In these proposals, is it a correct assumption that response headers and response chunks can be sent back as early as when the render_early() method is called?

comment:46 Changed 3 years ago by itamar

My proposal does not have a render_early. Technically right now you can send response data back during traversal before you hit render(), I bet, but you really shouldn't do that.

comment:47 follow-up: Changed 3 years ago by tomsheffler

Thanks. Let me ask this question in a more general way. Can response chunks be streamed back as request chunks are being consumed? (i.e., before the entire request is read.)

comment:48 in reply to: ↑ 47 ; follow-up: Changed 3 years ago by glyph

Replying to tomsheffler:

Thanks. Let me ask this question in a more general way. Can response chunks be streamed back as request chunks are being consumed? (i.e., before the entire request is read.)

Probably there are some clients that support that, but generally speaking, no; the server's supposed to receive the whole request before writing the response (unless the response is a 100-continue or a 417-please-don't-continue).

comment:49 in reply to: ↑ 42 Changed 3 years ago by glyph

Replying to daf:

The bit I left out: render_early() can call registerConsumer() on the request to get the body as it arrives.

One question: what happens if no consumer is registered by the time render_early() returns? Should it be an error? Should we silently fall back to traditional procesing? Should the body be ignored? I'm leaning towards the latter option.

  1. Even for a strawman, render_early is not a great name. That would be rendering the 'early' HTTP method. What you would want is renderEarly and then (presumably) renderEarly_GET etc.
  2. Keep in mind that getChild() can currently consume the request body. If the problem were render() then this would be quite a bit easier :). (This is what itamar was talking about with the "marker interface" that indicates a resource won't use the request body in getChild.)

comment:50 in reply to: ↑ 48 Changed 3 years ago by zooko

Replying to glyph:

Probably there are some clients that support that, but generally speaking, no; the server's supposed to receive the whole request before writing the response (unless the response is a 100-continue or a 417-please-don't-continue).

What? Are you sure?

http://www.w3.org/Protocols/rfc2616/rfc2616-sec6.html#sec6

After receiving and interpreting a request message, a server responds with an HTTP response message. 

Bleah! I guess you're right. But if Twisted allows servers to start sending a response before the request has been completely received, that's okay, right? Some folks might find it useful.

comment:51 follow-up: Changed 3 years ago by zooko

According to Ry4an Brase: "In practice you can start responding early and a good client handles it and a "bad" one provides backpressure."

comment:52 in reply to: ↑ 51 Changed 3 years ago by glyph

Replying to zooko:

According to Ry4an Brase: "In practice you can start responding early and a good client handles it and a "bad" one provides backpressure."

Thanks for both the spec reference and this tidbit about real-world behavior.

(Being right about the HTTP spec is one of my least favorite activities, but I seem to be doing more and more of it these days.)

comment:53 Changed 18 months ago by nejucomo

  • Cc nejucomo@… added

comment:54 Changed 5 months ago by glyph

  • Cc radix added
Note: See TracTickets for help on using tickets.