Opened 4 years ago

Last modified 3 years ago

#4519 enhancement new

Delay parsing of request body until Request.args is accessed

Reported by: jamesh Owned by: rlotun
Priority: normal Milestone:
Component: web Keywords:
Cc: Branch:
Author: Launchpad Bug:

Description

At the moment, if the request body for a post is form data (either url encoded or multipart), it is read into memory as the Request.args dictionary. This means that large requests can chew up large amounts of memory.

Attached is a patch based on one from ticket #288 that delays parsing of the request body until Request.args is accessed. This should maintain compatibility with the existing API while allowing applications that are concerned about memory usage to parse the request body themselves.

Attachments (2)

twisted-web-request-args.patch (6.6 KB) - added by jamesh 4 years ago.
twisted-web-request-args.patch
twisted-web-request-args-2-rlotun.patch (5.6 KB) - added by rlotun 3 years ago.
Patch to address review concerns.

Download all attachments as: .zip

Change History (8)

Changed 4 years ago by jamesh

twisted-web-request-args.patch

comment:1 Changed 4 years ago by jamesh

  • Keywords review added

comment:2 Changed 4 years ago by exarkun

  • Keywords review removed
  • Owner changed from jknight to jamesh

Thanks! A few simple things:

  1. Request._args should be documented. The Request.args docs are fine as-is, since the interface isn't changing here.
  2. The args getter function should have a docstring, though.
  3. There are some potential ways in which application code could break arg parsing now:
    1. Changing the method or uri attribute before accessing the args attribute. I'm not sure how important this is. It would be easy to save extra references to the original value and use those in the args getter, though.
    2. Reading from content before accessing args. This would prevent the args getter from getting all of the request bytes and probably cause the parsing to fail. A similar issue is that accessing the args attribute will now cause a self.content.seek(0, 0) which could be surprising.
  4. In the new test, the triples quotes of the docstring should be on lines by themselves.

comment:3 in reply to: ↑ description Changed 4 years ago by jupi

Replying to jamesh:

At the moment, if the request body for a post is form data (either url encoded or multipart), it is read into memory as the Request.args dictionary. This means that large requests can chew up large amounts of memory.

Question concerning mem usage:
As I read from the twisted http.py sourcecode, for large file uploads there will be a high mem usage nevertheless, because the whole request body is read into Request's content property before any parsing whatsoever is done. So delaying parsing of the body does IMHO not address this problem, you have to hook in at an earlier point, i.e. when the content variable is built. So you had to patch HTTPChannel and not Request. I need a twisted application which allows me to do large file uploads without storing the whole file to memory first, because it will serve as a firmware upgrader for an embedded device, which has only 128MB of RAM while firmware upgrades might take up to 150MB. So my approach will be to rewrite some of the functionality of cgi.FieldStorage in a class that handles async input, because FieldStorage requires an input stream it can read from which doesn't exist in twisted because of its async nature (or am I wrong in this?). This should result in a body parsing "on-the-fly" while body is received not after receival is completed, so that I could store files directly in a temporary file on disk and not into memory.
Do you see any better solution to the problem than my approach?

comment:4 Changed 4 years ago by exarkun

because the whole request body is read into Request's content property before any parsing whatsoever is done

If the content-length is large, Request.content is a temporary file, not in-memory storage.

I need a twisted application which allows me to do large file uploads without storing the whole file to memory first, because it will serve as a firmware upgrader for an embedded device, which has only 128MB of RAM while firmware upgrades might take up to 150MB

You're really after the functionality from #288.

Changed 3 years ago by rlotun

Patch to address review concerns.

comment:5 Changed 3 years ago by rlotun

  • Keywords review added
  • Owner jamesh deleted

This is an interesting bit of work and perhaps even a minor optimization. It's been a year since there has been any work on this so I thought I'd step in and continue the work. Please let me know if I'm stepping on anyone's toes.

I believe I have addressed all the review points except one: could you please elaborate why a self.content.seek(0, 0) could be problematic? In this patch I've saved the current position, did a seek(0, 0) to read the data, and then did a seek to the saved position. I am now unsure if there's a bit of logic elsewhere that perhaps advances the pointer for some reason.

comment:6 Changed 3 years ago by exarkun

  • Keywords review removed
  • Owner set to rlotun

Thanks for picking this up rlotun.

My earlier comment about self.content.seek(0, 0) is addressed perfectly by your position saving/restoring additions. A few other minor comments about the patch:

  1. The parse_qs check is pretty whitebox. If we're comfortable doing that, then we should be comfortable looking at the _args attribute instead. In fact, since it doesn't involve any monkey patching, I'd say we should be more comfortable doing that. :) So how about making the test just check that _args is uninitialized until the appropriate time, rather than monkey patching parse_qs?
  2. The content stream position won't be restored if an exception is raised during parsing - the code could use a try/finally.
  3. The new attributes should have camelCase names
  4. It'd be nice to verify that changes to uri and path don't affect the value of args. Similarly, it would be nice to exercise all of the codepaths through the args property. Right now there's no test coverage for the 'multipart/form-data' case (which is probably whence will come exceptions that can prevent seek(0, 0) from happening in the patch's current form). The arg parsing isn't really changed by this patch, so I guess there's no coverage for these cases in trunk either: it may make sense to add the test coverage separately, since it should apply equally to the old and new versions of this code. Doing it separately keeps the patch small and narrowly focused.

Thanks again!

Note: See TracTickets for help on using tickets.