Opened 9 years ago

Last modified 7 years ago

#1109 enhancement new

twisted.web.proxy doesn't reverse-map redirects like ProxyPassReverse

Reported by: kragen Owned by:
Priority: low Milestone:
Component: web Keywords:
Cc: Tv, jknight, kragen, exarkun Branch:
Author: Launchpad Bug:

Description


Attachments (2)

proxymap.py (14.5 KB) - added by kragen 9 years ago.
LICENSE (1.1 KB) - added by kragen 9 years ago.

Download all attachments as: .zip

Change History (16)

comment:1 Changed 9 years ago by kragen

Summary: twisted.web.proxy lacks crucial reverse-proxy functionality and cannot
be deployed in a situation where the backend server being reverse-proxied
generates redirects to itself

Details:
A reverse proxy server stands in front of a web server that is, or should be,
invisible to the outside world.  That "backend web server" need not know about
the reverse proxy server, and indeed there's no way (that I know of) for the
reverse proxy server to tell the backend web server of its "public" URL. 
Consequently the backend web server must be careful to avoid creating absolute
links to its own URIs, because they will say something useless like
"http://localhost:3000/index.html", but "localhost" is meaningful only to the
reverse proxy server.  At the same time, the reverse proxy server does not have
the latitude to send the "host" header sent by the end-user's browser on to the
backend web server, since that will screw up name-based virtual hosting on the
backend server.  (Normally, anyway.)

However, if the backend web server generates HTTP redirects (code 301 or 302,
for example) it is required to use an absolute URI in the "Location:" header of
its response.  Some user-agents don't actually work with relative URLs there. 
Unfortunately, this means that the backend web server has no alternative but to
insert some hostname and port number in this URI, but it does not have enough
information to insert the publicly-visible hostname and port number there.

Apache's mod_proxy has a directive known as ProxyPassReverse to handle this
problem.  ProxyPassReverse rewrites Location: fields with certain prefixes (such
as "http://localhost:3000/") into absolute URLs in the URL space owned by the
proxy (such as "http://publicsite/app1".)  Typically a ProxyPass directive (the
equivalent of attaching a twisted.web.proxy.ReverseProxyResource at some
location in the tree of Resoruces) is accompanied by a ProxyPassReverse
directive with the same arguments.

Without this functionality, I do not know of a way to deploy a working reverse
proxy server.

Unfortunately, it is not sufficient to rewrite Location: headers from requests
proxied by a particular ReverseProxyResource with the single Resource-to-URL
mapping used by that ReverseProxyResource, because multiple subtrees of the same
backend server may be reverse-proxied at different locations in the URL space
located on the same reverse-proxy server, and handling redirects between these
subtrees is important.  At least, it's important in the application where I'm
using it.

I think the correct solution is to create a single RedirectRewriter object for a
whole Site, containing all the reverse-proxy mappings, and use that
RedirectRewriter in twisted.web.proxy.ProxyClient.handleHeader to rewrite responses.

comment:2 Changed 9 years ago by jknight

ProxyPassReverse is stupid, because it cannot rewrite all uses of the internal 
url, only those uses in Location. Much better is to actually tell the backend 
host its real URL. Twisted.web2, for example, has support for getting told by 
the frontend proxy what URL it's actually processing. Or, if your frontend 
cannot add headers to the request as it sends it through (like apache1 cannot), 
you can also hardcode the url. I don't know what backend server you're trying to 
use, so I don't know if it's possible to tell it it's real URL or not.

Nevertheless, if it really is impossible to have the backend server know its 
real URL, ProxyPassReverse is at least a partial fix, so it might be a valuable 
feature to have. It's certainly not as completely necessary as you make out, 
though.

This should be something that's trivial to do without modifying proxy at all, 
except that twisted.web is annoying and inflexible. However, it sounds like you 
have a good handle on how to go about implementing the functionality if it turns 
out you really do need it.

This is certainly a feature request, not a bug request, and thus should be 
rejected as not for t.web, since t.web is basically frozen now. t.web2 could do 
with an output filter which can do this, although until it has client support, 
that's not very useful.

comment:3 Changed 9 years ago by kragen

It's true that ProxyPassReverse cannot rewrite all uses of the internal
host:port, only those uses in Location; however, Location is the only place
(that I know of) where you can't use a relative URL, and is therefore the only
place in which rewriting cannot be avoided.

It's true that you could (at least in some cases) configure the backend server
with knowledge of the public host and port, but this is usually difficult, and
in this case, the primary reason for the reverse proxy is so that none of the
servers need know the hostname, for reasons too painful to detail in this report.

The idea that we should cease fixing deficiencies in twisted.web, even though
twisted.web2 does not yet work even as well as twisted.web, strikes me as rather
impractical.

comment:4 Changed 9 years ago by kragen

Oops, didn't mean to set status back to new.  Silly browser.  Thank goodness for
Roundup.

comment:5 Changed 9 years ago by jknight

If you write the subclass that does what you need, you can attach it to this bug 
report.

comment:6 Changed 9 years ago by kragen

Thanks for your help, foom!

comment:7 Changed 9 years ago by kragen

#1117 is a separate bug related to twisted.web.proxy's reverse-proxying.

Changed 9 years ago by kragen

comment:8 Changed 9 years ago by kragen

Here's a passel of subclasses that implement this feature, and some limited
tests.  I clearly need to learn to use twisted.trial, and I think this would be
much cleaner as a patch to twisted/web/proxy.py and twisted/python/urlpath.py
than as a bunch of subclasses.

Changed 9 years ago by kragen

comment:9 Changed 9 years ago by kragen

As the attached changes were written as part of my efforts on the Zocalo
project, my employer CommerceNet licenses them under the attached MIT/X11 license.

comment:10 Changed 9 years ago by Tv

You are supposed to deploy the backend server with a hack like
twisted.web.vhost.VHostMonsterResource. Just Use It.

Rewriting is way too risky -- do you want to rewrite embedded
javascript, too?

The actual mechanism used to transfer the public host name, port
and URL path could be cleaner, and will likely be implemented with
HTTP headers in web2 or something.

comment:11 Changed 9 years ago by kragen

It's true that VHostMonsterResource is another way to solve the problem, and if
the backend server is a Twisted.web server, it's a simpler solution.  If the
backend server is running Rails, Tomcat, Jetty, Apache, or nearly any other
server software, this will require reimplementing VHostMonsterResource in the
local dialect, which will generally involve more total complexity than rewriting
Location: headers in the reverse proxy.  (In my case, the backends are running
Jetty and asyncore, and Jetty was the culprit.)  All of this remains true if the
public-hostname information moves from the URL into HTTP headers, as Virtanen
suggests may be possible in the future.

It's true that embedded JavaScript or HTML may contain a non-public hostname;
but in these cases, it is possible to avoid it, because they can contain
relative URLs instead of absolute ones.  Some people may prefer to avoid this
risk, but this risk is not introduced or increased by rewriting Location:
headers; it is introduced by using a reverse proxy.  The only way to avoid it is
not to deploy a reverse proxy.  Therefore this risk is not relevant to the
question of whether a reverse proxy should have an option to rewrite Location:
headers.

As I explained in my original request, it is possible to avoid including your
hostname in your JavaScript and your HTML, but it is not possible to avoid
including it in your Location: headers.

I think this code would be much simpler as a patch, by the way; it's only 160
lines because it has to duplicate much of the class structure of
twisted.web.proxy.  I'd be happy to rewrite it as a patch, and include Trial tests.

comment:12 Changed 9 years ago by Tv

I'm not really enthusiastic about adding features that make people _believe_ they
have a fully functional setup, when that is not true. I'd much rather see work
go e.g. into an apache module that understands the information the rproxy gives
it, overriding it's concept of host, port and path.

comment:13 Changed 7 years ago by exarkun

  • Cc exarkun added
  • Priority changed from high to low
  • Type changed from defect to enhancement

The attached code isn't suitable for inclusion in Twisted in its current form:

  • much of it is undocumented
  • it is lacking pyunit-style tests (although it does have some tests - but they don't look complete)
  • various Twisted naming and formatting conventions aren't followed
  • the implementation is very long; I'm not sure why jknight asked for a subclass-based implementation. This would be much simple as a patch to proxy.py, and that's how it should be supplied (obviously backward compatibility should be retained in any modified APIs).

comment:14 Changed 4 years ago by <automation>

  • Owner jknight deleted
Note: See TracTickets for help on using tickets.