Tuesday, April 19, 2011

Http Proxy Server

How does an HTTP proxy server work? I learned a little about this recently while trying to implement one using Apache, mod_python, and Python. HTTP is a relatively simple protocol. A connection is opened to a server, and a request in the form of a string is sent. For example, to get a web page, you just send a simple get request and wait for the web page (or an error code) to be sent back to you.

GET /path/to/file/index.html HTTP/1.0

However, there is no server information in this request, so how does a proxy server know what you want? After you configure your browser or other client to use a proxy, the client opens a connection to the proxy server for each request, and sends a slightly different get request for each page.

GET http://someserver:80/path/to/file/index.html HTTP/1.0

Notice that this get request includes the full URL (including the server and port). This tells the proxy server which server contains the actual content. So the proxy server connects to the specified server, and pipes the appropriate content back to you.

The problem I face is that I need a way to track groups of requests coming from a particular client process. For example, let's say the user has two browsers and I want a separate log file containing all the downloads made by each browser. I can't track by IP address (they are the same), I can't track by modification to the URL since most client apps don't support adding information other than the servername and port. So this leaves the authentication mechanism. Each browser will need to send a unique username, and that username will be used to track which client is making the request. For now, that's the solution I'm going with.

For more information, there is a nice introduction to HTTP here:
http://www.jmarshall.com/easy/http/
And for even more details, look at RFC 2616
http://www.w3.org/Protocols/rfc2616/rfc2616.html