At Fathom Data we have a few projects which require us to send HTTP requests from an evolving selection of IP addresses. This post details one approach which uses Tor (The Onion Router) as a proxy.
What is a Proxy Server?
A proxy server acts as an intermediary between a client and a server. When a request goes through a proxy server there is no direct connection between the client and the server. The client connects to the proxy and the proxy then connects to the server. Requests and responses pass through the proxy.
HTTP & SOCKS Proxies
HTTP (HyperText Transfer Protocol) is the dominant protocol for information exchange on the internet. HTTP is connectionless. This means that the client (often a browser) sends a request to a server. The server then replies with a response. Once this interaction is over there is no persistent connection between the client and the server. Any further interactions require new connections.
An HTTP proxy uses the HTTP protocol for all interactions with the client and server. As a result, an HTTP proxy is only able to handle HTTP and HTTPS requests. An HTTP proxy is also able to filter or modify the content of the requests and responses passing through it.
SOCKS is another internet protocol. Whereas as HTTP is an application layer protocol (at the top of the OSI Model), SOCKS is a lower level protocol in the session layer.
A SOCKS proxy uses the SOCKS protocol. Since it’s secure (the name is an abbreviation for “SOCKet Secure”), a SOCKS proxy cannot understand the contents of requests or responses, so is unable to modify or filter them. Since it operates at a lower level in the networking hierarchy, a SOCKS proxy is also faster and more flexible than an HTTP proxy.
Tor Proxy Docker Image
We constructed a Docker image which uses the Tor network to expose both SOCKS and HTTP proxies. The image uses the following components:
- Tor
- HAProxy and
- Privoxy.
The relationship between these components is detailed in the figure below.
Tor
Tor provides an anonymous SOCKS proxy. The image will run multiple Tor instances, each of which will (in general) have a different exit node. This means that requests being routed through each instance will appear to come from a distinct IP address.
HAProxy
HAProxy is a high availability proxy server and load balancer (spreads requests across multiple services). HAProxy is used to distribute SOCKS requests across the Tor instances using a round robin scheduling strategy.
Privoxy
To provide for services that prefer to communicate via HTTP or cannot communicate via SOCKS, Privoxy is used to accept HTTP requests and forward them as SOCKS requests to HAProxy.
Running the Docker Image
Let’s spin up a container and take a look.
docker run \
-p 8800:8800 \
-p 8888:8888 \
-p 1080:1080 \
-p 2090:2090 \
datawookie/medusa-proxy
We’re mapped a lot of ports.
- 8800 — list of proxy URLs (as
text/plain
) - 8888 — Privoxy port (HTTP protocol)
- 1080 — HAProxy port (SOCKS protocol) and
- 2090 — HAProxy statistics port.
Not all of them are required, but they all fulfill a distinct purpose. Below are some alternative ways to invoke the image.
# HTTP proxy on port 8888
docker run -p 8888:8888 datawookie/medusa-proxy
# SOCKS proxy on port 1080
docker run -p 1080:1080 datawookie/medusa-proxy
# Both HTTP and SOCKS proxies
docker run -p 8888:8888 -p 1080:1080 datawookie/medusa-proxy
Once we’ve got a running container we can set up a client to use the proxy. We’ll start with a browser and then look at curl
on the command line.
Browser
Set up your browser to use the HTTP proxy. You could equally choose to use the SOCKS proxy, which will have an IP address of 127.0.0.1 and port 1080.
Once you’ve configured the proxy settings in your browser, head over to What Is My IP Address to check on your effective IP address. Refresh the page to confirm that the IP address changes.
Testing
Further testing is easier to do (and record) on the command line. To illustrate how the proxy works we’ll send requests to http://httpbin.org/ip to retrieve our effective IP address.
First let’s look at the set of IP addresses reported by the container. Below is an extract from the Docker logs.
2021-09-29 04:24:40 [INFO] Testing proxy (port 10000): 195.154.35.52.
2021-09-29 04:24:40 [INFO] Testing proxy (port 10001): 178.20.55.18.
2021-09-29 04:24:40 [INFO] Testing proxy (port 10002): 195.176.3.24.
2021-09-29 04:24:40 [INFO] Testing proxy (port 10003): 185.220.100.252.
2021-09-29 04:24:40 [INFO] Testing proxy (port 10004): 185.220.101.198.
Don’t worry about the port numbers (those are only relevant within the container). Note that there are five proxies (one for each Tor instance) and that each has a different IP address.
Now send out a series of requests.
curl http://httpbin.org/ip
{
"origin": "195.154.35.52"
}
curl http://httpbin.org/ip
{
"origin": "178.20.55.18"
}
curl http://httpbin.org/ip
{
"origin": "195.176.3.24"
}
curl http://httpbin.org/ip
{
"origin": "185.220.100.252"
}
curl http://httpbin.org/ip
{
"origin": "185.220.101.198"
}
curl http://httpbin.org/ip
{
"origin": "195.154.35.52"
}
Notice that each request appears to originate from a distinct IP address and that once we’ve cycled through all of the Tor instances, we wrap back to the first one.
Rotating
The exit nodes are periodically rotated. Some time later we see that we’re using a different set of IP addresses.
2021-09-29 04:39:13 [INFO] Testing proxy (port 10000): 171.25.193.20.
2021-09-29 04:39:13 [INFO] Testing proxy (port 10001): 199.249.230.87.
2021-09-29 04:39:14 [INFO] Testing proxy (port 10002): 185.220.101.132.
2021-09-29 04:39:14 [INFO] Testing proxy (port 10003): 199.249.230.184.
2021-09-29 04:39:15 [INFO] Testing proxy (port 10004): 23.129.64.161.
A Weakness
So this is great, but there’s one major weakness: if any one of the Tor instances became unavailable then the proxy would be marked as broken (regardless of whether the other Tor instances were fine or not). To get around this I have been running multiple containers, each assigned to a different port. With this setup, even if a few of the containers are marked as broken, there are still others which are considered healthy and able to accept requests. Although easy enough to automate, the logistics associated with this setup are a little onerous.
Wouldn’t it be convenient if there was just a single container which exposes multiple proxies, each of which is hooked up to a distinct set of Tor instances?
Beast with Many Heads
So, rather than exposing just a single unit per container, the Medusa proxy can cater for multiple proxy units or heads. What this means is that there can be multiple copies of the components illustrated in the diagram above. All heads are served from the same network location but on different ports.
There are a few environment variables which can be used to tweak the configuration:
-
TORS
— Number of heads (default: 2) -
TORS
— Number of Tor instances (default: 5) -
HAPROXY_LOGIN
— Username for HAProxy (default: “admin”) -
HAPROXY_PASSWORD
— Password for HAProxy (default: “admin”)
Let’s give this a try. We’ll launch Medusa with 4 heads (each linking to 3 Tor instances) and only map the ports for the HTTP proxies.
docker run \
-e TORS=3 \
-e HEADS=4 \
-p 8800:8800 \
-p 8888:8888 -p 8889:8889 -p 8890:8890 -p 8891:8891 \
datawookie/medusa-proxy
So now we have four proxy connections at ports 8888, 8889, 8890 and 8891. If we look at the Docker logs then we see that there are 3 Tor endpoints for each head.
2021-09-30 08:05:34,606 [INFO] Testing proxies.
2021-09-30 08:05:34,606 [INFO] * Privoxy 0
2021-09-30 08:05:35,295 [INFO] Testing proxy (port 10000): 185.220.101.2.
2021-09-30 08:05:35,753 [INFO] Testing proxy (port 10001): 198.98.62.74.
2021-09-30 08:05:36,041 [INFO] Testing proxy (port 10002): 185.220.100.245.
2021-09-30 08:05:36,041 [INFO] * Privoxy 1
2021-09-30 08:05:36,652 [INFO] Testing proxy (port 10003): 37.123.163.58.
2021-09-30 08:05:37,304 [INFO] Testing proxy (port 10004): 37.187.196.70.
2021-09-30 08:05:37,634 [INFO] Testing proxy (port 10005): 77.68.20.217.
2021-09-30 08:05:37,634 [INFO] * Privoxy 2
2021-09-30 08:05:37,909 [INFO] Testing proxy (port 10006): 89.163.143.8.
2021-09-30 08:05:38,954 [INFO] Testing proxy (port 10007): 185.220.101.10.
2021-09-30 08:05:39,721 [INFO] Testing proxy (port 10008): 195.206.105.217.
2021-09-30 08:05:39,721 [INFO] * Privoxy 3
2021-09-30 08:05:40,050 [INFO] Testing proxy (port 10009): 185.220.100.243.
2021-09-30 08:05:41,010 [INFO] Testing proxy (port 10010): 185.220.101.43.
2021-09-30 08:05:41,415 [INFO] Testing proxy (port 10011): 185.185.170.27.
Proxy List
A proxy list is served as a plain text file on port 8800. This can be used to configure rotating proxies in clients.
Statistics
You can monitor the performance of the proxies via the statistics interface, which each HAProxy instance exposes via a port numbered sequentially from 2090.
Using Medusa with Scrapy
Technical Details
The image is derived from the official Alpine image, onto which Python 3, Tor, HAProxy and Privoxy are installed. The configuration files for each of the services are created from templates using Jinja templates.
Top comments (0)