A proxy can hide your IP, but what happens when that gets banned? You would need a new IP. Or you could maintain a list of them and rotate proxies for each request. The final option would be to use Smart Rotating Proxies, more on that later.
For now, we will focus on building our custom proxy rotator. We will start from a list of regular proxies, check them to mark the working ones, and provide simple monitoring to remove the failing ones from the working list. The provided examples use Python, but the idea will work in any language.
Let's dive in!
Prerequisites
For the code to work, you will need python3 installed. Some systems have it pre-installed. After that, install all the necessary libraries by running pip install
.
pip install aiohttp
Proxy List
You might not have a proxy provider with a list of domain+ports. Do not worry, we'll see how to get one.
There are several lists of free proxies online. For the demo, grab one of those and save its content (just the URLs) in a text file (rotating_proxies_list.txt
). Or use the ones below.
Free proxies aren't reliable, and the ones below probably won't work for you. They are usually short-lived.
167.71.230.124:8080
192.155.107.211:1080
77.238.79.111:8080
167.71.5.83:3128
195.189.123.213:3128
8.210.83.33:80
80.48.119.28:8080
152.0.209.175:8080
187.217.54.84:80
169.57.1.85:8123
Then, we will read that file and create an array with all the proxies. Read the file, strip empty spaces, and split each line. Careful when saving the file since we won't perform any sanity checks for valid IP:port
strings. We'll keep it simple.
proxies_list = open("rotating_proxies_list.txt",
"r").read().strip().split("\n")
Check Proxies
Let's assume that we want to run the scrapers at scale. The demo is simplified, but the idea would be to store the proxies and their "health state" in a reliable medium like a database. We will use in-memory data structures that disappear after each run, but you get the idea.
First, let's write a simple function to check that the proxy works. For that, call ident.me, which will return the IP. It is a simple page that fits our use case. We will use asyncio
and aiohttp
, an "Asynchronous HTTP Client/Server" similar to the famous requests
. It suits us better since its purpose is to work asynchronously, and it will help us when checking several proxies simultaneously.
For the moment, it takes an item from the proxies list and calls the provided URL. Most of the code is boilerplate that will soon prove useful. There are two possible results:
- π If everything goes OK, it prints the response's content and status code (i.e., 200), which will probably be the proxy's IP.
- π An error gets printed due to timeout or some other reason. It usually means that the proxy is not available o cannot process the request. Many of these will appear when using free proxies.
import aiohttp
import asyncio
proxies_list = open("rotating_proxies_list.txt", "r").read().strip().split("\n")
timeout = aiohttp.ClientTimeout(total=30)
async def get(url, session, proxy):
try:
async with session.get(url, proxy=f"http://{proxy}", timeout=timeout) as response:
print(response.status, await response.text())
except Exception as e:
print(e)
async def check_proxies():
proxy = proxies_list.pop()
async with aiohttp.ClientSession() as session:
await get("http://ident.me/", session, proxy=proxy)
asyncio.run(check_proxies())
We intentionally use HTTP instead of HTTPS because many free proxies don't support SSL.
Add More Checks To Validate the Results
An exception means that the request failed, but there are other options that we should check, such as status codes. We will consider valid only specific codes and mark the rest as errors. The list is not an exhaustive one, adjust it to your needs. You might think, for example, that 404 "Not Found" isn't valid and should be tested again.
We could also add other checks, like validating that the response contains an IP address.
VALID_STATUSES = [200, 301, 302, 307, 404]
async def get(url, session, proxy):
try:
async with session.get(url, proxy=f"http://{proxy}", timeout=timeout) as response:
if response.status in VALID_STATUSES: # valid proxy
print(response.status, await response.text())
else:
print(response.status)
except Exception as e:
print('Exception: ', type(e))
Iterate Over All the Proxies
Great! We now need to run the checks for each proxy in the array. We will loop over the proxy list calling get
just as before. But instead of doing it sequentially, we will use asyncio.gather
to launch all the requests and wait for them to finish. Async makes the code more complicated, but it speeds up web scraping.
The list is hardcoded to get a maximum of 10 items for security, to avoid hundreds of involuntary requests.
async def check_proxies():
proxies = proxies_list[0:10] # limited to 10 to avoid too many requests
async with aiohttp.ClientSession() as session:
tasks = [
get("http://ident.me/", session, proxy=proxy)
for proxy in proxies
]
await asyncio.gather(*tasks, return_exceptions=True)
We should also limit the number of concurrent requests. We'll do it using Semaphore, an object that will acquire and release a lock. It will maintain an internal counter that allows only so many calls (10 in this case), thus creating a maximum concurrency.
We need to change how to call check_proxies
too.
sem = asyncio.Semaphore(10)
# ...
async def get(url, session, proxy):
async with sem:
try:
# ...
loop = asyncio.get_event_loop()
loop.run_until_complete(check_proxies())
Separate Working Proxies From the Failed Ones
Examining an output log is far from ideal, isn't it? We should keep an internal state for the proxy list. We will separate them into three groups:
- unchecked: unknown state, to be checked.
- working: the last call using this proxy was successful.
- not working: the last request failed.
It is easier to add or remove items from set
s than arrays, and they come with the advantage of avoiding duplicates. We can move proxies between lists without worrying about having the same one twice. If it's present, it just won't be added. That will simplify our code: remove an item from a set and add it to another. To achieve that, we need to modify the proxy storage slightly.
Three sets will exist, one for each group seen above. The initial one, unchecked
, will contain the proxies from the file. A set can be initialized from an array, making it easy for us to create it.
proxies_list = open("rotating_proxies_list.txt", "r").read().strip().split("\n")
unchecked = set(proxies_list[0:10]) # limited to 10 to avoid too many requests
# unchecked = set(proxies_list)
working = set()
not_working = set()
# ...
async def check_proxies():
async with aiohttp.ClientSession() as session:
tasks = [
get("http://ident.me/", session, proxy=proxy)
for proxy in unchecked # use the new set for the loop
]
await asyncio.gather(*tasks, return_exceptions=True)
#...
Now, write helper functions to move proxies between states. One helper for each state. They will add the proxy to a set and remove it - if present - from the other two. Here is where sets come in handy since we don't need to worry about checking if the proxy is present or looping over the arrays. Call "discard" to remove if present or ignored, but no exception will raise.
For example, we will call set_working
when a request is successful. And that function will remove the proxy from the unchecked or not working sets while adding it to the working set.
def reset_proxy(proxy):
unchecked.add(proxy)
working.discard(proxy)
not_working.discard(proxy)
def set_working(proxy):
unchecked.discard(proxy)
working.add(proxy)
not_working.discard(proxy)
def set_not_working(proxy):
unchecked.discard(proxy)
working.discard(proxy)
not_working.add(proxy)
We are missing the crucial part! We need to edit get
to call these functions after each request. set_working
for the successful ones and set_not_working
for the rest.
async def get(url, session, proxy):
async with sem:
try:
async with session.get(url, proxy=f"http://{proxy}", timeout=timeout) as response:
if response.status in VALID_STATUSES: # valid proxy
set_working(proxy)
else:
set_not_working(proxy)
except Exception as e:
set_not_working(proxy)
For the moment, add some traces at the end of the script to see if it's working well. The unchecked
set should be empty since we run all the items. And those items will populate the other two sets. Hopefully working
is not empty π
- it might happen with free proxies.
#...
loop = asyncio.get_event_loop()
loop.run_until_complete(check_proxies())
print('unchecked ->', unchecked)
# unchecked -> set()
print('working ->', working)
# working -> {'152.0.209.175:8080', ...}
print('not_working ->', not_working)
# not_working -> {'167.71.5.83:3128', ...}
Using the Working Proxies
That was a straightforward way to check proxies but not truly useful yet. We now need a way to get the working proxies and use them for the real reason: web scraping actual content. We will create a function that will select a random proxy.
We included both working and unchecked proxies in our example, feel free to use only the working ones if it fits your needs. We will see later why the unchecked ones are present too.
random
doesn't work with sets, so we'll convert them using a tuple
.
import random
def get_random_proxy():
# create a tuple from unchecked and working sets
available_proxies = tuple(unchecked.union(working))
if not available_proxies:
raise Exception("no proxies available")
return random.choice(available_proxies)
Next, we can edit the get
function to use a random proxy if none is present. The proxy
parameter is now optional. We will use that param to check the initial proxies, as we were doing before. But after that, we can forget about the proxy list and call get
without it. A random one will be used and added to the not_working
set in case of failure.
Since we will now want to get actual content, we need to return the response or raise the exception. With aiohttp
, unlike requests
, the response's content must await
. Here is the final version.
async def get(url, session, proxy=None):
if not proxy:
proxy = get_random_proxy()
async with sem:
try:
async with session.get(url, proxy=f"http://{proxy}", timeout=timeout) as response:
if response.status in VALID_STATUSES:
set_working(proxy)
else:
set_not_working(proxy)
await response.text() # content needs to be "awaited"
return response # return response
except Exception as e:
set_not_working(proxy)
raise e # raise exception
Include below the script the content you want to scrape. We will merely call once again the same test URL for the demo.
The idea is to start from here to build a real-world scraper based on this backbone. And to scale it, store the items in persistent storage, such as a database (i.e., Redis).
#....
loop = asyncio.get_event_loop()
loop.run_until_complete(check_proxies())
# real scraping part comes here
async def main():
async with aiohttp.ClientSession() as session:
result = await get("http://ident.me/", session)
print(result.ok) # True
print(result.status) # 200
print(await result.text()) # 152.0.209.175
asyncio.run(main())
What happens with false negatives or one-time errors? Once we send a proxy to the not_working
set, it will remain there forever. There's no way back.
Re-Checking Not Working Proxies
We should re-check the failed proxies from time to time. There are many reasons: the failure was due to networking issues, a bug, or the proxy provider fixed it.
In any case, Python allows us to set Timers
, "an action that should be run only after a certain amount of time has passed". There are different ways to achieve the same end, and this is simple enough to run it using three lines.
Remember the reset_proxy
function? We didn't use it at all until now. We will set a Timer
to run that function for every proxy marked as not working. Twenty seconds is a small number for a real-world case but enough for our test. We exclude a failing proxy and move it back to unchecked after some time.
And this is the reason to use both working and unchecked sets in get_random_proxy
. Modify that function to use only working proxies for a more robust use case. And then, you can run check_proxies
periodically, which will loop over the unchecked elements - in this case, failed proxies that remained some time in the sin bin.
from threading import Timer
def set_not_working(proxy):
unchecked.discard(proxy)
working.discard(proxy)
not_working.add(proxy)
# move to unchecked after a certain time (20s in the example)
Timer(20.0, reset_proxy, [proxy]).start()
There is a final option for even more robust systems, but we'll leave the implementation up to you. Store analytics and usage for each proxy, for example, the number of times it failed and when was the last one. Using that info, adjust the time to re-check - longer times for proxies that failed several times. Or even set some alerts if the number of working proxies goes below a threshold.
Conclusion
Building a plain proxy rotator might seem doable for small scraping scripts, but it can grow painful. But, hey, you did it!! π
These are the steps we followed:
- Store proxy list as plain text
- Import from the file as an array
- Check each of them
- Separate the working ones
- Check for failures while scraping and remove them from the working list
- Re-check not working proxies from time to time
As a note of caution, do not rotate IP when scraping logged-in or any other kind of session/cookies.
If you don't want to worry about rotating proxies manually, you can always use ZenRows, a Web Scraping API that includes Smart Rotating Proxies. It works as a regular proxy - with a single URL - but provides different IPs for each request.
Did you find the content helpful? Please, spread the word and share it. π
Originally published at https://www.zenrows.com
Top comments (0)