kaelscion

Posted on Dec 5, 2018 • Edited on Dec 13, 2018

Bye Bye 403: Building a Filter Resistant Web Crawler Part II: Building a Proxy List

#python #datascience #discuss #learntocode

originally published on the Coding Duck blog: www.ccstechme.com/coding-duck-blog

Woohooo! We've got our environment set up and are ready to start building our bot! Seeing as the last post in this series (located here) was mostly informational, let's get right into the code for part two.

So, it is my assumption that if you are reading this series so far, that you know the basics of web scraping in Python, and that you are looking to learn how to not get blocked rather than the basics. So lets get right into the first line of defense we have against web filters: IP Address cycling

Many web scrapers will purchase Proxies that they are allowed to change once or twice a month for the purpose of not getting filtered by address. But did you know that an IP is usually filtered in under 100 requests and, seeing as most scrapers make at least 100 requests an hour, purchasing enough IP address proxies to last a month can get expensive.

There are free options out there, but the disadvantage there is that those IP Address lists are usually monitored and only last a couple hours before they're blacklisted. But there is some good news: Free Proxy providers (good ones anyway) originate new IPs faster than their old ones get pinged. So we're going to use my personal favorite: free-proxy-list.net

To use this effectively and account for the IPs we've used and abandoned, we're going to build a web scraper to retrieve the IPs dynamically! First step in that process, and the topic of this article, will be to build our proxy list. First step is to create a new file called proxy_retriever.py that contains the following:


import requests
import pandas as pd

from bs4 import BeautifulSoup

class proxyRetriever():

    def __init__(self):
        self.s = requests.Session()
        self.ip_list = []

Pretty simple so far. We've imported our http library (requests) and our html parser (beautifulSoup) as well as the library we're going to use to build the CSV file containing our proxies (pandas). Next, we create a new class called proxyRetriever to handle all actions involved in building our list. Now, you may want to lay out your project differently as far as the OOP concepts you use. But, in my personal opinion, I feel that an application that is as compartmentalized as possible makes life easier for yourself and everybody who uses your code to come. Basically, every class handles one general task (building a proxy list for example). Each function within that class will handle one specific task, the more specific the better. This way, your code will be very DRY and debugging will be a cinch because if something doesn't work correctly, you will have a very good idea of where the problem is based solely on what class and function is supposed to handle that one, very specific thing.

You will also notice that we are defining a requests Session here rather than simply a requests.get(). There are a few reasons for this that I find superior than repeatedly using one-off HTTP requests, but a specific post on that is coming soon so I won't go into it here. Just play along for now.

The second class-level property is the ip_list we will be building. Defining our properties in this way just makes it easier for our class functions to share access to that particular variable without having to pass data around to each other which can A) Get messy when multiple threads and/or processes become involved and B) Get messy even in a single-threaded environment. To me, it does not matter one whit how elegant your solution is or how brilliant it makes you look if the person who maintains it when you no longer do has no clue how to work with it. Making yourself look more clever than the next guy is the kind of A-type grandstanding that has no place in an environment as collaborative and diverse as the development and engineering field. Simple, readable, easy to follow solutions are superior to those that might be a trifle faster or more efficient, but are totally untenable.

Now that our class is defined and has some initialization behavior, lets add our first function:


def connect_and_parse(self, website):
        r = self.s.get(website)
        soup = BeautifulSoup(r.text, "html.parser")
        proxy_table = soup.find('tbody')
        proxy_list = proxy_table.find_all('tr')
        elites = [tr for tr in proxy_list if 'elite' in tr.text]
        tds = []
        for tr in elites:
            tds.append([td.text for td in tr])   
        return tds

With the principle of "one function per function" in mind, connect_and_parse takes a web address as an argument (in this case 'https://free-proxy-list.net/'), connects via our requests session, and pulls down the HTML located at that address. For the sake of simplicity, I tend to use the "html.parser" parser included with BeautifulSoup, rather than rely on another dependency like lxml.

Pro Tip: Remember, BeautifulSoup is markup parser. It works with HTML strings and nothing else. I see many folks get into a groove and pass the response object (in this case r) to BeautifulSoup and hit an exception. Our response is an object that contains properties that BeautifulSoup works with (like .text and .content), but is useless to us by itself. Just remember to pass r.text or r.content to avoid that annoying holdup.

Using the browser tools of your preference (I use Chrome Dev Tools), examine the site's content to find that the table containing our proxy information is the first instance of the 'tbody' HTML class. Calling the soup.find() function returns the first occurrence of this class and assigns it to the proxy_table variable.

To create a list of the rows of this table, assign proxy_table.find_all('tr') to the proxy_list variable.

Pro Tip: Any item returned by the soup.find() function will be a BeautifulSoup object that you can further parse with find(), find_all(), next_sibling(), etc. However, soup.find_all() returns a list filled with BeautifulSoup objects. Even if this list contains only one item, you will still need to access it specifically to parse it further. In this example, if we tried to call something like proxy_list.find('some-other-element'), we will get an exception because the proxy_list variable is a list containing soup objects rather than a soup object itself. To further parse an element of find_all(), iterate through the list or access an item by its index.

Our next line creates a list of "elite" status proxies. These types of proxies are totally anonymous and very difficult for filters to pin down and filter on the fly. They also use HTTPS so your traffic will be both anonymous and encrypted with SSL, both things we want our crawler to have. This process is done using a list comprehension, which you can get a crash course in here

Next line is an example of readability vs cleverness. We are creating a list of cells in each row stored in the elites list. This can be done with a list comprehension, however it would be a nested list comprehension that would be a bit of a brain teaser, difficult to read, and would need to be read multiple times just to understand the logic flow going on. This would make debugging any issues it could later develop take much longer than needed. So instead, we use a traditional for loop to access the rows in elites, then leverage a simpler, more readable list comprehension to process the cells of each row into our desired information on each proxy we are collecting.

Cool! So we've got our proxy information in a neat little list of lists. Now to process it and toss it into a CSV file we can reference later:


def clean_and_sort(self, data_set):
        columns = ['Address', 'Port', 'Country_Code', 'Country', 'Proxy_Type']
        mapped_lists = {}
        i=0
        while i < 5:
            mapped_lists.update({columns[i]: [tag[i] for tag in tags_stripped_data]})
            i+=1
        df = pd.DataFrame(mapped_lists, columns=columns)
        df.to_csv('proxy_ips.csv')

This next function takes the output of our previous function (our collection of lists of proxy details), cleans it, sorts it, and outputs it to a csv. The columns variable is a list that contains the column names we will be using for our csv file. We then create an empty dictionary (mapped_lists), and use a while loop to only process the first 5 items in each list we created in our connect_and_parse function. This is necessary because, if we re-examine free-proxy-list.net, the table contains 8 columns per row. We only care about the first 5. Now, we could have written our scraper to only retrieve the first 5. However, connect_and_parse is not responsible for cleaning and sorting our data, only grabbing it and returning it as an unceremonious blob. Is it the most efficient solution? Perhaps, perhaps not. But when it comes to debugging a problem with sorting the data correctly once retrieved, you will be glad you went about it this way. Because no matter what you change (and potentially break) in the course of debugging the clean_and_sort function, connect_and_parse will still work just fine. Can you say the same if both responsibilities were given to connect_and_parse alone?

From there, we call on pandas to create a dataframe from the data in our mapped_lists dictionary, using our columns list as the column headers, and exporting the dataframe to our new file, proxy_ips.csv. Once open, the contents will look like this:

Okay! Our first line of defense is in place! But before we can start going after the data we want, we have one more common filtering vector we need to confuse: User-Agents. Tune in for part three to continue your studies in filter-proofing your web crawler!

Latest comments (2)

Ben Halpern • Dec 5 '18

Very cool project!

kaelscion • Dec 5 '18

I always thought so! Thanks. One of my personal projects over the past couple years is a web-server level filter meant to detect bots like the end product of this entire post series, as well as those that use more advanced methods such as reverse engineering JSON requests and simply asking the web server for the information in the header rather than parsing HTML, XML, or using webdrivers etc. This series is meant to introduce novice web scrapers to the idea of fooling the current, admittedly kinda dumb, filters on most websites and to help expose the idea of how arbitrary most filter checks are on the modern web. My hope is that, if enough people know about the problem (in the detail of how to execute it), others will rally to the cause of stopping offshore trolls from making it necessary to have an explicit view in Google Analytics that factors out bot traffic and makes us depressed about how many human beings actually visit our sites and services :D