Whether it be Kaggle, Google Cloud, or the federal government, there's plenty of reliable open-sourced data on the web. While there are plenty of reasons to hate being alive in our current chapter of humanity, open data is one of the few redeeming qualities of life on Earth today. But what is the opposite of "open" data, anyway?
Like anything free and easily accessible, the only data inherently worth anything is either harvested privately or stolen from sources that would prefer you didn't. This is the sort of data business models can be built around, as social media platforms such as LinkedIn have shown us as our personal information is bought and sold by data brokers. These companies attempted to sue individual programmers like ourselves for scraping the data they collected via the same means, and epically lost in a court of law:
The topic of scraping data on the web tends to raise questions about the ethics and legality of scraping, to which I plea: don't hold back. If you aren't personally disgusted by the prospect of your life being transcribed, sold, and frequently leaked, the court system has ruled that you legally have a right to scrape data. The name of this publication is not People Who Play It Safe And Slackers. We're a home for those who fight to take power back, and we're going to scrape the shit out of you.
Web scraping in Python is dominated by three major libraries: BeautifulSoup, Scrapy, and Selenium. Each of these libraries intends to solve for very different use cases. Thus it's essential to understand what we're choosing and why.
- BeautifulSoup is one of the most prolific Python libraries in existence, in some part having shaped the web as we know it. BeautifulSoup is a lightweight, easy-to-learn, and highly effective way to programmatically isolate information on a single webpage at a time. It's common to use BeautifulSoupin conjunction with the requests library, where requests will fetch a page, and BeautifulSoup will extract the resulting data.
- Scrapy has an agenda much closer to mass pillaging than BeautifulSoup. Scrapy is a tool for building crawlers: these are absolute monstrosities unleashed upon the web like a swarm, loosely following links, and haste-fully grabbing data where data exists to be grabbed. Because Scrapy serves the purpose of mass-scraping, it is much easier to get in trouble with Scrapy.
- Selenium isn't exclusively a scraping tool as much as an automation tool that can be used to scrape sites. Selenium is the nuclear option for attempting to navigate sites programmatically, and should be treated as such: there are much better options for simple data extraction.
We'll be using BeautifulSoup, which should genuinely be anybody's default choice until the circumstances ask for more. BeautifulSoup is more than enough to steal data.
Before we steal any data, we need to set the stage. We'll start by installing our two libraries of choice:
As mentioned before, requests will provide us with our target's HTML, and beautifulsoup4 will parse that data.
We need to recognize that a lot of sites have precautions to fend off scrapers from accessing their data. The first thing we can do to get around this is spoofing the headers we send along with our requests to make our scraper look like a legitimate browser:
This is only a first line of defense (or offensive, in our case). There are plenty of ways sites can still keep us at bay, but setting headers works shockingly well to fix most issues.
Now let's fetch a page and inspect it with BeautifulSoup:
We set things up by making a request to http://example.com. We then create a BeautifulSoup object which accepts the raw content of that response via
req.content. The second parameter,
'html.parser', is our way of telling BeautifulSoup that this is an HTML document. There are other parsers available for parsing stuff like XML, if you're into that.
When we create a BeautifulSoup object from a page's HTML, our object contains the HTML structure of that page, which can now be easily parsed by all sorts of methods. First, let's see what our variable
soup looks like by using
There are many methods available to us for pinpointing and grabbing the information we're trying to get out of a page. Finding the exact information we want out of a web page is a bit of an art form: effective scraping requires us to recognize patterns in document's HTML that we can take advantage of to ensure we only grab the pieces we need. This is especially the case when dealing with sites that actively try to prevent us from doing just that.
Understanding the tools we have at our disposal is the first step to developing a keen eye for what's possible. We'll start with the meat and potatoes.
The most straightforward way to finding information in our
soup variable is by utilizing
soup.find_all(...). These two methods work the same with one exception: find returns the first HTML element found, whereas find_all returns a list of all elements matching the criteria (even if only one element is found, find_all will return a list of a single item).
We can search for DOM elements in our
soup variable by searching for certain criteria. Passing a positional argument to find_all will return all anchor tags on the site:
We can also find all anchor tags which have the class name "boy". Passing the
class_ argument allows us to filter by class name. Note the underscore!
If we wanted to get any element with the class name "boy" besides anchor tags, we can do that too:
We can search for elements by id in the same way we searched for classes. Remember that we should only expect a single element to be returned with an id, so we should use find here:
Often times we'll run into situations where elements don't have reliable class or id values. Luckily we can search for DOM elements with any attribute, including non-standard ones:
Searching HTML using CSS selectors is one of the most powerful ways to find what you're looking for, especially for sites trying to make your life difficult. Using CSS selectors enables us to find and leverage highly-specific patterns in the target's DOM structure. This is the best way to ensure we're grabbing exactly the content we need. If you're rusty on CSS selectors, I highly recommend becoming reacquainted. Here are a few examples:
In this example, we're looking for an element that has a "widget" class, as well as an "author" class. Once we have that element, we go deeper to find any paragraph tags held within that widget. We could also modify this to get only the second paragraph tag inside the author widget:
To understand why this is so powerful, imagine a site that intentionally has no identifying attributes on its tags to keep people like you from scraping their data. Even without names to select by, we could observe the DOM structure of the page and find a unique way to navigate to the element we want:
soup.select("body > div:first-of-type > div > ul li")
A specific pattern like this is likely unique to only a single collection of
<li> tags on the page we're exploiting. The downside of this method is we're at the whim of the site owner, as their HTML structure could change.
Chances are we'll almost always want the contents or the attributes of a tag, as opposed to the entirety of a tag's HTML. If we're scraping anchor tags, for instance, we probably just want the
href value, as opposed to the entire tag. The
.get method can be used here to retrieve values of attributes on a tag:
The above finds the destination URLs for all
<a> tags on a page. Another example can have us grab a site's logo image:
Sometimes it's not attributes we're looking for, but just the text within a tag:
In our example of creating link previews, a good first source of information would obviously be the page's meta tags: specifically the
og tags they've specified to openly provide the bite-sized information we're looking for. Grabbing these tags are a bit more difficult to deal with:
Now that's ugly. Meta tags are an especially interesting case; they're all uselessly dubbed 'meta', thus we need a second identifier (in addition to the tag name) to specify which meta tag we care about. Only then can we bother to get the actual content of said tag.
If we were to try the above selector on an HTML page that did not contain an
og:description, our script would break unforgivingly. Not only do we miss this data, but we miss out on everything entirely - this means we always need to build in a plan B, and at the very least deal with a lack of tag altogether.
It's best to break out this logic one tag at a time. First, let's look at an example for a base scraper with all the knowledge we have so far:
This function lays the foundation for snatching a given URL's metadata. The result we're looking for is a dictionary named
metadata, which contains the data we manage to scrape successfully.
Each key in our dictionary has a corresponding function which attempts to scrape the corresponding information. Here's what we have for fetching a page's title , description , and social image values:
get_title tries to get the
<title>tag, which has a very low chance of failing. Just in case the target page actually is missing this tag, we fall back to Facebook and Twitter meta tags. If all of this still fails, we finally resort to trying to pull the first
<h1>tag on the page (if we get to this point, we're probably scraping a garbage site).
- get_description is nearly identical to our method for scraping page titles. The last resort is a desperate attempt to pull the first paragraph on the page.
get_image looks for the page's "share" image, which is used to generate link previews on social media platforms. Our last resort is to pull the first
<img>tag containing a source image.
This simple script we just threw together is the basis for how most services generate "link previews": an embedded widget containing a synopsis of a site before clicking in (think Facebook, Slack, Discord, etc.). There are even some services which charge monthly fees of ~$10/month to provide the service we've just built. Instead of paying for something like that, feel free to take my source code and use it as you please:
I've uploaded the source code for this tutorial to Github, which contains instructions on how to download and run this script yourself. Enjoy, and join us next time when we up the ante with more nefarious scraping tactics!