Unit Testing Your Web Scraper

#python #webscraping #tdd #datascience

Goals:

By the end of this tutorial, you will have a starting point for writing unit tests for a web scraper. I also hope that this motivates the reader to learn more about test-driven development. This tutorial is less about teaching you how to do something. Instead, it suggests how to set up and think about your testing for web scrapping scripts.

Tools and prereqs:

The reader should know how to run pytests, if you don't, I suggest you read the first part of clean architecture for a primer listed in the resources section.
It will benefit the reader to have done at least one web scrape using Beautiful Soup before going over this tutorial.

Step 1

The example website we will be using is https://books.toscrape.com/catalogue/a-light-in-the-attic_1000/index.html. This website is a service to practice web scraping.
The first step is to decide what data we will want to collect.

Luckily, the "Product information" section makes it an easy task. We will collect the fields: UPC, Product Type, Price (excl. tax), Tax, and Availability In stock.

Step 2

My first suggestion begins before we start writing any code. I think all of your cleaning code should be organized together and tested together. Therefore you should separate your cleaning code in a "clean.py" file.

touch clean.py

Step 3

With Test-Driven Development, you write your test before writing your program code. I like using a format similar to follow Ken Youens' format in "Tiny Python Project" (link below). He starts with testing that the file exists in the right location. He then separates each of the tests with a commented outline.

import os

prg = './clean.py'


# -------------------------------------------------
def test_exists():
    """checks if the file exist"""

    assert os.path.isfile(prg)

You can run the test by typing this:
pytest -xv test.py
This test should pass, if it doesn't make sure the clean.py is in the correct directory.

Step 4

For the next step, we will be going back to the Product Information section of the webpage. When storing our Price and tax data, we will want to keep them as float data. However, if we tried to get the text, the pound symbol would prevent python from converting it into a float number. This problem lends itself to our first test.

# -------------------------------------------------
def test_price():
    """£51.77 -> 51.77 type float"""

    res = monetary('£51.77')
    assert res == float(51.77)`

Now when we run the test using:
pytest -xv test.py . We will see an error, which is a great thing.

I want to take the time to discuss this step a little more because this step is the main focus of the tutorial. There are much better TDD sources and excellent web scraping tutorials, but I don't always see where to start with TDD for your scrapes. Starting with how you want your data to look is a great way to get started with TDD and a great way to ensure your data is clean. As a data analyst, data engineer, or data scientist, there will most likely be several cleaning data steps. Web getting your data from the web; this can be your first step.
I know I got a bit wordy, but I would like to summarize these thoughts and the tutorial: write a test script to reflect what your data should look like and then write the code.

Step 5

Now we can write the code for this test. This code is how I chose to write the code; there can be more than one way. When I write code for small projects, I like to think of two advice pieces that I have read from real smart developers. First, get the code to pass the test and make sure it does what you are trying to get done. Second, is don't overcomplicate it with features you think you'll need in the future. Here is my code:

import re


def monetary(value_field):
    """Returns a float for items that are values"""
    amount = re.sub('[^0-9.]', '', value_field)
    return float(amount)

Sure, I could have written it this way:

def monetary(value_field):
    """Returns a float for items that are values"""
    amount = value_field[1:]
    return float(amount)

and for this project, it would be fine. I used regrex, because any time I see characters, I think to myself, "just use regrex". However, it doesn't matter as long as you are getting the desired result.

Step 6:

Now it is time for you to do the same thing for writing a test and code that returns an integer for the availability column! You can see my complete code and "answer" at the link below under resources.

Conclusion

We went over a starting point to create Unit tests for a web scraper. Thank you for reading and please let me know if you have any questions our suggestions!

Quick side note

In my project, I save the data as a NamedTuple in a model file. There will be a link at the end of this article with more information about NamedTuples if you have not used one before.