DEV Community

Oxylabs for Oxylabs

Posted on

Web Scraping With RegEx

Image descriptionThe demand for digital content has increased exponentially. Since the resulting competition increases, the existing websites are rapidly changing. Quick updates are beneficial to general consumers. However, it’s a considerable hassle for businesses that perform public data collection since web scraping uses routines tailored for specific conditions of websites, and frequent updates tend to disrupt them.

Let’s see how RegEx comes into play here by alleviating some of the more complex elements of certain acquisition and parsing processes.

What is RegEx?

RegEx stands for Regular Expressions, a method to match specific patterns depending on the provided combinations, which can be used as filters to get the desired output.

How to use RegEx for web scraping?

RegEx can be used to validate all types of character combinations, including special characters like line breaks. One of the biggest pros Regular Expressions have is that no matter the type of data/input (irrespective of its size), it’s always compared to the same single regular expression, making the code more efficient.

Regular Expressions are universal and can be implemented in any programming language.

Overview of RegEx tokens

Token

Matches

^

Start of a string

$

End of a string

.

Any character (except \n)

|

Characters on either side of the symbol

</span>

Escapes special characters

Char

The character given

Any number of previous characters

?

1 previous character

+

1 or more previous characters

{Digit}

Exact number

{Digit-Digit)

Between range

\d

Any digit

\s

Any whitespace character

\w

Any word character

\b

Word boundary character

\D

Inverse of \d

\S

Inverse of \s

\W

Inverse of \w

Collecting data using RegEx

In our tutorial, the RegEx scraping target is book titles and prices from a dummy website intended for training purposes.

Project requirements:

Importing libraries

Let’s begin with creating a virtual environment for the project:

python3 -m venv scrapingdemo

Activate the newly created virtual environment (the example for Linux):

source ./scrapingdemo/bin/activate

Now, install the required Python modules.

Requests is a library responsible for sending requests to the websites on the internet and returning their response. To install Requests, enter the following:

pip install requests

Beautiful Soup is a module used to parse and extract data from the HTML response. To install Beautiful Soup, enter the following:

pip install beautifulsoup4

re is a built-in Python module responsible for working with Regular Expressions.

Next, create an empty Python file, for example, demo.py.

To import the required libraries, enter the following:

import requests

from bs4 import BeautifulSoup

import re

Sending the GET request

Use the Requests library to send a request to a web page from which you want to scrape the data. In this case, https://books.toscrape.com/. To commence, enter the following:

page = requests.get('https://books.toscrape.com/&amp;#39;)

Selecting data

First, create a Beautiful Soup object and pass the page content received from your request during the initialization, including the parser type. As you’re working with an HTML code, select HTML.parser as the parser type.

Image description

Inspecting the HTML code element

By inspecting the elements (right-click and select inspect element) in a browser, you can see that each book title and price are presented inside an article element with the class called product_pod. Use Beautiful Soup to get all the data inside these elements, and then convert it to a string:

soup = BeautifulSoup(page.content, 'html.parser')

content = soup.find_all(class_='product_pod')

content = str(content)

Processing the data using RegEx

Since the acquired content has a lot of unnecessary data, create two regular expressions to get only the desired data.

Image description

Content of the acquired data

Expression # 1

Finding the pattern

First, inspect the title of the book to find the pattern. You can see above that every title is present after the text title= in the format title=“Titlename.

Generating the expression

Then, create an expression that returns the data inside quotations after the title= by specifying "(.?)".

The first expression is as follows:

re_titles = r'title="(.?)">'

Expression # 2

Finding the pattern

First, inspect the price of the book. Every price is present after the text £ in the format £=price before the paragraph tag </p>.

Generating the expression

Then, create an expression that returns the data inside quotations after the £= and before the </p> by specifying £(.?)</p>.

The second expression is as follows:

re_prices = '£(.?)</p>'

To conclude, use the expressions with re.findall to find the substrings matching the patterns. Lastly, save them in the variables title_list and price_list.

titles_list = re.findall(re_titles, content)

price_list = re.findall(re_prices, content)

Saving the output

To save the output, loop over the pairs for the titles and prices and write them to the output.txt file.

with open("output.txt", "w") as f:

   for title, price in zip(titles_list, price_list):

       f.write(title + "\t" + price + "\n")

Image description

The output file

Putting everything together, this is the complete code that can be run by calling python demo.py:

# Importing the required libraries.

import requests

from bs4 import BeautifulSoup

import re

# Requesting the HTML from the web page.

page = requests.get("https://books.toscrape.com/")

# Selecting the data.

soup = BeautifulSoup(page.content, "html.parser")

content = soup.find_all(class_="product_pod")

content = str(content)

# Processing the data using Regular Expressions.

re_titles = r'title="(.?)">'

titles_list = re.findall(re_titles, content)

re_prices = "£(.*?)</p>"

price_list = re.findall(re_prices, content)

#  Saving the output.

with open("output.txt", "w") as f:

   for title, price in zip(titles_list, price_list):

       f.write(title + "\t" + price + "\n")

Conclusion

In this post, we explained what Regular Expressions are, how to use them, and what most commonly used tokens do. Additionally, we provided an example of scraping the titles and prices from a web page utilizing Python and Regular Expressions.

Find this tutorial helpful? Don’t hesitate to like and leave a comment below with any questions or impressions regarding the post!

Top comments (1)

Collapse
 
crawlbase profile image
Crawlbase

Love this blog on web scraping with RegEx! Demystifying Regular Expressions and showing how they streamline scraping processes is brilliant. And that practical example of scraping book titles and prices? So cool! If you're into scraping, this tutorial's a must-read. Oh, and hey, check out Crawlbase to level up your scraping game! Keep it up!