DEV Community

Percival Villalva for Apify

Posted on • Originally published at blog.apify.com on

What are the best Python web scraping libraries?

Introduction

Web scraping is essentially a way to automate the process of extracting data from the web, and as a Python developer, you have access to some of the best libraries and frameworks available to help you get the job done.

We're going to take a look at some of the most popular Python libraries and frameworks for web scraping and compare their pros and cons, so you know exactly what tool to use to tackle any web scraping project you might come across.

HTTP Libraries - Requests and HTTPX

First up, let's talk about HTTP libraries. These are the foundation of web scraping since every scraping job starts by making a request to a website and retrieving its contents, usually as HTML.

Two popular HTTP libraries in Python are Requests and HTTPX.

Requests is easy to use and great for simple scraping tasks, while HTTPX offers some advanced features like async and HTTP/2 support.

Their core functionality and syntax are very similar, so I would recommend HTTPX even for smaller projects since you can easily scale up in the future without compromising performance.

Feature HTTPX Requests
Asynchronous
HTTP/2 support
Timeout support
Proxy support
TLS verification
Custom exceptions

Parsing HTML with Beautiful Soup

Once you have the HTML content, you need a way to parse it and extract the data you're interested in.

Beautiful Soup is the most popular HTML parser in Python, allowing you to easily navigate and search through the HTML tree structure. Its straightforward syntax and easy setup also make Beautiful Soup a great option for small to medium web scraping projects as well as web scraping beginners.

The two major drawbacks of Beautiful Soup are its inability to scrape JavaScript-heavy websites and its limited scalability, which results in low performance in large-scale projects. For large projects, you would be better off using Scrapy, but more about that later.

Web scraping with Beautiful Soup and Requests

Detailed tutorial with code examples. And some handy tricks.

favicon blog.apify.com

Next, lets take a look at how Beautiful Soup works in practice:

from bs4 import BeautifulSoup

import httpx

# Send an HTTP GET request to the specified URL using the httpx library

response = httpx.get("<https://news.ycombinator.com/news>")

# Save the content of the response

yc_web_page = response.content

# Use the BeautifulSoup library to parse the HTML content of the webpage

soup = BeautifulSoup(yc_web_page)

# Find all elements with the class "athing" (which represent articles on Hacker News) using the parsed HTML

articles = soup.find_all(class_="athing")

# Loop through each article and extract relevant data, such as the URL, title, and rank

for article in articles:

data = {

"URL": article.find(class_="titleline").find("a").get('href'), # Find the URL of the article by finding the first "a" tag within the element with class "titleline"

"title": article.find(class_="titleline").getText(), # Find the title of the article by getting the text content of the element with class "titleline"

"rank": article.find(class_="rank").getText().replace(".", "") # Find the rank of the article by getting the text content of the element with class "rank" and removing the period character

}

# Print the extracted data for the current article

print(data)

Enter fullscreen mode Exit fullscreen mode

Explaining the code:

1 - We start by sending an HTTP GET request to the specified URL using the HTTPX library. Then, we save the retrieved content to a variable.

2 - Now, we use the Beautiful Soup library to parse the HTML content of the webpage.

3 - This enables us to manipulate the parsed content using Beautiful Soup methods, such as find_all to find the content we need. In this particular case, we are finding all elements with the class athing, which represents articles on Hacker News.

4- Next, we simply loop through all the articles on the page and then use CSS selectors to further specify what data we would to extract from each article. Finally, we print the scraped data to the console.

Browser automation libraries - Selenium and Playwright

What if the website you're scraping relies on JavaScript to load its content? In that case, an HTML parser won't be enough, as you'll need to generate a browser instance to load the pages JavaScript using a browser automation tool like Selenium or Playwright.

These are primarily testing and automation tools that allow you to control a web browser programmatically, including clicking buttons, filling out forms, and more. However, they are also often used in web scraping as a means to access dynamically generated data on a webpage.

While Selenium and Playwright are very similar in their core functionality, Playwright is more modern and complete than Selenium.

For example, Playwright offers some unique built-in features, such as automatically waiting on elements to be visible before making actions and an asynchronous version of its API using asyncIO.

What is Playwright automation?

Learn why Playwright is ideal for web scraping and automation.

favicon blog.apify.com

To exemplify how we can use Playwright to do web scraping, lets quickly walk through a code snippet where we use Playwright to extract data from an Amazon product and save a screenshot of the page while at it.

import asyncio

from playwright.async_api import async_playwright

async def main():

async with async_playwright() as p:

browser = await p.firefox.launch(headless=False)

page = await browser.new_page()

await page.goto("<https://www.amazon.com/Hitchhikers-Guide-Galaxy-Douglas-Adams-ebook/dp/B000XUBC2C>")

# Create a dictionary with the scraped data

selectors = ['#productTitle', 'span.author a', '#productSubtitle', '.a-size-base.a-color-price.a-color-price']

book_data = await asyncio.gather(*(page.query_selector(sel) for sel in selectors))

book = {}

book["book_title"], book["author"], book["edition"], book["price"] = [await elem.inner_text() for elem in book_data if elem]

print(book)

await page.screenshot(path="book.png")

await browser.close()

asyncio.run(main())

Enter fullscreen mode Exit fullscreen mode

Explaining the code:

  1. Import the necessary modules: asyncio and async_playwright from Playwright's async API.

  2. After importing the necessary modules, we start by defining an async function called main that launches a Firefox browser instance with headless mode set to False so we can actually see the browser working. Creates a new page in the browser using the new_page method and finally navigates to the Amazon website using the gotomethod.

  3. Next, we define a list of CSS selectors for the data we want to be scraped. Then, we can use the method asyncio.gather to simultaneously execute the page.query_selector method on all the selectors in the list, and store the results in a book_data variable.

  4. Now we can iterate over book_data to populate the book dictionary with the scraped data. Note that we also check that the element is not None and only add the elements which exist. This is considered good practice since websites can make small changes that will affect your scraper. You could even expand on this example and write more complex tests to ensure the data being extracted is not missing any values.

  5. Finally, we print the book dictionary contents to the console and take a screenshot of the scraped page, saving it as a file called book.png.

  6. As a last step, we make sure to close the browser instance.

How to scrape the web with Playwright in 2023

Complete Playwright web scraping and crawling tutorial.

favicon blog.apify.com

But wait! If browser automation tools can be used to scrape virtually any webpage and, on top of that, can also make it easier for you to automate tasks, test and visualize your code working, why dont we just always use Playwright or Selenium for web scraping?

Well, despite being powerful scraping tools, these libraries and frameworks have a noticeable drawback. It turns out that generating a browser instance is a very resource-heavy action when compared to simply retrieving the pages HTML. This can easily become a huge performance bottleneck for large scraping jobs, which will not only take longer to complete but also become considerably more expensive. For that reason, we usually want to limit the usage of these tools to only the necessary tasks and, when possible, use them together with Beautiful Soup or Scrapy.

Web Scraping with Scrapy

A hands-on guide for web scraping with Scrapy.

favicon blog.apify.com

Scrapy

Next up, we have the most popular and, arguably, powerful web scraping framework for Python.

If you find yourself needing to scrape large amounts of data regularly, then Scrapy could be a great option.

The Scrapy framework offers a full-fledged suite of tools to aid you even in the most complex scraping jobs.

On top of its superior performance when compared to Beautiful Soup, Scrapy can also be easily integrated into other data-processing Python tools and even other libraries, such as Playwright.

Not only that, but it comes with a handy collection of built-in features catered specifically to web scraping, such as:

Feature Description
Powerful and flexible spidering framework Scrapy provides a built-in spidering framework that allows you to easily define and customize web crawlers to extract the data you need.
Fast and efficient Scrapy is designed to be fast and efficient, allowing you to extract data from large websites quickly and with minimal resource usage.
Support for handling common web data formats Export data in multiple formats such as HTML, XML, and JSON.
Extensible architecture Easily add custom functionality through middleware, pipelines, and extensions.
Distributed scraping Scrapy supports distributed scraping, allowing you to scale up your web scraping operation across multiple machines.
Error handling Scrapy has robust error-handling capabilities, allowing you to handle common errors and exceptions that may occur during web scraping.
Support for authentication and cookies Supports handling authentication and cookies to scrape websites that require login credentials.
Integration with other Python tools Scrapy can be easily integrated with other Python tools, such as data processing and storage libraries, making it a powerful tool for end-to-end data processing pipelines.

Here's an example of how to use a Scrapy Spider to scrape data from a website:

import scrapy

class HackernewsSpiderSpider(scrapy.Spider):
    name = 'hackernews_spider'
    allowed_domains = ['news.ycombinator.com']
    start_urls = ['<http://news.ycombinator.com/>']

    def parse(self, response):
        articles = response.css('tr.athing')
        for article in articles:
            yield {
                "URL": article.css(".titleline a::attr(href)").get(),
                "title": article.css(".titleline a::text").get(),
                "rank": article.css(".rank::text").get().replace(".", "")
        }

Enter fullscreen mode Exit fullscreen mode

We can use the following command to run this script and save the resulting data to a JSON file:

scrapy crawl hackernews -o hackernews.json

Enter fullscreen mode Exit fullscreen mode

Explaining the code:

The code example uses Scrapy to scrape data from the Hacker News website (news.ycombinator.com). Let's break down the code step by step:

After importing the necessary modules, we define the Spider class we want to use:


class HackernewsSpiderSpider(scrapy.Spider):

Enter fullscreen mode Exit fullscreen mode

Next, we set the Spider properties:

  • name: The name of the spider (used to identify it).

  • allowed_domains: A list of domains that the spider is allowed to crawl

  • start_urls: A list of URLs to start crawling from.


name = 'hackernews_spider'
allowed_domains = ['news.ycombinator.com']
start_urls = ['<http://news.ycombinator.com/>']

Enter fullscreen mode Exit fullscreen mode

Then, we define the parse method: This method is the entry point for the spider and is called with the response of the URLs specified in start_urls.


def parse(self, response):

Enter fullscreen mode Exit fullscreen mode

In the parse method, we will extract data from the HTML response: The response object represents the HTML page received from the website. The spider uses CSS selectors to extract relevant data from the HTML structure.


articles = response.css('tr.athing')

Enter fullscreen mode Exit fullscreen mode

Now we use a for loop to iterate over each article found on the page.


for article in articles:

Enter fullscreen mode Exit fullscreen mode

Finally, for each article, the spider extracts the URL, title, and rank information using CSS selectors and yields a Python dictionary containing this data.


yield {
    "URL": article.css(".titleline a::attr(href)").get(),
    "title": article.css(".titleline a::text").get(),
    "rank": article.css(".rank::text").get().replace(".", "")
}

Enter fullscreen mode Exit fullscreen mode

Scrapy alternatives: other web scraping libraries to try

5 Scrapy alternatives for web scraping you need to try.

favicon blog.apify.com

Which Python scraping library is right for you?

So, which library should you use for your web scraping project? The answer depends on the specific needs and requirements of your project. Each web scraping library and framework presented here has a unique purpose in an expert scraper's toolkit. Learning to use each one will give you the flexibility to select the best tool for each job, so don't be afraid to try each of them before deciding!

Whether you are scraping with BeautifulSoup, Scrapy, Selenium, or Playwright, the Apify Python SDK helps you run your project in the cloud at any scale.

Top comments (0)