DEV Community

agenthustler
agenthustler

Posted on

BeautifulSoup Web Scraping Tutorial in 2026: From Basics to Advanced Techniques

Web scraping is one of the most practical Python skills you can learn. Whether you're collecting product prices, monitoring competitor websites, or building datasets for analysis, BeautifulSoup remains the go-to library for parsing HTML in Python.

In this tutorial, I'll walk you through everything from installing BeautifulSoup to handling real-world scraping challenges like pagination and JavaScript-rendered pages.

What You'll Learn

  • Installing and setting up BeautifulSoup
  • Parsing HTML and navigating the DOM
  • Using CSS selectors to extract data
  • Handling pagination
  • Dealing with JavaScript-rendered content
  • Best practices for production scraping

1. Installing BeautifulSoup

BeautifulSoup4 (bs4) works alongside a parser. The recommended setup:

pip install beautifulsoup4 requests lxml
Enter fullscreen mode Exit fullscreen mode
  • beautifulsoup4 — the parsing library
  • requests — for fetching web pages
  • lxml — a fast HTML/XML parser (faster than the built-in html.parser)

Quick verification:

from bs4 import BeautifulSoup
import requests

response = requests.get("https://example.com")
soup = BeautifulSoup(response.text, "lxml")
print(soup.title.string)
# Output: Example Domain
Enter fullscreen mode Exit fullscreen mode

2. Parsing HTML: The Fundamentals

Let's work with a practical example. Say you want to scrape book data from a website:

from bs4 import BeautifulSoup
import requests

url = "https://books.toscrape.com/"
response = requests.get(url)
soup = BeautifulSoup(response.text, "lxml")

# Find all book containers
books = soup.find_all("article", class_="product_pod")

for book in books:
    title = book.h3.a["title"]
    price = book.select_one(".price_color").text
    availability = book.select_one(".availability").text.strip()
    print(f"{title}{price}{availability}")
Enter fullscreen mode Exit fullscreen mode

Key Methods

Method Use Case
find() First matching element
find_all() All matching elements
select() CSS selector (returns list)
select_one() CSS selector (first match)

3. CSS Selectors — The Power Tool

CSS selectors are the most flexible way to target elements. Here's a cheat sheet:

# By class
soup.select(".product_pod")

# By ID
soup.select("#main-content")

# Nested elements
soup.select("div.row > article.product_pod h3 a")

# Attribute selectors
soup.select('a[href*="catalogue"]')

# Multiple selectors
soup.select("h1, h2, h3")

# Nth child
soup.select("tr:nth-of-type(2) td")
Enter fullscreen mode Exit fullscreen mode

Real Example: Extracting a Data Table

url = "https://books.toscrape.com/catalogue/a-light-in-the-attic_1000/index.html"
response = requests.get(url)
soup = BeautifulSoup(response.text, "lxml")

# Extract product information table
table = soup.select_one("table.table-striped")
rows = table.select("tr")

product_info = {}
for row in rows:
    key = row.select_one("th").text.strip()
    value = row.select_one("td").text.strip()
    product_info[key] = value

print(json.dumps(product_info, indent=2))
Enter fullscreen mode Exit fullscreen mode

4. Handling Pagination

Most websites split data across multiple pages. Here's a robust pattern:

import requests
from bs4 import BeautifulSoup
import time

BASE_URL = "https://books.toscrape.com/catalogue/page-{}.html"
all_books = []

for page_num in range(1, 51):  # 50 pages total
    url = BASE_URL.format(page_num)
    response = requests.get(url)

    if response.status_code != 200:
        print(f"Stopped at page {page_num}")
        break

    soup = BeautifulSoup(response.text, "lxml")
    books = soup.select("article.product_pod")

    for book in books:
        all_books.append({
            "title": book.h3.a["title"],
            "price": book.select_one(".price_color").text,
            "rating": book.select_one("p.star-rating")["class"][1]
        })

    # Be respectful — don't hammer the server
    time.sleep(1)

print(f"Scraped {len(all_books)} books across {page_num} pages")
Enter fullscreen mode Exit fullscreen mode

Dynamic "Next Page" Pattern

When you don't know total pages upfront:

url = "https://books.toscrape.com/"

while url:
    response = requests.get(url)
    soup = BeautifulSoup(response.text, "lxml")

    # Process current page...
    books = soup.select("article.product_pod")
    for book in books:
        print(book.h3.a["title"])

    # Find the next page link
    next_btn = soup.select_one("li.next a")
    if next_btn:
        next_href = next_btn["href"]
        url = requests.compat.urljoin(url, next_href)
    else:
        url = None  # No more pages

    time.sleep(1)
Enter fullscreen mode Exit fullscreen mode

5. JavaScript-Rendered Pages

BeautifulSoup only parses static HTML. Many modern sites load content via JavaScript. You have two options:

Option A: Find the API

Before reaching for a headless browser, check the Network tab in DevTools. Many sites load data from a JSON API:

import requests

# Often the actual data comes from an API endpoint
api_url = "https://api.example.com/products?page=1&limit=20"
response = requests.get(api_url)
data = response.json()

for product in data["results"]:
    print(product["name"], product["price"])
Enter fullscreen mode Exit fullscreen mode

This is 10x faster than rendering JavaScript. Always check for APIs first.

Option B: Use a Headless Browser

When there's no API, use Playwright or Selenium:

pip install playwright
playwright install chromium
Enter fullscreen mode Exit fullscreen mode
from playwright.sync_api import sync_playwright
from bs4 import BeautifulSoup

with sync_playwright() as p:
    browser = p.chromium.launch(headless=True)
    page = browser.new_page()
    page.goto("https://example.com/spa-page")

    # Wait for dynamic content to load
    page.wait_for_selector(".product-card")

    # Get the rendered HTML
    html = page.content()
    soup = BeautifulSoup(html, "lxml")

    products = soup.select(".product-card")
    for product in products:
        print(product.select_one(".name").text)

    browser.close()
Enter fullscreen mode Exit fullscreen mode

Option C: Use a Scraping API

For production workloads, a scraping API handles JavaScript rendering, proxies, and CAPTCHAs for you. Services like ScrapeOps provide a simple API that returns rendered HTML:

import requests

# ScrapeOps handles rendering, proxies, and retries
params = {
    "api_key": "YOUR_API_KEY",
    "url": "https://example.com/dynamic-page",
    "render_js": "true"
}
response = requests.get("https://proxy.scrapeops.io/v1/", params=params)
soup = BeautifulSoup(response.text, "lxml")
Enter fullscreen mode Exit fullscreen mode

This is especially valuable when you need reliable, high-volume scraping without managing browser infrastructure.


6. Putting It All Together: A Complete Scraper

Here's a production-ready scraper with error handling, retries, and structured output:

import requests
from bs4 import BeautifulSoup
import csv
import time
from urllib.parse import urljoin

class BookScraper:
    def __init__(self, base_url, output_file="books.csv"):
        self.base_url = base_url
        self.output_file = output_file
        self.session = requests.Session()
        self.session.headers.update({
            "User-Agent": "Mozilla/5.0 (compatible; BookScraper/1.0)"
        })

    def fetch_page(self, url, retries=3):
        for attempt in range(retries):
            try:
                response = self.session.get(url, timeout=10)
                response.raise_for_status()
                return BeautifulSoup(response.text, "lxml")
            except requests.RequestException as e:
                print(f"Attempt {attempt + 1} failed: {e}")
                time.sleep(2 ** attempt)  # Exponential backoff
        return None

    def parse_book(self, article):
        return {
            "title": article.h3.a["title"],
            "price": article.select_one(".price_color").text.strip(),
            "rating": article.select_one("p.star-rating")["class"][1],
            "available": "In stock" in article.select_one(".availability").text
        }

    def scrape_all(self):
        all_books = []
        url = self.base_url

        while url:
            soup = self.fetch_page(url)
            if not soup:
                break

            for article in soup.select("article.product_pod"):
                all_books.append(self.parse_book(article))

            next_link = soup.select_one("li.next a")
            url = urljoin(url, next_link["href"]) if next_link else None
            time.sleep(1)

        return all_books

    def save_csv(self, books):
        if not books:
            return
        with open(self.output_file, "w", newline="") as f:
            writer = csv.DictWriter(f, fieldnames=books[0].keys())
            writer.writeheader()
            writer.writerows(books)
        print(f"Saved {len(books)} books to {self.output_file}")

# Usage
scraper = BookScraper("https://books.toscrape.com/")
books = scraper.scrape_all()
scraper.save_csv(books)
Enter fullscreen mode Exit fullscreen mode

7. Best Practices

  1. Respect robots.txt — Check it before scraping any site
  2. Add delaystime.sleep(1) between requests minimum
  3. Use sessionsrequests.Session() reuses connections efficiently
  4. Set a User-Agent — Identify your scraper honestly
  5. Handle errors gracefully — Retries with exponential backoff
  6. Cache responses — Don't re-scrape pages you already have
  7. Use proxies for scale — Services like ThorData provide residential proxies to avoid IP blocks when you need to collect data at higher volumes

What's Next?

Once you've mastered BeautifulSoup basics, the next step is scaling up. Check out the next article in this series where we cover async scraping, task queues, and distributed architectures for handling 100K+ pages.

If you want to skip the infrastructure headaches and jump straight to production scraping, take a look at ScrapeOps for managed proxy rotation and CAPTCHA solving.

Happy scraping! 🕷️


Have questions? Drop them in the comments — I read every one.

Top comments (0)