Web scraping is one of the most practical Python skills you can learn. Whether you're collecting product prices, monitoring competitor websites, or building datasets for analysis, BeautifulSoup remains the go-to library for parsing HTML in Python.
In this tutorial, I'll walk you through everything from installing BeautifulSoup to handling real-world scraping challenges like pagination and JavaScript-rendered pages.
What You'll Learn
- Installing and setting up BeautifulSoup
- Parsing HTML and navigating the DOM
- Using CSS selectors to extract data
- Handling pagination
- Dealing with JavaScript-rendered content
- Best practices for production scraping
1. Installing BeautifulSoup
BeautifulSoup4 (bs4) works alongside a parser. The recommended setup:
pip install beautifulsoup4 requests lxml
-
beautifulsoup4— the parsing library -
requests— for fetching web pages -
lxml— a fast HTML/XML parser (faster than the built-inhtml.parser)
Quick verification:
from bs4 import BeautifulSoup
import requests
response = requests.get("https://example.com")
soup = BeautifulSoup(response.text, "lxml")
print(soup.title.string)
# Output: Example Domain
2. Parsing HTML: The Fundamentals
Let's work with a practical example. Say you want to scrape book data from a website:
from bs4 import BeautifulSoup
import requests
url = "https://books.toscrape.com/"
response = requests.get(url)
soup = BeautifulSoup(response.text, "lxml")
# Find all book containers
books = soup.find_all("article", class_="product_pod")
for book in books:
title = book.h3.a["title"]
price = book.select_one(".price_color").text
availability = book.select_one(".availability").text.strip()
print(f"{title} — {price} — {availability}")
Key Methods
| Method | Use Case |
|---|---|
find() |
First matching element |
find_all() |
All matching elements |
select() |
CSS selector (returns list) |
select_one() |
CSS selector (first match) |
3. CSS Selectors — The Power Tool
CSS selectors are the most flexible way to target elements. Here's a cheat sheet:
# By class
soup.select(".product_pod")
# By ID
soup.select("#main-content")
# Nested elements
soup.select("div.row > article.product_pod h3 a")
# Attribute selectors
soup.select('a[href*="catalogue"]')
# Multiple selectors
soup.select("h1, h2, h3")
# Nth child
soup.select("tr:nth-of-type(2) td")
Real Example: Extracting a Data Table
url = "https://books.toscrape.com/catalogue/a-light-in-the-attic_1000/index.html"
response = requests.get(url)
soup = BeautifulSoup(response.text, "lxml")
# Extract product information table
table = soup.select_one("table.table-striped")
rows = table.select("tr")
product_info = {}
for row in rows:
key = row.select_one("th").text.strip()
value = row.select_one("td").text.strip()
product_info[key] = value
print(json.dumps(product_info, indent=2))
4. Handling Pagination
Most websites split data across multiple pages. Here's a robust pattern:
import requests
from bs4 import BeautifulSoup
import time
BASE_URL = "https://books.toscrape.com/catalogue/page-{}.html"
all_books = []
for page_num in range(1, 51): # 50 pages total
url = BASE_URL.format(page_num)
response = requests.get(url)
if response.status_code != 200:
print(f"Stopped at page {page_num}")
break
soup = BeautifulSoup(response.text, "lxml")
books = soup.select("article.product_pod")
for book in books:
all_books.append({
"title": book.h3.a["title"],
"price": book.select_one(".price_color").text,
"rating": book.select_one("p.star-rating")["class"][1]
})
# Be respectful — don't hammer the server
time.sleep(1)
print(f"Scraped {len(all_books)} books across {page_num} pages")
Dynamic "Next Page" Pattern
When you don't know total pages upfront:
url = "https://books.toscrape.com/"
while url:
response = requests.get(url)
soup = BeautifulSoup(response.text, "lxml")
# Process current page...
books = soup.select("article.product_pod")
for book in books:
print(book.h3.a["title"])
# Find the next page link
next_btn = soup.select_one("li.next a")
if next_btn:
next_href = next_btn["href"]
url = requests.compat.urljoin(url, next_href)
else:
url = None # No more pages
time.sleep(1)
5. JavaScript-Rendered Pages
BeautifulSoup only parses static HTML. Many modern sites load content via JavaScript. You have two options:
Option A: Find the API
Before reaching for a headless browser, check the Network tab in DevTools. Many sites load data from a JSON API:
import requests
# Often the actual data comes from an API endpoint
api_url = "https://api.example.com/products?page=1&limit=20"
response = requests.get(api_url)
data = response.json()
for product in data["results"]:
print(product["name"], product["price"])
This is 10x faster than rendering JavaScript. Always check for APIs first.
Option B: Use a Headless Browser
When there's no API, use Playwright or Selenium:
pip install playwright
playwright install chromium
from playwright.sync_api import sync_playwright
from bs4 import BeautifulSoup
with sync_playwright() as p:
browser = p.chromium.launch(headless=True)
page = browser.new_page()
page.goto("https://example.com/spa-page")
# Wait for dynamic content to load
page.wait_for_selector(".product-card")
# Get the rendered HTML
html = page.content()
soup = BeautifulSoup(html, "lxml")
products = soup.select(".product-card")
for product in products:
print(product.select_one(".name").text)
browser.close()
Option C: Use a Scraping API
For production workloads, a scraping API handles JavaScript rendering, proxies, and CAPTCHAs for you. Services like ScrapeOps provide a simple API that returns rendered HTML:
import requests
# ScrapeOps handles rendering, proxies, and retries
params = {
"api_key": "YOUR_API_KEY",
"url": "https://example.com/dynamic-page",
"render_js": "true"
}
response = requests.get("https://proxy.scrapeops.io/v1/", params=params)
soup = BeautifulSoup(response.text, "lxml")
This is especially valuable when you need reliable, high-volume scraping without managing browser infrastructure.
6. Putting It All Together: A Complete Scraper
Here's a production-ready scraper with error handling, retries, and structured output:
import requests
from bs4 import BeautifulSoup
import csv
import time
from urllib.parse import urljoin
class BookScraper:
def __init__(self, base_url, output_file="books.csv"):
self.base_url = base_url
self.output_file = output_file
self.session = requests.Session()
self.session.headers.update({
"User-Agent": "Mozilla/5.0 (compatible; BookScraper/1.0)"
})
def fetch_page(self, url, retries=3):
for attempt in range(retries):
try:
response = self.session.get(url, timeout=10)
response.raise_for_status()
return BeautifulSoup(response.text, "lxml")
except requests.RequestException as e:
print(f"Attempt {attempt + 1} failed: {e}")
time.sleep(2 ** attempt) # Exponential backoff
return None
def parse_book(self, article):
return {
"title": article.h3.a["title"],
"price": article.select_one(".price_color").text.strip(),
"rating": article.select_one("p.star-rating")["class"][1],
"available": "In stock" in article.select_one(".availability").text
}
def scrape_all(self):
all_books = []
url = self.base_url
while url:
soup = self.fetch_page(url)
if not soup:
break
for article in soup.select("article.product_pod"):
all_books.append(self.parse_book(article))
next_link = soup.select_one("li.next a")
url = urljoin(url, next_link["href"]) if next_link else None
time.sleep(1)
return all_books
def save_csv(self, books):
if not books:
return
with open(self.output_file, "w", newline="") as f:
writer = csv.DictWriter(f, fieldnames=books[0].keys())
writer.writeheader()
writer.writerows(books)
print(f"Saved {len(books)} books to {self.output_file}")
# Usage
scraper = BookScraper("https://books.toscrape.com/")
books = scraper.scrape_all()
scraper.save_csv(books)
7. Best Practices
-
Respect
robots.txt— Check it before scraping any site -
Add delays —
time.sleep(1)between requests minimum -
Use sessions —
requests.Session()reuses connections efficiently - Set a User-Agent — Identify your scraper honestly
- Handle errors gracefully — Retries with exponential backoff
- Cache responses — Don't re-scrape pages you already have
- Use proxies for scale — Services like ThorData provide residential proxies to avoid IP blocks when you need to collect data at higher volumes
What's Next?
Once you've mastered BeautifulSoup basics, the next step is scaling up. Check out the next article in this series where we cover async scraping, task queues, and distributed architectures for handling 100K+ pages.
If you want to skip the infrastructure headaches and jump straight to production scraping, take a look at ScrapeOps for managed proxy rotation and CAPTCHA solving.
Happy scraping! 🕷️
Have questions? Drop them in the comments — I read every one.
Top comments (0)