DEV Community

Cover image for Web Scraping with Python: A Complete BeautifulSoup & Requests Guide
ZyVOP
ZyVOP

Posted on • Originally published at zyvop.com

Web Scraping with Python: A Complete BeautifulSoup & Requests Guide

Every day, billions of web pages sit on the internet — full of prices, headlines, job listings, research data, and more. Most of it has no official API. Web scraping is how you collect that data programmatically, turning raw HTML into clean, structured datasets you can actually use.

Python is the gold standard for web scraping. It has a rich ecosystem, readable syntax, and two libraries in particular that make scraping feel almost effortless: requests (for fetching web pages) and BeautifulSoup (for parsing them).

By the end of this guide, you will:

  • Understand how HTTP requests and HTML parsing work together

  • Write a scraper that collects data from real websites

  • Handle pagination, headers, and common errors

  • Export your data to CSV using pandas

Let's dig in.

How Web Scraping Works

When you type a URL into a browser, your browser sends an HTTP GET request to a server. The server responds with HTML. Your browser renders that HTML into the visual page you see.

Web scraping does the same thing — but instead of a browser rendering the HTML visually, Python reads it programmatically and extracts exactly the data you want.

Your Script  →  HTTP GET Request  →  Web Server
Web Server   →  HTML Response     →  Your Script
Your Script  →  Parse HTML        →  Structured Data
Enter fullscreen mode Exit fullscreen mode

There are two key parts:

  • requests handles the first half: sending the HTTP request and receiving the HTML

  • BeautifulSoup handles the second half: parsing that HTML so you can navigate and extract from it

Installation

Install all required libraries with a single pip command:

pip install requests beautifulsoup4 pandas lxml
Enter fullscreen mode Exit fullscreen mode

Why lxml? BeautifulSoup supports multiple parsers. lxml is the fastest and most lenient — it handles malformed HTML gracefully, which is important because real-world HTML is often messy.

Your First Scraper: Fetching a Page

Let's start simple. Here is how to fetch the HTML of any webpage:

import requests

url = "https://books.toscrape.com/"

# A User-Agent tells the server what kind of client is making the request.
# Without this, many servers will block or return a different response.
headers = {
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) "
                  "AppleWebKit/537.36 (KHTML, like Gecko) "
                  "Chrome/120.0.0.0 Safari/537.36"
}

response = requests.get(url, headers=headers, timeout=10)

# Always check the status code before parsing
print(response.status_code)   # 200 = success
print(len(response.text))     # Length of the HTML string
Enter fullscreen mode Exit fullscreen mode

About status codes:

Code Meaning
200 Success
301/302 Redirect (requests follows these automatically)
403 Forbidden — you're being blocked
404 Page not found
429 Too many requests — you're being rate-limited
500 Server error

If you get a 403, your User-Agent is probably missing or being rejected. If you get a 429, you are scraping too fast.

Parsing HTML with BeautifulSoup

Once you have the HTML string, you pass it to BeautifulSoup:

from bs4 import BeautifulSoup

soup = BeautifulSoup(response.text, "lxml")

# The `soup` object now represents the entire HTML document.
# You can navigate it like a tree.
print(soup.title.text)         # Page title
print(soup.find("h1").text)    # First h1 on the page
Enter fullscreen mode Exit fullscreen mode

BeautifulSoup gives you several ways to find elements:

Method 1: find() — returns the first match

# Find the first element with tag ## 
heading = soup.find("h2")

# Find the first element with a specific class
box = soup.find("div", class_="product-box")

# Find by ID
sidebar = soup.find("div", id="sidebar")
Enter fullscreen mode Exit fullscreen mode

Method 2: find_all() — returns a list of all matches

# Find ALL  tags
all_links = soup.find_all("a")

# Iterate and extract
for link in all_links:
    print(link.text, link.get("href"))
Enter fullscreen mode Exit fullscreen mode

Method 3: CSS Selectors with select() — the most powerful

If you know CSS, you already know this. .select() accepts any CSS selector string.

# All elements with class "product_pod"
products = soup.select("article.product_pod")

# The first anchor inside elements with class "titleline"
title_links = soup.select(".titleline a")

# Nested selectors — p tags inside div.content
paragraphs = soup.select("div.content p")

# select_one() is like find() but uses CSS syntax
price = soup.select_one(".price_color")
Enter fullscreen mode Exit fullscreen mode

Tip: Use your browser's DevTools to get selectors instantly. Right-click any element → Inspect → Right-click the highlighted HTML → Copy → Copy selector.

Real Example: Scraping Book Data
books.toscrape.com is a sandbox website built specifically for scraping practice. Let's scrape its catalog.

import requests
from bs4 import BeautifulSoup
import pandas as pd
import time

HEADERS = {
    "User-Agent": "Mozilla/5.0 (compatible; BookScraper/1.0)"
}

def parse_rating(class_string):
    """Convert word-based star rating to number."""
    rating_map = {
        "One": 1, "Two": 2, "Three": 3,
        "Four": 4, "Five": 5
    }
    # class_string looks like "star-rating Three"
    word = class_string.split()[-1]
    return rating_map.get(word, 0)

def scrape_page(url):
    """Scrape all books from a single catalogue page."""
    response = requests.get(url, headers=HEADERS, timeout=10)
    response.raise_for_status()  # raises exception on 4xx/5xx

    soup = BeautifulSoup(response.text, "lxml")
    books = []

    for article in soup.select("article.product_pod"):
        title = article.select_one("h3 a")["title"]
        price = article.select_one(".price_color").text.strip()
        rating_class = article.select_one(".star-rating")["class"]
        rating = parse_rating(" ".join(rating_class))
        in_stock = "In stock" in article.select_one(".availability").text

        books.append({
            "title": title,
            "price": price,
            "rating": rating,
            "in_stock": in_stock
        })

    return books

def scrape_catalog(pages=5):
    """Scrape multiple pages with polite delays."""
    base_url = "https://books.toscrape.com/catalogue/page-{}.html"
    all_books = []

    for page_num in range(1, pages + 1):
        url = base_url.format(page_num)
        print(f"Scraping page {page_num}...")

        page_books = scrape_page(url)
        all_books.extend(page_books)

        time.sleep(1.5)  # Be polite — don't hammer the server

    return pd.DataFrame(all_books)

# Run the scraper
df = scrape_catalog(pages=10)
print(f"Scraped {len(df)} books")
print(df.head())

# Save to CSV
df.to_csv("books.csv", index=False)
Enter fullscreen mode Exit fullscreen mode

Sample output:

Scraped 200 books
                                           title  price  rating  in_stock
0                          A Light in the Attic  £51.77       3      True
1                            Tipping the Velvet  £53.74       1      True
2                                    Soumission  £50.10       1      True
...
Enter fullscreen mode Exit fullscreen mode

Handling Pagination Automatically

The previous example used hard-coded page numbers. A better approach is to follow "Next" links dynamically — this way your scraper adapts to any number of pages.

from urllib.parse import urljoin

def scrape_all_pages(start_url):
    """Follow pagination links until there are no more pages."""
    all_books = []
    current_url = start_url

    while current_url:
        print(f"Scraping: {current_url}")
        response = requests.get(current_url, headers=HEADERS, timeout=10)
        soup = BeautifulSoup(response.text, "lxml")

        # Scrape current page
        for article in soup.select("article.product_pod"):
            title = article.select_one("h3 a")["title"]
            price = article.select_one(".price_color").text.strip()
            all_books.append({"title": title, "price": price})

        # Find the "next" button — returns None if we're on the last page
        next_btn = soup.select_one("li.next a")
        if next_btn:
            # Build the absolute URL from the relative href
            current_url = urljoin(current_url, next_btn["href"])
        else:
            current_url = None  # No more pages, stop the loop

        time.sleep(1)

    return pd.DataFrame(all_books)

df = scrape_all_pages("https://books.toscrape.com/catalogue/page-1.html")
print(f"Total books scraped: {len(df)}")
Enter fullscreen mode Exit fullscreen mode

This pattern works for virtually any paginated website — product listings, news archives, search results.

Extracting Common Data Types

Extracting text

# .text gives raw text including whitespace
raw = element.text

# .get_text(strip=True) is cleaner
clean = element.get_text(strip=True)

# .get_text(separator=", ") joins multiple text nodes
joined = element.get_text(separator=", ")
Enter fullscreen mode Exit fullscreen mode

Extracting attributes

# Get the href from a link
url = soup.find("a")["href"]
url = soup.find("a").get("href")  # safer — returns None instead of KeyError

# Get the src from an image
img_src = soup.find("img").get("src")

# Get a data attribute
product_id = element.get("data-product-id")
Enter fullscreen mode Exit fullscreen mode

Extracting tables

HTML tables are tedious to parse manually. pandas does it in one line:

import pandas as pd

# pd.read_html() returns a list of all tables on the page as DataFrames
tables = pd.read_html(response.text)
df = tables[0]  # first table on the page
print(df)
Enter fullscreen mode Exit fullscreen mode

Handling Errors Gracefully

Real-world scraping always involves errors — network timeouts, missing elements, rate limiting. Here is a robust error-handling pattern:

import requests
from requests.adapters import HTTPAdapter
from urllib3.util.retry import Retry

def create_session():
    """Create a session with automatic retries on network errors."""
    session = requests.Session()
    session.headers.update({"User-Agent": "Mozilla/5.0"})

    # Retry up to 3 times on connection errors and 500/502/503/504
    retry_strategy = Retry(
        total=3,
        backoff_factor=1,           # Wait 1s, 2s, 4s between retries
        status_forcelist=[500, 502, 503, 504],
        allowed_methods=["GET"]
    )
    adapter = HTTPAdapter(max_retries=retry_strategy)
    session.mount("http://", adapter)
    session.mount("https://", adapter)
    return session

def safe_get_text(element, selector, default="N/A"):
    """Extract text from a CSS selector, with a fallback default."""
    found = element.select_one(selector)
    return found.get_text(strip=True) if found else default

# Usage
session = create_session()

try:
    response = session.get("https://example.com", timeout=15)
    response.raise_for_status()
    soup = BeautifulSoup(response.text, "lxml")

    title = safe_get_text(soup, "h1")
    price = safe_get_text(soup, ".price", default="Price not found")

except requests.exceptions.Timeout:
    print("Request timed out")
except requests.exceptions.HTTPError as e:
    print(f"HTTP error: {e.response.status_code}")
except requests.exceptions.ConnectionError:
    print("Could not connect to the server")
Enter fullscreen mode Exit fullscreen mode

Respecting robots.txt

Before scraping any site, check its robots.txt file. This file, always located at domain.com/robots.txt, specifies which paths are off-limits for bots.

import urllib.robotparser

def is_allowed(url):
    """Check if robots.txt permits scraping this URL."""
    from urllib.parse import urlparse
    parsed = urlparse(url)
    robots_url = f"{parsed.scheme}://{parsed.netloc}/robots.txt"

    rp = urllib.robotparser.RobotFileParser()
    rp.set_url(robots_url)
    rp.read()

    return rp.can_fetch("*", url)

print(is_allowed("https://books.toscrape.com/"))  # True
Enter fullscreen mode Exit fullscreen mode

Ignoring robots.txt is considered impolite and can have legal implications depending on your jurisdiction and the site's Terms of Service.

Exporting Data

To CSV

df.to_csv("output.csv", index=False, encoding="utf-8-sig")
# utf-8-sig adds a BOM that makes Excel read accented characters correctly
Enter fullscreen mode Exit fullscreen mode

To JSON

df.to_json("output.json", orient="records", indent=2, force_ascii=False)
Enter fullscreen mode Exit fullscreen mode

To SQLite

import sqlite3

conn = sqlite3.connect("scraping_results.db")
df.to_sql("books", conn, if_exists="replace", index=False)
conn.close()
Enter fullscreen mode Exit fullscreen mode

A Complete, Production-Ready Scraper

Here is the complete, polished version combining everything above:

import requests
import pandas as pd
import time
import logging
from bs4 import BeautifulSoup
from urllib.parse import urljoin
from requests.adapters import HTTPAdapter
from urllib3.util.retry import Retry

logging.basicConfig(level=logging.INFO, format="%(asctime)s - %(message)s")
logger = logging.getLogger(__name__)

class BookScraper:
    BASE_URL = "https://books.toscrape.com/catalogue/page-1.html"
    HEADERS = {"User-Agent": "Mozilla/5.0 (compatible; BookScraper/1.0)"}
    DELAY = 1.5  # seconds between requests

    RATING_MAP = {"One": 1, "Two": 2, "Three": 3, "Four": 4, "Five": 5}

    def __init__(self):
        self.session = self._create_session()

    def _create_session(self):
        session = requests.Session()
        session.headers.update(self.HEADERS)
        retries = Retry(total=3, backoff_factor=1, status_forcelist=[500, 502, 503])
        session.mount("https://", HTTPAdapter(max_retries=retries))
        return session

    def _fetch(self, url):
        response = self.session.get(url, timeout=15)
        response.raise_for_status()
        return BeautifulSoup(response.text, "lxml")

    def _parse_book(self, article):
        title = article.select_one("h3 a").get("title", "Unknown")
        price = article.select_one(".price_color").get_text(strip=True)
        rating_word = article.select_one(".star-rating")["class"][-1]
        rating = self.RATING_MAP.get(rating_word, 0)
        in_stock = "In stock" in article.select_one(".availability").text
        return {"title": title, "price": price, "rating": rating, "in_stock": in_stock}

    def scrape(self):
        all_books = []
        current_url = self.BASE_URL

        while current_url:
            logger.info(f"Scraping: {current_url}")
            soup = self._fetch(current_url)

            for article in soup.select("article.product_pod"):
                all_books.append(self._parse_book(article))

            next_btn = soup.select_one("li.next a")
            current_url = urljoin(current_url, next_btn["href"]) if next_btn else None
            time.sleep(self.DELAY)

        logger.info(f"Done. Scraped {len(all_books)} books.")
        return pd.DataFrame(all_books)

if __name__ == "__main__":
    scraper = BookScraper()
    df = scraper.scrape()
    df.to_csv("all_books.csv", index=False)
    print(df.describe())
Enter fullscreen mode Exit fullscreen mode

Common Pitfalls and How to Avoid Them

1. Missing User-Agent Many servers return a 403 or a bot-detection page if no User-Agent header is set. Always include one that mimics a real browser.

2. Not handling missing elements If a single product is missing its price tag, calling .text on None will crash your entire scraper. Always use .get_text() on find() results with a None check, or use the safe_get_text() helper pattern shown earlier.

3. Scraping too fast Without delays, you can overwhelm small servers, get IP-banned, or cause real harm. A delay of 1–2 seconds between requests is standard practice. For large jobs, use asyncio (covered in Blog 03).

4. Ignoring encoding Some sites serve Latin-1 or Windows-1252 encoded pages. If your text looks garbled, check response.encoding and set it explicitly: response.encoding = "utf-8".

5. Parsing JavaScript-rendered content BeautifulSoup only parses the raw HTML sent by the server. If the data you need is loaded by JavaScript after the page renders, BeautifulSoup cannot see it. You need Selenium or Playwright for those cases (covered in Blog 02).

What to Learn Next

You now have a solid foundation in synchronous scraping with BeautifulSoup and requests. The natural next steps are:

  • Dynamic pages (JavaScript-rendered): Move to Playwright or Selenium when the data is loaded by JS

  • Async scraping: Use httpx + asyncio to scrape 10x faster (Blog 03)

  • Anti-bot evasion: Learn how sites detect scrapers and how to avoid detection (Blog 04)

  • Production pipelines: Use Scrapy for large-scale, fault-tolerant crawling (Blog 05)

Summary

Concept What you learned
HTTP basics requests.get(), status codes, headers
Parsing BeautifulSoup, find(), find_all(), select()
Navigation CSS selectors, attribute extraction, text extraction
Pagination Following next-page links dynamically
Error handling Retry sessions, safe element access
Data export CSV, JSON, SQLite via pandas
Ethics robots.txt, rate limiting, Terms of Service

Web scraping is a superpower. Use it responsibly.


Originally published on ZyVOP

Top comments (0)