ZyVOP

Posted on Jun 2 • Edited on Jun 8 • Originally published at zyvop.com

Web Scraping with Python: A Complete BeautifulSoup & Requests Guide

#python #beautifulsoup #requests #webscraping

Every day, billions of web pages sit on the internet — full of prices, headlines, job listings, research data, and more. Most of it has no official API. Web scraping is how you collect that data programmatically, turning raw HTML into clean, structured datasets you can actually use.

Python is the gold standard for web scraping. It has a rich ecosystem, readable syntax, and two libraries in particular that make scraping feel almost effortless: requests (for fetching web pages) and BeautifulSoup (for parsing them).

By the end of this guide, you will:

Understand how HTTP requests and HTML parsing work together
Write a scraper that collects data from real websites
Handle pagination, headers, and common errors
Export your data to CSV using pandas

Let's dig in.

How Web Scraping Works

When you type a URL into a browser, your browser sends an HTTP GET request to a server. The server responds with HTML. Your browser renders that HTML into the visual page you see.

Web scraping does the same thing — but instead of a browser rendering the HTML visually, Python reads it programmatically and extracts exactly the data you want.

Your Script  →  HTTP GET Request  →  Web Server
Web Server   →  HTML Response     →  Your Script
Your Script  →  Parse HTML        →  Structured Data

There are two key parts:

requests handles the first half: sending the HTTP request and receiving the HTML
BeautifulSoup handles the second half: parsing that HTML so you can navigate and extract from it

Installation

Install all required libraries with a single pip command:

pip install requests beautifulsoup4 pandas lxml

Why lxml? BeautifulSoup supports multiple parsers. lxml is the fastest and most lenient — it handles malformed HTML gracefully, which is important because real-world HTML is often messy.

Your First Scraper: Fetching a Page

Let's start simple. Here is how to fetch the HTML of any webpage:

import requests

url = "https://books.toscrape.com/"

# A User-Agent tells the server what kind of client is making the request.
# Without this, many servers will block or return a different response.
headers = {
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) "
                  "AppleWebKit/537.36 (KHTML, like Gecko) "
                  "Chrome/120.0.0.0 Safari/537.36"
}

response = requests.get(url, headers=headers, timeout=10)

# Always check the status code before parsing
print(response.status_code)   # 200 = success
print(len(response.text))     # Length of the HTML string

About status codes:

Code	Meaning
200	Success
301/302	Redirect (requests follows these automatically)
403	Forbidden — you're being blocked
404	Page not found
429	Too many requests — you're being rate-limited
500	Server error

If you get a 403, your User-Agent is probably missing or being rejected. If you get a 429, you are scraping too fast.

Parsing HTML with BeautifulSoup

Once you have the HTML string, you pass it to BeautifulSoup:

from bs4 import BeautifulSoup

soup = BeautifulSoup(response.text, "lxml")

# The `soup` object now represents the entire HTML document.
# You can navigate it like a tree.
print(soup.title.text)         # Page title
print(soup.find("h1").text)    # First h1 on the page

BeautifulSoup gives you several ways to find elements:

Method 1: `find()` — returns the first match

# Find the first element with tag <h2>
heading = soup.find("h2")

# Find the first element with a specific class
box = soup.find("div", class_="product-box")

# Find by ID
sidebar = soup.find("div", id="sidebar")

Method 2: `find_all()` — returns a list of all matches

# Find ALL <a> tags
all_links = soup.find_all("a")

# Iterate and extract
for link in all_links:
    print(link.text, link.get("href"))

Method 3: CSS Selectors with `select()` — the most powerful

If you know CSS, you already know this. .select() accepts any CSS selector string.

# All elements with class "product_pod"
products = soup.select("article.product_pod")

# The first anchor inside elements with class "titleline"
title_links = soup.select(".titleline a")

# Nested selectors — p tags inside div.content
paragraphs = soup.select("div.content p")

# select_one() is like find() but uses CSS syntax
price = soup.select_one(".price_color")

Tip: Use your browser's DevTools to get selectors instantly. Right-click any element → Inspect → Right-click the highlighted HTML → Copy → Copy selector.

Real Example: Scraping Book Data

books.toscrape.com is a sandbox website built specifically for scraping practice. Let's scrape its catalog.

import requests
from bs4 import BeautifulSoup
import pandas as pd
import time

HEADERS = {
    "User-Agent": "Mozilla/5.0 (compatible; BookScraper/1.0)"
}

def parse_rating(class_string):
    """Convert word-based star rating to number."""
    rating_map = {
        "One": 1, "Two": 2, "Three": 3,
        "Four": 4, "Five": 5
    }
    # class_string looks like "star-rating Three"
    word = class_string.split()[-1]
    return rating_map.get(word, 0)

def scrape_page(url):
    """Scrape all books from a single catalogue page."""
    response = requests.get(url, headers=HEADERS, timeout=10)
    response.raise_for_status()  # raises exception on 4xx/5xx

    soup = BeautifulSoup(response.text, "lxml")
    books = []

    for article in soup.select("article.product_pod"):
        title = article.select_one("h3 a")["title"]
        price = article.select_one(".price_color").text.strip()
        rating_class = article.select_one(".star-rating")["class"]
        rating = parse_rating(" ".join(rating_class))
        in_stock = "In stock" in article.select_one(".availability").text

        books.append({
            "title": title,
            "price": price,
            "rating": rating,
            "in_stock": in_stock
        })

    return books

def scrape_catalog(pages=5):
    """Scrape multiple pages with polite delays."""
    base_url = "https://books.toscrape.com/catalogue/page-{}.html"
    all_books = []

    for page_num in range(1, pages + 1):
        url = base_url.format(page_num)
        print(f"Scraping page {page_num}...")

        page_books = scrape_page(url)
        all_books.extend(page_books)

        time.sleep(1.5)  # Be polite — don't hammer the server

    return pd.DataFrame(all_books)

# Run the scraper
df = scrape_catalog(pages=10)
print(f"Scraped {len(df)} books")
print(df.head())

# Save to CSV
df.to_csv("books.csv", index=False)

Sample output:

Scraped 200 books
                                           title  price  rating  in_stock
0                          A Light in the Attic  £51.77       3      True
1                            Tipping the Velvet  £53.74       1      True
2                                    Soumission  £50.10       1      True
...

Handling Pagination Automatically

The previous example used hard-coded page numbers. A better approach is to follow "Next" links dynamically — this way your scraper adapts to any number of pages.

from urllib.parse import urljoin

def scrape_all_pages(start_url):
    """Follow pagination links until there are no more pages."""
    all_books = []
    current_url = start_url

    while current_url:
        print(f"Scraping: {current_url}")
        response = requests.get(current_url, headers=HEADERS, timeout=10)
        soup = BeautifulSoup(response.text, "lxml")

        # Scrape current page
        for article in soup.select("article.product_pod"):
            title = article.select_one("h3 a")["title"]
            price = article.select_one(".price_color").text.strip()
            all_books.append({"title": title, "price": price})

        # Find the "next" button — returns None if we're on the last page
        next_btn = soup.select_one("li.next a")
        if next_btn:
            # Build the absolute URL from the relative href
            current_url = urljoin(current_url, next_btn["href"])
        else:
            current_url = None  # No more pages, stop the loop

        time.sleep(1)

    return pd.DataFrame(all_books)

df = scrape_all_pages("https://books.toscrape.com/catalogue/page-1.html")
print(f"Total books scraped: {len(df)}")

This pattern works for virtually any paginated website — product listings, news archives, search results.

Extracting Common Data Types

Extracting text

# .text gives raw text including whitespace
raw = element.text

# .get_text(strip=True) is cleaner
clean = element.get_text(strip=True)

# .get_text(separator=", ") joins multiple text nodes
joined = element.get_text(separator=", ")

Extracting attributes

# Get the href from a link
url = soup.find("a")["href"]
url = soup.find("a").get("href")  # safer — returns None instead of KeyError

# Get the src from an image
img_src = soup.find("img").get("src")

# Get a data attribute
product_id = element.get("data-product-id")

Extracting tables

HTML tables are tedious to parse manually. pandas does it in one line:

import pandas as pd

# pd.read_html() returns a list of all tables on the page as DataFrames
tables = pd.read_html(response.text)
df = tables[0]  # first table on the page
print(df)

Handling Errors Gracefully

Real-world scraping always involves errors — network timeouts, missing elements, rate limiting. Here is a robust error-handling pattern:

import requests
from requests.adapters import HTTPAdapter
from urllib3.util.retry import Retry

def create_session():
    """Create a session with automatic retries on network errors."""
    session = requests.Session()
    session.headers.update({"User-Agent": "Mozilla/5.0"})

    # Retry up to 3 times on connection errors and 500/502/503/504
    retry_strategy = Retry(
        total=3,
        backoff_factor=1,           # Wait 1s, 2s, 4s between retries
        status_forcelist=[500, 502, 503, 504],
        allowed_methods=["GET"]
    )
    adapter = HTTPAdapter(max_retries=retry_strategy)
    session.mount("http://", adapter)
    session.mount("https://", adapter)
    return session

def safe_get_text(element, selector, default="N/A"):
    """Extract text from a CSS selector, with a fallback default."""
    found = element.select_one(selector)
    return found.get_text(strip=True) if found else default

# Usage
session = create_session()

try:
    response = session.get("https://example.com", timeout=15)
    response.raise_for_status()
    soup = BeautifulSoup(response.text, "lxml")

    title = safe_get_text(soup, "h1")
    price = safe_get_text(soup, ".price", default="Price not found")

except requests.exceptions.Timeout:
    print("Request timed out")
except requests.exceptions.HTTPError as e:
    print(f"HTTP error: {e.response.status_code}")
except requests.exceptions.ConnectionError:
    print("Could not connect to the server")

Respecting robots.txt

Before scraping any site, check its robots.txt file. This file, always located at domain.com/robots.txt, specifies which paths are off-limits for bots.

import urllib.robotparser

def is_allowed(url):
    """Check if robots.txt permits scraping this URL."""
    from urllib.parse import urlparse
    parsed = urlparse(url)
    robots_url = f"{parsed.scheme}://{parsed.netloc}/robots.txt"

    rp = urllib.robotparser.RobotFileParser()
    rp.set_url(robots_url)
    rp.read()

    return rp.can_fetch("*", url)

print(is_allowed("https://books.toscrape.com/"))  # True

Ignoring robots.txt is considered impolite and can have legal implications depending on your jurisdiction and the site's Terms of Service.

Exporting Data

To CSV

df.to_csv("output.csv", index=False, encoding="utf-8-sig")
# utf-8-sig adds a BOM that makes Excel read accented characters correctly

To JSON

df.to_json("output.json", orient="records", indent=2, force_ascii=False)

To SQLite

import sqlite3

conn = sqlite3.connect("scraping_results.db")
df.to_sql("books", conn, if_exists="replace", index=False)
conn.close()

A Complete, Production-Ready Scraper

Here is the complete, polished version combining everything above:

import requests
import pandas as pd
import time
import logging
from bs4 import BeautifulSoup
from urllib.parse import urljoin
from requests.adapters import HTTPAdapter
from urllib3.util.retry import Retry

logging.basicConfig(level=logging.INFO, format="%(asctime)s - %(message)s")
logger = logging.getLogger(__name__)

class BookScraper:
    BASE_URL = "https://books.toscrape.com/catalogue/page-1.html"
    HEADERS = {"User-Agent": "Mozilla/5.0 (compatible; BookScraper/1.0)"}
    DELAY = 1.5  # seconds between requests

    RATING_MAP = {"One": 1, "Two": 2, "Three": 3, "Four": 4, "Five": 5}

    def __init__(self):
        self.session = self._create_session()

    def _create_session(self):
        session = requests.Session()
        session.headers.update(self.HEADERS)
        retries = Retry(total=3, backoff_factor=1, status_forcelist=[500, 502, 503])
        session.mount("https://", HTTPAdapter(max_retries=retries))
        return session

    def _fetch(self, url):
        response = self.session.get(url, timeout=15)
        response.raise_for_status()
        return BeautifulSoup(response.text, "lxml")

    def _parse_book(self, article):
        title = article.select_one("h3 a").get("title", "Unknown")
        price = article.select_one(".price_color").get_text(strip=True)
        rating_word = article.select_one(".star-rating")["class"][-1]
        rating = self.RATING_MAP.get(rating_word, 0)
        in_stock = "In stock" in article.select_one(".availability").text
        return {"title": title, "price": price, "rating": rating, "in_stock": in_stock}

    def scrape(self):
        all_books = []
        current_url = self.BASE_URL

        while current_url:
            logger.info(f"Scraping: {current_url}")
            soup = self._fetch(current_url)

            for article in soup.select("article.product_pod"):
                all_books.append(self._parse_book(article))

            next_btn = soup.select_one("li.next a")
            current_url = urljoin(current_url, next_btn["href"]) if next_btn else None
            time.sleep(self.DELAY)

        logger.info(f"Done. Scraped {len(all_books)} books.")
        return pd.DataFrame(all_books)

if __name__ == "__main__":
    scraper = BookScraper()
    df = scraper.scrape()
    df.to_csv("all_books.csv", index=False)
    print(df.describe())

Common Pitfalls and How to Avoid Them

1. Missing User-Agent Many servers return a 403 or a bot-detection page if no User-Agent header is set. Always include one that mimics a real browser.

2. Not handling missing elements If a single product is missing its price tag, calling .text on None will crash your entire scraper. Always use .get_text() on find() results with a None check, or use the safe_get_text() helper pattern shown earlier.

3. Scraping too fast Without delays, you can overwhelm small servers, get IP-banned, or cause real harm. A delay of 1–2 seconds between requests is standard practice. For large jobs, use asyncio (covered in Blog 03).

4. Ignoring encoding Some sites serve Latin-1 or Windows-1252 encoded pages. If your text looks garbled, check response.encoding and set it explicitly: response.encoding = "utf-8".

5. Parsing JavaScript-rendered content BeautifulSoup only parses the raw HTML sent by the server. If the data you need is loaded by JavaScript after the page renders, BeautifulSoup cannot see it. You need Selenium or Playwright for those cases (covered in Blog 02).

What to Learn Next

You now have a solid foundation in synchronous scraping with BeautifulSoup and requests. The natural next steps are:

Dynamic pages (JavaScript-rendered): Move to Playwright or Selenium when the data is loaded by JS
Async scraping: Use httpx + asyncio to scrape 10x faster (Blog 03)
Anti-bot evasion: Learn how sites detect scrapers and how to avoid detection (Blog 04)
Production pipelines: Use Scrapy for large-scale, fault-tolerant crawling (Blog 05)

Summary

Concept	What you learned
HTTP basics	requests.get(), status codes, headers
Parsing	BeautifulSoup, find(), find_all(), select()
Navigation	CSS selectors, attribute extraction, text extraction
Pagination	Following next-page links dynamically
Error handling	Retry sessions, safe element access
Data export	CSV, JSON, SQLite via pandas
Ethics	robots.txt, rate limiting, Terms of Service

Web scraping is a superpower. Use it responsibly.

Originally published on ZyVOP

💡 For more articles like this, subscribe to the ZyVOP newsletter!

DEV Community

Web Scraping with Python: A Complete BeautifulSoup & Requests Guide

How Web Scraping Works

Installation

Your First Scraper: Fetching a Page

Parsing HTML with BeautifulSoup

Method 1: `find()` — returns the first match

Method 2: `find_all()` — returns a list of all matches

Method 3: CSS Selectors with `select()` — the most powerful

Real Example: Scraping Book Data

Handling Pagination Automatically

Extracting Common Data Types

Extracting text

Extracting attributes

Extracting tables

Handling Errors Gracefully

Respecting robots.txt

Exporting Data

To CSV

To JSON

To SQLite

A Complete, Production-Ready Scraper

Common Pitfalls and How to Avoid Them

What to Learn Next

Summary

Top comments (0)

How Web Scraping Works

Installation

Your First Scraper: Fetching a Page

Parsing HTML with BeautifulSoup

Method 1: find() — returns the first match

Method 2: find_all() — returns a list of all matches

Method 3: CSS Selectors with select() — the most powerful

Real Example: Scraping Book Data

Handling Pagination Automatically

Extracting Common Data Types

Extracting text

Extracting attributes

Extracting tables

Handling Errors Gracefully

Respecting robots.txt

Exporting Data

To CSV

To JSON

To SQLite

A Complete, Production-Ready Scraper

Common Pitfalls and How to Avoid Them

What to Learn Next

Summary

Method 1: `find()` — returns the first match

Method 2: `find_all()` — returns a list of all matches

Method 3: CSS Selectors with `select()` — the most powerful