DEV Community

Art Baker
Art Baker

Posted on

How to Build a Web Scraper in Python (Step by Step)

Web scraping is one of the most practical Python skills you can learn. Here's how to build one from scratch.

What You Need

pip install requests beautifulsoup4
Enter fullscreen mode Exit fullscreen mode

Step 1: Fetch the Page

import requests
from bs4 import BeautifulSoup

url = "https://example.com/products"
response = requests.get(url, headers={
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64)"
})
soup = BeautifulSoup(response.text, "html.parser")
Enter fullscreen mode Exit fullscreen mode

Always set a User-Agent. Many sites block requests without one.

Step 2: Find Your Data

Use your browser's DevTools (F12 → Inspect) to identify the HTML structure. Then:

# Find all product cards
products = soup.find_all("div", class_="product-card")

for product in products:
    name = product.find("h2").text.strip()
    price = product.find("span", class_="price").text.strip()
    print(f"{name}: {price}")
Enter fullscreen mode Exit fullscreen mode

Step 3: Handle Pagination

def scrape_all_pages(base_url):
    all_data = []
    page = 1
    while True:
        response = requests.get(f"{base_url}?page={page}")
        soup = BeautifulSoup(response.text, "html.parser")
        items = soup.find_all("div", class_="product-card")
        if not items:
            break
        for item in items:
            all_data.append({
                "name": item.find("h2").text.strip(),
                "price": item.find("span", class_="price").text.strip(),
            })
        page += 1
    return all_data
Enter fullscreen mode Exit fullscreen mode

Step 4: Save to CSV

import csv

data = scrape_all_pages("https://example.com/products")
with open("products.csv", "w", newline="") as f:
    writer = csv.DictWriter(f, fieldnames=["name", "price"])
    writer.writeheader()
    writer.writerows(data)
Enter fullscreen mode Exit fullscreen mode

Step 5: Add Error Handling

import time

def safe_request(url, retries=3):
    for attempt in range(retries):
        try:
            r = requests.get(url, timeout=10, headers={
                "User-Agent": "Mozilla/5.0"
            })
            r.raise_for_status()
            return r
        except requests.RequestException as e:
            print(f"Attempt {attempt+1} failed: {e}")
            time.sleep(2 ** attempt)
    return None
Enter fullscreen mode Exit fullscreen mode

Common Pitfalls

  1. Rate limiting — Add time.sleep(1) between requests
  2. Dynamic content — If data loads via JavaScript, use Playwright or Selenium instead
  3. Changing HTML — Your selectors will break when the site updates. Use flexible selectors.
  4. Legal — Check the site's robots.txt and terms of service

Want ready-to-use scraping scripts? My Web Scraping Starter Kit includes 5 production scripts covering tables, pagination, login-protected sites, and API extraction.

Also check out: Python Automation Toolkit — 10 scripts for common dev tasks.

Top comments (0)