DEV Community

agenthustler
agenthustler

Posted on

Complete Scrapy Tutorial in 2026: Build a Production Web Scraper from Scratch

Web scraping is one of the most valuable skills in a Python developer's toolkit. Whether you're building a price tracker, aggregating job listings, or feeding data into an ML pipeline, you need a reliable scraping framework. Scrapy is that framework.

In this tutorial, you'll build a complete, production-ready web scraper from scratch using Scrapy. By the end, you'll understand spiders, pipelines, middlewares, and how to deploy your scraper for recurring jobs.

Why Scrapy Over BeautifulSoup + Requests?

Before we dive in, let's settle a common question.

Feature BeautifulSoup + Requests Scrapy
Learning curve Low Medium
Async requests No (manual threading) Built-in
Rate limiting Manual Built-in (DOWNLOAD_DELAY)
Data pipelines Manual Built-in pipeline system
Middleware support None Full middleware stack
Retry logic Manual Built-in
Export formats Manual JSON, CSV, XML built-in
Best for Quick one-off scripts Production scrapers

Rule of thumb: Use BeautifulSoup for quick scripts under 100 lines. Use Scrapy for anything that runs on a schedule, needs to handle errors gracefully, or scrapes more than a few hundred pages.

Setting Up Your Scrapy Project

pip install scrapy
scrapy startproject bookstore
cd bookstore
Enter fullscreen mode Exit fullscreen mode

This generates the standard Scrapy project structure:

bookstore/
├── scrapy.cfg
└── bookstore/
    ├── __init__.py
    ├── items.py
    ├── middlewares.py
    ├── pipelines.py
    ├── settings.py
    └── spiders/
        └── __init__.py
Enter fullscreen mode Exit fullscreen mode

Each file has a purpose:

  • items.py — Data models (like dataclasses for your scraped data)
  • pipelines.py — Post-processing: cleaning, validation, database storage
  • middlewares.py — Request/response hooks: proxy rotation, header injection
  • settings.py — Configuration: delays, concurrency, pipeline order

Step 1: Define Your Data Model

Open bookstore/items.py:

import scrapy

class BookItem(scrapy.Item):
    title = scrapy.Field()
    price = scrapy.Field()
    rating = scrapy.Field()
    availability = scrapy.Field()
    url = scrapy.Field()
Enter fullscreen mode Exit fullscreen mode

Items enforce structure. Every scraped record must match this schema, which prevents silent data corruption down the line.

Step 2: Build Your Spider

Create bookstore/spiders/books_spider.py:

import scrapy
from bookstore.items import BookItem

class BooksSpider(scrapy.Spider):
    name = "books"
    start_urls = ["https://books.toscrape.com/catalogue/page-1.html"]

    rating_map = {
        "One": 1, "Two": 2, "Three": 3, "Four": 4, "Five": 5
    }

    def parse(self, response):
        for book in response.css("article.product_pod"):
            item = BookItem()
            item["title"] = book.css("h3 a::attr(title)").get()
            item["price"] = book.css(".price_color::text").get()
            item["rating"] = self.rating_map.get(
                book.css(".star-rating::attr(class)").re_first(
                    r"star-rating (\w+)"
                ), 0
            )
            item["availability"] = book.css(
                ".availability::text"
            ).getall()[-1].strip()
            item["url"] = response.urljoin(
                book.css("h3 a::attr(href)").get()
            )
            yield item

        next_page = response.css("li.next a::attr(href)").get()
        if next_page:
            yield response.follow(next_page, callback=self.parse)
Enter fullscreen mode Exit fullscreen mode

Key Scrapy concepts here:

  • start_urls — Entry points. Scrapy sends GET requests to each.
  • response.css() — CSS selectors (also supports XPath via response.xpath()).
  • yield item — Sends the item through your pipeline.
  • response.follow() — Follows links with automatic URL resolution.

Step 3: Add a Data Pipeline

Pipelines process items after extraction. Open bookstore/pipelines.py:

class CleanPricePipeline:
    def process_item(self, item, spider):
        if item.get("price"):
            raw = item["price"].replace("£", "").strip()
            item["price"] = float(raw)
        return item


class DuplicateFilterPipeline:
    def __init__(self):
        self.seen_urls = set()

    def process_item(self, item, spider):
        from scrapy.exceptions import DropItem
        if item["url"] in self.seen_urls:
            raise DropItem(f"Duplicate: {item['url']}")
        self.seen_urls.add(item["url"])
        return item
Enter fullscreen mode Exit fullscreen mode

Enable them in settings.py:

ITEM_PIPELINES = {
    "bookstore.pipelines.CleanPricePipeline": 100,
    "bookstore.pipelines.DuplicateFilterPipeline": 200,
}
Enter fullscreen mode Exit fullscreen mode

The numbers set execution order. Lower runs first.

Step 4: Configure Settings for Production

Edit bookstore/settings.py:

DOWNLOAD_DELAY = 1.5
CONCURRENT_REQUESTS_PER_DOMAIN = 4
RANDOMIZE_DOWNLOAD_DELAY = True

ROBOTSTXT_OBEY = True

RETRY_TIMES = 3
RETRY_HTTP_CODES = [500, 502, 503, 504, 408, 429]

USER_AGENT = "Mozilla/5.0 (compatible; BookstoreBot/1.0)"

AUTOTHROTTLE_ENABLED = True
AUTOTHROTTLE_START_DELAY = 1
AUTOTHROTTLE_MAX_DELAY = 10
AUTOTHROTTLE_TARGET_CONCURRENCY = 2.0

FEEDS = {
    "output/books.json": {
        "format": "json",
        "encoding": "utf8",
        "overwrite": True,
    },
}
Enter fullscreen mode Exit fullscreen mode

AutoThrottle is a killer feature. It dynamically adjusts request speed based on server response times. If the server slows down, Scrapy backs off automatically.

Step 5: Custom Middleware for Proxy Rotation

For production scraping, you'll need proxies. Here's a simple rotation middleware in bookstore/middlewares.py:

import random

class ProxyRotationMiddleware:
    def __init__(self):
        self.proxies = [
            "http://proxy1:8080",
            "http://proxy2:8080",
            "http://proxy3:8080",
        ]

    def process_request(self, request, spider):
        request.meta["proxy"] = random.choice(self.proxies)
Enter fullscreen mode Exit fullscreen mode

Enable it in settings.py:

DOWNLOADER_MIDDLEWARES = {
    "bookstore.middlewares.ProxyRotationMiddleware": 350,
}
Enter fullscreen mode Exit fullscreen mode

Need reliable proxies? ThorData offers residential proxies with high success rates for scraping workloads, and ScrapeOps provides a proxy aggregator that automatically routes through the best provider for each domain.

Step 6: Run Your Scraper

scrapy crawl books
scrapy crawl books -O books.json
scrapy crawl books -L INFO
Enter fullscreen mode Exit fullscreen mode

You should see output like:

2026-03-09 10:00:01 [scrapy.core.engine] INFO: Spider opened
2026-03-09 10:00:03 [scrapy.core.scraper] DEBUG: Scraped from <200 ...>
{'title': 'A Light in the Attic', 'price': 51.77, 'rating': 3, ...}
...
2026-03-09 10:00:45 [scrapy.core.engine] INFO: Closing spider (finished)
 'item_scraped_count': 1000,
Enter fullscreen mode Exit fullscreen mode

Running Scrapy in Production

For recurring scrapes, you have several options:

Option 1: Cron job

0 2 * * * cd /path/to/bookstore && scrapy crawl books >> /var/log/scraper.log 2>&1
Enter fullscreen mode Exit fullscreen mode

Option 2: CrawlerProcess for programmatic execution

from scrapy.crawler import CrawlerProcess
from scrapy.utils.project import get_project_settings

process = CrawlerProcess(get_project_settings())
process.crawl("books")
process.start()
Enter fullscreen mode Exit fullscreen mode

Option 3: ScrapyD — A daemon with a JSON API for scheduling:

pip install scrapyd
scrapyd
curl http://localhost:6800/schedule.json -d project=bookstore -d spider=books
Enter fullscreen mode Exit fullscreen mode

Handling Common Challenges

JavaScript-Rendered Pages

Scrapy alone can't execute JavaScript. For SPAs, add scrapy-playwright:

pip install scrapy-playwright
Enter fullscreen mode Exit fullscreen mode
yield scrapy.Request(
    url,
    meta={
        "playwright": True,
        "playwright_page_methods": [
            {"method": "wait_for_selector", "args": [".product-list"]},
        ],
    },
)
Enter fullscreen mode Exit fullscreen mode

Rate Limiting and Bans

Signs you're being blocked: 403 responses, CAPTCHAs, empty pages.

Solutions:

  1. Increase DOWNLOAD_DELAY
  2. Rotate User-Agents and proxies
  3. Use ScrapeOps Monitoring to track success rates across domains

Storing Data in a Database

Add a database pipeline:

import sqlite3

class SQLitePipeline:
    def open_spider(self, spider):
        self.conn = sqlite3.connect("books.db")
        self.conn.execute(
            "CREATE TABLE IF NOT EXISTS books"
            " (title TEXT, price REAL, rating INT,"
            " availability TEXT, url TEXT UNIQUE)"
        )

    def process_item(self, item, spider):
        self.conn.execute(
            "INSERT OR IGNORE INTO books VALUES (?, ?, ?, ?, ?)",
            (item["title"], item["price"], item["rating"],
             item["availability"], item["url"])
        )
        self.conn.commit()
        return item

    def close_spider(self, spider):
        self.conn.close()
Enter fullscreen mode Exit fullscreen mode

Architecture Recap

Request -> Downloader Middlewares -> Download -> Response
    |
  Spider (parse) -> Items -> Item Pipelines -> Storage
    |
  New Requests -> Scheduler -> Loop
Enter fullscreen mode Exit fullscreen mode

Every component is pluggable. You can swap the scheduler, add custom middlewares, chain pipelines. Scrapy is designed for extension.

What's Next

You now have a production-grade Scrapy scraper with:

  • Structured data models
  • Cleaning and deduplication pipelines
  • Rate limiting and auto-throttle
  • Proxy rotation middleware
  • Multiple output formats

From here, explore:

  • Scrapy Cloud (Zyte) for hosted deployments
  • scrapy-splash or scrapy-playwright for JavaScript rendering
  • Item Loaders for complex field processing
  • ThorData proxies for high-volume production scraping

The full code for this tutorial is based on books.toscrape.com — a safe sandbox for scraping practice. Start there, then adapt the patterns to your real-world targets.


Building scrapers that run reliably in production? ScrapeOps provides monitoring dashboards that track success rates, response times, and proxy performance across all your spiders.

Top comments (0)