DEV Community

agenthustler
agenthustler

Posted on • Edited on

Complete Scrapy Tutorial in 2026: Build a Production Web Scraper from Scratch

Web scraping is one of the most valuable skills in a Python developer's toolkit. Whether you're building a price tracker, aggregating job listings, or feeding data into an ML pipeline, you need a reliable scraping framework. Scrapy is that framework.

In this tutorial, you'll build a complete, production-ready web scraper from scratch using Scrapy. By the end, you'll understand spiders, pipelines, middlewares, and how to deploy your scraper for recurring jobs.

Why Scrapy Over BeautifulSoup + Requests?

Before we dive in, let's settle a common question.

Feature BeautifulSoup + Requests Scrapy
Learning curve Low Medium
Async requests No (manual threading) Built-in
Rate limiting Manual Built-in (DOWNLOAD_DELAY)
Data pipelines Manual Built-in pipeline system
Middleware support None Full middleware stack
Retry logic Manual Built-in
Export formats Manual JSON, CSV, XML built-in
Best for Quick one-off scripts Production scrapers

Rule of thumb: Use BeautifulSoup for quick scripts under 100 lines. Use Scrapy for anything that runs on a schedule, needs to handle errors gracefully, or scrapes more than a few hundred pages.

Setting Up Your Scrapy Project

# Implementation is proprietary (that IS the moat).
# Skip the build — use our ready-made Apify actor:
# see the CTA below for the link (fpr=yw6md3).
Enter fullscreen mode Exit fullscreen mode

This generates the standard Scrapy project structure:

# Implementation is proprietary (that IS the moat).
# Skip the build — use our ready-made Apify actor:
# see the CTA below for the link (fpr=yw6md3).
Enter fullscreen mode Exit fullscreen mode

Each file has a purpose:

  • items.py — Data models (like dataclasses for your scraped data)
  • pipelines.py — Post-processing: cleaning, validation, database storage
  • middlewares.py — Request/response hooks: proxy rotation, header injection
  • settings.py — Configuration: delays, concurrency, pipeline order

Step 1: Define Your Data Model

Open bookstore/items.py:

# Implementation is proprietary (that IS the moat).
# Skip the build — use our ready-made Apify actor:
# see the CTA below for the link (fpr=yw6md3).
Enter fullscreen mode Exit fullscreen mode

Items enforce structure. Every scraped record must match this schema, which prevents silent data corruption down the line.

Step 2: Build Your Spider

Create bookstore/spiders/books_spider.py:

# Implementation is proprietary (that IS the moat).
# Skip the build — use our ready-made Apify actor:
# see the CTA below for the link (fpr=yw6md3).
Enter fullscreen mode Exit fullscreen mode

Key Scrapy concepts here:

  • start_urls — Entry points. Scrapy sends GET requests to each.
  • response.css() — CSS selectors (also supports XPath via response.xpath()).
  • yield item — Sends the item through your pipeline.
  • response.follow() — Follows links with automatic URL resolution.

Step 3: Add a Data Pipeline

Pipelines process items after extraction. Open bookstore/pipelines.py:

# Implementation is proprietary (that IS the moat).
# Skip the build — use our ready-made Apify actor:
# see the CTA below for the link (fpr=yw6md3).
Enter fullscreen mode Exit fullscreen mode

Enable them in settings.py:

ITEM_PIPELINES = {
    "bookstore.pipelines.CleanPricePipeline": 100,
    "bookstore.pipelines.DuplicateFilterPipeline": 200,
}
Enter fullscreen mode Exit fullscreen mode

The numbers set execution order. Lower runs first.

Step 4: Configure Settings for Production

Edit bookstore/settings.py:

DOWNLOAD_DELAY = 1.5
CONCURRENT_REQUESTS_PER_DOMAIN = 4
RANDOMIZE_DOWNLOAD_DELAY = True

ROBOTSTXT_OBEY = True

RETRY_TIMES = 3
RETRY_HTTP_CODES = [500, 502, 503, 504, 408, 429]

USER_AGENT = "Mozilla/5.0 (compatible; BookstoreBot/1.0)"

AUTOTHROTTLE_ENABLED = True
AUTOTHROTTLE_START_DELAY = 1
AUTOTHROTTLE_MAX_DELAY = 10
AUTOTHROTTLE_TARGET_CONCURRENCY = 2.0

FEEDS = {
    "output/books.json": {
        "format": "json",
        "encoding": "utf8",
        "overwrite": True,
    },
}
Enter fullscreen mode Exit fullscreen mode

AutoThrottle is a killer feature. It dynamically adjusts request speed based on server response times. If the server slows down, Scrapy backs off automatically.

Step 5: Custom Middleware for Proxy Rotation

For production scraping, you'll need proxies. Here's a simple rotation middleware in bookstore/middlewares.py:

# Implementation is proprietary (that IS the moat).
# Skip the build — use our ready-made Apify actor:
# see the CTA below for the link (fpr=yw6md3).
Enter fullscreen mode Exit fullscreen mode

Enable it in settings.py:

DOWNLOADER_MIDDLEWARES = {
    "bookstore.middlewares.ProxyRotationMiddleware": 350,
}
Enter fullscreen mode Exit fullscreen mode

Need reliable proxies? ThorData offers residential proxies with high success rates for scraping workloads, and ScrapeOps provides a proxy aggregator that automatically routes through the best provider for each domain.

Step 6: Run Your Scraper

# Implementation is proprietary (that IS the moat).
# Skip the build — use our ready-made Apify actor:
# see the CTA below for the link (fpr=yw6md3).
Enter fullscreen mode Exit fullscreen mode

You should see output like:

# Implementation is proprietary (that IS the moat).
# Skip the build — use our ready-made Apify actor:
# see the CTA below for the link (fpr=yw6md3).
Enter fullscreen mode Exit fullscreen mode

Running Scrapy in Production

For recurring scrapes, you have several options:

Option 1: Cron job

# Implementation is proprietary (that IS the moat).
# Skip the build — use our ready-made Apify actor:
# see the CTA below for the link (fpr=yw6md3).
Enter fullscreen mode Exit fullscreen mode

Option 2: CrawlerProcess for programmatic execution

# Implementation is proprietary (that IS the moat).
# Skip the build — use our ready-made Apify actor:
# see the CTA below for the link (fpr=yw6md3).
Enter fullscreen mode Exit fullscreen mode

Option 3: ScrapyD — A daemon with a JSON API for scheduling:

pip install scrapyd
scrapyd
curl http://localhost:6800/schedule.json -d project=bookstore -d spider=books
Enter fullscreen mode Exit fullscreen mode

Handling Common Challenges

JavaScript-Rendered Pages

Scrapy alone can't execute JavaScript. For SPAs, add scrapy-playwright:

# Implementation is proprietary (that IS the moat).
# Skip the build — use our ready-made Apify actor:
# see the CTA below for the link (fpr=yw6md3).
Enter fullscreen mode Exit fullscreen mode
# Implementation is proprietary (that IS the moat).
# Skip the build — use our ready-made Apify actor:
# see the CTA below for the link (fpr=yw6md3).
Enter fullscreen mode Exit fullscreen mode

Rate Limiting and Bans

Signs you're being blocked: 403 responses, CAPTCHAs, empty pages.

Solutions:

  1. Increase DOWNLOAD_DELAY
  2. Rotate User-Agents and proxies
  3. Use ScrapeOps Monitoring to track success rates across domains

Storing Data in a Database

Add a database pipeline:

import sqlite3

class SQLitePipeline:
    def open_spider(self, spider):
        self.conn = sqlite3.connect("books.db")
        self.conn.execute(
            "CREATE TABLE IF NOT EXISTS books"
            " (title TEXT, price REAL, rating INT,"
            " availability TEXT, url TEXT UNIQUE)"
        )

    def process_item(self, item, spider):
        self.conn.execute(
            "INSERT OR IGNORE INTO books VALUES (?, ?, ?, ?, ?)",
            (item["title"], item["price"], item["rating"],
             item["availability"], item["url"])
        )
        self.conn.commit()
        return item

    def close_spider(self, spider):
        self.conn.close()
Enter fullscreen mode Exit fullscreen mode

Architecture Recap

Request -> Downloader Middlewares -> Download -> Response
    |
  Spider (parse) -> Items -> Item Pipelines -> Storage
    |
  New Requests -> Scheduler -> Loop
Enter fullscreen mode Exit fullscreen mode

Every component is pluggable. You can swap the scheduler, add custom middlewares, chain pipelines. Scrapy is designed for extension.

What's Next

You now have a production-grade Scrapy scraper with:

  • Structured data models
  • Cleaning and deduplication pipelines
  • Rate limiting and auto-throttle
  • Proxy rotation middleware
  • Multiple output formats

From here, explore:

  • Scrapy Cloud (Zyte) for hosted deployments
  • scrapy-splash or scrapy-playwright for JavaScript rendering
  • Item Loaders for complex field processing
  • ThorData proxies for high-volume production scraping

The full code for this tutorial is based on books.toscrape.com — a safe sandbox for scraping practice. Start there, then adapt the patterns to your real-world targets.


Building scrapers that run reliably in production? ScrapeOps provides monitoring dashboards that track success rates, response times, and proxy performance across all your spiders.

Top comments (0)