agenthustler

Posted on Mar 26 • Edited on Apr 19

Complete Scrapy Tutorial in 2026: Build a Production Web Scraper from Scratch

#beginners #tutorial #webdev #python

Web scraping is one of the most valuable skills in a Python developer's toolkit. Whether you're building a price tracker, aggregating job listings, or feeding data into an ML pipeline, you need a reliable scraping framework. Scrapy is that framework.

In this tutorial, you'll build a complete, production-ready web scraper from scratch using Scrapy. By the end, you'll understand spiders, pipelines, middlewares, and how to deploy your scraper for recurring jobs.

Why Scrapy Over BeautifulSoup + Requests?

Before we dive in, let's settle a common question.

Feature	BeautifulSoup + Requests	Scrapy
Learning curve	Low	Medium
Async requests	No (manual threading)	Built-in
Rate limiting	Manual	Built-in (DOWNLOAD_DELAY)
Data pipelines	Manual	Built-in pipeline system
Middleware support	None	Full middleware stack
Retry logic	Manual	Built-in
Export formats	Manual	JSON, CSV, XML built-in
Best for	Quick one-off scripts	Production scrapers

Rule of thumb: Use BeautifulSoup for quick scripts under 100 lines. Use Scrapy for anything that runs on a schedule, needs to handle errors gracefully, or scrapes more than a few hundred pages.

Setting Up Your Scrapy Project

# Implementation is proprietary (that IS the moat).
# Skip the build — use our ready-made Apify actor:
# see the CTA below for the link (fpr=yw6md3).

This generates the standard Scrapy project structure:

# Implementation is proprietary (that IS the moat).
# Skip the build — use our ready-made Apify actor:
# see the CTA below for the link (fpr=yw6md3).

Each file has a purpose:

items.py — Data models (like dataclasses for your scraped data)
pipelines.py — Post-processing: cleaning, validation, database storage
middlewares.py — Request/response hooks: proxy rotation, header injection
settings.py — Configuration: delays, concurrency, pipeline order

Step 1: Define Your Data Model

Open bookstore/items.py:

# Implementation is proprietary (that IS the moat).
# Skip the build — use our ready-made Apify actor:
# see the CTA below for the link (fpr=yw6md3).

Items enforce structure. Every scraped record must match this schema, which prevents silent data corruption down the line.

Step 2: Build Your Spider

Create bookstore/spiders/books_spider.py:

# Implementation is proprietary (that IS the moat).
# Skip the build — use our ready-made Apify actor:
# see the CTA below for the link (fpr=yw6md3).

Key Scrapy concepts here:

start_urls — Entry points. Scrapy sends GET requests to each.
response.css() — CSS selectors (also supports XPath via response.xpath()).
yield item — Sends the item through your pipeline.
response.follow() — Follows links with automatic URL resolution.

Step 3: Add a Data Pipeline

Pipelines process items after extraction. Open bookstore/pipelines.py:

# Implementation is proprietary (that IS the moat).
# Skip the build — use our ready-made Apify actor:
# see the CTA below for the link (fpr=yw6md3).

Enable them in settings.py:

ITEM_PIPELINES = {
    "bookstore.pipelines.CleanPricePipeline": 100,
    "bookstore.pipelines.DuplicateFilterPipeline": 200,
}

The numbers set execution order. Lower runs first.

Step 4: Configure Settings for Production

Edit bookstore/settings.py:

DOWNLOAD_DELAY = 1.5
CONCURRENT_REQUESTS_PER_DOMAIN = 4
RANDOMIZE_DOWNLOAD_DELAY = True

ROBOTSTXT_OBEY = True

RETRY_TIMES = 3
RETRY_HTTP_CODES = [500, 502, 503, 504, 408, 429]

USER_AGENT = "Mozilla/5.0 (compatible; BookstoreBot/1.0)"

AUTOTHROTTLE_ENABLED = True
AUTOTHROTTLE_START_DELAY = 1
AUTOTHROTTLE_MAX_DELAY = 10
AUTOTHROTTLE_TARGET_CONCURRENCY = 2.0

FEEDS = {
    "output/books.json": {
        "format": "json",
        "encoding": "utf8",
        "overwrite": True,
    },
}

AutoThrottle is a killer feature. It dynamically adjusts request speed based on server response times. If the server slows down, Scrapy backs off automatically.

Step 5: Custom Middleware for Proxy Rotation

For production scraping, you'll need proxies. Here's a simple rotation middleware in bookstore/middlewares.py:

# Implementation is proprietary (that IS the moat).
# Skip the build — use our ready-made Apify actor:
# see the CTA below for the link (fpr=yw6md3).

Enable it in settings.py:

DOWNLOADER_MIDDLEWARES = {
    "bookstore.middlewares.ProxyRotationMiddleware": 350,
}

Need reliable proxies? ThorData offers residential proxies with high success rates for scraping workloads, and ScrapeOps provides a proxy aggregator that automatically routes through the best provider for each domain.

Step 6: Run Your Scraper

# Implementation is proprietary (that IS the moat).
# Skip the build — use our ready-made Apify actor:
# see the CTA below for the link (fpr=yw6md3).

You should see output like:

# Implementation is proprietary (that IS the moat).
# Skip the build — use our ready-made Apify actor:
# see the CTA below for the link (fpr=yw6md3).

Running Scrapy in Production

For recurring scrapes, you have several options:

Option 1: Cron job

# Implementation is proprietary (that IS the moat).
# Skip the build — use our ready-made Apify actor:
# see the CTA below for the link (fpr=yw6md3).

Option 2: CrawlerProcess for programmatic execution

# Implementation is proprietary (that IS the moat).
# Skip the build — use our ready-made Apify actor:
# see the CTA below for the link (fpr=yw6md3).

Option 3: ScrapyD — A daemon with a JSON API for scheduling:

pip install scrapyd
scrapyd
curl http://localhost:6800/schedule.json -d project=bookstore -d spider=books

Handling Common Challenges

JavaScript-Rendered Pages

Scrapy alone can't execute JavaScript. For SPAs, add scrapy-playwright:

# Implementation is proprietary (that IS the moat).
# Skip the build — use our ready-made Apify actor:
# see the CTA below for the link (fpr=yw6md3).

# Implementation is proprietary (that IS the moat).
# Skip the build — use our ready-made Apify actor:
# see the CTA below for the link (fpr=yw6md3).

Rate Limiting and Bans

Signs you're being blocked: 403 responses, CAPTCHAs, empty pages.

Solutions:

Increase DOWNLOAD_DELAY
Rotate User-Agents and proxies
Use ScrapeOps Monitoring to track success rates across domains

Storing Data in a Database

Add a database pipeline:

import sqlite3

class SQLitePipeline:
    def open_spider(self, spider):
        self.conn = sqlite3.connect("books.db")
        self.conn.execute(
            "CREATE TABLE IF NOT EXISTS books"
            " (title TEXT, price REAL, rating INT,"
            " availability TEXT, url TEXT UNIQUE)"
        )

    def process_item(self, item, spider):
        self.conn.execute(
            "INSERT OR IGNORE INTO books VALUES (?, ?, ?, ?, ?)",
            (item["title"], item["price"], item["rating"],
             item["availability"], item["url"])
        )
        self.conn.commit()
        return item

    def close_spider(self, spider):
        self.conn.close()

Architecture Recap

Request -> Downloader Middlewares -> Download -> Response
    |
  Spider (parse) -> Items -> Item Pipelines -> Storage
    |
  New Requests -> Scheduler -> Loop

Every component is pluggable. You can swap the scheduler, add custom middlewares, chain pipelines. Scrapy is designed for extension.

What's Next

You now have a production-grade Scrapy scraper with:

Structured data models
Cleaning and deduplication pipelines
Rate limiting and auto-throttle
Proxy rotation middleware
Multiple output formats

From here, explore:

Scrapy Cloud (Zyte) for hosted deployments
scrapy-splash or scrapy-playwright for JavaScript rendering
Item Loaders for complex field processing
ThorData proxies for high-volume production scraping

The full code for this tutorial is based on books.toscrape.com — a safe sandbox for scraping practice. Start there, then adapt the patterns to your real-world targets.

Building scrapers that run reliably in production? ScrapeOps provides monitoring dashboards that track success rates, response times, and proxy performance across all your spiders.

DEV Community