Web scraping is one of the most valuable skills in a Python developer's toolkit. Whether you're building a price tracker, aggregating job listings, or feeding data into an ML pipeline, you need a reliable scraping framework. Scrapy is that framework.
In this tutorial, you'll build a complete, production-ready web scraper from scratch using Scrapy. By the end, you'll understand spiders, pipelines, middlewares, and how to deploy your scraper for recurring jobs.
Why Scrapy Over BeautifulSoup + Requests?
Before we dive in, let's settle a common question.
| Feature | BeautifulSoup + Requests | Scrapy |
|---|---|---|
| Learning curve | Low | Medium |
| Async requests | No (manual threading) | Built-in |
| Rate limiting | Manual | Built-in (DOWNLOAD_DELAY) |
| Data pipelines | Manual | Built-in pipeline system |
| Middleware support | None | Full middleware stack |
| Retry logic | Manual | Built-in |
| Export formats | Manual | JSON, CSV, XML built-in |
| Best for | Quick one-off scripts | Production scrapers |
Rule of thumb: Use BeautifulSoup for quick scripts under 100 lines. Use Scrapy for anything that runs on a schedule, needs to handle errors gracefully, or scrapes more than a few hundred pages.
Setting Up Your Scrapy Project
pip install scrapy
scrapy startproject bookstore
cd bookstore
This generates the standard Scrapy project structure:
bookstore/
├── scrapy.cfg
└── bookstore/
├── __init__.py
├── items.py
├── middlewares.py
├── pipelines.py
├── settings.py
└── spiders/
└── __init__.py
Each file has a purpose:
- items.py — Data models (like dataclasses for your scraped data)
- pipelines.py — Post-processing: cleaning, validation, database storage
- middlewares.py — Request/response hooks: proxy rotation, header injection
- settings.py — Configuration: delays, concurrency, pipeline order
Step 1: Define Your Data Model
Open bookstore/items.py:
import scrapy
class BookItem(scrapy.Item):
title = scrapy.Field()
price = scrapy.Field()
rating = scrapy.Field()
availability = scrapy.Field()
url = scrapy.Field()
Items enforce structure. Every scraped record must match this schema, which prevents silent data corruption down the line.
Step 2: Build Your Spider
Create bookstore/spiders/books_spider.py:
import scrapy
from bookstore.items import BookItem
class BooksSpider(scrapy.Spider):
name = "books"
start_urls = ["https://books.toscrape.com/catalogue/page-1.html"]
rating_map = {
"One": 1, "Two": 2, "Three": 3, "Four": 4, "Five": 5
}
def parse(self, response):
for book in response.css("article.product_pod"):
item = BookItem()
item["title"] = book.css("h3 a::attr(title)").get()
item["price"] = book.css(".price_color::text").get()
item["rating"] = self.rating_map.get(
book.css(".star-rating::attr(class)").re_first(
r"star-rating (\w+)"
), 0
)
item["availability"] = book.css(
".availability::text"
).getall()[-1].strip()
item["url"] = response.urljoin(
book.css("h3 a::attr(href)").get()
)
yield item
next_page = response.css("li.next a::attr(href)").get()
if next_page:
yield response.follow(next_page, callback=self.parse)
Key Scrapy concepts here:
- start_urls — Entry points. Scrapy sends GET requests to each.
- response.css() — CSS selectors (also supports XPath via response.xpath()).
- yield item — Sends the item through your pipeline.
- response.follow() — Follows links with automatic URL resolution.
Step 3: Add a Data Pipeline
Pipelines process items after extraction. Open bookstore/pipelines.py:
class CleanPricePipeline:
def process_item(self, item, spider):
if item.get("price"):
raw = item["price"].replace("£", "").strip()
item["price"] = float(raw)
return item
class DuplicateFilterPipeline:
def __init__(self):
self.seen_urls = set()
def process_item(self, item, spider):
from scrapy.exceptions import DropItem
if item["url"] in self.seen_urls:
raise DropItem(f"Duplicate: {item['url']}")
self.seen_urls.add(item["url"])
return item
Enable them in settings.py:
ITEM_PIPELINES = {
"bookstore.pipelines.CleanPricePipeline": 100,
"bookstore.pipelines.DuplicateFilterPipeline": 200,
}
The numbers set execution order. Lower runs first.
Step 4: Configure Settings for Production
Edit bookstore/settings.py:
DOWNLOAD_DELAY = 1.5
CONCURRENT_REQUESTS_PER_DOMAIN = 4
RANDOMIZE_DOWNLOAD_DELAY = True
ROBOTSTXT_OBEY = True
RETRY_TIMES = 3
RETRY_HTTP_CODES = [500, 502, 503, 504, 408, 429]
USER_AGENT = "Mozilla/5.0 (compatible; BookstoreBot/1.0)"
AUTOTHROTTLE_ENABLED = True
AUTOTHROTTLE_START_DELAY = 1
AUTOTHROTTLE_MAX_DELAY = 10
AUTOTHROTTLE_TARGET_CONCURRENCY = 2.0
FEEDS = {
"output/books.json": {
"format": "json",
"encoding": "utf8",
"overwrite": True,
},
}
AutoThrottle is a killer feature. It dynamically adjusts request speed based on server response times. If the server slows down, Scrapy backs off automatically.
Step 5: Custom Middleware for Proxy Rotation
For production scraping, you'll need proxies. Here's a simple rotation middleware in bookstore/middlewares.py:
import random
class ProxyRotationMiddleware:
def __init__(self):
self.proxies = [
"http://proxy1:8080",
"http://proxy2:8080",
"http://proxy3:8080",
]
def process_request(self, request, spider):
request.meta["proxy"] = random.choice(self.proxies)
Enable it in settings.py:
DOWNLOADER_MIDDLEWARES = {
"bookstore.middlewares.ProxyRotationMiddleware": 350,
}
Need reliable proxies? ThorData offers residential proxies with high success rates for scraping workloads, and ScrapeOps provides a proxy aggregator that automatically routes through the best provider for each domain.
Step 6: Run Your Scraper
scrapy crawl books
scrapy crawl books -O books.json
scrapy crawl books -L INFO
You should see output like:
2026-03-09 10:00:01 [scrapy.core.engine] INFO: Spider opened
2026-03-09 10:00:03 [scrapy.core.scraper] DEBUG: Scraped from <200 ...>
{'title': 'A Light in the Attic', 'price': 51.77, 'rating': 3, ...}
...
2026-03-09 10:00:45 [scrapy.core.engine] INFO: Closing spider (finished)
'item_scraped_count': 1000,
Running Scrapy in Production
For recurring scrapes, you have several options:
Option 1: Cron job
0 2 * * * cd /path/to/bookstore && scrapy crawl books >> /var/log/scraper.log 2>&1
Option 2: CrawlerProcess for programmatic execution
from scrapy.crawler import CrawlerProcess
from scrapy.utils.project import get_project_settings
process = CrawlerProcess(get_project_settings())
process.crawl("books")
process.start()
Option 3: ScrapyD — A daemon with a JSON API for scheduling:
pip install scrapyd
scrapyd
curl http://localhost:6800/schedule.json -d project=bookstore -d spider=books
Handling Common Challenges
JavaScript-Rendered Pages
Scrapy alone can't execute JavaScript. For SPAs, add scrapy-playwright:
pip install scrapy-playwright
yield scrapy.Request(
url,
meta={
"playwright": True,
"playwright_page_methods": [
{"method": "wait_for_selector", "args": [".product-list"]},
],
},
)
Rate Limiting and Bans
Signs you're being blocked: 403 responses, CAPTCHAs, empty pages.
Solutions:
- Increase DOWNLOAD_DELAY
- Rotate User-Agents and proxies
- Use ScrapeOps Monitoring to track success rates across domains
Storing Data in a Database
Add a database pipeline:
import sqlite3
class SQLitePipeline:
def open_spider(self, spider):
self.conn = sqlite3.connect("books.db")
self.conn.execute(
"CREATE TABLE IF NOT EXISTS books"
" (title TEXT, price REAL, rating INT,"
" availability TEXT, url TEXT UNIQUE)"
)
def process_item(self, item, spider):
self.conn.execute(
"INSERT OR IGNORE INTO books VALUES (?, ?, ?, ?, ?)",
(item["title"], item["price"], item["rating"],
item["availability"], item["url"])
)
self.conn.commit()
return item
def close_spider(self, spider):
self.conn.close()
Architecture Recap
Request -> Downloader Middlewares -> Download -> Response
|
Spider (parse) -> Items -> Item Pipelines -> Storage
|
New Requests -> Scheduler -> Loop
Every component is pluggable. You can swap the scheduler, add custom middlewares, chain pipelines. Scrapy is designed for extension.
What's Next
You now have a production-grade Scrapy scraper with:
- Structured data models
- Cleaning and deduplication pipelines
- Rate limiting and auto-throttle
- Proxy rotation middleware
- Multiple output formats
From here, explore:
- Scrapy Cloud (Zyte) for hosted deployments
- scrapy-splash or scrapy-playwright for JavaScript rendering
- Item Loaders for complex field processing
- ThorData proxies for high-volume production scraping
The full code for this tutorial is based on books.toscrape.com — a safe sandbox for scraping practice. Start there, then adapt the patterns to your real-world targets.
Building scrapers that run reliably in production? ScrapeOps provides monitoring dashboards that track success rates, response times, and proxy performance across all your spiders.
Top comments (0)