Web scraping is one of the most valuable skills in a Python developer's toolkit. Whether you're building a price tracker, aggregating job listings, or feeding data into an ML pipeline, you need a reliable scraping framework. Scrapy is that framework.
In this tutorial, you'll build a complete, production-ready web scraper from scratch using Scrapy. By the end, you'll understand spiders, pipelines, middlewares, and how to deploy your scraper for recurring jobs.
Why Scrapy Over BeautifulSoup + Requests?
Before we dive in, let's settle a common question.
| Feature | BeautifulSoup + Requests | Scrapy |
|---|---|---|
| Learning curve | Low | Medium |
| Async requests | No (manual threading) | Built-in |
| Rate limiting | Manual | Built-in (DOWNLOAD_DELAY) |
| Data pipelines | Manual | Built-in pipeline system |
| Middleware support | None | Full middleware stack |
| Retry logic | Manual | Built-in |
| Export formats | Manual | JSON, CSV, XML built-in |
| Best for | Quick one-off scripts | Production scrapers |
Rule of thumb: Use BeautifulSoup for quick scripts under 100 lines. Use Scrapy for anything that runs on a schedule, needs to handle errors gracefully, or scrapes more than a few hundred pages.
Setting Up Your Scrapy Project
# Implementation is proprietary (that IS the moat).
# Skip the build — use our ready-made Apify actor:
# see the CTA below for the link (fpr=yw6md3).
This generates the standard Scrapy project structure:
# Implementation is proprietary (that IS the moat).
# Skip the build — use our ready-made Apify actor:
# see the CTA below for the link (fpr=yw6md3).
Each file has a purpose:
- items.py — Data models (like dataclasses for your scraped data)
- pipelines.py — Post-processing: cleaning, validation, database storage
- middlewares.py — Request/response hooks: proxy rotation, header injection
- settings.py — Configuration: delays, concurrency, pipeline order
Step 1: Define Your Data Model
Open bookstore/items.py:
# Implementation is proprietary (that IS the moat).
# Skip the build — use our ready-made Apify actor:
# see the CTA below for the link (fpr=yw6md3).
Items enforce structure. Every scraped record must match this schema, which prevents silent data corruption down the line.
Step 2: Build Your Spider
Create bookstore/spiders/books_spider.py:
# Implementation is proprietary (that IS the moat).
# Skip the build — use our ready-made Apify actor:
# see the CTA below for the link (fpr=yw6md3).
Key Scrapy concepts here:
- start_urls — Entry points. Scrapy sends GET requests to each.
- response.css() — CSS selectors (also supports XPath via response.xpath()).
- yield item — Sends the item through your pipeline.
- response.follow() — Follows links with automatic URL resolution.
Step 3: Add a Data Pipeline
Pipelines process items after extraction. Open bookstore/pipelines.py:
# Implementation is proprietary (that IS the moat).
# Skip the build — use our ready-made Apify actor:
# see the CTA below for the link (fpr=yw6md3).
Enable them in settings.py:
ITEM_PIPELINES = {
"bookstore.pipelines.CleanPricePipeline": 100,
"bookstore.pipelines.DuplicateFilterPipeline": 200,
}
The numbers set execution order. Lower runs first.
Step 4: Configure Settings for Production
Edit bookstore/settings.py:
DOWNLOAD_DELAY = 1.5
CONCURRENT_REQUESTS_PER_DOMAIN = 4
RANDOMIZE_DOWNLOAD_DELAY = True
ROBOTSTXT_OBEY = True
RETRY_TIMES = 3
RETRY_HTTP_CODES = [500, 502, 503, 504, 408, 429]
USER_AGENT = "Mozilla/5.0 (compatible; BookstoreBot/1.0)"
AUTOTHROTTLE_ENABLED = True
AUTOTHROTTLE_START_DELAY = 1
AUTOTHROTTLE_MAX_DELAY = 10
AUTOTHROTTLE_TARGET_CONCURRENCY = 2.0
FEEDS = {
"output/books.json": {
"format": "json",
"encoding": "utf8",
"overwrite": True,
},
}
AutoThrottle is a killer feature. It dynamically adjusts request speed based on server response times. If the server slows down, Scrapy backs off automatically.
Step 5: Custom Middleware for Proxy Rotation
For production scraping, you'll need proxies. Here's a simple rotation middleware in bookstore/middlewares.py:
# Implementation is proprietary (that IS the moat).
# Skip the build — use our ready-made Apify actor:
# see the CTA below for the link (fpr=yw6md3).
Enable it in settings.py:
DOWNLOADER_MIDDLEWARES = {
"bookstore.middlewares.ProxyRotationMiddleware": 350,
}
Need reliable proxies? ThorData offers residential proxies with high success rates for scraping workloads, and ScrapeOps provides a proxy aggregator that automatically routes through the best provider for each domain.
Step 6: Run Your Scraper
# Implementation is proprietary (that IS the moat).
# Skip the build — use our ready-made Apify actor:
# see the CTA below for the link (fpr=yw6md3).
You should see output like:
# Implementation is proprietary (that IS the moat).
# Skip the build — use our ready-made Apify actor:
# see the CTA below for the link (fpr=yw6md3).
Running Scrapy in Production
For recurring scrapes, you have several options:
Option 1: Cron job
# Implementation is proprietary (that IS the moat).
# Skip the build — use our ready-made Apify actor:
# see the CTA below for the link (fpr=yw6md3).
Option 2: CrawlerProcess for programmatic execution
# Implementation is proprietary (that IS the moat).
# Skip the build — use our ready-made Apify actor:
# see the CTA below for the link (fpr=yw6md3).
Option 3: ScrapyD — A daemon with a JSON API for scheduling:
pip install scrapyd
scrapyd
curl http://localhost:6800/schedule.json -d project=bookstore -d spider=books
Handling Common Challenges
JavaScript-Rendered Pages
Scrapy alone can't execute JavaScript. For SPAs, add scrapy-playwright:
# Implementation is proprietary (that IS the moat).
# Skip the build — use our ready-made Apify actor:
# see the CTA below for the link (fpr=yw6md3).
# Implementation is proprietary (that IS the moat).
# Skip the build — use our ready-made Apify actor:
# see the CTA below for the link (fpr=yw6md3).
Rate Limiting and Bans
Signs you're being blocked: 403 responses, CAPTCHAs, empty pages.
Solutions:
- Increase DOWNLOAD_DELAY
- Rotate User-Agents and proxies
- Use ScrapeOps Monitoring to track success rates across domains
Storing Data in a Database
Add a database pipeline:
import sqlite3
class SQLitePipeline:
def open_spider(self, spider):
self.conn = sqlite3.connect("books.db")
self.conn.execute(
"CREATE TABLE IF NOT EXISTS books"
" (title TEXT, price REAL, rating INT,"
" availability TEXT, url TEXT UNIQUE)"
)
def process_item(self, item, spider):
self.conn.execute(
"INSERT OR IGNORE INTO books VALUES (?, ?, ?, ?, ?)",
(item["title"], item["price"], item["rating"],
item["availability"], item["url"])
)
self.conn.commit()
return item
def close_spider(self, spider):
self.conn.close()
Architecture Recap
Request -> Downloader Middlewares -> Download -> Response
|
Spider (parse) -> Items -> Item Pipelines -> Storage
|
New Requests -> Scheduler -> Loop
Every component is pluggable. You can swap the scheduler, add custom middlewares, chain pipelines. Scrapy is designed for extension.
What's Next
You now have a production-grade Scrapy scraper with:
- Structured data models
- Cleaning and deduplication pipelines
- Rate limiting and auto-throttle
- Proxy rotation middleware
- Multiple output formats
From here, explore:
- Scrapy Cloud (Zyte) for hosted deployments
- scrapy-splash or scrapy-playwright for JavaScript rendering
- Item Loaders for complex field processing
- ThorData proxies for high-volume production scraping
The full code for this tutorial is based on books.toscrape.com — a safe sandbox for scraping practice. Start there, then adapt the patterns to your real-world targets.
Building scrapers that run reliably in production? ScrapeOps provides monitoring dashboards that track success rates, response times, and proxy performance across all your spiders.
Top comments (0)