When I started with Scrapy, I only used the basic Spider class for everything. I'd manually write pagination logic, manually follow category links, manually handle sitemaps.
Then I discovered CrawlSpider. Suddenly, pagination and link following became automatic. My code got shorter and cleaner.
Later, I found SitemapSpider. For sites with sitemaps, it was even simpler than CrawlSpider.
Each spider type has its purpose. Let me show you when to use which.
The Three Spider Types
Spider (Basic Spider)
- Manual control over everything
- You write all the logic
- Most flexible, most code
CrawlSpider (Rule-Based Spider)
- Automatically follows links based on rules
- Less code, less control
- Perfect for structured sites
SitemapSpider (Sitemap-Based Spider)
- Automatically crawls from sitemap.xml
- Minimal code, minimal control
- Perfect when sitemaps exist
Spider: The Basic One (Full Control)
This is what you've been using. You control everything manually.
When to Use
- You need complete control
- Site structure is complex or unusual
- You're learning Scrapy
- You need custom logic for each page type
Basic Example
import scrapy
class BasicSpider(scrapy.Spider):
name = 'basic'
start_urls = ['https://example.com/products']
def parse(self, response):
# Scrape products
for product in response.css('.product'):
yield {
'name': product.css('h2::text').get(),
'price': product.css('.price::text').get()
}
# Follow pagination manually
next_page = response.css('.next::attr(href)').get()
if next_page:
yield response.follow(next_page, callback=self.parse)
# Follow category links manually
for category in response.css('.category a::attr(href)').getall():
yield response.follow(category, callback=self.parse_category)
def parse_category(self, response):
# Different handling for category pages
for product in response.css('.product-list .item'):
yield response.follow(
product.css('a::attr(href)').get(),
callback=self.parse_product
)
def parse_product(self, response):
# Detailed product scraping
yield {
'name': response.css('h1::text').get(),
'price': response.css('.price::text').get(),
'description': response.css('.description::text').get()
}
Characteristics
Pros:
- Complete control over logic
- Can handle any site structure
- Easy to understand
- Easy to debug
Cons:
- More code
- Manual pagination handling
- Manual link following
- Easy to make mistakes
CrawlSpider: The Rule-Based One (Automatic Link Following)
CrawlSpider uses rules to automatically follow links. You define patterns, Scrapy handles the rest.
When to Use
- Site has clear URL patterns
- You want automatic link following
- Pagination is straightforward
- Category/product structure is consistent
Basic Example
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
class AutoSpider(CrawlSpider):
name = 'auto'
start_urls = ['https://example.com/products']
rules = (
# Rule 1: Follow category links
Rule(
LinkExtractor(allow=r'/category/'),
follow=True # Follow but don't scrape
),
# Rule 2: Follow pagination
Rule(
LinkExtractor(allow=r'/page/\d+'),
follow=True
),
# Rule 3: Scrape product pages
Rule(
LinkExtractor(allow=r'/product/\d+'),
callback='parse_product'
),
)
def parse_product(self, response):
yield {
'name': response.css('h1::text').get(),
'price': response.css('.price::text').get()
}
What the Docs Don't Tell You
1. Rules are processed in order
First matching rule wins:
rules = (
# Specific rule first
Rule(LinkExtractor(allow=r'/product/special/'), callback='parse_special'),
# General rule second
Rule(LinkExtractor(allow=r'/product/'), callback='parse_product'),
)
2. parse() method is reserved
Don't override parse() in CrawlSpider. It's used internally. Use parse_start_url() instead:
class MyCrawlSpider(CrawlSpider):
# DON'T do this
def parse(self, response):
pass
# DO this instead
def parse_start_url(self, response):
# Handle start_urls differently
return self.parse_product(response)
3. follow=True vs callback
# Follow links but don't scrape
Rule(LinkExtractor(allow=r'/category/'), follow=True)
# Scrape but don't follow further (default)
Rule(LinkExtractor(allow=r'/product/'), callback='parse_product')
# Scrape AND follow further links
Rule(
LinkExtractor(allow=r'/product/'),
callback='parse_product',
follow=True # Explicit
)
Advanced CrawlSpider
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
class AdvancedSpider(CrawlSpider):
name = 'advanced'
allowed_domains = ['example.com']
start_urls = ['https://example.com']
rules = (
# Follow categories (multiple levels)
Rule(
LinkExtractor(
allow=r'/category/',
restrict_css='.navigation' # Only in navigation
),
follow=True,
process_links='process_category_links' # Custom processing
),
# Follow pagination
Rule(
LinkExtractor(
allow=r'/page/\d+',
restrict_css='.pagination'
),
follow=True
),
# Scrape products
Rule(
LinkExtractor(
allow=r'/product/\d+',
deny=r'/product/\d+/reviews' # Exclude reviews pages
),
callback='parse_product',
cb_kwargs={'product_type': 'regular'} # Pass extra data
),
# Scrape special products differently
Rule(
LinkExtractor(allow=r'/special-offer/\d+'),
callback='parse_special_product',
cb_kwargs={'product_type': 'special'}
),
)
def process_category_links(self, links):
# Custom link processing
for link in links:
# Modify URLs, filter links, etc.
if 'old-category' not in link.url:
yield link
def parse_product(self, response, product_type):
yield {
'type': product_type,
'name': response.css('h1::text').get(),
'price': response.css('.price::text').get()
}
def parse_special_product(self, response, product_type):
yield {
'type': product_type,
'name': response.css('h1::text').get(),
'price': response.css('.special-price::text').get(),
'original_price': response.css('.original-price::text').get()
}
Characteristics
Pros:
- Much less code
- Automatic link following
- Declarative (rules are clear)
- Perfect for structured sites
Cons:
- Less flexible than basic Spider
- Learning curve for rules
- Can't do complex per-page logic easily
- Debugging rules is harder
SitemapSpider: The Sitemap-Based One (Easiest of All)
If a site has a sitemap.xml, SitemapSpider is the easiest option.
When to Use
- Site has sitemap.xml
- You want to scrape all pages listed
- Site structure doesn't matter
- Fastest way to crawl large sites
Basic Example
from scrapy.spiders import SitemapSpider
class MySitemapSpider(SitemapSpider):
name = 'sitemap'
sitemap_urls = ['https://example.com/sitemap.xml']
def parse(self, response):
yield {
'url': response.url,
'title': response.css('h1::text').get()
}
That's it! Scrapy:
- Downloads sitemap.xml
- Extracts all URLs
- Scrapes each one
- Calls parse() for each page
Multiple Sitemaps
class MultipleSitemapSpider(SitemapSpider):
name = 'multiple'
sitemap_urls = [
'https://example.com/sitemap.xml',
'https://example.com/sitemap-products.xml',
'https://example.com/sitemap-articles.xml'
]
def parse(self, response):
yield {'url': response.url}
Sitemap Rules (Filter URLs)
Only scrape certain URLs from sitemap:
class FilteredSitemapSpider(SitemapSpider):
name = 'filtered'
sitemap_urls = ['https://example.com/sitemap.xml']
sitemap_rules = [
('/product/', 'parse_product'), # Product URLs
('/article/', 'parse_article'), # Article URLs
]
def parse_product(self, response):
yield {
'type': 'product',
'name': response.css('h1::text').get()
}
def parse_article(self, response):
yield {
'type': 'article',
'title': response.css('h1::text').get()
}
Follow Sitemap Index
Some sites have a sitemap index (sitemap of sitemaps):
class IndexSpider(SitemapSpider):
name = 'index'
sitemap_urls = ['https://example.com/sitemap-index.xml']
sitemap_follow = ['/sitemap-products'] # Only follow product sitemaps
def parse(self, response):
yield {'url': response.url}
Alternate URLs (Multilingual Sites)
class MultilingualSpider(SitemapSpider):
name = 'multilingual'
sitemap_urls = ['https://example.com/sitemap.xml']
sitemap_alternate_links = True # Follow alternate language URLs
def parse(self, response):
yield {
'url': response.url,
'language': response.url.split('/')[3] # Extract language code
}
What the Docs Don't Tell You
1. Not all sitemaps are at /sitemap.xml
Check robots.txt for the actual location:
https://example.com/robots.txt
Look for:
Sitemap: https://example.com/actual-sitemap.xml
2. Large sitemaps might be compressed
sitemap_urls = ['https://example.com/sitemap.xml.gz'] # Works fine
3. You can combine with CrawlSpider features
class HybridSpider(SitemapSpider):
name = 'hybrid'
sitemap_urls = ['https://example.com/sitemap.xml']
# Add rules like CrawlSpider
rules = (
Rule(LinkExtractor(allow=r'/related/'), callback='parse_related'),
)
Characteristics
Pros:
- Minimal code
- Very fast (no link discovery needed)
- Guaranteed to find all pages
- Perfect for large sites
Cons:
- Only works if sitemap exists
- No control over crawl order
- Can't filter during crawl (only in rules)
- Less flexible
Choosing the Right Spider Type
Use Spider When:
- Site structure is complex
- You need complete control
- Custom logic per page type
- You're learning Scrapy
- Site is unusual
Example:
Site with dynamic navigation, AJAX-loaded content,
or complex multi-step workflows
Use CrawlSpider When:
- Clear URL patterns exist
- Automatic link following is needed
- Site structure is consistent
- You want less code
Example:
E-commerce site with categories → subcategories → products
News site with sections → articles
Use SitemapSpider When:
- Site has sitemap.xml
- You want to scrape all pages
- Fastest crawl needed
- Site structure doesn't matter
Example:
Large content sites (WordPress, Drupal)
E-commerce with good SEO
Any site that publishes sitemaps
Real-World Comparison
Let's scrape the same site with all three approaches:
Site Structure
Homepage
├── Category: Electronics
│ ├── Product: Laptop
│ ├── Product: Phone
│ └── Page 2 → More products
└── Category: Books
└── Product: Novel
Approach 1: Basic Spider
class BasicSpider(scrapy.Spider):
name = 'basic'
start_urls = ['https://example.com']
def parse(self, response):
# Follow categories
for cat in response.css('.category a'):
yield response.follow(cat, self.parse_category)
def parse_category(self, response):
# Scrape products
for prod in response.css('.product a'):
yield response.follow(prod, self.parse_product)
# Pagination
next_page = response.css('.next::attr(href)').get()
if next_page:
yield response.follow(next_page, self.parse_category)
def parse_product(self, response):
yield {'name': response.css('h1::text').get()}
Lines of code: ~20
Approach 2: CrawlSpider
class CrawlSpiderVersion(CrawlSpider):
name = 'crawl'
start_urls = ['https://example.com']
rules = (
Rule(LinkExtractor(allow=r'/category/'), follow=True),
Rule(LinkExtractor(allow=r'/page/\d+'), follow=True),
Rule(LinkExtractor(allow=r'/product/'), callback='parse_product'),
)
def parse_product(self, response):
yield {'name': response.css('h1::text').get()}
Lines of code: ~12
Approach 3: SitemapSpider
class SitemapVersion(SitemapSpider):
name = 'sitemap'
sitemap_urls = ['https://example.com/sitemap.xml']
sitemap_rules = [('/product/', 'parse_product')]
def parse_product(self, response):
yield {'name': response.css('h1::text').get()}
Lines of code: ~7
Converting Between Spider Types
From Spider to CrawlSpider
Before:
class MySpider(scrapy.Spider):
def parse(self, response):
for link in response.css('a.product'):
yield response.follow(link, self.parse_product)
After:
class MySpider(CrawlSpider):
rules = (
Rule(LinkExtractor(allow=r'/product/'), callback='parse_product'),
)
From CrawlSpider to Spider
Sometimes you need more control. Just convert rules to manual logic.
Mixing Approaches
You can combine spider types:
class HybridSpider(CrawlSpider):
name = 'hybrid'
start_urls = ['https://example.com']
# CrawlSpider rules for most pages
rules = (
Rule(LinkExtractor(allow=r'/category/'), follow=True),
Rule(LinkExtractor(allow=r'/product/'), callback='parse_product'),
)
# Custom logic for start URLs
def parse_start_url(self, response):
# Special handling for homepage
featured = response.css('.featured-product a')
for link in featured:
yield response.follow(link, self.parse_featured)
def parse_product(self, response):
yield {'name': response.css('h1::text').get()}
def parse_featured(self, response):
yield {
'name': response.css('h1::text').get(),
'featured': True
}
Quick Decision Tree
Does site have sitemap.xml?
├─ Yes → Use SitemapSpider
└─ No
│
Does site have clear URL patterns?
├─ Yes → Use CrawlSpider
└─ No → Use basic Spider
Summary
Spider (Basic):
- Full control
- Most code
- Use when: complex sites, learning, custom logic
CrawlSpider (Rules):
- Automatic link following
- Less code
- Use when: clear patterns, structured sites
SitemapSpider (Sitemap):
- Minimal code
- Very fast
- Use when: sitemap exists, want all pages
Start with: Basic Spider (learn fundamentals)
Graduate to: CrawlSpider (save time)
Use when available: SitemapSpider (fastest)
Don't overthink it. Start with what you know and upgrade when needed.
Happy scraping! 🕷️
Top comments (0)