If you've been writing Scrapy spiders, you've probably found yourself doing this:
def parse(self, response):
# Extract data from current page
yield {'title': 'something'}
# Find all links
for link in response.css('a::attr(href)'):
yield response.follow(link, self.parse)
This works, but there's a problem. You're manually following every single link. What if you only want to follow certain links? What if different types of links need different handling?
This is where Scrapy Rules come in.
Rules let you say "follow this type of link" and "scrape this type of page" without writing tons of repetitive code. They're like setting up traffic rules for your spider.
Let me show you how they work.
What Are Scrapy Rules?
Think of rules like instructions you give your spider:
Rule 1: "When you see a category link, follow it but don't scrape it yet."
Rule 2: "When you see a product link, scrape the product details."
Rule 3: "When you see a pagination link, follow it but ignore everything else."
Instead of writing if statements and loops for every type of link, you define rules once, and Scrapy handles the rest automatically.
Regular Spider vs CrawlSpider (With Rules)
Here's a regular spider:
import scrapy
class RegularSpider(scrapy.Spider):
name = 'regular'
start_urls = ['https://example.com']
def parse(self, response):
# You manually handle everything
for product in response.css('.product'):
yield {'name': product.css('h2::text').get()}
# You manually follow links
for link in response.css('a.next::attr(href)'):
yield response.follow(link, self.parse)
Here's the same thing with CrawlSpider and rules:
import scrapy
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
class RulesSpider(CrawlSpider):
name = 'with_rules'
start_urls = ['https://example.com']
rules = (
Rule(LinkExtractor(allow=r'/product/'), callback='parse_product'),
Rule(LinkExtractor(allow=r'/page/\d+'), follow=True),
)
def parse_product(self, response):
yield {'name': response.css('h2::text').get()}
See the difference? With rules, you tell Scrapy what to follow and what to scrape. Scrapy does the rest.
When Should You Use Rules?
Use CrawlSpider with rules when:
- You need to follow links automatically
- Different types of links need different handling
- You're crawling multiple pages (pagination, categories, etc.)
- You want cleaner, more organized code
Don't use rules when:
- You're scraping a single page
- Your scraping logic is very complex and custom
- You're doing something unusual that rules can't handle
My advice: Start with regular spiders. When you find yourself writing lots of link-following code, switch to rules.
The Basics: Your First Rule
Let's build a spider that scrapes a bookstore. We'll start simple and add complexity.
Basic Setup
import scrapy
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
class BookSpider(CrawlSpider):
name = 'books'
allowed_domains = ['books.toscrape.com']
start_urls = ['https://books.toscrape.com']
rules = (
# We'll add rules here
)
Notice we're using CrawlSpider instead of scrapy.Spider. This is important. You can't use rules with regular spiders.
Rule #1: Following Links
Let's say we want to follow pagination links (next page, page 2, page 3, etc.).
rules = (
Rule(LinkExtractor(allow=r'catalogue/page'), follow=True),
)
What this does:
- Looks for any link containing "catalogue/page"
- Follows those links
- Doesn't scrape anything (no callback)
follow=True means "follow these links and keep looking for more links on those pages."
Rule #2: Scraping Pages
Now let's scrape individual book pages:
rules = (
Rule(LinkExtractor(allow=r'catalogue/page'), follow=True),
Rule(LinkExtractor(allow=r'catalogue/.*_\d+'), callback='parse_book'),
)
def parse_book(self, response):
yield {
'title': response.css('h1::text').get(),
'price': response.css('.price_color::text').get(),
'rating': response.css('.star-rating::attr(class)').get(),
}
What this does:
- First rule: Follow pagination links
- Second rule: When you find a book page (matches the pattern), scrape it using
parse_book
Important: Don't name your callback parse. CrawlSpider uses that name internally. Use parse_item, parse_product, parse_book, etc.
Understanding LinkExtractor
LinkExtractor is what finds the links. It has several options:
allow (Include Links)
LinkExtractor(allow=r'/product/\d+')
This says "only extract links that match this pattern." The r means it's a regular expression.
Common patterns:
-
r'/category/': Any link with /category/ in it -
r'/product/\d+': Links like /product/123 -
r'/page-\d+\.html': Links like /page-1.html -
r'\.pdf$': Links ending in .pdf
deny (Exclude Links)
LinkExtractor(allow=r'/category/', deny=r'/category/books')
This says "extract category links, but NOT if they contain /category/books."
restrict_css (Only Look in Specific Areas)
LinkExtractor(restrict_css='.product-list')
Only looks for links inside elements with class "product-list". Super useful for avoiding navigation links, footer links, etc.
restrict_xpaths (XPath Version)
LinkExtractor(restrict_xpaths='//div[@class="content"]')
Same as restrict_css but using XPath.
Real Example: E-commerce Site
Let's scrape an e-commerce site with categories, subcategories, and products.
import scrapy
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
class ShopSpider(CrawlSpider):
name = 'shop'
allowed_domains = ['example.com']
start_urls = ['https://example.com']
rules = (
# Rule 1: Follow category links (but don't scrape them)
Rule(
LinkExtractor(allow=r'/category/'),
follow=True
),
# Rule 2: Follow pagination (next page, page numbers)
Rule(
LinkExtractor(allow=r'/page/\d+'),
follow=True
),
# Rule 3: Scrape product pages
Rule(
LinkExtractor(allow=r'/product/\d+'),
callback='parse_product'
),
)
def parse_product(self, response):
yield {
'name': response.css('h1.product-name::text').get(),
'price': response.css('span.price::text').get(),
'description': response.css('div.description::text').get(),
'url': response.url,
}
What happens:
- Spider starts at homepage
- Finds category links (like /category/electronics), follows them
- On category pages, finds pagination links, follows them
- On any page, finds product links (like /product/123), scrapes them
All automatic. You just defined the rules.
Advanced: Using Multiple Patterns
You can use lists in allow and deny:
Rule(
LinkExtractor(
allow=(r'/product/\d+', r'/item/\d+'), # Match either pattern
deny=(r'/product/admin', r'/item/test') # But exclude these
),
callback='parse_product'
)
The follow Parameter (Important!)
This confuses a lot of beginners. Let me explain:
follow=True
Rule(LinkExtractor(allow=r'/category/'), follow=True)
Means: "Follow these links AND keep looking for more links on those pages."
Use when: You want to keep crawling deeper (categories, pagination).
No follow (or follow=False)
Rule(LinkExtractor(allow=r'/product/'), callback='parse_product')
Means: "Extract data from these pages, but don't follow any more links."
Use when: You've reached your target pages (products, articles).
Important: If you set a callback, follow defaults to False. If you want both, you need to explicitly set follow=True:
Rule(
LinkExtractor(allow=r'/product/'),
callback='parse_product',
follow=True # Scrape AND keep following links
)
Common Patterns and Examples
Pattern 1: Blog with Categories and Posts
rules = (
# Follow category links
Rule(LinkExtractor(allow=r'/category/'), follow=True),
# Follow pagination
Rule(LinkExtractor(allow=r'/page/\d+'), follow=True),
# Scrape individual posts
Rule(LinkExtractor(allow=r'/\d{4}/\d{2}/'), callback='parse_post'),
)
def parse_post(self, response):
yield {
'title': response.css('h1::text').get(),
'author': response.css('.author::text').get(),
'date': response.css('.date::text').get(),
'content': ' '.join(response.css('.content p::text').getall()),
}
Pattern 2: News Site with Sections
rules = (
# Follow section links (sports, tech, politics)
Rule(LinkExtractor(allow=r'/section/'), follow=True),
# Scrape articles
Rule(
LinkExtractor(allow=r'/article/\d+'),
callback='parse_article',
follow=False # Don't follow links inside articles
),
)
Pattern 3: Forum with Threads
rules = (
# Follow forum sections
Rule(LinkExtractor(allow=r'/forum/\d+'), follow=True),
# Follow thread pages
Rule(LinkExtractor(allow=r'/thread/\d+'), follow=True),
# Scrape individual posts
Rule(
LinkExtractor(restrict_css='.post'),
callback='parse_post'
),
)
Restricting Where to Look for Links
This is super important. You don't want to follow navigation links, footer links, or ads.
Use restrict_css
Rule(
LinkExtractor(
allow=r'/product/',
restrict_css='.product-list' # Only look in this section
),
callback='parse_product'
)
Use restrict_xpaths
Rule(
LinkExtractor(
allow=r'/product/',
restrict_xpaths='//div[@id="content"]' # Only look in main content
),
callback='parse_product'
)
Use Multiple Restrictions
Rule(
LinkExtractor(
allow=r'/article/',
restrict_css='.article-list', # Look in article list
deny=r'/article/sponsored' # But ignore sponsored articles
),
callback='parse_article'
)
Debugging Your Rules
Your rules aren't working? Here's how to debug:
Step 1: Check What Links Are Being Extracted
Add this to your spider:
def parse_start_url(self, response):
self.logger.info(f'Starting at: {response.url}')
for link in response.css('a::attr(href)').getall():
self.logger.info(f'Found link: {link}')
return super().parse_start_url(response)
Step 2: Test LinkExtractor in Shell
scrapy shell "https://example.com"
Then test your extractor:
>>> from scrapy.linkextractors import LinkExtractor
>>> le = LinkExtractor(allow=r'/product/')
>>> links = le.extract_links(response)
>>> for link in links:
... print(link.url)
Step 3: Add Logging to Callbacks
def parse_product(self, response):
self.logger.info(f'Scraping product: {response.url}')
yield {
'name': response.css('h1::text').get(),
}
Common Mistakes (And How to Fix Them)
Mistake 1: Using "parse" as Callback
# WRONG
rules = (
Rule(LinkExtractor(allow=r'/product/'), callback='parse'),
)
# RIGHT
rules = (
Rule(LinkExtractor(allow=r'/product/'), callback='parse_product'),
)
CrawlSpider uses parse internally. Don't override it.
Mistake 2: Forgetting follow=True
# WRONG (won't keep crawling)
rules = (
Rule(LinkExtractor(allow=r'/category/'), callback='parse_category'),
)
# RIGHT (keeps crawling)
rules = (
Rule(LinkExtractor(allow=r'/category/'), callback='parse_category', follow=True),
)
Mistake 3: Too Broad Patterns
# WRONG (matches too much)
rules = (
Rule(LinkExtractor(allow=r'/'), callback='parse_page'),
)
# RIGHT (specific pattern)
rules = (
Rule(LinkExtractor(allow=r'/product/\d+'), callback='parse_product'),
)
Mistake 4: Not Using restrict_css/restrict_xpaths
# WRONG (follows navigation, footer, etc.)
rules = (
Rule(LinkExtractor(allow=r'/product/'), callback='parse_product'),
)
# RIGHT (only follows links in product listings)
rules = (
Rule(
LinkExtractor(allow=r'/product/', restrict_css='.product-grid'),
callback='parse_product'
),
)
Rule Order Matters
Rules are processed in order, and only the first matching rule is used.
rules = (
# This runs first
Rule(LinkExtractor(allow=r'/product/special'), callback='parse_special'),
# This runs second (for products that aren't special)
Rule(LinkExtractor(allow=r'/product/'), callback='parse_product'),
)
If a link matches the first rule, the second rule never runs for that link.
Real-World Example: Complete Bookstore Spider
Let me show you a complete, working example:
import scrapy
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
class BookstoreSpider(CrawlSpider):
name = 'bookstore'
allowed_domains = ['books.toscrape.com']
start_urls = ['https://books.toscrape.com']
rules = (
# Rule 1: Follow category links
Rule(
LinkExtractor(
allow=r'catalogue/category/',
restrict_css='.side_categories'
),
follow=True
),
# Rule 2: Follow pagination
Rule(
LinkExtractor(
allow=r'catalogue/page',
restrict_css='.pager'
),
follow=True
),
# Rule 3: Scrape individual books
Rule(
LinkExtractor(
allow=r'catalogue/.+_\d+/index\.html',
restrict_css='.product_pod'
),
callback='parse_book',
follow=False
),
)
def parse_book(self, response):
yield {
'title': response.css('h1::text').get(),
'price': response.css('.price_color::text').get(),
'availability': response.css('.availability::text').getall()[1].strip(),
'rating': response.css('.star-rating::attr(class)').get().split()[-1],
'description': response.css('#product_description + p::text').get(),
'category': response.css('.breadcrumb li:nth-child(3) a::text').get(),
'url': response.url,
}
Run it:
scrapy crawl bookstore -o books.json
This will:
- Start at the homepage
- Follow all category links
- Follow pagination on each category
- Scrape every book it finds
- Save everything to books.json
All automatic!
When NOT to Use Rules
Rules are great, but sometimes they're not the right choice:
Don't use rules when:
Your logic is too complex
If you need lots of conditional logic, custom request handling, or complex state management, a regular spider is clearer.You're scraping a single page
Rules are for following links. If you're not following links, you don't need them.You need fine-grained control
Sometimes you need to inspect each link before deciding to follow it. Rules are all-or-nothing.The site structure is unpredictable
If the site doesn't follow consistent URL patterns, rules won't help.
Rules vs Manual Link Following
Use rules when:
- URL patterns are consistent
- You're crawling many pages
- Different link types need different handling
- You want cleaner code
Use manual following when:
- You need custom logic for each link
- You need to pass data between pages
- You're doing something unusual
- The learning curve of rules isn't worth it
Quick Reference
Basic Rule Structure
Rule(
LinkExtractor(
allow=r'pattern', # Required: What links to extract
deny=r'pattern', # Optional: What links to ignore
restrict_css='.class', # Optional: Where to look
restrict_xpaths='//div', # Optional: Where to look (XPath)
),
callback='parse_item', # Optional: Function to call
follow=True, # Optional: Keep following links
)
Common Patterns
# Pagination
Rule(LinkExtractor(allow=r'/page/\d+'), follow=True)
# Product pages
Rule(LinkExtractor(allow=r'/product/\d+'), callback='parse_product')
# Blog posts by date
Rule(LinkExtractor(allow=r'/\d{4}/\d{2}/\d{2}/'), callback='parse_post')
# Categories
Rule(LinkExtractor(allow=r'/category/'), follow=True)
Final Thoughts
Rules make your Scrapy code cleaner and more maintainable. Instead of writing loops and if statements, you define patterns once and let Scrapy handle the rest.
Start simple. Add one rule. Test it. Then add another rule. Don't try to write all your rules at once.
And remember: you don't always need rules. Sometimes a regular spider with manual link following is clearer and simpler. Use the right tool for the job.
Now go build something!
Questions? Drop a comment below. Rules can be confusing at first, but once you get them, they're incredibly powerful.
Happy scraping! 🕷️
Top comments (0)