Muhammad Ikramullah Khan

Posted on Dec 22, 2025

Scrapy Rules: A Complete Beginner's Guide (With Real Examples)

#python #webdev #beginners #programming

If you've been writing Scrapy spiders, you've probably found yourself doing this:

def parse(self, response):
    # Extract data from current page
    yield {'title': 'something'}

    # Find all links
    for link in response.css('a::attr(href)'):
        yield response.follow(link, self.parse)

This works, but there's a problem. You're manually following every single link. What if you only want to follow certain links? What if different types of links need different handling?

This is where Scrapy Rules come in.

Rules let you say "follow this type of link" and "scrape this type of page" without writing tons of repetitive code. They're like setting up traffic rules for your spider.

Let me show you how they work.

What Are Scrapy Rules?

Think of rules like instructions you give your spider:

Rule 1: "When you see a category link, follow it but don't scrape it yet."
Rule 2: "When you see a product link, scrape the product details."
Rule 3: "When you see a pagination link, follow it but ignore everything else."

Instead of writing if statements and loops for every type of link, you define rules once, and Scrapy handles the rest automatically.

Regular Spider vs CrawlSpider (With Rules)

Here's a regular spider:

import scrapy

class RegularSpider(scrapy.Spider):
    name = 'regular'
    start_urls = ['https://example.com']

    def parse(self, response):
        # You manually handle everything
        for product in response.css('.product'):
            yield {'name': product.css('h2::text').get()}

        # You manually follow links
        for link in response.css('a.next::attr(href)'):
            yield response.follow(link, self.parse)

Here's the same thing with CrawlSpider and rules:

import scrapy
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor

class RulesSpider(CrawlSpider):
    name = 'with_rules'
    start_urls = ['https://example.com']

    rules = (
        Rule(LinkExtractor(allow=r'/product/'), callback='parse_product'),
        Rule(LinkExtractor(allow=r'/page/\d+'), follow=True),
    )

    def parse_product(self, response):
        yield {'name': response.css('h2::text').get()}

See the difference? With rules, you tell Scrapy what to follow and what to scrape. Scrapy does the rest.

When Should You Use Rules?

Use CrawlSpider with rules when:

You need to follow links automatically
Different types of links need different handling
You're crawling multiple pages (pagination, categories, etc.)
You want cleaner, more organized code

Don't use rules when:

You're scraping a single page
Your scraping logic is very complex and custom
You're doing something unusual that rules can't handle

My advice: Start with regular spiders. When you find yourself writing lots of link-following code, switch to rules.

The Basics: Your First Rule

Let's build a spider that scrapes a bookstore. We'll start simple and add complexity.

Basic Setup

import scrapy
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor

class BookSpider(CrawlSpider):
    name = 'books'
    allowed_domains = ['books.toscrape.com']
    start_urls = ['https://books.toscrape.com']

    rules = (
        # We'll add rules here
    )

Notice we're using CrawlSpider instead of scrapy.Spider. This is important. You can't use rules with regular spiders.

Rule #1: Following Links

Let's say we want to follow pagination links (next page, page 2, page 3, etc.).

rules = (
    Rule(LinkExtractor(allow=r'catalogue/page'), follow=True),
)

What this does:

Looks for any link containing "catalogue/page"
Follows those links
Doesn't scrape anything (no callback)

follow=True means "follow these links and keep looking for more links on those pages."

Rule #2: Scraping Pages

Now let's scrape individual book pages:

rules = (
    Rule(LinkExtractor(allow=r'catalogue/page'), follow=True),
    Rule(LinkExtractor(allow=r'catalogue/.*_\d+'), callback='parse_book'),
)

def parse_book(self, response):
    yield {
        'title': response.css('h1::text').get(),
        'price': response.css('.price_color::text').get(),
        'rating': response.css('.star-rating::attr(class)').get(),
    }

What this does:

First rule: Follow pagination links
Second rule: When you find a book page (matches the pattern), scrape it using parse_book

Important: Don't name your callback parse. CrawlSpider uses that name internally. Use parse_item, parse_product, parse_book, etc.

Understanding LinkExtractor

LinkExtractor is what finds the links. It has several options:

allow (Include Links)

LinkExtractor(allow=r'/product/\d+')

This says "only extract links that match this pattern." The r means it's a regular expression.

Common patterns:

r'/category/' : Any link with /category/ in it
r'/product/\d+' : Links like /product/123
r'/page-\d+\.html' : Links like /page-1.html
r'\.pdf$' : Links ending in .pdf

deny (Exclude Links)

LinkExtractor(allow=r'/category/', deny=r'/category/books')

This says "extract category links, but NOT if they contain /category/books."

restrict_css (Only Look in Specific Areas)

LinkExtractor(restrict_css='.product-list')

Only looks for links inside elements with class "product-list". Super useful for avoiding navigation links, footer links, etc.

restrict_xpaths (XPath Version)

LinkExtractor(restrict_xpaths='//div[@class="content"]')

Same as restrict_css but using XPath.

Real Example: E-commerce Site

Let's scrape an e-commerce site with categories, subcategories, and products.

import scrapy
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor

class ShopSpider(CrawlSpider):
    name = 'shop'
    allowed_domains = ['example.com']
    start_urls = ['https://example.com']

    rules = (
        # Rule 1: Follow category links (but don't scrape them)
        Rule(
            LinkExtractor(allow=r'/category/'),
            follow=True
        ),

        # Rule 2: Follow pagination (next page, page numbers)
        Rule(
            LinkExtractor(allow=r'/page/\d+'),
            follow=True
        ),

        # Rule 3: Scrape product pages
        Rule(
            LinkExtractor(allow=r'/product/\d+'),
            callback='parse_product'
        ),
    )

    def parse_product(self, response):
        yield {
            'name': response.css('h1.product-name::text').get(),
            'price': response.css('span.price::text').get(),
            'description': response.css('div.description::text').get(),
            'url': response.url,
        }

What happens:

Spider starts at homepage
Finds category links (like /category/electronics), follows them
On category pages, finds pagination links, follows them
On any page, finds product links (like /product/123), scrapes them

All automatic. You just defined the rules.

Advanced: Using Multiple Patterns

You can use lists in allow and deny:

Rule(
    LinkExtractor(
        allow=(r'/product/\d+', r'/item/\d+'),  # Match either pattern
        deny=(r'/product/admin', r'/item/test')  # But exclude these
    ),
    callback='parse_product'
)

The follow Parameter (Important!)

This confuses a lot of beginners. Let me explain:

follow=True

Rule(LinkExtractor(allow=r'/category/'), follow=True)

Means: "Follow these links AND keep looking for more links on those pages."

Use when: You want to keep crawling deeper (categories, pagination).

No follow (or follow=False)

Rule(LinkExtractor(allow=r'/product/'), callback='parse_product')

Means: "Extract data from these pages, but don't follow any more links."

Use when: You've reached your target pages (products, articles).

Important: If you set a callback, follow defaults to False. If you want both, you need to explicitly set follow=True:

Rule(
    LinkExtractor(allow=r'/product/'),
    callback='parse_product',
    follow=True  # Scrape AND keep following links
)

Common Patterns and Examples

Pattern 1: Blog with Categories and Posts

rules = (
    # Follow category links
    Rule(LinkExtractor(allow=r'/category/'), follow=True),

    # Follow pagination
    Rule(LinkExtractor(allow=r'/page/\d+'), follow=True),

    # Scrape individual posts
    Rule(LinkExtractor(allow=r'/\d{4}/\d{2}/'), callback='parse_post'),
)

def parse_post(self, response):
    yield {
        'title': response.css('h1::text').get(),
        'author': response.css('.author::text').get(),
        'date': response.css('.date::text').get(),
        'content': ' '.join(response.css('.content p::text').getall()),
    }

Pattern 2: News Site with Sections

rules = (
    # Follow section links (sports, tech, politics)
    Rule(LinkExtractor(allow=r'/section/'), follow=True),

    # Scrape articles
    Rule(
        LinkExtractor(allow=r'/article/\d+'),
        callback='parse_article',
        follow=False  # Don't follow links inside articles
    ),
)

Pattern 3: Forum with Threads

rules = (
    # Follow forum sections
    Rule(LinkExtractor(allow=r'/forum/\d+'), follow=True),

    # Follow thread pages
    Rule(LinkExtractor(allow=r'/thread/\d+'), follow=True),

    # Scrape individual posts
    Rule(
        LinkExtractor(restrict_css='.post'),
        callback='parse_post'
    ),
)

Restricting Where to Look for Links

This is super important. You don't want to follow navigation links, footer links, or ads.

Use restrict_css

Rule(
    LinkExtractor(
        allow=r'/product/',
        restrict_css='.product-list'  # Only look in this section
    ),
    callback='parse_product'
)

Use restrict_xpaths

Rule(
    LinkExtractor(
        allow=r'/product/',
        restrict_xpaths='//div[@id="content"]'  # Only look in main content
    ),
    callback='parse_product'
)

Use Multiple Restrictions

Rule(
    LinkExtractor(
        allow=r'/article/',
        restrict_css='.article-list',  # Look in article list
        deny=r'/article/sponsored'  # But ignore sponsored articles
    ),
    callback='parse_article'
)

Debugging Your Rules

Your rules aren't working? Here's how to debug:

Step 1: Check What Links Are Being Extracted

Add this to your spider:

def parse_start_url(self, response):
    self.logger.info(f'Starting at: {response.url}')
    for link in response.css('a::attr(href)').getall():
        self.logger.info(f'Found link: {link}')
    return super().parse_start_url(response)

Step 2: Test LinkExtractor in Shell

scrapy shell "https://example.com"

Then test your extractor:

>>> from scrapy.linkextractors import LinkExtractor
>>> le = LinkExtractor(allow=r'/product/')
>>> links = le.extract_links(response)
>>> for link in links:
...     print(link.url)

Step 3: Add Logging to Callbacks

def parse_product(self, response):
    self.logger.info(f'Scraping product: {response.url}')
    yield {
        'name': response.css('h1::text').get(),
    }

Common Mistakes (And How to Fix Them)

Mistake 1: Using "parse" as Callback

# WRONG
rules = (
    Rule(LinkExtractor(allow=r'/product/'), callback='parse'),
)

# RIGHT
rules = (
    Rule(LinkExtractor(allow=r'/product/'), callback='parse_product'),
)

CrawlSpider uses parse internally. Don't override it.

Mistake 2: Forgetting follow=True

# WRONG (won't keep crawling)
rules = (
    Rule(LinkExtractor(allow=r'/category/'), callback='parse_category'),
)

# RIGHT (keeps crawling)
rules = (
    Rule(LinkExtractor(allow=r'/category/'), callback='parse_category', follow=True),
)

Mistake 3: Too Broad Patterns

# WRONG (matches too much)
rules = (
    Rule(LinkExtractor(allow=r'/'), callback='parse_page'),
)

# RIGHT (specific pattern)
rules = (
    Rule(LinkExtractor(allow=r'/product/\d+'), callback='parse_product'),
)

Mistake 4: Not Using restrict_css/restrict_xpaths

# WRONG (follows navigation, footer, etc.)
rules = (
    Rule(LinkExtractor(allow=r'/product/'), callback='parse_product'),
)

# RIGHT (only follows links in product listings)
rules = (
    Rule(
        LinkExtractor(allow=r'/product/', restrict_css='.product-grid'),
        callback='parse_product'
    ),
)

Rule Order Matters

Rules are processed in order, and only the first matching rule is used.

rules = (
    # This runs first
    Rule(LinkExtractor(allow=r'/product/special'), callback='parse_special'),

    # This runs second (for products that aren't special)
    Rule(LinkExtractor(allow=r'/product/'), callback='parse_product'),
)

If a link matches the first rule, the second rule never runs for that link.

Real-World Example: Complete Bookstore Spider

Let me show you a complete, working example:

import scrapy
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor

class BookstoreSpider(CrawlSpider):
    name = 'bookstore'
    allowed_domains = ['books.toscrape.com']
    start_urls = ['https://books.toscrape.com']

    rules = (
        # Rule 1: Follow category links
        Rule(
            LinkExtractor(
                allow=r'catalogue/category/',
                restrict_css='.side_categories'
            ),
            follow=True
        ),

        # Rule 2: Follow pagination
        Rule(
            LinkExtractor(
                allow=r'catalogue/page',
                restrict_css='.pager'
            ),
            follow=True
        ),

        # Rule 3: Scrape individual books
        Rule(
            LinkExtractor(
                allow=r'catalogue/.+_\d+/index\.html',
                restrict_css='.product_pod'
            ),
            callback='parse_book',
            follow=False
        ),
    )

    def parse_book(self, response):
        yield {
            'title': response.css('h1::text').get(),
            'price': response.css('.price_color::text').get(),
            'availability': response.css('.availability::text').getall()[1].strip(),
            'rating': response.css('.star-rating::attr(class)').get().split()[-1],
            'description': response.css('#product_description + p::text').get(),
            'category': response.css('.breadcrumb li:nth-child(3) a::text').get(),
            'url': response.url,
        }

Run it:

scrapy crawl bookstore -o books.json

This will:

Start at the homepage
Follow all category links
Follow pagination on each category
Scrape every book it finds
Save everything to books.json

All automatic!

When NOT to Use Rules

Rules are great, but sometimes they're not the right choice:

Don't use rules when:

Your logic is too complex
If you need lots of conditional logic, custom request handling, or complex state management, a regular spider is clearer.
You're scraping a single page
Rules are for following links. If you're not following links, you don't need them.
You need fine-grained control
Sometimes you need to inspect each link before deciding to follow it. Rules are all-or-nothing.
The site structure is unpredictable
If the site doesn't follow consistent URL patterns, rules won't help.

Rules vs Manual Link Following

Use rules when:

URL patterns are consistent
You're crawling many pages
Different link types need different handling
You want cleaner code

Use manual following when:

You need custom logic for each link
You need to pass data between pages
You're doing something unusual
The learning curve of rules isn't worth it

Quick Reference

Basic Rule Structure

Rule(
    LinkExtractor(
        allow=r'pattern',          # Required: What links to extract
        deny=r'pattern',            # Optional: What links to ignore
        restrict_css='.class',      # Optional: Where to look
        restrict_xpaths='//div',    # Optional: Where to look (XPath)
    ),
    callback='parse_item',          # Optional: Function to call
    follow=True,                    # Optional: Keep following links
)

Common Patterns

# Pagination
Rule(LinkExtractor(allow=r'/page/\d+'), follow=True)

# Product pages
Rule(LinkExtractor(allow=r'/product/\d+'), callback='parse_product')

# Blog posts by date
Rule(LinkExtractor(allow=r'/\d{4}/\d{2}/\d{2}/'), callback='parse_post')

# Categories
Rule(LinkExtractor(allow=r'/category/'), follow=True)

Final Thoughts

Rules make your Scrapy code cleaner and more maintainable. Instead of writing loops and if statements, you define patterns once and let Scrapy handle the rest.

Start simple. Add one rule. Test it. Then add another rule. Don't try to write all your rules at once.

And remember: you don't always need rules. Sometimes a regular spider with manual link following is clearer and simpler. Use the right tool for the job.

Now go build something!

Questions? Drop a comment below. Rules can be confusing at first, but once you get them, they're incredibly powerful.

Happy scraping! 🕷️