DEV Community

Muhammad Ikramullah Khan
Muhammad Ikramullah Khan

Posted on

Scrapy Rules: A Complete Beginner's Guide (With Real Examples)

If you've been writing Scrapy spiders, you've probably found yourself doing this:

def parse(self, response):
    # Extract data from current page
    yield {'title': 'something'}

    # Find all links
    for link in response.css('a::attr(href)'):
        yield response.follow(link, self.parse)
Enter fullscreen mode Exit fullscreen mode

This works, but there's a problem. You're manually following every single link. What if you only want to follow certain links? What if different types of links need different handling?

This is where Scrapy Rules come in.

Rules let you say "follow this type of link" and "scrape this type of page" without writing tons of repetitive code. They're like setting up traffic rules for your spider.

Let me show you how they work.


What Are Scrapy Rules?

Think of rules like instructions you give your spider:

Rule 1: "When you see a category link, follow it but don't scrape it yet."
Rule 2: "When you see a product link, scrape the product details."
Rule 3: "When you see a pagination link, follow it but ignore everything else."

Instead of writing if statements and loops for every type of link, you define rules once, and Scrapy handles the rest automatically.


Regular Spider vs CrawlSpider (With Rules)

Here's a regular spider:

import scrapy

class RegularSpider(scrapy.Spider):
    name = 'regular'
    start_urls = ['https://example.com']

    def parse(self, response):
        # You manually handle everything
        for product in response.css('.product'):
            yield {'name': product.css('h2::text').get()}

        # You manually follow links
        for link in response.css('a.next::attr(href)'):
            yield response.follow(link, self.parse)
Enter fullscreen mode Exit fullscreen mode

Here's the same thing with CrawlSpider and rules:

import scrapy
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor

class RulesSpider(CrawlSpider):
    name = 'with_rules'
    start_urls = ['https://example.com']

    rules = (
        Rule(LinkExtractor(allow=r'/product/'), callback='parse_product'),
        Rule(LinkExtractor(allow=r'/page/\d+'), follow=True),
    )

    def parse_product(self, response):
        yield {'name': response.css('h2::text').get()}
Enter fullscreen mode Exit fullscreen mode

See the difference? With rules, you tell Scrapy what to follow and what to scrape. Scrapy does the rest.


When Should You Use Rules?

Use CrawlSpider with rules when:

  • You need to follow links automatically
  • Different types of links need different handling
  • You're crawling multiple pages (pagination, categories, etc.)
  • You want cleaner, more organized code

Don't use rules when:

  • You're scraping a single page
  • Your scraping logic is very complex and custom
  • You're doing something unusual that rules can't handle

My advice: Start with regular spiders. When you find yourself writing lots of link-following code, switch to rules.


The Basics: Your First Rule

Let's build a spider that scrapes a bookstore. We'll start simple and add complexity.

Basic Setup

import scrapy
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor

class BookSpider(CrawlSpider):
    name = 'books'
    allowed_domains = ['books.toscrape.com']
    start_urls = ['https://books.toscrape.com']

    rules = (
        # We'll add rules here
    )
Enter fullscreen mode Exit fullscreen mode

Notice we're using CrawlSpider instead of scrapy.Spider. This is important. You can't use rules with regular spiders.


Rule #1: Following Links

Let's say we want to follow pagination links (next page, page 2, page 3, etc.).

rules = (
    Rule(LinkExtractor(allow=r'catalogue/page'), follow=True),
)
Enter fullscreen mode Exit fullscreen mode

What this does:

  • Looks for any link containing "catalogue/page"
  • Follows those links
  • Doesn't scrape anything (no callback)

follow=True means "follow these links and keep looking for more links on those pages."


Rule #2: Scraping Pages

Now let's scrape individual book pages:

rules = (
    Rule(LinkExtractor(allow=r'catalogue/page'), follow=True),
    Rule(LinkExtractor(allow=r'catalogue/.*_\d+'), callback='parse_book'),
)

def parse_book(self, response):
    yield {
        'title': response.css('h1::text').get(),
        'price': response.css('.price_color::text').get(),
        'rating': response.css('.star-rating::attr(class)').get(),
    }
Enter fullscreen mode Exit fullscreen mode

What this does:

  • First rule: Follow pagination links
  • Second rule: When you find a book page (matches the pattern), scrape it using parse_book

Important: Don't name your callback parse. CrawlSpider uses that name internally. Use parse_item, parse_product, parse_book, etc.


Understanding LinkExtractor

LinkExtractor is what finds the links. It has several options:

allow (Include Links)

LinkExtractor(allow=r'/product/\d+')
Enter fullscreen mode Exit fullscreen mode

This says "only extract links that match this pattern." The r means it's a regular expression.

Common patterns:

  • r'/category/' : Any link with /category/ in it
  • r'/product/\d+' : Links like /product/123
  • r'/page-\d+\.html' : Links like /page-1.html
  • r'\.pdf$' : Links ending in .pdf

deny (Exclude Links)

LinkExtractor(allow=r'/category/', deny=r'/category/books')
Enter fullscreen mode Exit fullscreen mode

This says "extract category links, but NOT if they contain /category/books."

restrict_css (Only Look in Specific Areas)

LinkExtractor(restrict_css='.product-list')
Enter fullscreen mode Exit fullscreen mode

Only looks for links inside elements with class "product-list". Super useful for avoiding navigation links, footer links, etc.

restrict_xpaths (XPath Version)

LinkExtractor(restrict_xpaths='//div[@class="content"]')
Enter fullscreen mode Exit fullscreen mode

Same as restrict_css but using XPath.


Real Example: E-commerce Site

Let's scrape an e-commerce site with categories, subcategories, and products.

import scrapy
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor

class ShopSpider(CrawlSpider):
    name = 'shop'
    allowed_domains = ['example.com']
    start_urls = ['https://example.com']

    rules = (
        # Rule 1: Follow category links (but don't scrape them)
        Rule(
            LinkExtractor(allow=r'/category/'),
            follow=True
        ),

        # Rule 2: Follow pagination (next page, page numbers)
        Rule(
            LinkExtractor(allow=r'/page/\d+'),
            follow=True
        ),

        # Rule 3: Scrape product pages
        Rule(
            LinkExtractor(allow=r'/product/\d+'),
            callback='parse_product'
        ),
    )

    def parse_product(self, response):
        yield {
            'name': response.css('h1.product-name::text').get(),
            'price': response.css('span.price::text').get(),
            'description': response.css('div.description::text').get(),
            'url': response.url,
        }
Enter fullscreen mode Exit fullscreen mode

What happens:

  1. Spider starts at homepage
  2. Finds category links (like /category/electronics), follows them
  3. On category pages, finds pagination links, follows them
  4. On any page, finds product links (like /product/123), scrapes them

All automatic. You just defined the rules.


Advanced: Using Multiple Patterns

You can use lists in allow and deny:

Rule(
    LinkExtractor(
        allow=(r'/product/\d+', r'/item/\d+'),  # Match either pattern
        deny=(r'/product/admin', r'/item/test')  # But exclude these
    ),
    callback='parse_product'
)
Enter fullscreen mode Exit fullscreen mode

The follow Parameter (Important!)

This confuses a lot of beginners. Let me explain:

follow=True

Rule(LinkExtractor(allow=r'/category/'), follow=True)
Enter fullscreen mode Exit fullscreen mode

Means: "Follow these links AND keep looking for more links on those pages."

Use when: You want to keep crawling deeper (categories, pagination).

No follow (or follow=False)

Rule(LinkExtractor(allow=r'/product/'), callback='parse_product')
Enter fullscreen mode Exit fullscreen mode

Means: "Extract data from these pages, but don't follow any more links."

Use when: You've reached your target pages (products, articles).

Important: If you set a callback, follow defaults to False. If you want both, you need to explicitly set follow=True:

Rule(
    LinkExtractor(allow=r'/product/'),
    callback='parse_product',
    follow=True  # Scrape AND keep following links
)
Enter fullscreen mode Exit fullscreen mode

Common Patterns and Examples

Pattern 1: Blog with Categories and Posts

rules = (
    # Follow category links
    Rule(LinkExtractor(allow=r'/category/'), follow=True),

    # Follow pagination
    Rule(LinkExtractor(allow=r'/page/\d+'), follow=True),

    # Scrape individual posts
    Rule(LinkExtractor(allow=r'/\d{4}/\d{2}/'), callback='parse_post'),
)

def parse_post(self, response):
    yield {
        'title': response.css('h1::text').get(),
        'author': response.css('.author::text').get(),
        'date': response.css('.date::text').get(),
        'content': ' '.join(response.css('.content p::text').getall()),
    }
Enter fullscreen mode Exit fullscreen mode

Pattern 2: News Site with Sections

rules = (
    # Follow section links (sports, tech, politics)
    Rule(LinkExtractor(allow=r'/section/'), follow=True),

    # Scrape articles
    Rule(
        LinkExtractor(allow=r'/article/\d+'),
        callback='parse_article',
        follow=False  # Don't follow links inside articles
    ),
)
Enter fullscreen mode Exit fullscreen mode

Pattern 3: Forum with Threads

rules = (
    # Follow forum sections
    Rule(LinkExtractor(allow=r'/forum/\d+'), follow=True),

    # Follow thread pages
    Rule(LinkExtractor(allow=r'/thread/\d+'), follow=True),

    # Scrape individual posts
    Rule(
        LinkExtractor(restrict_css='.post'),
        callback='parse_post'
    ),
)
Enter fullscreen mode Exit fullscreen mode

Restricting Where to Look for Links

This is super important. You don't want to follow navigation links, footer links, or ads.

Use restrict_css

Rule(
    LinkExtractor(
        allow=r'/product/',
        restrict_css='.product-list'  # Only look in this section
    ),
    callback='parse_product'
)
Enter fullscreen mode Exit fullscreen mode

Use restrict_xpaths

Rule(
    LinkExtractor(
        allow=r'/product/',
        restrict_xpaths='//div[@id="content"]'  # Only look in main content
    ),
    callback='parse_product'
)
Enter fullscreen mode Exit fullscreen mode

Use Multiple Restrictions

Rule(
    LinkExtractor(
        allow=r'/article/',
        restrict_css='.article-list',  # Look in article list
        deny=r'/article/sponsored'  # But ignore sponsored articles
    ),
    callback='parse_article'
)
Enter fullscreen mode Exit fullscreen mode

Debugging Your Rules

Your rules aren't working? Here's how to debug:

Step 1: Check What Links Are Being Extracted

Add this to your spider:

def parse_start_url(self, response):
    self.logger.info(f'Starting at: {response.url}')
    for link in response.css('a::attr(href)').getall():
        self.logger.info(f'Found link: {link}')
    return super().parse_start_url(response)
Enter fullscreen mode Exit fullscreen mode

Step 2: Test LinkExtractor in Shell

scrapy shell "https://example.com"
Enter fullscreen mode Exit fullscreen mode

Then test your extractor:

>>> from scrapy.linkextractors import LinkExtractor
>>> le = LinkExtractor(allow=r'/product/')
>>> links = le.extract_links(response)
>>> for link in links:
...     print(link.url)
Enter fullscreen mode Exit fullscreen mode

Step 3: Add Logging to Callbacks

def parse_product(self, response):
    self.logger.info(f'Scraping product: {response.url}')
    yield {
        'name': response.css('h1::text').get(),
    }
Enter fullscreen mode Exit fullscreen mode

Common Mistakes (And How to Fix Them)

Mistake 1: Using "parse" as Callback

# WRONG
rules = (
    Rule(LinkExtractor(allow=r'/product/'), callback='parse'),
)

# RIGHT
rules = (
    Rule(LinkExtractor(allow=r'/product/'), callback='parse_product'),
)
Enter fullscreen mode Exit fullscreen mode

CrawlSpider uses parse internally. Don't override it.

Mistake 2: Forgetting follow=True

# WRONG (won't keep crawling)
rules = (
    Rule(LinkExtractor(allow=r'/category/'), callback='parse_category'),
)

# RIGHT (keeps crawling)
rules = (
    Rule(LinkExtractor(allow=r'/category/'), callback='parse_category', follow=True),
)
Enter fullscreen mode Exit fullscreen mode

Mistake 3: Too Broad Patterns

# WRONG (matches too much)
rules = (
    Rule(LinkExtractor(allow=r'/'), callback='parse_page'),
)

# RIGHT (specific pattern)
rules = (
    Rule(LinkExtractor(allow=r'/product/\d+'), callback='parse_product'),
)
Enter fullscreen mode Exit fullscreen mode

Mistake 4: Not Using restrict_css/restrict_xpaths

# WRONG (follows navigation, footer, etc.)
rules = (
    Rule(LinkExtractor(allow=r'/product/'), callback='parse_product'),
)

# RIGHT (only follows links in product listings)
rules = (
    Rule(
        LinkExtractor(allow=r'/product/', restrict_css='.product-grid'),
        callback='parse_product'
    ),
)
Enter fullscreen mode Exit fullscreen mode

Rule Order Matters

Rules are processed in order, and only the first matching rule is used.

rules = (
    # This runs first
    Rule(LinkExtractor(allow=r'/product/special'), callback='parse_special'),

    # This runs second (for products that aren't special)
    Rule(LinkExtractor(allow=r'/product/'), callback='parse_product'),
)
Enter fullscreen mode Exit fullscreen mode

If a link matches the first rule, the second rule never runs for that link.


Real-World Example: Complete Bookstore Spider

Let me show you a complete, working example:

import scrapy
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor

class BookstoreSpider(CrawlSpider):
    name = 'bookstore'
    allowed_domains = ['books.toscrape.com']
    start_urls = ['https://books.toscrape.com']

    rules = (
        # Rule 1: Follow category links
        Rule(
            LinkExtractor(
                allow=r'catalogue/category/',
                restrict_css='.side_categories'
            ),
            follow=True
        ),

        # Rule 2: Follow pagination
        Rule(
            LinkExtractor(
                allow=r'catalogue/page',
                restrict_css='.pager'
            ),
            follow=True
        ),

        # Rule 3: Scrape individual books
        Rule(
            LinkExtractor(
                allow=r'catalogue/.+_\d+/index\.html',
                restrict_css='.product_pod'
            ),
            callback='parse_book',
            follow=False
        ),
    )

    def parse_book(self, response):
        yield {
            'title': response.css('h1::text').get(),
            'price': response.css('.price_color::text').get(),
            'availability': response.css('.availability::text').getall()[1].strip(),
            'rating': response.css('.star-rating::attr(class)').get().split()[-1],
            'description': response.css('#product_description + p::text').get(),
            'category': response.css('.breadcrumb li:nth-child(3) a::text').get(),
            'url': response.url,
        }
Enter fullscreen mode Exit fullscreen mode

Run it:

scrapy crawl bookstore -o books.json
Enter fullscreen mode Exit fullscreen mode

This will:

  1. Start at the homepage
  2. Follow all category links
  3. Follow pagination on each category
  4. Scrape every book it finds
  5. Save everything to books.json

All automatic!


When NOT to Use Rules

Rules are great, but sometimes they're not the right choice:

Don't use rules when:

  1. Your logic is too complex
    If you need lots of conditional logic, custom request handling, or complex state management, a regular spider is clearer.

  2. You're scraping a single page
    Rules are for following links. If you're not following links, you don't need them.

  3. You need fine-grained control
    Sometimes you need to inspect each link before deciding to follow it. Rules are all-or-nothing.

  4. The site structure is unpredictable
    If the site doesn't follow consistent URL patterns, rules won't help.


Rules vs Manual Link Following

Use rules when:

  • URL patterns are consistent
  • You're crawling many pages
  • Different link types need different handling
  • You want cleaner code

Use manual following when:

  • You need custom logic for each link
  • You need to pass data between pages
  • You're doing something unusual
  • The learning curve of rules isn't worth it

Quick Reference

Basic Rule Structure

Rule(
    LinkExtractor(
        allow=r'pattern',          # Required: What links to extract
        deny=r'pattern',            # Optional: What links to ignore
        restrict_css='.class',      # Optional: Where to look
        restrict_xpaths='//div',    # Optional: Where to look (XPath)
    ),
    callback='parse_item',          # Optional: Function to call
    follow=True,                    # Optional: Keep following links
)
Enter fullscreen mode Exit fullscreen mode

Common Patterns

# Pagination
Rule(LinkExtractor(allow=r'/page/\d+'), follow=True)

# Product pages
Rule(LinkExtractor(allow=r'/product/\d+'), callback='parse_product')

# Blog posts by date
Rule(LinkExtractor(allow=r'/\d{4}/\d{2}/\d{2}/'), callback='parse_post')

# Categories
Rule(LinkExtractor(allow=r'/category/'), follow=True)
Enter fullscreen mode Exit fullscreen mode

Final Thoughts

Rules make your Scrapy code cleaner and more maintainable. Instead of writing loops and if statements, you define patterns once and let Scrapy handle the rest.

Start simple. Add one rule. Test it. Then add another rule. Don't try to write all your rules at once.

And remember: you don't always need rules. Sometimes a regular spider with manual link following is clearer and simpler. Use the right tool for the job.

Now go build something!


Questions? Drop a comment below. Rules can be confusing at first, but once you get them, they're incredibly powerful.

Happy scraping! 🕷️

Top comments (0)