Muhammad Ikramullah Khan

Posted on Dec 30, 2025

Scrapy Spider Types: Spider vs CrawlSpider vs SitemapSpider (When to Use What)

#webdev #programming #beginners #python

When I started with Scrapy, I only used the basic Spider class for everything. I'd manually write pagination logic, manually follow category links, manually handle sitemaps.

Then I discovered CrawlSpider. Suddenly, pagination and link following became automatic. My code got shorter and cleaner.

Later, I found SitemapSpider. For sites with sitemaps, it was even simpler than CrawlSpider.

Each spider type has its purpose. Let me show you when to use which.

The Three Spider Types

Spider (Basic Spider)

Manual control over everything
You write all the logic
Most flexible, most code

CrawlSpider (Rule-Based Spider)

Automatically follows links based on rules
Less code, less control
Perfect for structured sites

SitemapSpider (Sitemap-Based Spider)

Automatically crawls from sitemap.xml
Minimal code, minimal control
Perfect when sitemaps exist

Spider: The Basic One (Full Control)

This is what you've been using. You control everything manually.

When to Use

You need complete control
Site structure is complex or unusual
You're learning Scrapy
You need custom logic for each page type

Basic Example

import scrapy

class BasicSpider(scrapy.Spider):
    name = 'basic'
    start_urls = ['https://example.com/products']

    def parse(self, response):
        # Scrape products
        for product in response.css('.product'):
            yield {
                'name': product.css('h2::text').get(),
                'price': product.css('.price::text').get()
            }

        # Follow pagination manually
        next_page = response.css('.next::attr(href)').get()
        if next_page:
            yield response.follow(next_page, callback=self.parse)

        # Follow category links manually
        for category in response.css('.category a::attr(href)').getall():
            yield response.follow(category, callback=self.parse_category)

    def parse_category(self, response):
        # Different handling for category pages
        for product in response.css('.product-list .item'):
            yield response.follow(
                product.css('a::attr(href)').get(),
                callback=self.parse_product
            )

    def parse_product(self, response):
        # Detailed product scraping
        yield {
            'name': response.css('h1::text').get(),
            'price': response.css('.price::text').get(),
            'description': response.css('.description::text').get()
        }

Characteristics

Pros:

Complete control over logic
Can handle any site structure
Easy to understand
Easy to debug

Cons:

More code
Manual pagination handling
Manual link following
Easy to make mistakes

CrawlSpider: The Rule-Based One (Automatic Link Following)

CrawlSpider uses rules to automatically follow links. You define patterns, Scrapy handles the rest.

When to Use

Site has clear URL patterns
You want automatic link following
Pagination is straightforward
Category/product structure is consistent

Basic Example

from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor

class AutoSpider(CrawlSpider):
    name = 'auto'
    start_urls = ['https://example.com/products']

    rules = (
        # Rule 1: Follow category links
        Rule(
            LinkExtractor(allow=r'/category/'),
            follow=True  # Follow but don't scrape
        ),

        # Rule 2: Follow pagination
        Rule(
            LinkExtractor(allow=r'/page/\d+'),
            follow=True
        ),

        # Rule 3: Scrape product pages
        Rule(
            LinkExtractor(allow=r'/product/\d+'),
            callback='parse_product'
        ),
    )

    def parse_product(self, response):
        yield {
            'name': response.css('h1::text').get(),
            'price': response.css('.price::text').get()
        }

What the Docs Don't Tell You

1. Rules are processed in order

First matching rule wins:

rules = (
    # Specific rule first
    Rule(LinkExtractor(allow=r'/product/special/'), callback='parse_special'),

    # General rule second
    Rule(LinkExtractor(allow=r'/product/'), callback='parse_product'),
)

2. parse() method is reserved

Don't override parse() in CrawlSpider. It's used internally. Use parse_start_url() instead:

class MyCrawlSpider(CrawlSpider):
    # DON'T do this
    def parse(self, response):
        pass

    # DO this instead
    def parse_start_url(self, response):
        # Handle start_urls differently
        return self.parse_product(response)

3. follow=True vs callback

# Follow links but don't scrape
Rule(LinkExtractor(allow=r'/category/'), follow=True)

# Scrape but don't follow further (default)
Rule(LinkExtractor(allow=r'/product/'), callback='parse_product')

# Scrape AND follow further links
Rule(
    LinkExtractor(allow=r'/product/'),
    callback='parse_product',
    follow=True  # Explicit
)

Advanced CrawlSpider

from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor

class AdvancedSpider(CrawlSpider):
    name = 'advanced'
    allowed_domains = ['example.com']
    start_urls = ['https://example.com']

    rules = (
        # Follow categories (multiple levels)
        Rule(
            LinkExtractor(
                allow=r'/category/',
                restrict_css='.navigation'  # Only in navigation
            ),
            follow=True,
            process_links='process_category_links'  # Custom processing
        ),

        # Follow pagination
        Rule(
            LinkExtractor(
                allow=r'/page/\d+',
                restrict_css='.pagination'
            ),
            follow=True
        ),

        # Scrape products
        Rule(
            LinkExtractor(
                allow=r'/product/\d+',
                deny=r'/product/\d+/reviews'  # Exclude reviews pages
            ),
            callback='parse_product',
            cb_kwargs={'product_type': 'regular'}  # Pass extra data
        ),

        # Scrape special products differently
        Rule(
            LinkExtractor(allow=r'/special-offer/\d+'),
            callback='parse_special_product',
            cb_kwargs={'product_type': 'special'}
        ),
    )

    def process_category_links(self, links):
        # Custom link processing
        for link in links:
            # Modify URLs, filter links, etc.
            if 'old-category' not in link.url:
                yield link

    def parse_product(self, response, product_type):
        yield {
            'type': product_type,
            'name': response.css('h1::text').get(),
            'price': response.css('.price::text').get()
        }

    def parse_special_product(self, response, product_type):
        yield {
            'type': product_type,
            'name': response.css('h1::text').get(),
            'price': response.css('.special-price::text').get(),
            'original_price': response.css('.original-price::text').get()
        }

Characteristics

Pros:

Much less code
Automatic link following
Declarative (rules are clear)
Perfect for structured sites

Cons:

Less flexible than basic Spider
Learning curve for rules
Can't do complex per-page logic easily
Debugging rules is harder

SitemapSpider: The Sitemap-Based One (Easiest of All)

If a site has a sitemap.xml, SitemapSpider is the easiest option.

When to Use

Site has sitemap.xml
You want to scrape all pages listed
Site structure doesn't matter
Fastest way to crawl large sites

Basic Example

from scrapy.spiders import SitemapSpider

class MySitemapSpider(SitemapSpider):
    name = 'sitemap'
    sitemap_urls = ['https://example.com/sitemap.xml']

    def parse(self, response):
        yield {
            'url': response.url,
            'title': response.css('h1::text').get()
        }

That's it! Scrapy:

Downloads sitemap.xml
Extracts all URLs
Scrapes each one
Calls parse() for each page

Multiple Sitemaps

class MultipleSitemapSpider(SitemapSpider):
    name = 'multiple'
    sitemap_urls = [
        'https://example.com/sitemap.xml',
        'https://example.com/sitemap-products.xml',
        'https://example.com/sitemap-articles.xml'
    ]

    def parse(self, response):
        yield {'url': response.url}

Sitemap Rules (Filter URLs)

Only scrape certain URLs from sitemap:

class FilteredSitemapSpider(SitemapSpider):
    name = 'filtered'
    sitemap_urls = ['https://example.com/sitemap.xml']
    sitemap_rules = [
        ('/product/', 'parse_product'),  # Product URLs
        ('/article/', 'parse_article'),  # Article URLs
    ]

    def parse_product(self, response):
        yield {
            'type': 'product',
            'name': response.css('h1::text').get()
        }

    def parse_article(self, response):
        yield {
            'type': 'article',
            'title': response.css('h1::text').get()
        }

Follow Sitemap Index

Some sites have a sitemap index (sitemap of sitemaps):

class IndexSpider(SitemapSpider):
    name = 'index'
    sitemap_urls = ['https://example.com/sitemap-index.xml']
    sitemap_follow = ['/sitemap-products']  # Only follow product sitemaps

    def parse(self, response):
        yield {'url': response.url}

Alternate URLs (Multilingual Sites)

class MultilingualSpider(SitemapSpider):
    name = 'multilingual'
    sitemap_urls = ['https://example.com/sitemap.xml']
    sitemap_alternate_links = True  # Follow alternate language URLs

    def parse(self, response):
        yield {
            'url': response.url,
            'language': response.url.split('/')[3]  # Extract language code
        }

What the Docs Don't Tell You

1. Not all sitemaps are at /sitemap.xml

Check robots.txt for the actual location:

https://example.com/robots.txt

Look for:

Sitemap: https://example.com/actual-sitemap.xml

2. Large sitemaps might be compressed

sitemap_urls = ['https://example.com/sitemap.xml.gz']  # Works fine

3. You can combine with CrawlSpider features

class HybridSpider(SitemapSpider):
    name = 'hybrid'
    sitemap_urls = ['https://example.com/sitemap.xml']

    # Add rules like CrawlSpider
    rules = (
        Rule(LinkExtractor(allow=r'/related/'), callback='parse_related'),
    )

Characteristics

Pros:

Minimal code
Very fast (no link discovery needed)
Guaranteed to find all pages
Perfect for large sites

Cons:

Only works if sitemap exists
No control over crawl order
Can't filter during crawl (only in rules)
Less flexible

Choosing the Right Spider Type

Use Spider When:

Site structure is complex
You need complete control
Custom logic per page type
You're learning Scrapy
Site is unusual

Example:

Site with dynamic navigation, AJAX-loaded content, 
or complex multi-step workflows

Use CrawlSpider When:

Clear URL patterns exist
Automatic link following is needed
Site structure is consistent
You want less code

Example:

E-commerce site with categories → subcategories → products
News site with sections → articles

Use SitemapSpider When:

Site has sitemap.xml
You want to scrape all pages
Fastest crawl needed
Site structure doesn't matter

Example:

Large content sites (WordPress, Drupal)
E-commerce with good SEO
Any site that publishes sitemaps

Real-World Comparison

Let's scrape the same site with all three approaches:

Site Structure

Homepage
├── Category: Electronics
│   ├── Product: Laptop
│   ├── Product: Phone
│   └── Page 2 → More products
└── Category: Books
    └── Product: Novel

Approach 1: Basic Spider

class BasicSpider(scrapy.Spider):
    name = 'basic'
    start_urls = ['https://example.com']

    def parse(self, response):
        # Follow categories
        for cat in response.css('.category a'):
            yield response.follow(cat, self.parse_category)

    def parse_category(self, response):
        # Scrape products
        for prod in response.css('.product a'):
            yield response.follow(prod, self.parse_product)

        # Pagination
        next_page = response.css('.next::attr(href)').get()
        if next_page:
            yield response.follow(next_page, self.parse_category)

    def parse_product(self, response):
        yield {'name': response.css('h1::text').get()}

Lines of code: ~20

Approach 2: CrawlSpider

class CrawlSpiderVersion(CrawlSpider):
    name = 'crawl'
    start_urls = ['https://example.com']

    rules = (
        Rule(LinkExtractor(allow=r'/category/'), follow=True),
        Rule(LinkExtractor(allow=r'/page/\d+'), follow=True),
        Rule(LinkExtractor(allow=r'/product/'), callback='parse_product'),
    )

    def parse_product(self, response):
        yield {'name': response.css('h1::text').get()}

Lines of code: ~12

Approach 3: SitemapSpider

class SitemapVersion(SitemapSpider):
    name = 'sitemap'
    sitemap_urls = ['https://example.com/sitemap.xml']
    sitemap_rules = [('/product/', 'parse_product')]

    def parse_product(self, response):
        yield {'name': response.css('h1::text').get()}

Lines of code: ~7

Converting Between Spider Types

From Spider to CrawlSpider

Before:

class MySpider(scrapy.Spider):
    def parse(self, response):
        for link in response.css('a.product'):
            yield response.follow(link, self.parse_product)

After:

class MySpider(CrawlSpider):
    rules = (
        Rule(LinkExtractor(allow=r'/product/'), callback='parse_product'),
    )

From CrawlSpider to Spider

Sometimes you need more control. Just convert rules to manual logic.

Mixing Approaches

You can combine spider types:

class HybridSpider(CrawlSpider):
    name = 'hybrid'
    start_urls = ['https://example.com']

    # CrawlSpider rules for most pages
    rules = (
        Rule(LinkExtractor(allow=r'/category/'), follow=True),
        Rule(LinkExtractor(allow=r'/product/'), callback='parse_product'),
    )

    # Custom logic for start URLs
    def parse_start_url(self, response):
        # Special handling for homepage
        featured = response.css('.featured-product a')
        for link in featured:
            yield response.follow(link, self.parse_featured)

    def parse_product(self, response):
        yield {'name': response.css('h1::text').get()}

    def parse_featured(self, response):
        yield {
            'name': response.css('h1::text').get(),
            'featured': True
        }

Quick Decision Tree

Does site have sitemap.xml?
├─ Yes → Use SitemapSpider
└─ No
   │
   Does site have clear URL patterns?
   ├─ Yes → Use CrawlSpider
   └─ No → Use basic Spider

Summary

Spider (Basic):

Full control
Most code
Use when: complex sites, learning, custom logic

CrawlSpider (Rules):

Automatic link following
Less code
Use when: clear patterns, structured sites

SitemapSpider (Sitemap):

Minimal code
Very fast
Use when: sitemap exists, want all pages

Start with: Basic Spider (learn fundamentals)
Graduate to: CrawlSpider (save time)
Use when available: SitemapSpider (fastest)

Don't overthink it. Start with what you know and upgrade when needed.

Happy scraping! 🕷️

DEV Community

Scrapy Spider Types: Spider vs CrawlSpider vs SitemapSpider (When to Use What)

The Three Spider Types

Spider: The Basic One (Full Control)

When to Use

Basic Example

Characteristics

CrawlSpider: The Rule-Based One (Automatic Link Following)

When to Use

Basic Example

What the Docs Don't Tell You

Advanced CrawlSpider

Characteristics

SitemapSpider: The Sitemap-Based One (Easiest of All)

When to Use

Basic Example

Multiple Sitemaps

Sitemap Rules (Filter URLs)

Follow Sitemap Index

Alternate URLs (Multilingual Sites)

What the Docs Don't Tell You

Characteristics

Choosing the Right Spider Type

Use Spider When:

Use CrawlSpider When:

Use SitemapSpider When:

Real-World Comparison

Site Structure

Approach 1: Basic Spider

Approach 2: CrawlSpider

Approach 3: SitemapSpider

Converting Between Spider Types

From Spider to CrawlSpider

From CrawlSpider to Spider

Mixing Approaches

Quick Decision Tree

Summary

Top comments (0)