DEV Community

Muhammad Ikramullah Khan
Muhammad Ikramullah Khan

Posted on

Scrapy Spider Types: Spider vs CrawlSpider vs SitemapSpider (When to Use What)

When I started with Scrapy, I only used the basic Spider class for everything. I'd manually write pagination logic, manually follow category links, manually handle sitemaps.

Then I discovered CrawlSpider. Suddenly, pagination and link following became automatic. My code got shorter and cleaner.

Later, I found SitemapSpider. For sites with sitemaps, it was even simpler than CrawlSpider.

Each spider type has its purpose. Let me show you when to use which.


The Three Spider Types

Spider (Basic Spider)

  • Manual control over everything
  • You write all the logic
  • Most flexible, most code

CrawlSpider (Rule-Based Spider)

  • Automatically follows links based on rules
  • Less code, less control
  • Perfect for structured sites

SitemapSpider (Sitemap-Based Spider)

  • Automatically crawls from sitemap.xml
  • Minimal code, minimal control
  • Perfect when sitemaps exist

Spider: The Basic One (Full Control)

This is what you've been using. You control everything manually.

When to Use

  • You need complete control
  • Site structure is complex or unusual
  • You're learning Scrapy
  • You need custom logic for each page type

Basic Example

import scrapy

class BasicSpider(scrapy.Spider):
    name = 'basic'
    start_urls = ['https://example.com/products']

    def parse(self, response):
        # Scrape products
        for product in response.css('.product'):
            yield {
                'name': product.css('h2::text').get(),
                'price': product.css('.price::text').get()
            }

        # Follow pagination manually
        next_page = response.css('.next::attr(href)').get()
        if next_page:
            yield response.follow(next_page, callback=self.parse)

        # Follow category links manually
        for category in response.css('.category a::attr(href)').getall():
            yield response.follow(category, callback=self.parse_category)

    def parse_category(self, response):
        # Different handling for category pages
        for product in response.css('.product-list .item'):
            yield response.follow(
                product.css('a::attr(href)').get(),
                callback=self.parse_product
            )

    def parse_product(self, response):
        # Detailed product scraping
        yield {
            'name': response.css('h1::text').get(),
            'price': response.css('.price::text').get(),
            'description': response.css('.description::text').get()
        }
Enter fullscreen mode Exit fullscreen mode

Characteristics

Pros:

  • Complete control over logic
  • Can handle any site structure
  • Easy to understand
  • Easy to debug

Cons:

  • More code
  • Manual pagination handling
  • Manual link following
  • Easy to make mistakes

CrawlSpider: The Rule-Based One (Automatic Link Following)

CrawlSpider uses rules to automatically follow links. You define patterns, Scrapy handles the rest.

When to Use

  • Site has clear URL patterns
  • You want automatic link following
  • Pagination is straightforward
  • Category/product structure is consistent

Basic Example

from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor

class AutoSpider(CrawlSpider):
    name = 'auto'
    start_urls = ['https://example.com/products']

    rules = (
        # Rule 1: Follow category links
        Rule(
            LinkExtractor(allow=r'/category/'),
            follow=True  # Follow but don't scrape
        ),

        # Rule 2: Follow pagination
        Rule(
            LinkExtractor(allow=r'/page/\d+'),
            follow=True
        ),

        # Rule 3: Scrape product pages
        Rule(
            LinkExtractor(allow=r'/product/\d+'),
            callback='parse_product'
        ),
    )

    def parse_product(self, response):
        yield {
            'name': response.css('h1::text').get(),
            'price': response.css('.price::text').get()
        }
Enter fullscreen mode Exit fullscreen mode

What the Docs Don't Tell You

1. Rules are processed in order

First matching rule wins:

rules = (
    # Specific rule first
    Rule(LinkExtractor(allow=r'/product/special/'), callback='parse_special'),

    # General rule second
    Rule(LinkExtractor(allow=r'/product/'), callback='parse_product'),
)
Enter fullscreen mode Exit fullscreen mode

2. parse() method is reserved

Don't override parse() in CrawlSpider. It's used internally. Use parse_start_url() instead:

class MyCrawlSpider(CrawlSpider):
    # DON'T do this
    def parse(self, response):
        pass

    # DO this instead
    def parse_start_url(self, response):
        # Handle start_urls differently
        return self.parse_product(response)
Enter fullscreen mode Exit fullscreen mode

3. follow=True vs callback

# Follow links but don't scrape
Rule(LinkExtractor(allow=r'/category/'), follow=True)

# Scrape but don't follow further (default)
Rule(LinkExtractor(allow=r'/product/'), callback='parse_product')

# Scrape AND follow further links
Rule(
    LinkExtractor(allow=r'/product/'),
    callback='parse_product',
    follow=True  # Explicit
)
Enter fullscreen mode Exit fullscreen mode

Advanced CrawlSpider

from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor

class AdvancedSpider(CrawlSpider):
    name = 'advanced'
    allowed_domains = ['example.com']
    start_urls = ['https://example.com']

    rules = (
        # Follow categories (multiple levels)
        Rule(
            LinkExtractor(
                allow=r'/category/',
                restrict_css='.navigation'  # Only in navigation
            ),
            follow=True,
            process_links='process_category_links'  # Custom processing
        ),

        # Follow pagination
        Rule(
            LinkExtractor(
                allow=r'/page/\d+',
                restrict_css='.pagination'
            ),
            follow=True
        ),

        # Scrape products
        Rule(
            LinkExtractor(
                allow=r'/product/\d+',
                deny=r'/product/\d+/reviews'  # Exclude reviews pages
            ),
            callback='parse_product',
            cb_kwargs={'product_type': 'regular'}  # Pass extra data
        ),

        # Scrape special products differently
        Rule(
            LinkExtractor(allow=r'/special-offer/\d+'),
            callback='parse_special_product',
            cb_kwargs={'product_type': 'special'}
        ),
    )

    def process_category_links(self, links):
        # Custom link processing
        for link in links:
            # Modify URLs, filter links, etc.
            if 'old-category' not in link.url:
                yield link

    def parse_product(self, response, product_type):
        yield {
            'type': product_type,
            'name': response.css('h1::text').get(),
            'price': response.css('.price::text').get()
        }

    def parse_special_product(self, response, product_type):
        yield {
            'type': product_type,
            'name': response.css('h1::text').get(),
            'price': response.css('.special-price::text').get(),
            'original_price': response.css('.original-price::text').get()
        }
Enter fullscreen mode Exit fullscreen mode

Characteristics

Pros:

  • Much less code
  • Automatic link following
  • Declarative (rules are clear)
  • Perfect for structured sites

Cons:

  • Less flexible than basic Spider
  • Learning curve for rules
  • Can't do complex per-page logic easily
  • Debugging rules is harder

SitemapSpider: The Sitemap-Based One (Easiest of All)

If a site has a sitemap.xml, SitemapSpider is the easiest option.

When to Use

  • Site has sitemap.xml
  • You want to scrape all pages listed
  • Site structure doesn't matter
  • Fastest way to crawl large sites

Basic Example

from scrapy.spiders import SitemapSpider

class MySitemapSpider(SitemapSpider):
    name = 'sitemap'
    sitemap_urls = ['https://example.com/sitemap.xml']

    def parse(self, response):
        yield {
            'url': response.url,
            'title': response.css('h1::text').get()
        }
Enter fullscreen mode Exit fullscreen mode

That's it! Scrapy:

  1. Downloads sitemap.xml
  2. Extracts all URLs
  3. Scrapes each one
  4. Calls parse() for each page

Multiple Sitemaps

class MultipleSitemapSpider(SitemapSpider):
    name = 'multiple'
    sitemap_urls = [
        'https://example.com/sitemap.xml',
        'https://example.com/sitemap-products.xml',
        'https://example.com/sitemap-articles.xml'
    ]

    def parse(self, response):
        yield {'url': response.url}
Enter fullscreen mode Exit fullscreen mode

Sitemap Rules (Filter URLs)

Only scrape certain URLs from sitemap:

class FilteredSitemapSpider(SitemapSpider):
    name = 'filtered'
    sitemap_urls = ['https://example.com/sitemap.xml']
    sitemap_rules = [
        ('/product/', 'parse_product'),  # Product URLs
        ('/article/', 'parse_article'),  # Article URLs
    ]

    def parse_product(self, response):
        yield {
            'type': 'product',
            'name': response.css('h1::text').get()
        }

    def parse_article(self, response):
        yield {
            'type': 'article',
            'title': response.css('h1::text').get()
        }
Enter fullscreen mode Exit fullscreen mode

Follow Sitemap Index

Some sites have a sitemap index (sitemap of sitemaps):

class IndexSpider(SitemapSpider):
    name = 'index'
    sitemap_urls = ['https://example.com/sitemap-index.xml']
    sitemap_follow = ['/sitemap-products']  # Only follow product sitemaps

    def parse(self, response):
        yield {'url': response.url}
Enter fullscreen mode Exit fullscreen mode

Alternate URLs (Multilingual Sites)

class MultilingualSpider(SitemapSpider):
    name = 'multilingual'
    sitemap_urls = ['https://example.com/sitemap.xml']
    sitemap_alternate_links = True  # Follow alternate language URLs

    def parse(self, response):
        yield {
            'url': response.url,
            'language': response.url.split('/')[3]  # Extract language code
        }
Enter fullscreen mode Exit fullscreen mode

What the Docs Don't Tell You

1. Not all sitemaps are at /sitemap.xml

Check robots.txt for the actual location:

https://example.com/robots.txt
Enter fullscreen mode Exit fullscreen mode

Look for:

Sitemap: https://example.com/actual-sitemap.xml
Enter fullscreen mode Exit fullscreen mode

2. Large sitemaps might be compressed

sitemap_urls = ['https://example.com/sitemap.xml.gz']  # Works fine
Enter fullscreen mode Exit fullscreen mode

3. You can combine with CrawlSpider features

class HybridSpider(SitemapSpider):
    name = 'hybrid'
    sitemap_urls = ['https://example.com/sitemap.xml']

    # Add rules like CrawlSpider
    rules = (
        Rule(LinkExtractor(allow=r'/related/'), callback='parse_related'),
    )
Enter fullscreen mode Exit fullscreen mode

Characteristics

Pros:

  • Minimal code
  • Very fast (no link discovery needed)
  • Guaranteed to find all pages
  • Perfect for large sites

Cons:

  • Only works if sitemap exists
  • No control over crawl order
  • Can't filter during crawl (only in rules)
  • Less flexible

Choosing the Right Spider Type

Use Spider When:

  • Site structure is complex
  • You need complete control
  • Custom logic per page type
  • You're learning Scrapy
  • Site is unusual

Example:

Site with dynamic navigation, AJAX-loaded content, 
or complex multi-step workflows
Enter fullscreen mode Exit fullscreen mode

Use CrawlSpider When:

  • Clear URL patterns exist
  • Automatic link following is needed
  • Site structure is consistent
  • You want less code

Example:

E-commerce site with categories → subcategories → products
News site with sections → articles
Enter fullscreen mode Exit fullscreen mode

Use SitemapSpider When:

  • Site has sitemap.xml
  • You want to scrape all pages
  • Fastest crawl needed
  • Site structure doesn't matter

Example:

Large content sites (WordPress, Drupal)
E-commerce with good SEO
Any site that publishes sitemaps
Enter fullscreen mode Exit fullscreen mode

Real-World Comparison

Let's scrape the same site with all three approaches:

Site Structure

Homepage
├── Category: Electronics
│   ├── Product: Laptop
│   ├── Product: Phone
│   └── Page 2 → More products
└── Category: Books
    └── Product: Novel
Enter fullscreen mode Exit fullscreen mode

Approach 1: Basic Spider

class BasicSpider(scrapy.Spider):
    name = 'basic'
    start_urls = ['https://example.com']

    def parse(self, response):
        # Follow categories
        for cat in response.css('.category a'):
            yield response.follow(cat, self.parse_category)

    def parse_category(self, response):
        # Scrape products
        for prod in response.css('.product a'):
            yield response.follow(prod, self.parse_product)

        # Pagination
        next_page = response.css('.next::attr(href)').get()
        if next_page:
            yield response.follow(next_page, self.parse_category)

    def parse_product(self, response):
        yield {'name': response.css('h1::text').get()}
Enter fullscreen mode Exit fullscreen mode

Lines of code: ~20

Approach 2: CrawlSpider

class CrawlSpiderVersion(CrawlSpider):
    name = 'crawl'
    start_urls = ['https://example.com']

    rules = (
        Rule(LinkExtractor(allow=r'/category/'), follow=True),
        Rule(LinkExtractor(allow=r'/page/\d+'), follow=True),
        Rule(LinkExtractor(allow=r'/product/'), callback='parse_product'),
    )

    def parse_product(self, response):
        yield {'name': response.css('h1::text').get()}
Enter fullscreen mode Exit fullscreen mode

Lines of code: ~12

Approach 3: SitemapSpider

class SitemapVersion(SitemapSpider):
    name = 'sitemap'
    sitemap_urls = ['https://example.com/sitemap.xml']
    sitemap_rules = [('/product/', 'parse_product')]

    def parse_product(self, response):
        yield {'name': response.css('h1::text').get()}
Enter fullscreen mode Exit fullscreen mode

Lines of code: ~7


Converting Between Spider Types

From Spider to CrawlSpider

Before:

class MySpider(scrapy.Spider):
    def parse(self, response):
        for link in response.css('a.product'):
            yield response.follow(link, self.parse_product)
Enter fullscreen mode Exit fullscreen mode

After:

class MySpider(CrawlSpider):
    rules = (
        Rule(LinkExtractor(allow=r'/product/'), callback='parse_product'),
    )
Enter fullscreen mode Exit fullscreen mode

From CrawlSpider to Spider

Sometimes you need more control. Just convert rules to manual logic.


Mixing Approaches

You can combine spider types:

class HybridSpider(CrawlSpider):
    name = 'hybrid'
    start_urls = ['https://example.com']

    # CrawlSpider rules for most pages
    rules = (
        Rule(LinkExtractor(allow=r'/category/'), follow=True),
        Rule(LinkExtractor(allow=r'/product/'), callback='parse_product'),
    )

    # Custom logic for start URLs
    def parse_start_url(self, response):
        # Special handling for homepage
        featured = response.css('.featured-product a')
        for link in featured:
            yield response.follow(link, self.parse_featured)

    def parse_product(self, response):
        yield {'name': response.css('h1::text').get()}

    def parse_featured(self, response):
        yield {
            'name': response.css('h1::text').get(),
            'featured': True
        }
Enter fullscreen mode Exit fullscreen mode

Quick Decision Tree

Does site have sitemap.xml?
├─ Yes → Use SitemapSpider
└─ No
   │
   Does site have clear URL patterns?
   ├─ Yes → Use CrawlSpider
   └─ No → Use basic Spider
Enter fullscreen mode Exit fullscreen mode

Summary

Spider (Basic):

  • Full control
  • Most code
  • Use when: complex sites, learning, custom logic

CrawlSpider (Rules):

  • Automatic link following
  • Less code
  • Use when: clear patterns, structured sites

SitemapSpider (Sitemap):

  • Minimal code
  • Very fast
  • Use when: sitemap exists, want all pages

Start with: Basic Spider (learn fundamentals)
Graduate to: CrawlSpider (save time)
Use when available: SitemapSpider (fastest)

Don't overthink it. Start with what you know and upgrade when needed.

Happy scraping! 🕷️

Top comments (0)