DEV Community

Muhammad Ikramullah Khan
Muhammad Ikramullah Khan

Posted on

Handling Pagination in Scrapy: Scrape Every Page Without Breaking

The first time I tried scraping a paginated site, I only got the first page. I knew there were 50 pages, but my spider stopped after page 1.

I didn't know about response.follow() or pagination patterns. I was manually building URLs and getting it wrong.

Once I learned the pagination patterns, scraping multi-page sites became trivial. Let me show you every pagination pattern and how to handle it.


Why Pagination Matters

Most websites split content across pages:

  • Product listings (page 1, 2, 3...)
  • Search results
  • Blog archives
  • Category pages

If you don't handle pagination:

  • Miss most of the data
  • Only scrape first page
  • Incomplete results

Pattern 1: Next Button (Most Common)

Website has a "Next" button linking to the next page.

HTML Example

<a class="next" href="/products?page=2">Next</a>
Enter fullscreen mode Exit fullscreen mode

Spider Code

import scrapy

class NextButtonSpider(scrapy.Spider):
    name = 'next_button'
    start_urls = ['https://example.com/products']

    def parse(self, response):
        # Scrape items on current page
        for product in response.css('.product'):
            yield {
                'name': product.css('h2::text').get(),
                'price': product.css('.price::text').get()
            }

        # Follow next page link
        next_page = response.css('.next::attr(href)').get()
        if next_page:
            yield response.follow(next_page, callback=self.parse)
Enter fullscreen mode Exit fullscreen mode

Key points:

  • response.follow() handles relative URLs automatically
  • Check if next_page exists before following
  • Use same callback (self.parse) to repeat on next page

What the Docs Don't Tell You

Different "Next" selectors:

# Try multiple selectors
next_page = response.css('.next::attr(href)').get()
next_page = next_page or response.css('a.pagination-next::attr(href)').get()
next_page = next_page or response.css('a[rel="next"]::attr(href)').get()
next_page = next_page or response.xpath('//a[contains(text(), "Next")]/@href').get()
Enter fullscreen mode Exit fullscreen mode

Pattern 2: Page Numbers (1, 2, 3...)

Website shows page numbers with links.

HTML Example

<div class="pagination">
    <a href="/products?page=1">1</a>
    <a href="/products?page=2">2</a>
    <a href="/products?page=3">3</a>
</div>
Enter fullscreen mode Exit fullscreen mode

Spider Code

class PageNumberSpider(scrapy.Spider):
    name = 'page_numbers'
    start_urls = ['https://example.com/products?page=1']

    def parse(self, response):
        # Scrape items
        for product in response.css('.product'):
            yield {
                'name': product.css('h2::text').get()
            }

        # Follow all pagination links
        for page_link in response.css('.pagination a::attr(href)').getall():
            yield response.follow(page_link, callback=self.parse)
Enter fullscreen mode Exit fullscreen mode

Scrapy automatically deduplicates URLs, so this won't visit the same page twice!


Pattern 3: Known Number of Pages

You know there are exactly N pages.

Spider Code

class KnownPagesSpider(scrapy.Spider):
    name = 'known_pages'

    def start_requests(self):
        base_url = 'https://example.com/products?page={}'

        # Scrape pages 1 through 50
        for page_num in range(1, 51):
            url = base_url.format(page_num)
            yield scrapy.Request(url, callback=self.parse)

    def parse(self, response):
        for product in response.css('.product'):
            yield {
                'name': product.css('h2::text').get()
            }
Enter fullscreen mode Exit fullscreen mode

Simple and fast!


Pattern 4: Infinite Scroll (AJAX Loading)

Page loads more content as you scroll down.

Method 1: Find the AJAX API

Most infinite scroll sites load data via AJAX. Find the API:

  1. Open DevTools (F12)
  2. Network tab → XHR filter
  3. Scroll down
  4. Look for JSON responses

Example API found:

https://example.com/api/products?offset=0&limit=20
https://example.com/api/products?offset=20&limit=20
Enter fullscreen mode Exit fullscreen mode

Spider:

import json

class InfiniteScrollSpider(scrapy.Spider):
    name = 'infinite_scroll'

    def start_requests(self):
        url = 'https://example.com/api/products?offset=0&limit=20'
        yield scrapy.Request(url, callback=self.parse_api)

    def parse_api(self, response):
        data = json.loads(response.text)

        # Extract items
        for product in data['products']:
            yield {
                'name': product['name'],
                'price': product['price']
            }

        # Check if more data
        if data['has_more']:
            # Get next offset
            offset = int(response.url.split('offset=')[1].split('&')[0])
            next_offset = offset + 20

            next_url = f'https://example.com/api/products?offset={next_offset}&limit=20'
            yield scrapy.Request(next_url, callback=self.parse_api)
Enter fullscreen mode Exit fullscreen mode

Method 2: Use Playwright to Scroll

If no API is available, use Playwright to scroll:

import scrapy

class ScrollSpider(scrapy.Spider):
    name = 'scroll'

    def start_requests(self):
        yield scrapy.Request(
            'https://example.com/products',
            meta={
                'playwright': True,
                'playwright_include_page': True
            },
            callback=self.parse
        )

    async def parse(self, response):
        page = response.meta['playwright_page']

        # Scroll down 10 times
        for i in range(10):
            await page.evaluate('window.scrollTo(0, document.body.scrollHeight)')
            await page.wait_for_timeout(2000)  # Wait 2 seconds

        # Get final HTML
        content = await page.content()
        await page.close()

        # Parse with Scrapy
        from scrapy.http import HtmlResponse
        new_response = HtmlResponse(
            url=response.url,
            body=content.encode('utf-8')
        )

        for product in new_response.css('.product'):
            yield {
                'name': product.css('h2::text').get()
            }
Enter fullscreen mode Exit fullscreen mode

Pattern 5: "Load More" Button

Button that loads more items via AJAX.

Find the API

Same as infinite scroll: check Network tab when clicking "Load More".

Example:

POST https://example.com/load-more
payload: {"page": 2}
Enter fullscreen mode Exit fullscreen mode

Spider:

class LoadMoreSpider(scrapy.Spider):
    name = 'load_more'

    def start_requests(self):
        yield scrapy.Request(
            'https://example.com/products',
            callback=self.parse
        )

    def parse(self, response):
        # Scrape initial items
        for product in response.css('.product'):
            yield {
                'name': product.css('h2::text').get()
            }

        # Simulate "Load More" clicks
        for page_num in range(2, 11):  # Load 10 more times
            yield scrapy.FormRequest(
                'https://example.com/load-more',
                formdata={'page': str(page_num)},
                callback=self.parse_more
            )

    def parse_more(self, response):
        data = json.loads(response.text)

        for product in data['products']:
            yield {
                'name': product['name']
            }
Enter fullscreen mode Exit fullscreen mode

Pattern 6: Cursor-Based Pagination

API uses cursor tokens instead of page numbers.

Example Response

{
  "products": [...],
  "next_cursor": "abc123xyz"
}
Enter fullscreen mode Exit fullscreen mode

Spider:

class CursorSpider(scrapy.Spider):
    name = 'cursor'

    def start_requests(self):
        url = 'https://api.example.com/products'
        yield scrapy.Request(url, callback=self.parse)

    def parse(self, response):
        data = json.loads(response.text)

        # Extract items
        for product in data['products']:
            yield {
                'name': product['name']
            }

        # Follow next cursor
        next_cursor = data.get('next_cursor')
        if next_cursor:
            next_url = f'https://api.example.com/products?cursor={next_cursor}'
            yield scrapy.Request(next_url, callback=self.parse)
Enter fullscreen mode Exit fullscreen mode

Pattern 7: URL Parameters (Offset/Limit)

URLs use offset and limit parameters.

Example URLs

/products?offset=0&limit=20
/products?offset=20&limit=20
/products?offset=40&limit=20
Enter fullscreen mode Exit fullscreen mode

Spider:

class OffsetSpider(scrapy.Spider):
    name = 'offset'

    def start_requests(self):
        # Start with offset 0
        url = 'https://example.com/products?offset=0&limit=20'
        yield scrapy.Request(url, callback=self.parse)

    def parse(self, response):
        products = response.css('.product')

        # Scrape items
        for product in products:
            yield {
                'name': product.css('h2::text').get()
            }

        # If we got items, there might be more
        if products:
            # Extract current offset
            offset = int(response.url.split('offset=')[1].split('&')[0])

            # Next offset
            next_offset = offset + 20
            next_url = f'https://example.com/products?offset={next_offset}&limit=20'

            yield scrapy.Request(next_url, callback=self.parse)
Enter fullscreen mode Exit fullscreen mode

Pattern 8: Date-Based Pagination

Archive pages organized by date.

Example URLs

/archive/2024/01
/archive/2024/02
/archive/2024/03
Enter fullscreen mode Exit fullscreen mode

Spider:

from datetime import datetime, timedelta

class DateSpider(scrapy.Spider):
    name = 'date'

    def start_requests(self):
        # Start date
        start_date = datetime(2024, 1, 1)
        end_date = datetime(2024, 12, 31)

        current_date = start_date

        while current_date <= end_date:
            url = f'https://example.com/archive/{current_date.year}/{current_date.month:02d}'
            yield scrapy.Request(url, callback=self.parse)

            # Next month
            current_date = current_date + timedelta(days=32)
            current_date = current_date.replace(day=1)

    def parse(self, response):
        for article in response.css('.article'):
            yield {
                'title': article.css('h2::text').get()
            }
Enter fullscreen mode Exit fullscreen mode

Stopping Pagination

Don't scrape forever! Add stop conditions.

Stop After N Pages

class LimitedSpider(scrapy.Spider):
    name = 'limited'
    max_pages = 10
    page_count = 0

    def parse(self, response):
        self.page_count += 1

        # Scrape items
        for product in response.css('.product'):
            yield {'name': product.css('h2::text').get()}

        # Stop if reached limit
        if self.page_count >= self.max_pages:
            self.logger.info(f'Reached max pages: {self.max_pages}')
            return

        # Continue pagination
        next_page = response.css('.next::attr(href)').get()
        if next_page:
            yield response.follow(next_page, callback=self.parse)
Enter fullscreen mode Exit fullscreen mode

Stop on Empty Page

def parse(self, response):
    products = response.css('.product')

    # If no products, stop
    if not products:
        self.logger.info('No more products, stopping')
        return

    # Scrape items
    for product in products:
        yield {'name': product.css('h2::text').get()}

    # Continue
    next_page = response.css('.next::attr(href)').get()
    if next_page:
        yield response.follow(next_page, callback=self.parse)
Enter fullscreen mode Exit fullscreen mode

Common Mistakes

Mistake #1: Not Using response.follow()

# BAD (doesn't handle relative URLs)
next_page = response.css('.next::attr(href)').get()
yield scrapy.Request(next_page)  # Might be relative!

# GOOD
next_page = response.css('.next::attr(href)').get()
if next_page:
    yield response.follow(next_page)  # Handles relative URLs
Enter fullscreen mode Exit fullscreen mode

Mistake #2: Creating Duplicate URLs

# BAD (might visit same page twice)
for page in range(1, 100):
    url = f'/page/{page}'
    yield scrapy.Request(url, dont_filter=True)  # Forces duplicates!

# GOOD
for page in range(1, 100):
    url = f'/page/{page}'
    yield scrapy.Request(url)  # Scrapy deduplicates automatically
Enter fullscreen mode Exit fullscreen mode

Mistake #3: Not Checking If Next Page Exists

# BAD (crashes if no next page)
next_page = response.css('.next::attr(href)').get()
yield response.follow(next_page)  # next_page might be None!

# GOOD
next_page = response.css('.next::attr(href)').get()
if next_page:
    yield response.follow(next_page)
Enter fullscreen mode Exit fullscreen mode

Testing Pagination

Make sure pagination works:

def test_follows_pagination():
    html = '''
    <div class="product">Product 1</div>
    <a class="next" href="/page2">Next</a>
    '''

    response = HtmlResponse(url='http://example.com', body=html.encode())

    results = list(spider.parse(response))

    # Should have item and next request
    items = [r for r in results if not isinstance(r, Request)]
    requests = [r for r in results if isinstance(r, Request)]

    assert len(items) == 1
    assert len(requests) == 1
    assert 'page2' in requests[0].url
Enter fullscreen mode Exit fullscreen mode

Complete Real-World Example

Production-ready pagination handler:

import scrapy

class ProductionPaginationSpider(scrapy.Spider):
    name = 'pagination'
    start_urls = ['https://example.com/products']

    # Configuration
    max_pages = 100
    page_count = 0
    min_items_per_page = 5

    def parse(self, response):
        self.page_count += 1
        self.logger.info(f'Scraping page {self.page_count}')

        # Extract products
        products = response.css('.product')

        # Log if few items (might indicate end)
        if len(products) < self.min_items_per_page:
            self.logger.warning(
                f'Only {len(products)} items on page {self.page_count} '
                f'(expected at least {self.min_items_per_page})'
            )

        # If no products, we've reached the end
        if not products:
            self.logger.info('No products found, stopping pagination')
            return

        # Scrape all products
        for product in products:
            name = product.css('h2::text').get()
            price = product.css('.price::text').get()

            if name and price:
                yield {
                    'name': name.strip(),
                    'price': price.strip(),
                    'page': self.page_count,
                    'url': response.url
                }

        # Check if reached max pages
        if self.page_count >= self.max_pages:
            self.logger.info(f'Reached max pages limit: {self.max_pages}')
            return

        # Try multiple next page selectors
        next_page = (
            response.css('.next::attr(href)').get() or
            response.css('a.pagination-next::attr(href)').get() or
            response.css('a[rel="next"]::attr(href)').get() or
            response.xpath('//a[contains(text(), "Next")]/@href').get()
        )

        if next_page:
            self.logger.info(f'Following next page: {next_page}')
            yield response.follow(next_page, callback=self.parse)
        else:
            self.logger.info('No next page link found, stopping')

    def closed(self, reason):
        self.logger.info('='*60)
        self.logger.info('PAGINATION STATISTICS')
        self.logger.info(f'Total pages scraped: {self.page_count}')
        self.logger.info(f'Close reason: {reason}')
        self.logger.info('='*60)
Enter fullscreen mode Exit fullscreen mode

Summary

Common pagination patterns:

  1. Next button - Most common, use response.follow()
  2. Page numbers - Follow all pagination links
  3. Known pages - Generate URLs in start_requests()
  4. Infinite scroll - Find AJAX API or use Playwright
  5. Load more - POST request to load-more endpoint
  6. Cursor-based - Follow next_cursor in API
  7. Offset/limit - Increment offset parameter
  8. Date-based - Generate date-based URLs

Best practices:

  • Always use response.follow() for relative URLs
  • Check if next page exists before following
  • Add stop conditions (max pages or empty results)
  • Log pagination progress
  • Test pagination logic

Debugging tips:

  • Check if next_page selector is correct
  • Verify URLs are being generated correctly
  • Watch for infinite loops
  • Check Scrapy stats for request count

Remember:

  • Scrapy deduplicates URLs automatically
  • response.follow() handles relative URLs
  • Stop when no more items found
  • Log progress for debugging

Start with "Next" button pattern, it covers 80% of cases!

Happy scraping! 🕷️

Top comments (0)