The first time I tried scraping a paginated site, I only got the first page. I knew there were 50 pages, but my spider stopped after page 1.
I didn't know about response.follow() or pagination patterns. I was manually building URLs and getting it wrong.
Once I learned the pagination patterns, scraping multi-page sites became trivial. Let me show you every pagination pattern and how to handle it.
Why Pagination Matters
Most websites split content across pages:
- Product listings (page 1, 2, 3...)
- Search results
- Blog archives
- Category pages
If you don't handle pagination:
- Miss most of the data
- Only scrape first page
- Incomplete results
Pattern 1: Next Button (Most Common)
Website has a "Next" button linking to the next page.
HTML Example
<a class="next" href="/products?page=2">Next</a>
Spider Code
import scrapy
class NextButtonSpider(scrapy.Spider):
name = 'next_button'
start_urls = ['https://example.com/products']
def parse(self, response):
# Scrape items on current page
for product in response.css('.product'):
yield {
'name': product.css('h2::text').get(),
'price': product.css('.price::text').get()
}
# Follow next page link
next_page = response.css('.next::attr(href)').get()
if next_page:
yield response.follow(next_page, callback=self.parse)
Key points:
-
response.follow()handles relative URLs automatically - Check if next_page exists before following
- Use same callback (self.parse) to repeat on next page
What the Docs Don't Tell You
Different "Next" selectors:
# Try multiple selectors
next_page = response.css('.next::attr(href)').get()
next_page = next_page or response.css('a.pagination-next::attr(href)').get()
next_page = next_page or response.css('a[rel="next"]::attr(href)').get()
next_page = next_page or response.xpath('//a[contains(text(), "Next")]/@href').get()
Pattern 2: Page Numbers (1, 2, 3...)
Website shows page numbers with links.
HTML Example
<div class="pagination">
<a href="/products?page=1">1</a>
<a href="/products?page=2">2</a>
<a href="/products?page=3">3</a>
</div>
Spider Code
class PageNumberSpider(scrapy.Spider):
name = 'page_numbers'
start_urls = ['https://example.com/products?page=1']
def parse(self, response):
# Scrape items
for product in response.css('.product'):
yield {
'name': product.css('h2::text').get()
}
# Follow all pagination links
for page_link in response.css('.pagination a::attr(href)').getall():
yield response.follow(page_link, callback=self.parse)
Scrapy automatically deduplicates URLs, so this won't visit the same page twice!
Pattern 3: Known Number of Pages
You know there are exactly N pages.
Spider Code
class KnownPagesSpider(scrapy.Spider):
name = 'known_pages'
def start_requests(self):
base_url = 'https://example.com/products?page={}'
# Scrape pages 1 through 50
for page_num in range(1, 51):
url = base_url.format(page_num)
yield scrapy.Request(url, callback=self.parse)
def parse(self, response):
for product in response.css('.product'):
yield {
'name': product.css('h2::text').get()
}
Simple and fast!
Pattern 4: Infinite Scroll (AJAX Loading)
Page loads more content as you scroll down.
Method 1: Find the AJAX API
Most infinite scroll sites load data via AJAX. Find the API:
- Open DevTools (F12)
- Network tab → XHR filter
- Scroll down
- Look for JSON responses
Example API found:
https://example.com/api/products?offset=0&limit=20
https://example.com/api/products?offset=20&limit=20
Spider:
import json
class InfiniteScrollSpider(scrapy.Spider):
name = 'infinite_scroll'
def start_requests(self):
url = 'https://example.com/api/products?offset=0&limit=20'
yield scrapy.Request(url, callback=self.parse_api)
def parse_api(self, response):
data = json.loads(response.text)
# Extract items
for product in data['products']:
yield {
'name': product['name'],
'price': product['price']
}
# Check if more data
if data['has_more']:
# Get next offset
offset = int(response.url.split('offset=')[1].split('&')[0])
next_offset = offset + 20
next_url = f'https://example.com/api/products?offset={next_offset}&limit=20'
yield scrapy.Request(next_url, callback=self.parse_api)
Method 2: Use Playwright to Scroll
If no API is available, use Playwright to scroll:
import scrapy
class ScrollSpider(scrapy.Spider):
name = 'scroll'
def start_requests(self):
yield scrapy.Request(
'https://example.com/products',
meta={
'playwright': True,
'playwright_include_page': True
},
callback=self.parse
)
async def parse(self, response):
page = response.meta['playwright_page']
# Scroll down 10 times
for i in range(10):
await page.evaluate('window.scrollTo(0, document.body.scrollHeight)')
await page.wait_for_timeout(2000) # Wait 2 seconds
# Get final HTML
content = await page.content()
await page.close()
# Parse with Scrapy
from scrapy.http import HtmlResponse
new_response = HtmlResponse(
url=response.url,
body=content.encode('utf-8')
)
for product in new_response.css('.product'):
yield {
'name': product.css('h2::text').get()
}
Pattern 5: "Load More" Button
Button that loads more items via AJAX.
Find the API
Same as infinite scroll: check Network tab when clicking "Load More".
Example:
POST https://example.com/load-more
payload: {"page": 2}
Spider:
class LoadMoreSpider(scrapy.Spider):
name = 'load_more'
def start_requests(self):
yield scrapy.Request(
'https://example.com/products',
callback=self.parse
)
def parse(self, response):
# Scrape initial items
for product in response.css('.product'):
yield {
'name': product.css('h2::text').get()
}
# Simulate "Load More" clicks
for page_num in range(2, 11): # Load 10 more times
yield scrapy.FormRequest(
'https://example.com/load-more',
formdata={'page': str(page_num)},
callback=self.parse_more
)
def parse_more(self, response):
data = json.loads(response.text)
for product in data['products']:
yield {
'name': product['name']
}
Pattern 6: Cursor-Based Pagination
API uses cursor tokens instead of page numbers.
Example Response
{
"products": [...],
"next_cursor": "abc123xyz"
}
Spider:
class CursorSpider(scrapy.Spider):
name = 'cursor'
def start_requests(self):
url = 'https://api.example.com/products'
yield scrapy.Request(url, callback=self.parse)
def parse(self, response):
data = json.loads(response.text)
# Extract items
for product in data['products']:
yield {
'name': product['name']
}
# Follow next cursor
next_cursor = data.get('next_cursor')
if next_cursor:
next_url = f'https://api.example.com/products?cursor={next_cursor}'
yield scrapy.Request(next_url, callback=self.parse)
Pattern 7: URL Parameters (Offset/Limit)
URLs use offset and limit parameters.
Example URLs
/products?offset=0&limit=20
/products?offset=20&limit=20
/products?offset=40&limit=20
Spider:
class OffsetSpider(scrapy.Spider):
name = 'offset'
def start_requests(self):
# Start with offset 0
url = 'https://example.com/products?offset=0&limit=20'
yield scrapy.Request(url, callback=self.parse)
def parse(self, response):
products = response.css('.product')
# Scrape items
for product in products:
yield {
'name': product.css('h2::text').get()
}
# If we got items, there might be more
if products:
# Extract current offset
offset = int(response.url.split('offset=')[1].split('&')[0])
# Next offset
next_offset = offset + 20
next_url = f'https://example.com/products?offset={next_offset}&limit=20'
yield scrapy.Request(next_url, callback=self.parse)
Pattern 8: Date-Based Pagination
Archive pages organized by date.
Example URLs
/archive/2024/01
/archive/2024/02
/archive/2024/03
Spider:
from datetime import datetime, timedelta
class DateSpider(scrapy.Spider):
name = 'date'
def start_requests(self):
# Start date
start_date = datetime(2024, 1, 1)
end_date = datetime(2024, 12, 31)
current_date = start_date
while current_date <= end_date:
url = f'https://example.com/archive/{current_date.year}/{current_date.month:02d}'
yield scrapy.Request(url, callback=self.parse)
# Next month
current_date = current_date + timedelta(days=32)
current_date = current_date.replace(day=1)
def parse(self, response):
for article in response.css('.article'):
yield {
'title': article.css('h2::text').get()
}
Stopping Pagination
Don't scrape forever! Add stop conditions.
Stop After N Pages
class LimitedSpider(scrapy.Spider):
name = 'limited'
max_pages = 10
page_count = 0
def parse(self, response):
self.page_count += 1
# Scrape items
for product in response.css('.product'):
yield {'name': product.css('h2::text').get()}
# Stop if reached limit
if self.page_count >= self.max_pages:
self.logger.info(f'Reached max pages: {self.max_pages}')
return
# Continue pagination
next_page = response.css('.next::attr(href)').get()
if next_page:
yield response.follow(next_page, callback=self.parse)
Stop on Empty Page
def parse(self, response):
products = response.css('.product')
# If no products, stop
if not products:
self.logger.info('No more products, stopping')
return
# Scrape items
for product in products:
yield {'name': product.css('h2::text').get()}
# Continue
next_page = response.css('.next::attr(href)').get()
if next_page:
yield response.follow(next_page, callback=self.parse)
Common Mistakes
Mistake #1: Not Using response.follow()
# BAD (doesn't handle relative URLs)
next_page = response.css('.next::attr(href)').get()
yield scrapy.Request(next_page) # Might be relative!
# GOOD
next_page = response.css('.next::attr(href)').get()
if next_page:
yield response.follow(next_page) # Handles relative URLs
Mistake #2: Creating Duplicate URLs
# BAD (might visit same page twice)
for page in range(1, 100):
url = f'/page/{page}'
yield scrapy.Request(url, dont_filter=True) # Forces duplicates!
# GOOD
for page in range(1, 100):
url = f'/page/{page}'
yield scrapy.Request(url) # Scrapy deduplicates automatically
Mistake #3: Not Checking If Next Page Exists
# BAD (crashes if no next page)
next_page = response.css('.next::attr(href)').get()
yield response.follow(next_page) # next_page might be None!
# GOOD
next_page = response.css('.next::attr(href)').get()
if next_page:
yield response.follow(next_page)
Testing Pagination
Make sure pagination works:
def test_follows_pagination():
html = '''
<div class="product">Product 1</div>
<a class="next" href="/page2">Next</a>
'''
response = HtmlResponse(url='http://example.com', body=html.encode())
results = list(spider.parse(response))
# Should have item and next request
items = [r for r in results if not isinstance(r, Request)]
requests = [r for r in results if isinstance(r, Request)]
assert len(items) == 1
assert len(requests) == 1
assert 'page2' in requests[0].url
Complete Real-World Example
Production-ready pagination handler:
import scrapy
class ProductionPaginationSpider(scrapy.Spider):
name = 'pagination'
start_urls = ['https://example.com/products']
# Configuration
max_pages = 100
page_count = 0
min_items_per_page = 5
def parse(self, response):
self.page_count += 1
self.logger.info(f'Scraping page {self.page_count}')
# Extract products
products = response.css('.product')
# Log if few items (might indicate end)
if len(products) < self.min_items_per_page:
self.logger.warning(
f'Only {len(products)} items on page {self.page_count} '
f'(expected at least {self.min_items_per_page})'
)
# If no products, we've reached the end
if not products:
self.logger.info('No products found, stopping pagination')
return
# Scrape all products
for product in products:
name = product.css('h2::text').get()
price = product.css('.price::text').get()
if name and price:
yield {
'name': name.strip(),
'price': price.strip(),
'page': self.page_count,
'url': response.url
}
# Check if reached max pages
if self.page_count >= self.max_pages:
self.logger.info(f'Reached max pages limit: {self.max_pages}')
return
# Try multiple next page selectors
next_page = (
response.css('.next::attr(href)').get() or
response.css('a.pagination-next::attr(href)').get() or
response.css('a[rel="next"]::attr(href)').get() or
response.xpath('//a[contains(text(), "Next")]/@href').get()
)
if next_page:
self.logger.info(f'Following next page: {next_page}')
yield response.follow(next_page, callback=self.parse)
else:
self.logger.info('No next page link found, stopping')
def closed(self, reason):
self.logger.info('='*60)
self.logger.info('PAGINATION STATISTICS')
self.logger.info(f'Total pages scraped: {self.page_count}')
self.logger.info(f'Close reason: {reason}')
self.logger.info('='*60)
Summary
Common pagination patterns:
-
Next button - Most common, use
response.follow() - Page numbers - Follow all pagination links
-
Known pages - Generate URLs in
start_requests() - Infinite scroll - Find AJAX API or use Playwright
- Load more - POST request to load-more endpoint
- Cursor-based - Follow next_cursor in API
- Offset/limit - Increment offset parameter
- Date-based - Generate date-based URLs
Best practices:
- Always use
response.follow()for relative URLs - Check if next page exists before following
- Add stop conditions (max pages or empty results)
- Log pagination progress
- Test pagination logic
Debugging tips:
- Check if next_page selector is correct
- Verify URLs are being generated correctly
- Watch for infinite loops
- Check Scrapy stats for request count
Remember:
- Scrapy deduplicates URLs automatically
-
response.follow()handles relative URLs - Stop when no more items found
- Log progress for debugging
Start with "Next" button pattern, it covers 80% of cases!
Happy scraping! 🕷️
Top comments (0)