Every scraper eventually hits a wall: the data you need spans multiple pages. Pagination is the single most common challenge in web scraping, and the approach varies wildly depending on the site.
This guide covers the five main pagination patterns you'll encounter, with working Python code for each.
1. URL-Based Page Numbers
The simplest pattern. The page number appears directly in the URL.
https://example.com/products?page=1
https://example.com/products?page=2
https://example.com/products/page/3
Solution: Increment the page number in a loop.
import requests
from bs4 import BeautifulSoup
def scrape_numbered_pages(base_url, max_pages=50):
all_items = []
for page in range(1, max_pages + 1):
url = f"{base_url}?page={page}"
resp = requests.get(url)
if resp.status_code != 200:
break
soup = BeautifulSoup(resp.text, "html.parser")
items = soup.select(".product-card")
if not items: # No more results
break
for item in items:
all_items.append({
"name": item.select_one(".title").text.strip(),
"price": item.select_one(".price").text.strip(),
})
print(f"Page {page}: {len(items)} items")
return all_items
data = scrape_numbered_pages("https://books.toscrape.com/catalogue/page")
Stop condition is critical. Without it, you'll loop forever or hammer the server with 404 requests. Check for:
- Empty result set
- Non-200 status code
- A "no results" message on the page
2. Offset and Limit Pagination
Common in APIs. Instead of page numbers, the URL uses offset and limit parameters.
/api/products?offset=0&limit=20
/api/products?offset=20&limit=20
/api/products?offset=40&limit=20
import requests
def scrape_offset_api(api_url, limit=20):
all_items = []
offset = 0
while True:
resp = requests.get(api_url, params={"offset": offset, "limit": limit})
data = resp.json()
items = data.get("results", [])
if not items:
break
all_items.extend(items)
offset += limit
# Some APIs tell you the total count
total = data.get("total_count")
if total and offset >= total:
break
print(f"Fetched {len(all_items)}/{total or '?'} items")
return all_items
Pro tip: Many APIs return a total_count or has_more field. Always check for it to avoid unnecessary requests.
3. Cursor-Based Pagination
Modern APIs (Twitter, Shopify, GitHub) use cursors instead of page numbers. Each response includes a token pointing to the next batch.
{
"data": ["..."],
"pagination": {
"next_cursor": "eyJpZCI6MTAwfQ==",
"has_next": true
}
}
import requests
def scrape_cursor_api(api_url, headers=None):
all_items = []
cursor = None
while True:
params = {}
if cursor:
params["cursor"] = cursor
resp = requests.get(api_url, params=params, headers=headers or {})
data = resp.json()
items = data.get("data", [])
all_items.extend(items)
pagination = data.get("pagination", {})
cursor = pagination.get("next_cursor")
if not cursor or not pagination.get("has_next", False):
break
print(f"Fetched {len(all_items)} items, next cursor: {cursor[:20]}...")
return all_items
Why cursors? They're stable. With page numbers, if someone inserts a row while you're scraping, you'll get duplicates or miss items. Cursors always point to a consistent position.
4. Infinite Scroll
No "Next" button. New content loads when you scroll to the bottom. This is the hardest pattern because it requires JavaScript execution.
Under the hood, infinite scroll usually triggers an AJAX request. Your first move should be to open DevTools, go to the Network tab, and look for that API call. If you find it, scrape the API directly (much faster).
If there's no accessible API, use Playwright:
import asyncio
from playwright.async_api import async_playwright
async def scrape_infinite_scroll(url, max_scrolls=20):
async with async_playwright() as p:
browser = await p.chromium.launch(headless=True)
page = await browser.new_page()
await page.goto(url)
previous_height = 0
scroll_count = 0
while scroll_count < max_scrolls:
# Scroll to bottom
await page.evaluate("window.scrollTo(0, document.body.scrollHeight)")
await page.wait_for_timeout(2000) # Wait for content to load
# Check if page height changed
current_height = await page.evaluate("document.body.scrollHeight")
if current_height == previous_height:
break # No new content loaded
previous_height = current_height
scroll_count += 1
print(f"Scroll {scroll_count}: height = {current_height}")
# Now extract all loaded items
items = await page.query_selector_all(".item-card")
results = []
for item in items:
title = await item.query_selector(".title")
results.append(await title.inner_text() if title else "N/A")
await browser.close()
return results
data = asyncio.run(scrape_infinite_scroll("https://example.com/feed"))
Key details:
- Always set
max_scrollsto prevent infinite loops - The
wait_for_timeout(2000)gives the AJAX request time to complete - Compare
scrollHeightbefore and after to detect when you've reached the end
Tip: Running headless browsers at scale? A reliable proxy service like ThorData prevents IP blocks when making hundreds of browser-based requests. For a managed approach, ScraperAPI handles browser rendering and proxy rotation in a single API call.
5. "Load More" Buttons
Similar to infinite scroll, but requires clicking a button. Again, check DevTools first since the button usually triggers an API call you can replicate directly.
Approach A: Replicate the API call
import requests
def scrape_load_more_api(api_url):
all_items = []
page = 1
while True:
resp = requests.post(api_url, json={"page": page, "per_page": 24})
data = resp.json()
items = data.get("items", [])
if not items:
break
all_items.extend(items)
page += 1
return all_items
Approach B: Click the button with Playwright
async def scrape_load_more_button(url, button_selector=".load-more-btn"):
async with async_playwright() as p:
browser = await p.chromium.launch(headless=True)
page = await browser.new_page()
await page.goto(url)
while True:
try:
button = await page.wait_for_selector(
button_selector, timeout=5000
)
await button.click()
await page.wait_for_timeout(1500)
except Exception:
break # Button no longer appears
# Extract all items
items = await page.query_selector_all(".product")
results = []
for item in items:
text = await item.inner_text()
results.append(text)
await browser.close()
return results
Choosing the Right Approach
| Pattern | Detection | Best Tool |
|---|---|---|
| URL page numbers |
?page=N in URL |
requests + BeautifulSoup |
| Offset/limit |
?offset=N&limit=M in API |
requests |
| Cursor-based |
next_cursor in response |
requests |
| Infinite scroll | Content loads on scroll | Playwright (or find hidden API) |
| Load more button | Button triggers new content | Playwright (or find hidden API) |
Always check the Network tab first. 90% of "infinite scroll" and "load more" sites have a clean API underneath. Scraping that API is 10x faster and more reliable than browser automation.
Handling Pagination at Scale
When you're scraping thousands of pages, you need to think about:
Rate limiting: Space out your requests. A 1-2 second delay between pages prevents bans and is polite to the server.
import time
import random
delay = random.uniform(1.0, 2.5)
time.sleep(delay)
Retries: Network errors happen. Wrap your requests in retry logic:
from tenacity import retry, stop_after_attempt, wait_exponential
@retry(stop=stop_after_attempt(3), wait=wait_exponential(multiplier=1, max=10))
def fetch_page(url):
resp = requests.get(url, timeout=10)
resp.raise_for_status()
return resp
Proxy rotation: After a few hundred requests, many sites will throttle or block your IP. Use rotating proxies to distribute your traffic.
For high-volume pagination scraping, ScraperAPI handles retries, proxy rotation, and CAPTCHA solving automatically. ThorData offers residential proxies if you prefer managing rotation yourself.
Checkpointing: Save your progress so you can resume if the scraper crashes:
import json
def save_checkpoint(page_num, items, filename="checkpoint.json"):
with open(filename, "w") as f:
json.dump({"last_page": page_num, "items": items}, f)
def load_checkpoint(filename="checkpoint.json"):
try:
with open(filename) as f:
return json.load(f)
except FileNotFoundError:
return {"last_page": 0, "items": []}
Common Pitfalls
- No stop condition — Your scraper runs forever, making thousands of empty requests
- Not handling duplicates — Page boundaries shift, causing repeated items
- Ignoring rate limits — Getting your IP banned after 50 pages
- Scraping HTML when an API exists — Always check DevTools Network tab first
- Hardcoding page counts — Sites add and remove pages. Always detect the end dynamically
Wrapping Up
Pagination is a solved problem once you identify which pattern a site uses. Start with the simplest approach (URL parameters), check for hidden APIs before reaching for browser automation, and always build in stop conditions and retry logic.
The patterns in this guide cover 95% of real-world pagination. For the remaining edge cases involving token-based authentication, CAPTCHA-gated pages, or custom JavaScript, a managed service like ScraperAPI can save you days of debugging.
Need proxies for large-scale scraping? ThorData offers residential and datacenter proxies optimized for web scraping workloads.
Top comments (0)