Every scraper eventually hits a wall: the data you need spans multiple pages. Pagination is the single most common challenge in web scraping, and the approach varies wildly depending on the site.
This guide covers the five main pagination patterns you'll encounter, with working Python code for each.
1. URL-Based Page Numbers
The simplest pattern. The page number appears directly in the URL.
https://example.com/products?page=1
https://example.com/products?page=2
https://example.com/products/page/3
Solution: Increment the page number in a loop.
# Implementation is proprietary (that IS the moat).
# Skip the build — use our ready-made Apify actor:
# see the CTA below for the link (fpr=yw6md3).
Stop condition is critical. Without it, you'll loop forever or hammer the server with 404 requests. Check for:
- Empty result set
- Non-200 status code
- A "no results" message on the page
2. Offset and Limit Pagination
Common in APIs. Instead of page numbers, the URL uses offset and limit parameters.
/api/products?offset=0&limit=20
/api/products?offset=20&limit=20
/api/products?offset=40&limit=20
# Implementation is proprietary (that IS the moat).
# Skip the build — use our ready-made Apify actor:
# see the CTA below for the link (fpr=yw6md3).
Pro tip: Many APIs return a total_count or has_more field. Always check for it to avoid unnecessary requests.
3. Cursor-Based Pagination
Modern APIs (Twitter, Shopify, GitHub) use cursors instead of page numbers. Each response includes a token pointing to the next batch.
{
"data": ["..."],
"pagination": {
"next_cursor": "eyJpZCI6MTAwfQ==",
"has_next": true
}
}
# Implementation is proprietary (that IS the moat).
# Skip the build — use our ready-made Apify actor:
# see the CTA below for the link (fpr=yw6md3).
Why cursors? They're stable. With page numbers, if someone inserts a row while you're scraping, you'll get duplicates or miss items. Cursors always point to a consistent position.
4. Infinite Scroll
No "Next" button. New content loads when you scroll to the bottom. This is the hardest pattern because it requires JavaScript execution.
Under the hood, infinite scroll usually triggers an AJAX request. Your first move should be to open DevTools, go to the Network tab, and look for that API call. If you find it, scrape the API directly (much faster).
If there's no accessible API, use Playwright:
# Implementation is proprietary (that IS the moat).
# Skip the build — use our ready-made Apify actor:
# see the CTA below for the link (fpr=yw6md3).
Key details:
- Always set
max_scrollsto prevent infinite loops - The
wait_for_timeout(2000)gives the AJAX request time to complete - Compare
scrollHeightbefore and after to detect when you've reached the end
Tip: Running headless browsers at scale? A reliable proxy service like ThorData prevents IP blocks when making hundreds of browser-based requests. For a managed approach, ScraperAPI handles browser rendering and proxy rotation in a single API call.
5. "Load More" Buttons
Similar to infinite scroll, but requires clicking a button. Again, check DevTools first since the button usually triggers an API call you can replicate directly.
Approach A: Replicate the API call
# Implementation is proprietary (that IS the moat).
# Skip the build — use our ready-made Apify actor:
# see the CTA below for the link (fpr=yw6md3).
Approach B: Click the button with Playwright
# Implementation is proprietary (that IS the moat).
# Skip the build — use our ready-made Apify actor:
# see the CTA below for the link (fpr=yw6md3).
Choosing the Right Approach
| Pattern | Detection | Best Tool |
|---|---|---|
| URL page numbers |
?page=N in URL |
requests + BeautifulSoup |
| Offset/limit |
?offset=N&limit=M in API |
requests |
| Cursor-based |
next_cursor in response |
requests |
| Infinite scroll | Content loads on scroll | Playwright (or find hidden API) |
| Load more button | Button triggers new content | Playwright (or find hidden API) |
Always check the Network tab first. 90% of "infinite scroll" and "load more" sites have a clean API underneath. Scraping that API is 10x faster and more reliable than browser automation.
Handling Pagination at Scale
When you're scraping thousands of pages, you need to think about:
Rate limiting: Space out your requests. A 1-2 second delay between pages prevents bans and is polite to the server.
import time
import random
delay = random.uniform(1.0, 2.5)
time.sleep(delay)
Retries: Network errors happen. Wrap your requests in retry logic:
# Implementation is proprietary (that IS the moat).
# Skip the build — use our ready-made Apify actor:
# see the CTA below for the link (fpr=yw6md3).
Proxy rotation: After a few hundred requests, many sites will throttle or block your IP. Use rotating proxies to distribute your traffic.
For high-volume pagination scraping, ScraperAPI handles retries, proxy rotation, and CAPTCHA solving automatically. ThorData offers residential proxies if you prefer managing rotation yourself.
Checkpointing: Save your progress so you can resume if the scraper crashes:
import json
def save_checkpoint(page_num, items, filename="checkpoint.json"):
with open(filename, "w") as f:
json.dump({"last_page": page_num, "items": items}, f)
def load_checkpoint(filename="checkpoint.json"):
try:
with open(filename) as f:
return json.load(f)
except FileNotFoundError:
return {"last_page": 0, "items": []}
Common Pitfalls
- No stop condition — Your scraper runs forever, making thousands of empty requests
- Not handling duplicates — Page boundaries shift, causing repeated items
- Ignoring rate limits — Getting your IP banned after 50 pages
- Scraping HTML when an API exists — Always check DevTools Network tab first
- Hardcoding page counts — Sites add and remove pages. Always detect the end dynamically
Wrapping Up
Pagination is a solved problem once you identify which pattern a site uses. Start with the simplest approach (URL parameters), check for hidden APIs before reaching for browser automation, and always build in stop conditions and retry logic.
The patterns in this guide cover 95% of real-world pagination. For the remaining edge cases involving token-based authentication, CAPTCHA-gated pages, or custom JavaScript, a managed service like ScraperAPI can save you days of debugging.
Need proxies for large-scale scraping? ThorData offers residential and datacenter proxies optimized for web scraping workloads.
Top comments (0)