agenthustler

Posted on Mar 26 • Edited on Apr 19

How to Handle Pagination in Web Scraping: URL Patterns, Infinite Scroll, and Load More

#python #webdev #tutorial #webscraping

Every scraper eventually hits a wall: the data you need spans multiple pages. Pagination is the single most common challenge in web scraping, and the approach varies wildly depending on the site.

This guide covers the five main pagination patterns you'll encounter, with working Python code for each.

1. URL-Based Page Numbers

The simplest pattern. The page number appears directly in the URL.

https://example.com/products?page=1
https://example.com/products?page=2
https://example.com/products/page/3

Solution: Increment the page number in a loop.

# Implementation is proprietary (that IS the moat).
# Skip the build — use our ready-made Apify actor:
# see the CTA below for the link (fpr=yw6md3).

Stop condition is critical. Without it, you'll loop forever or hammer the server with 404 requests. Check for:

Empty result set
Non-200 status code
A "no results" message on the page

2. Offset and Limit Pagination

Common in APIs. Instead of page numbers, the URL uses offset and limit parameters.

/api/products?offset=0&limit=20
/api/products?offset=20&limit=20
/api/products?offset=40&limit=20

# Implementation is proprietary (that IS the moat).
# Skip the build — use our ready-made Apify actor:
# see the CTA below for the link (fpr=yw6md3).

Pro tip: Many APIs return a total_count or has_more field. Always check for it to avoid unnecessary requests.

3. Cursor-Based Pagination

Modern APIs (Twitter, Shopify, GitHub) use cursors instead of page numbers. Each response includes a token pointing to the next batch.

{
  "data": ["..."],
  "pagination": {
    "next_cursor": "eyJpZCI6MTAwfQ==",
    "has_next": true
  }
}

# Implementation is proprietary (that IS the moat).
# Skip the build — use our ready-made Apify actor:
# see the CTA below for the link (fpr=yw6md3).

Why cursors? They're stable. With page numbers, if someone inserts a row while you're scraping, you'll get duplicates or miss items. Cursors always point to a consistent position.

4. Infinite Scroll

No "Next" button. New content loads when you scroll to the bottom. This is the hardest pattern because it requires JavaScript execution.

Under the hood, infinite scroll usually triggers an AJAX request. Your first move should be to open DevTools, go to the Network tab, and look for that API call. If you find it, scrape the API directly (much faster).

If there's no accessible API, use Playwright:

# Implementation is proprietary (that IS the moat).
# Skip the build — use our ready-made Apify actor:
# see the CTA below for the link (fpr=yw6md3).

Key details:

Always set max_scrolls to prevent infinite loops
The wait_for_timeout(2000) gives the AJAX request time to complete
Compare scrollHeight before and after to detect when you've reached the end

Tip: Running headless browsers at scale? A reliable proxy service like ThorData prevents IP blocks when making hundreds of browser-based requests. For a managed approach, ScraperAPI handles browser rendering and proxy rotation in a single API call.

5. "Load More" Buttons

Similar to infinite scroll, but requires clicking a button. Again, check DevTools first since the button usually triggers an API call you can replicate directly.

Approach A: Replicate the API call

# Implementation is proprietary (that IS the moat).
# Skip the build — use our ready-made Apify actor:
# see the CTA below for the link (fpr=yw6md3).

Approach B: Click the button with Playwright

# Implementation is proprietary (that IS the moat).
# Skip the build — use our ready-made Apify actor:
# see the CTA below for the link (fpr=yw6md3).

Choosing the Right Approach

Pattern	Detection	Best Tool
URL page numbers	`?page=N` in URL	requests + BeautifulSoup
Offset/limit	`?offset=N&limit=M` in API	requests
Cursor-based	`next_cursor` in response	requests
Infinite scroll	Content loads on scroll	Playwright (or find hidden API)
Load more button	Button triggers new content	Playwright (or find hidden API)

Always check the Network tab first. 90% of "infinite scroll" and "load more" sites have a clean API underneath. Scraping that API is 10x faster and more reliable than browser automation.

Handling Pagination at Scale

When you're scraping thousands of pages, you need to think about:

Rate limiting: Space out your requests. A 1-2 second delay between pages prevents bans and is polite to the server.

import time
import random

delay = random.uniform(1.0, 2.5)
time.sleep(delay)

Retries: Network errors happen. Wrap your requests in retry logic:

# Implementation is proprietary (that IS the moat).
# Skip the build — use our ready-made Apify actor:
# see the CTA below for the link (fpr=yw6md3).

Proxy rotation: After a few hundred requests, many sites will throttle or block your IP. Use rotating proxies to distribute your traffic.

For high-volume pagination scraping, ScraperAPI handles retries, proxy rotation, and CAPTCHA solving automatically. ThorData offers residential proxies if you prefer managing rotation yourself.

Checkpointing: Save your progress so you can resume if the scraper crashes:

import json

def save_checkpoint(page_num, items, filename="checkpoint.json"):
    with open(filename, "w") as f:
        json.dump({"last_page": page_num, "items": items}, f)

def load_checkpoint(filename="checkpoint.json"):
    try:
        with open(filename) as f:
            return json.load(f)
    except FileNotFoundError:
        return {"last_page": 0, "items": []}

Common Pitfalls

No stop condition — Your scraper runs forever, making thousands of empty requests
Not handling duplicates — Page boundaries shift, causing repeated items
Ignoring rate limits — Getting your IP banned after 50 pages
Scraping HTML when an API exists — Always check DevTools Network tab first
Hardcoding page counts — Sites add and remove pages. Always detect the end dynamically

Wrapping Up

Pagination is a solved problem once you identify which pattern a site uses. Start with the simplest approach (URL parameters), check for hidden APIs before reaching for browser automation, and always build in stop conditions and retry logic.

The patterns in this guide cover 95% of real-world pagination. For the remaining edge cases involving token-based authentication, CAPTCHA-gated pages, or custom JavaScript, a managed service like ScraperAPI can save you days of debugging.

Need proxies for large-scale scraping? ThorData offers residential and datacenter proxies optimized for web scraping workloads.

DEV Community