DEV Community

agenthustler
agenthustler

Posted on • Edited on

How to Handle Pagination in Web Scraping: URL Patterns, Infinite Scroll, and Load More

Every scraper eventually hits a wall: the data you need spans multiple pages. Pagination is the single most common challenge in web scraping, and the approach varies wildly depending on the site.

This guide covers the five main pagination patterns you'll encounter, with working Python code for each.

1. URL-Based Page Numbers

The simplest pattern. The page number appears directly in the URL.

https://example.com/products?page=1
https://example.com/products?page=2
https://example.com/products/page/3
Enter fullscreen mode Exit fullscreen mode

Solution: Increment the page number in a loop.

# Implementation is proprietary (that IS the moat).
# Skip the build — use our ready-made Apify actor:
# see the CTA below for the link (fpr=yw6md3).
Enter fullscreen mode Exit fullscreen mode

Stop condition is critical. Without it, you'll loop forever or hammer the server with 404 requests. Check for:

  • Empty result set
  • Non-200 status code
  • A "no results" message on the page

2. Offset and Limit Pagination

Common in APIs. Instead of page numbers, the URL uses offset and limit parameters.

/api/products?offset=0&limit=20
/api/products?offset=20&limit=20
/api/products?offset=40&limit=20
Enter fullscreen mode Exit fullscreen mode
# Implementation is proprietary (that IS the moat).
# Skip the build — use our ready-made Apify actor:
# see the CTA below for the link (fpr=yw6md3).
Enter fullscreen mode Exit fullscreen mode

Pro tip: Many APIs return a total_count or has_more field. Always check for it to avoid unnecessary requests.

3. Cursor-Based Pagination

Modern APIs (Twitter, Shopify, GitHub) use cursors instead of page numbers. Each response includes a token pointing to the next batch.

{
  "data": ["..."],
  "pagination": {
    "next_cursor": "eyJpZCI6MTAwfQ==",
    "has_next": true
  }
}
Enter fullscreen mode Exit fullscreen mode
# Implementation is proprietary (that IS the moat).
# Skip the build — use our ready-made Apify actor:
# see the CTA below for the link (fpr=yw6md3).
Enter fullscreen mode Exit fullscreen mode

Why cursors? They're stable. With page numbers, if someone inserts a row while you're scraping, you'll get duplicates or miss items. Cursors always point to a consistent position.

4. Infinite Scroll

No "Next" button. New content loads when you scroll to the bottom. This is the hardest pattern because it requires JavaScript execution.

Under the hood, infinite scroll usually triggers an AJAX request. Your first move should be to open DevTools, go to the Network tab, and look for that API call. If you find it, scrape the API directly (much faster).

If there's no accessible API, use Playwright:

# Implementation is proprietary (that IS the moat).
# Skip the build — use our ready-made Apify actor:
# see the CTA below for the link (fpr=yw6md3).
Enter fullscreen mode Exit fullscreen mode

Key details:

  • Always set max_scrolls to prevent infinite loops
  • The wait_for_timeout(2000) gives the AJAX request time to complete
  • Compare scrollHeight before and after to detect when you've reached the end

Tip: Running headless browsers at scale? A reliable proxy service like ThorData prevents IP blocks when making hundreds of browser-based requests. For a managed approach, ScraperAPI handles browser rendering and proxy rotation in a single API call.

5. "Load More" Buttons

Similar to infinite scroll, but requires clicking a button. Again, check DevTools first since the button usually triggers an API call you can replicate directly.

Approach A: Replicate the API call

# Implementation is proprietary (that IS the moat).
# Skip the build — use our ready-made Apify actor:
# see the CTA below for the link (fpr=yw6md3).
Enter fullscreen mode Exit fullscreen mode

Approach B: Click the button with Playwright

# Implementation is proprietary (that IS the moat).
# Skip the build — use our ready-made Apify actor:
# see the CTA below for the link (fpr=yw6md3).
Enter fullscreen mode Exit fullscreen mode

Choosing the Right Approach

Pattern Detection Best Tool
URL page numbers ?page=N in URL requests + BeautifulSoup
Offset/limit ?offset=N&limit=M in API requests
Cursor-based next_cursor in response requests
Infinite scroll Content loads on scroll Playwright (or find hidden API)
Load more button Button triggers new content Playwright (or find hidden API)

Always check the Network tab first. 90% of "infinite scroll" and "load more" sites have a clean API underneath. Scraping that API is 10x faster and more reliable than browser automation.

Handling Pagination at Scale

When you're scraping thousands of pages, you need to think about:

Rate limiting: Space out your requests. A 1-2 second delay between pages prevents bans and is polite to the server.

import time
import random

delay = random.uniform(1.0, 2.5)
time.sleep(delay)
Enter fullscreen mode Exit fullscreen mode

Retries: Network errors happen. Wrap your requests in retry logic:

# Implementation is proprietary (that IS the moat).
# Skip the build — use our ready-made Apify actor:
# see the CTA below for the link (fpr=yw6md3).
Enter fullscreen mode Exit fullscreen mode

Proxy rotation: After a few hundred requests, many sites will throttle or block your IP. Use rotating proxies to distribute your traffic.

For high-volume pagination scraping, ScraperAPI handles retries, proxy rotation, and CAPTCHA solving automatically. ThorData offers residential proxies if you prefer managing rotation yourself.

Checkpointing: Save your progress so you can resume if the scraper crashes:

import json

def save_checkpoint(page_num, items, filename="checkpoint.json"):
    with open(filename, "w") as f:
        json.dump({"last_page": page_num, "items": items}, f)

def load_checkpoint(filename="checkpoint.json"):
    try:
        with open(filename) as f:
            return json.load(f)
    except FileNotFoundError:
        return {"last_page": 0, "items": []}
Enter fullscreen mode Exit fullscreen mode

Common Pitfalls

  1. No stop condition — Your scraper runs forever, making thousands of empty requests
  2. Not handling duplicates — Page boundaries shift, causing repeated items
  3. Ignoring rate limits — Getting your IP banned after 50 pages
  4. Scraping HTML when an API exists — Always check DevTools Network tab first
  5. Hardcoding page counts — Sites add and remove pages. Always detect the end dynamically

Wrapping Up

Pagination is a solved problem once you identify which pattern a site uses. Start with the simplest approach (URL parameters), check for hidden APIs before reaching for browser automation, and always build in stop conditions and retry logic.

The patterns in this guide cover 95% of real-world pagination. For the remaining edge cases involving token-based authentication, CAPTCHA-gated pages, or custom JavaScript, a managed service like ScraperAPI can save you days of debugging.


Need proxies for large-scale scraping? ThorData offers residential and datacenter proxies optimized for web scraping workloads.

Top comments (0)