DEV Community

agenthustler
agenthustler

Posted on

How to Handle Pagination in Web Scraping: URL Patterns, Infinite Scroll, and Load More

Every scraper eventually hits a wall: the data you need spans multiple pages. Pagination is the single most common challenge in web scraping, and the approach varies wildly depending on the site.

This guide covers the five main pagination patterns you'll encounter, with working Python code for each.

1. URL-Based Page Numbers

The simplest pattern. The page number appears directly in the URL.

https://example.com/products?page=1
https://example.com/products?page=2
https://example.com/products/page/3
Enter fullscreen mode Exit fullscreen mode

Solution: Increment the page number in a loop.

import requests
from bs4 import BeautifulSoup

def scrape_numbered_pages(base_url, max_pages=50):
    all_items = []
    for page in range(1, max_pages + 1):
        url = f"{base_url}?page={page}"
        resp = requests.get(url)

        if resp.status_code != 200:
            break

        soup = BeautifulSoup(resp.text, "html.parser")
        items = soup.select(".product-card")

        if not items:  # No more results
            break

        for item in items:
            all_items.append({
                "name": item.select_one(".title").text.strip(),
                "price": item.select_one(".price").text.strip(),
            })

        print(f"Page {page}: {len(items)} items")

    return all_items

data = scrape_numbered_pages("https://books.toscrape.com/catalogue/page")
Enter fullscreen mode Exit fullscreen mode

Stop condition is critical. Without it, you'll loop forever or hammer the server with 404 requests. Check for:

  • Empty result set
  • Non-200 status code
  • A "no results" message on the page

2. Offset and Limit Pagination

Common in APIs. Instead of page numbers, the URL uses offset and limit parameters.

/api/products?offset=0&limit=20
/api/products?offset=20&limit=20
/api/products?offset=40&limit=20
Enter fullscreen mode Exit fullscreen mode
import requests

def scrape_offset_api(api_url, limit=20):
    all_items = []
    offset = 0

    while True:
        resp = requests.get(api_url, params={"offset": offset, "limit": limit})
        data = resp.json()

        items = data.get("results", [])
        if not items:
            break

        all_items.extend(items)
        offset += limit

        # Some APIs tell you the total count
        total = data.get("total_count")
        if total and offset >= total:
            break

        print(f"Fetched {len(all_items)}/{total or '?'} items")

    return all_items
Enter fullscreen mode Exit fullscreen mode

Pro tip: Many APIs return a total_count or has_more field. Always check for it to avoid unnecessary requests.

3. Cursor-Based Pagination

Modern APIs (Twitter, Shopify, GitHub) use cursors instead of page numbers. Each response includes a token pointing to the next batch.

{
  "data": ["..."],
  "pagination": {
    "next_cursor": "eyJpZCI6MTAwfQ==",
    "has_next": true
  }
}
Enter fullscreen mode Exit fullscreen mode
import requests

def scrape_cursor_api(api_url, headers=None):
    all_items = []
    cursor = None

    while True:
        params = {}
        if cursor:
            params["cursor"] = cursor

        resp = requests.get(api_url, params=params, headers=headers or {})
        data = resp.json()

        items = data.get("data", [])
        all_items.extend(items)

        pagination = data.get("pagination", {})
        cursor = pagination.get("next_cursor")

        if not cursor or not pagination.get("has_next", False):
            break

        print(f"Fetched {len(all_items)} items, next cursor: {cursor[:20]}...")

    return all_items
Enter fullscreen mode Exit fullscreen mode

Why cursors? They're stable. With page numbers, if someone inserts a row while you're scraping, you'll get duplicates or miss items. Cursors always point to a consistent position.

4. Infinite Scroll

No "Next" button. New content loads when you scroll to the bottom. This is the hardest pattern because it requires JavaScript execution.

Under the hood, infinite scroll usually triggers an AJAX request. Your first move should be to open DevTools, go to the Network tab, and look for that API call. If you find it, scrape the API directly (much faster).

If there's no accessible API, use Playwright:

import asyncio
from playwright.async_api import async_playwright

async def scrape_infinite_scroll(url, max_scrolls=20):
    async with async_playwright() as p:
        browser = await p.chromium.launch(headless=True)
        page = await browser.new_page()
        await page.goto(url)

        previous_height = 0
        scroll_count = 0

        while scroll_count < max_scrolls:
            # Scroll to bottom
            await page.evaluate("window.scrollTo(0, document.body.scrollHeight)")
            await page.wait_for_timeout(2000)  # Wait for content to load

            # Check if page height changed
            current_height = await page.evaluate("document.body.scrollHeight")
            if current_height == previous_height:
                break  # No new content loaded

            previous_height = current_height
            scroll_count += 1
            print(f"Scroll {scroll_count}: height = {current_height}")

        # Now extract all loaded items
        items = await page.query_selector_all(".item-card")
        results = []
        for item in items:
            title = await item.query_selector(".title")
            results.append(await title.inner_text() if title else "N/A")

        await browser.close()
        return results

data = asyncio.run(scrape_infinite_scroll("https://example.com/feed"))
Enter fullscreen mode Exit fullscreen mode

Key details:

  • Always set max_scrolls to prevent infinite loops
  • The wait_for_timeout(2000) gives the AJAX request time to complete
  • Compare scrollHeight before and after to detect when you've reached the end

Tip: Running headless browsers at scale? A reliable proxy service like ThorData prevents IP blocks when making hundreds of browser-based requests. For a managed approach, ScraperAPI handles browser rendering and proxy rotation in a single API call.

5. "Load More" Buttons

Similar to infinite scroll, but requires clicking a button. Again, check DevTools first since the button usually triggers an API call you can replicate directly.

Approach A: Replicate the API call

import requests

def scrape_load_more_api(api_url):
    all_items = []
    page = 1

    while True:
        resp = requests.post(api_url, json={"page": page, "per_page": 24})
        data = resp.json()

        items = data.get("items", [])
        if not items:
            break

        all_items.extend(items)
        page += 1

    return all_items
Enter fullscreen mode Exit fullscreen mode

Approach B: Click the button with Playwright

async def scrape_load_more_button(url, button_selector=".load-more-btn"):
    async with async_playwright() as p:
        browser = await p.chromium.launch(headless=True)
        page = await browser.new_page()
        await page.goto(url)

        while True:
            try:
                button = await page.wait_for_selector(
                    button_selector, timeout=5000
                )
                await button.click()
                await page.wait_for_timeout(1500)
            except Exception:
                break  # Button no longer appears

        # Extract all items
        items = await page.query_selector_all(".product")
        results = []
        for item in items:
            text = await item.inner_text()
            results.append(text)

        await browser.close()
        return results
Enter fullscreen mode Exit fullscreen mode

Choosing the Right Approach

Pattern Detection Best Tool
URL page numbers ?page=N in URL requests + BeautifulSoup
Offset/limit ?offset=N&limit=M in API requests
Cursor-based next_cursor in response requests
Infinite scroll Content loads on scroll Playwright (or find hidden API)
Load more button Button triggers new content Playwright (or find hidden API)

Always check the Network tab first. 90% of "infinite scroll" and "load more" sites have a clean API underneath. Scraping that API is 10x faster and more reliable than browser automation.

Handling Pagination at Scale

When you're scraping thousands of pages, you need to think about:

Rate limiting: Space out your requests. A 1-2 second delay between pages prevents bans and is polite to the server.

import time
import random

delay = random.uniform(1.0, 2.5)
time.sleep(delay)
Enter fullscreen mode Exit fullscreen mode

Retries: Network errors happen. Wrap your requests in retry logic:

from tenacity import retry, stop_after_attempt, wait_exponential

@retry(stop=stop_after_attempt(3), wait=wait_exponential(multiplier=1, max=10))
def fetch_page(url):
    resp = requests.get(url, timeout=10)
    resp.raise_for_status()
    return resp
Enter fullscreen mode Exit fullscreen mode

Proxy rotation: After a few hundred requests, many sites will throttle or block your IP. Use rotating proxies to distribute your traffic.

For high-volume pagination scraping, ScraperAPI handles retries, proxy rotation, and CAPTCHA solving automatically. ThorData offers residential proxies if you prefer managing rotation yourself.

Checkpointing: Save your progress so you can resume if the scraper crashes:

import json

def save_checkpoint(page_num, items, filename="checkpoint.json"):
    with open(filename, "w") as f:
        json.dump({"last_page": page_num, "items": items}, f)

def load_checkpoint(filename="checkpoint.json"):
    try:
        with open(filename) as f:
            return json.load(f)
    except FileNotFoundError:
        return {"last_page": 0, "items": []}
Enter fullscreen mode Exit fullscreen mode

Common Pitfalls

  1. No stop condition — Your scraper runs forever, making thousands of empty requests
  2. Not handling duplicates — Page boundaries shift, causing repeated items
  3. Ignoring rate limits — Getting your IP banned after 50 pages
  4. Scraping HTML when an API exists — Always check DevTools Network tab first
  5. Hardcoding page counts — Sites add and remove pages. Always detect the end dynamically

Wrapping Up

Pagination is a solved problem once you identify which pattern a site uses. Start with the simplest approach (URL parameters), check for hidden APIs before reaching for browser automation, and always build in stop conditions and retry logic.

The patterns in this guide cover 95% of real-world pagination. For the remaining edge cases involving token-based authentication, CAPTCHA-gated pages, or custom JavaScript, a managed service like ScraperAPI can save you days of debugging.


Need proxies for large-scale scraping? ThorData offers residential and datacenter proxies optimized for web scraping workloads.

Top comments (0)