Scraping Paginated Sites Without Getting It Wrong

#webscraping #python #zyte #programming

Pagination is where a lot of scrapers quietly go wrong — not with errors, but with missing data. A scraper that stops two pages early produces no exceptions. Neither does one that re-fetches the same page in a loop. The output just looks thin, and you may not notice until you check it against the actual record count.

There are three common pagination patterns. Identifying which one you're dealing with takes 30 seconds in DevTools; implementing it correctly takes another ten minutes. This post covers all three.

Pattern 1: Page number in the URL

The simplest pattern. The URL contains a page parameter — either as a query string (?page=2) or as part of the path (/catalogue/page-2.html). Increment it until you get a 404 or an empty result set.

import requests
from bs4 import BeautifulSoup
from urllib.parse import urljoin

BASE = "https://books.toscrape.com/catalogue/"

session = requests.Session()
session.headers.update({
    "User-Agent": "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/125.0.0.0 Safari/537.36",
    "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
    "Accept-Language": "en-GB,en;q=0.9",
})

all_books = []
page = 1

while True:
    url = f"https://books.toscrape.com/catalogue/page-{page}.html"
    resp = session.get(url, timeout=15)

    if resp.status_code == 404:
        break  # past the last page

    resp.encoding = "utf-8"
    soup = BeautifulSoup(resp.text, "html.parser")
    books = soup.find_all("article", class_="product_pod")

    if not books:
        break  # empty page — also done

    for book in books:
        all_books.append({
            "title":  book.find("h3").find("a")["title"],
            "price":  book.find("p", class_="price_color").text.strip(),
            "rating": book.find("p", class_="star-rating")["class"][1],
        })

    print(f"Page {page}: {len(books)} books")
    page += 1

print(f"\nTotal: {len(all_books)} books")

Two termination conditions, not one. Some sites return an empty 200 for out-of-range pages rather than a 404. Check both.

Pattern 2: Following the "next" link

A cleaner approach for HTML-paginated sites: let the page tell you where to go next, rather than constructing URLs yourself. Most paginated sites include a "Next" link in the HTML. Follow it until it disappears.

import requests
from bs4 import BeautifulSoup
from urllib.parse import urljoin

BASE = "https://books.toscrape.com/catalogue/"

session = requests.Session()
session.headers.update({
    "User-Agent": "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/125.0.0.0 Safari/537.36",
    "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
    "Accept-Language": "en-GB,en;q=0.9",
})

url = "https://books.toscrape.com/catalogue/page-1.html"
all_books = []

while url:
    resp = session.get(url, timeout=15)
    resp.encoding = "utf-8"
    soup = BeautifulSoup(resp.text, "html.parser")

    for book in soup.find_all("article", class_="product_pod"):
        all_books.append({
            "title":  book.find("h3").find("a")["title"],
            "price":  book.find("p", class_="price_color").text.strip(),
            "rating": book.find("p", class_="star-rating")["class"][1],
        })

    next_btn = soup.select_one("li.next a")
    url = urljoin(BASE, next_btn["href"]) if next_btn else None

print(f"Scraped {len(all_books)} books")

The urljoin(BASE, next_btn["href"]) call is worth noting. The href in a "next" link is often relative (page-2.html, ../page-2.html). urljoin resolves it against the base URL correctly regardless of what form the relative path takes. Concatenating strings instead will break on unusual relative paths.

Pattern 3: API cursor / continuation token

JSON APIs often paginate differently. Instead of page numbers, they return a token or flag telling you whether more results exist, and sometimes a cursor to pass back on the next request.

The simplest version: a has_next boolean.

import requests

session = requests.Session()
session.headers.update({
    "User-Agent": "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/125.0.0.0 Safari/537.36",
    "Accept": "application/json",
    "Accept-Language": "en-GB,en;q=0.9",
})

all_quotes = []
page = 1

while True:
    resp = session.get(
        "https://quotes.toscrape.com/api/quotes",
        params={"page": page},
        timeout=15,
    )
    resp.raise_for_status()
    data = resp.json()

    all_quotes.extend(data["quotes"])
    print(f"Page {page}: {len(data['quotes'])} quotes")

    if not data["has_next"]:
        break
    page += 1

print(f"\nTotal: {len(all_quotes)} quotes")

Some APIs use a cursor instead — the response includes a next_cursor or next_page_token field that you pass as a parameter on the subsequent request. The structure changes but the loop logic is the same: keep going until the cursor field is null or absent.

# Generic cursor pattern
params = {"limit": 100}

while True:
    resp = session.get("https://example.com/api/items", params=params, timeout=15)
    data = resp.json()

    items.extend(data["results"])

    cursor = data.get("next_cursor")
    if not cursor:
        break
    params["cursor"] = cursor

Rate limiting

Sending requests as fast as the network allows is not scraping — it's a load test. Most sites will rate-limit or block traffic that arrives faster than a human could generate it. A 1-2 second delay between pages is a reasonable starting point; adjust based on the site's response times and any explicit rate-limit headers it sends.

import time

while url:
    resp = session.get(url, timeout=15)
    # ... process page ...

    next_btn = soup.select_one("li.next a")
    url = urljoin(BASE, next_btn["href"]) if next_btn else None

    if url:
        time.sleep(1)  # only sleep if there's another request coming

Sleeping after the last page is unnecessary. Put the sleep before the next request or, as above, after confirming there is a next request.

For higher-volume work, time.sleep with a fixed value is blunt. A better approach uses a random delay within a range — time.sleep(random.uniform(0.5, 2.0)) — which avoids the metronomic request timing that fixed delays produce.

Duplicate URL detection

Some sites have inconsistent pagination — "next" links that eventually loop back, or page parameters that wrap around. A simple seen_urls set catches this before it turns into an infinite loop:

from urllib.parse import urljoin
import requests
from bs4 import BeautifulSoup

session = requests.Session()
session.headers.update({
    "User-Agent": "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/125.0.0.0 Safari/537.36",
    "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
    "Accept-Language": "en-GB,en;q=0.9",
})

BASE = "https://books.toscrape.com/catalogue/"
url = "https://books.toscrape.com/catalogue/page-1.html"
seen_urls = set()
all_books = []

while url:
    if url in seen_urls:
        print(f"Loop detected at {url} — stopping")
        break
    seen_urls.add(url)

    resp = session.get(url, timeout=15)
    resp.encoding = "utf-8"
    soup = BeautifulSoup(resp.text, "html.parser")

    for book in soup.find_all("article", class_="product_pod"):
        all_books.append({
            "title":  book.find("h3").find("a")["title"],
            "price":  book.find("p", class_="price_color").text.strip(),
            "rating": book.find("p", class_="star-rating")["class"][1],
        })

    next_btn = soup.select_one("li.next a")
    url = urljoin(BASE, next_btn["href"]) if next_btn else None

print(f"Scraped {len(all_books)} books across {len(seen_urls)} pages")

Using Scrapy's CrawlSpider

If you're building on Scrapy, CrawlSpider handles link following automatically via rules. This is the idiomatic Scrapy approach for sites where pagination follows a consistent pattern:

import scrapy
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor


class BooksSpider(CrawlSpider):
    name = "books"
    allowed_domains = ["books.toscrape.com"]
    start_urls = ["https://books.toscrape.com/catalogue/page-1.html"]

    rules = (
        # Follow "next page" links
        Rule(
            LinkExtractor(restrict_css="li.next a"),
            callback="parse_page",
            follow=True,
        ),
    )

    def parse_page(self, response):
        for book in response.css("article.product_pod"):
            yield {
                "title":  book.css("h3 a::attr(title)").get(),
                "price":  book.css("p.price_color::text").get(default="").strip(),
                "rating": book.css("p.star-rating::attr(class)").get(default="").split()[-1],
            }

CrawlSpider deduplicates URLs by default (Scrapy's built-in duplicate filter handles it), respects DOWNLOAD_DELAY in your settings, and handles retries. For a site with straightforward pagination, it removes most of the boilerplate above.

One thing to know: CrawlSpider calls the rules on every response, including the ones your callback generates. If a page both contains items and a "next" link, the rule fires correctly — but if you override parse() directly on a CrawlSpider, you'll break the rule processing. Use a separate callback method, as above.

Quick decision guide

Situation	Approach
URL has `/page/2` or `?page=2`	Increment, stop on 404 or empty
Page has a "Next" link in HTML	Follow href with `urljoin`, stop when absent
JSON API with `has_next` flag	Loop until flag is false
JSON API with cursor/token	Pass cursor back each request, stop when null
Building on Scrapy	`CrawlSpider` + `LinkExtractor`