Pagination is where a lot of scrapers quietly go wrong — not with errors, but with missing data. A scraper that stops two pages early produces no exceptions. Neither does one that re-fetches the same page in a loop. The output just looks thin, and you may not notice until you check it against the actual record count.
There are three common pagination patterns. Identifying which one you're dealing with takes 30 seconds in DevTools; implementing it correctly takes another ten minutes. This post covers all three.
Pattern 1: Page number in the URL
The simplest pattern. The URL contains a page parameter — either as a query string (?page=2) or as part of the path (/catalogue/page-2.html). Increment it until you get a 404 or an empty result set.
import requests
from bs4 import BeautifulSoup
from urllib.parse import urljoin
BASE = "https://books.toscrape.com/catalogue/"
session = requests.Session()
session.headers.update({
"User-Agent": "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/125.0.0.0 Safari/537.36",
"Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
"Accept-Language": "en-GB,en;q=0.9",
})
all_books = []
page = 1
while True:
url = f"https://books.toscrape.com/catalogue/page-{page}.html"
resp = session.get(url, timeout=15)
if resp.status_code == 404:
break # past the last page
resp.encoding = "utf-8"
soup = BeautifulSoup(resp.text, "html.parser")
books = soup.find_all("article", class_="product_pod")
if not books:
break # empty page — also done
for book in books:
all_books.append({
"title": book.find("h3").find("a")["title"],
"price": book.find("p", class_="price_color").text.strip(),
"rating": book.find("p", class_="star-rating")["class"][1],
})
print(f"Page {page}: {len(books)} books")
page += 1
print(f"\nTotal: {len(all_books)} books")
Two termination conditions, not one. Some sites return an empty 200 for out-of-range pages rather than a 404. Check both.
Pattern 2: Following the "next" link
A cleaner approach for HTML-paginated sites: let the page tell you where to go next, rather than constructing URLs yourself. Most paginated sites include a "Next" link in the HTML. Follow it until it disappears.
import requests
from bs4 import BeautifulSoup
from urllib.parse import urljoin
BASE = "https://books.toscrape.com/catalogue/"
session = requests.Session()
session.headers.update({
"User-Agent": "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/125.0.0.0 Safari/537.36",
"Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
"Accept-Language": "en-GB,en;q=0.9",
})
url = "https://books.toscrape.com/catalogue/page-1.html"
all_books = []
while url:
resp = session.get(url, timeout=15)
resp.encoding = "utf-8"
soup = BeautifulSoup(resp.text, "html.parser")
for book in soup.find_all("article", class_="product_pod"):
all_books.append({
"title": book.find("h3").find("a")["title"],
"price": book.find("p", class_="price_color").text.strip(),
"rating": book.find("p", class_="star-rating")["class"][1],
})
next_btn = soup.select_one("li.next a")
url = urljoin(BASE, next_btn["href"]) if next_btn else None
print(f"Scraped {len(all_books)} books")
The urljoin(BASE, next_btn["href"]) call is worth noting. The href in a "next" link is often relative (page-2.html, ../page-2.html). urljoin resolves it against the base URL correctly regardless of what form the relative path takes. Concatenating strings instead will break on unusual relative paths.
Pattern 3: API cursor / continuation token
JSON APIs often paginate differently. Instead of page numbers, they return a token or flag telling you whether more results exist, and sometimes a cursor to pass back on the next request.
The simplest version: a has_next boolean.
import requests
session = requests.Session()
session.headers.update({
"User-Agent": "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/125.0.0.0 Safari/537.36",
"Accept": "application/json",
"Accept-Language": "en-GB,en;q=0.9",
})
all_quotes = []
page = 1
while True:
resp = session.get(
"https://quotes.toscrape.com/api/quotes",
params={"page": page},
timeout=15,
)
resp.raise_for_status()
data = resp.json()
all_quotes.extend(data["quotes"])
print(f"Page {page}: {len(data['quotes'])} quotes")
if not data["has_next"]:
break
page += 1
print(f"\nTotal: {len(all_quotes)} quotes")
Some APIs use a cursor instead — the response includes a next_cursor or next_page_token field that you pass as a parameter on the subsequent request. The structure changes but the loop logic is the same: keep going until the cursor field is null or absent.
# Generic cursor pattern
params = {"limit": 100}
while True:
resp = session.get("https://example.com/api/items", params=params, timeout=15)
data = resp.json()
items.extend(data["results"])
cursor = data.get("next_cursor")
if not cursor:
break
params["cursor"] = cursor
Rate limiting
Sending requests as fast as the network allows is not scraping — it's a load test. Most sites will rate-limit or block traffic that arrives faster than a human could generate it. A 1-2 second delay between pages is a reasonable starting point; adjust based on the site's response times and any explicit rate-limit headers it sends.
import time
while url:
resp = session.get(url, timeout=15)
# ... process page ...
next_btn = soup.select_one("li.next a")
url = urljoin(BASE, next_btn["href"]) if next_btn else None
if url:
time.sleep(1) # only sleep if there's another request coming
Sleeping after the last page is unnecessary. Put the sleep before the next request or, as above, after confirming there is a next request.
For higher-volume work, time.sleep with a fixed value is blunt. A better approach uses a random delay within a range — time.sleep(random.uniform(0.5, 2.0)) — which avoids the metronomic request timing that fixed delays produce.
Duplicate URL detection
Some sites have inconsistent pagination — "next" links that eventually loop back, or page parameters that wrap around. A simple seen_urls set catches this before it turns into an infinite loop:
from urllib.parse import urljoin
import requests
from bs4 import BeautifulSoup
session = requests.Session()
session.headers.update({
"User-Agent": "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/125.0.0.0 Safari/537.36",
"Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
"Accept-Language": "en-GB,en;q=0.9",
})
BASE = "https://books.toscrape.com/catalogue/"
url = "https://books.toscrape.com/catalogue/page-1.html"
seen_urls = set()
all_books = []
while url:
if url in seen_urls:
print(f"Loop detected at {url} — stopping")
break
seen_urls.add(url)
resp = session.get(url, timeout=15)
resp.encoding = "utf-8"
soup = BeautifulSoup(resp.text, "html.parser")
for book in soup.find_all("article", class_="product_pod"):
all_books.append({
"title": book.find("h3").find("a")["title"],
"price": book.find("p", class_="price_color").text.strip(),
"rating": book.find("p", class_="star-rating")["class"][1],
})
next_btn = soup.select_one("li.next a")
url = urljoin(BASE, next_btn["href"]) if next_btn else None
print(f"Scraped {len(all_books)} books across {len(seen_urls)} pages")
Using Scrapy's CrawlSpider
If you're building on Scrapy, CrawlSpider handles link following automatically via rules. This is the idiomatic Scrapy approach for sites where pagination follows a consistent pattern:
import scrapy
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
class BooksSpider(CrawlSpider):
name = "books"
allowed_domains = ["books.toscrape.com"]
start_urls = ["https://books.toscrape.com/catalogue/page-1.html"]
rules = (
# Follow "next page" links
Rule(
LinkExtractor(restrict_css="li.next a"),
callback="parse_page",
follow=True,
),
)
def parse_page(self, response):
for book in response.css("article.product_pod"):
yield {
"title": book.css("h3 a::attr(title)").get(),
"price": book.css("p.price_color::text").get(default="").strip(),
"rating": book.css("p.star-rating::attr(class)").get(default="").split()[-1],
}
CrawlSpider deduplicates URLs by default (Scrapy's built-in duplicate filter handles it), respects DOWNLOAD_DELAY in your settings, and handles retries. For a site with straightforward pagination, it removes most of the boilerplate above.
One thing to know: CrawlSpider calls the rules on every response, including the ones your callback generates. If a page both contains items and a "next" link, the rule fires correctly — but if you override parse() directly on a CrawlSpider, you'll break the rule processing. Use a separate callback method, as above.
Quick decision guide
| Situation | Approach |
|---|---|
URL has /page/2 or ?page=2
|
Increment, stop on 404 or empty |
| Page has a "Next" link in HTML | Follow href with urljoin, stop when absent |
JSON API with has_next flag |
Loop until flag is false |
| JSON API with cursor/token | Pass cursor back each request, stop when null |
| Building on Scrapy |
CrawlSpider + LinkExtractor
|
Tags: python scrapy webscraping tutorial
Top comments (0)