Goodreads has no public API anymore — it was deprecated in 2020. But the site still holds the richest book database on the internet. Here's how to extract the data you need using Python.
What You Can Scrape
Goodreads serves most of its content as plain HTML. The main data targets are:
- Book pages — title, author, rating, review count, genres, description, ISBN, publication date
- Search results — paginated lists of books matching a query
- Author pages — biography, book list, average rating
- Lists — curated collections like "Best of 2025" or genre-specific rankings
- Shelves — user-created collections (public ones only)
Setup
Install the required packages:
pip install requests beautifulsoup4 lxml
We'll use lxml as the parser for speed. html.parser works too if you prefer no extra dependencies.
Scraping Book Details
Every Goodreads book has a URL like goodreads.com/book/show/12345. Here's how to extract structured data from a book page:
import requests
from bs4 import BeautifulSoup
import json
import time
HEADERS = {
"User-Agent": (
"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) "
"AppleWebKit/537.36 (KHTML, like Gecko) "
"Chrome/120.0.0.0 Safari/537.36"
)
}
def scrape_book(url):
resp = requests.get(url, headers=HEADERS)
resp.raise_for_status()
soup = BeautifulSoup(resp.text, "lxml")
# Title and author
title_el = soup.select_one("h1.Text__title1")
author_el = soup.select_one("span.ContributorLink__name")
# Rating info
rating_el = soup.select_one("div.RatingStatistics__rating")
count_el = soup.select_one(
"span[data-testid='ratingsCount']"
)
# Genres
genre_els = soup.select(
"span.BookPageMetadataSection__genreButton a"
)
# Description
desc_el = soup.select_one(
"div.DetailsLayoutRightParagraph span.Formatted"
)
return {
"url": url,
"title": title_el.text.strip() if title_el else None,
"author": author_el.text.strip() if author_el else None,
"rating": float(rating_el.text.strip())
if rating_el else None,
"ratings_count": count_el.text.strip()
if count_el else None,
"genres": [g.text.strip() for g in genre_els],
"description": desc_el.text.strip()
if desc_el else None,
}
# Example usage
book = scrape_book(
"https://www.goodreads.com/book/show/5907.The_Hobbit"
)
print(json.dumps(book, indent=2))
Important: Add time.sleep(1) between requests. Goodreads will block you if you hit it too fast.
Scraping Search Results
To find books by keyword:
def search_books(query, max_pages=3):
results = []
for page in range(1, max_pages + 1):
url = (
f"https://www.goodreads.com/search"
f"?q={query}&page={page}"
)
resp = requests.get(url, headers=HEADERS)
soup = BeautifulSoup(resp.text, "lxml")
rows = soup.select("tr[itemtype]")
if not rows:
break
for row in rows:
title_link = row.select_one("a.bookTitle")
author_link = row.select_one("a.authorName")
rating_el = row.select_one("span.minirating")
results.append({
"title": title_link.text.strip()
if title_link else None,
"url": "https://www.goodreads.com"
+ title_link["href"]
if title_link else None,
"author": author_link.text.strip()
if author_link else None,
"mini_rating": rating_el.text.strip()
if rating_el else None,
})
time.sleep(2) # Be respectful
return results
books = search_books("machine learning")
print(f"Found {len(books)} books")
Scraping Author Profiles
Author pages contain biography, book lists, and aggregate stats:
def scrape_author(url):
resp = requests.get(url, headers=HEADERS)
soup = BeautifulSoup(resp.text, "lxml")
name_el = soup.select_one("h1.authorName span")
bio_el = soup.select_one("div.aboutAuthorInfo span")
books = []
book_els = soup.select("tr[itemtype] a.bookTitle")
for b in book_els[:20]:
books.append({
"title": b.text.strip(),
"url": "https://www.goodreads.com" + b["href"]
})
return {
"name": name_el.text.strip() if name_el else None,
"bio": bio_el.text.strip() if bio_el else None,
"books": books,
}
Handling Common Issues
Rate Limiting
Goodreads blocks IPs that make too many requests. Mitigations:
import random
def polite_get(url):
time.sleep(random.uniform(1.5, 3.0))
return requests.get(url, headers=HEADERS)
For large-scale scraping, use rotating proxies or a managed platform.
JavaScript-Rendered Content
Some elements (review pagination, "Show more" buttons) require JavaScript. Use Playwright for these:
from playwright.sync_api import sync_playwright
def get_rendered_page(url):
with sync_playwright() as p:
browser = p.chromium.launch(headless=True)
page = browser.new_page()
page.goto(url, wait_until="networkidle")
content = page.content()
browser.close()
return BeautifulSoup(content, "lxml")
Selector Changes
Goodreads changes its CSS class names occasionally. If your scraper breaks:
- Open the target page in a browser
- Inspect the element you need
- Update the selector
- Consider using
data-testidattributes — they change less often than class names
Scaling Up with Apify
If you need to scrape thousands of books, doing it yourself means managing proxies, retries, and scheduling. Cloud platforms like Apify handle this infrastructure for you.
We're building a dedicated Goodreads scraper actor at apify.com/cryptosignals/goodreads-scraper that handles:
- Automatic proxy rotation
- Retry on failures
- Structured JSON/CSV output
- Scheduled recurring runs
The actor is upcoming — check the link for availability.
Output Formats
Whatever method you use, save your data in a reusable format:
import csv
def save_to_csv(books, filename="books.csv"):
if not books:
return
keys = books[0].keys()
with open(filename, "w", newline="") as f:
writer = csv.DictWriter(f, fieldnames=keys)
writer.writeheader()
writer.writerows(books)
Ethical Scraping Guidelines
- Rate-limit your requests (minimum 1-2 seconds between requests)
- Respect
robots.txt - Don't scrape private user data
- Use the data for research, analysis, or building better book tools — not for cloning Goodreads
- Cache results to avoid redundant requests
Goodreads still has the best book data on the internet. With the right tools and respectful scraping practices, you can access it programmatically even without the old API.
Top comments (0)