agenthustler

Posted on Mar 20

Scraping Goodreads in 2026: Books, Ratings & Author Data

#webscraping #python #tutorial #books

Goodreads has no public API anymore — it was deprecated in 2020. But the site still holds the richest book database on the internet. Here's how to extract the data you need using Python.

What You Can Scrape

Goodreads serves most of its content as plain HTML. The main data targets are:

Book pages — title, author, rating, review count, genres, description, ISBN, publication date
Search results — paginated lists of books matching a query
Author pages — biography, book list, average rating
Lists — curated collections like "Best of 2025" or genre-specific rankings
Shelves — user-created collections (public ones only)

Setup

Install the required packages:

pip install requests beautifulsoup4 lxml

We'll use lxml as the parser for speed. html.parser works too if you prefer no extra dependencies.

Scraping Book Details

Every Goodreads book has a URL like goodreads.com/book/show/12345. Here's how to extract structured data from a book page:

import requests
from bs4 import BeautifulSoup
import json
import time

HEADERS = {
    "User-Agent": (
        "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) "
        "AppleWebKit/537.36 (KHTML, like Gecko) "
        "Chrome/120.0.0.0 Safari/537.36"
    )
}

def scrape_book(url):
    resp = requests.get(url, headers=HEADERS)
    resp.raise_for_status()
    soup = BeautifulSoup(resp.text, "lxml")

    # Title and author
    title_el = soup.select_one("h1.Text__title1")
    author_el = soup.select_one("span.ContributorLink__name")

    # Rating info
    rating_el = soup.select_one("div.RatingStatistics__rating")
    count_el = soup.select_one(
        "span[data-testid='ratingsCount']"
    )

    # Genres
    genre_els = soup.select(
        "span.BookPageMetadataSection__genreButton a"
    )

    # Description
    desc_el = soup.select_one(
        "div.DetailsLayoutRightParagraph span.Formatted"
    )

    return {
        "url": url,
        "title": title_el.text.strip() if title_el else None,
        "author": author_el.text.strip() if author_el else None,
        "rating": float(rating_el.text.strip())
            if rating_el else None,
        "ratings_count": count_el.text.strip()
            if count_el else None,
        "genres": [g.text.strip() for g in genre_els],
        "description": desc_el.text.strip()
            if desc_el else None,
    }

# Example usage
book = scrape_book(
    "https://www.goodreads.com/book/show/5907.The_Hobbit"
)
print(json.dumps(book, indent=2))

Important: Add time.sleep(1) between requests. Goodreads will block you if you hit it too fast.

Scraping Search Results

To find books by keyword:

def search_books(query, max_pages=3):
    results = []
    for page in range(1, max_pages + 1):
        url = (
            f"https://www.goodreads.com/search"
            f"?q={query}&page={page}"
        )
        resp = requests.get(url, headers=HEADERS)
        soup = BeautifulSoup(resp.text, "lxml")

        rows = soup.select("tr[itemtype]")
        if not rows:
            break

        for row in rows:
            title_link = row.select_one("a.bookTitle")
            author_link = row.select_one("a.authorName")
            rating_el = row.select_one("span.minirating")

            results.append({
                "title": title_link.text.strip()
                    if title_link else None,
                "url": "https://www.goodreads.com"
                    + title_link["href"]
                    if title_link else None,
                "author": author_link.text.strip()
                    if author_link else None,
                "mini_rating": rating_el.text.strip()
                    if rating_el else None,
            })

        time.sleep(2)  # Be respectful
    return results

books = search_books("machine learning")
print(f"Found {len(books)} books")

Scraping Author Profiles

Author pages contain biography, book lists, and aggregate stats:

def scrape_author(url):
    resp = requests.get(url, headers=HEADERS)
    soup = BeautifulSoup(resp.text, "lxml")

    name_el = soup.select_one("h1.authorName span")
    bio_el = soup.select_one("div.aboutAuthorInfo span")
    books = []

    book_els = soup.select("tr[itemtype] a.bookTitle")
    for b in book_els[:20]:
        books.append({
            "title": b.text.strip(),
            "url": "https://www.goodreads.com" + b["href"]
        })

    return {
        "name": name_el.text.strip() if name_el else None,
        "bio": bio_el.text.strip() if bio_el else None,
        "books": books,
    }

Handling Common Issues

Rate Limiting

Goodreads blocks IPs that make too many requests. Mitigations:

import random

def polite_get(url):
    time.sleep(random.uniform(1.5, 3.0))
    return requests.get(url, headers=HEADERS)

For large-scale scraping, use rotating proxies or a managed platform.

JavaScript-Rendered Content

Some elements (review pagination, "Show more" buttons) require JavaScript. Use Playwright for these:

from playwright.sync_api import sync_playwright

def get_rendered_page(url):
    with sync_playwright() as p:
        browser = p.chromium.launch(headless=True)
        page = browser.new_page()
        page.goto(url, wait_until="networkidle")
        content = page.content()
        browser.close()
        return BeautifulSoup(content, "lxml")

Selector Changes

Goodreads changes its CSS class names occasionally. If your scraper breaks:

Open the target page in a browser
Inspect the element you need
Update the selector
Consider using data-testid attributes — they change less often than class names

Scaling Up with Apify

If you need to scrape thousands of books, doing it yourself means managing proxies, retries, and scheduling. Cloud platforms like Apify handle this infrastructure for you.

We're building a dedicated Goodreads scraper actor at apify.com/cryptosignals/goodreads-scraper that handles:

Automatic proxy rotation
Retry on failures
Structured JSON/CSV output
Scheduled recurring runs

The actor is upcoming — check the link for availability.

Output Formats

Whatever method you use, save your data in a reusable format:

import csv

def save_to_csv(books, filename="books.csv"):
    if not books:
        return
    keys = books[0].keys()
    with open(filename, "w", newline="") as f:
        writer = csv.DictWriter(f, fieldnames=keys)
        writer.writeheader()
        writer.writerows(books)

Ethical Scraping Guidelines

Rate-limit your requests (minimum 1-2 seconds between requests)
Respect robots.txt
Don't scrape private user data
Use the data for research, analysis, or building better book tools — not for cloning Goodreads
Cache results to avoid redundant requests

Goodreads still has the best book data on the internet. With the right tools and respectful scraping practices, you can access it programmatically even without the old API.

DEV Community