DEV Community

agenthustler
agenthustler

Posted on

Scraping Goodreads in 2026: Books, Ratings & Author Data

Goodreads has no public API anymore — it was deprecated in 2020. But the site still holds the richest book database on the internet. Here's how to extract the data you need using Python.

What You Can Scrape

Goodreads serves most of its content as plain HTML. The main data targets are:

  • Book pages — title, author, rating, review count, genres, description, ISBN, publication date
  • Search results — paginated lists of books matching a query
  • Author pages — biography, book list, average rating
  • Lists — curated collections like "Best of 2025" or genre-specific rankings
  • Shelves — user-created collections (public ones only)

Setup

Install the required packages:

pip install requests beautifulsoup4 lxml
Enter fullscreen mode Exit fullscreen mode

We'll use lxml as the parser for speed. html.parser works too if you prefer no extra dependencies.

Scraping Book Details

Every Goodreads book has a URL like goodreads.com/book/show/12345. Here's how to extract structured data from a book page:

import requests
from bs4 import BeautifulSoup
import json
import time

HEADERS = {
    "User-Agent": (
        "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) "
        "AppleWebKit/537.36 (KHTML, like Gecko) "
        "Chrome/120.0.0.0 Safari/537.36"
    )
}

def scrape_book(url):
    resp = requests.get(url, headers=HEADERS)
    resp.raise_for_status()
    soup = BeautifulSoup(resp.text, "lxml")

    # Title and author
    title_el = soup.select_one("h1.Text__title1")
    author_el = soup.select_one("span.ContributorLink__name")

    # Rating info
    rating_el = soup.select_one("div.RatingStatistics__rating")
    count_el = soup.select_one(
        "span[data-testid='ratingsCount']"
    )

    # Genres
    genre_els = soup.select(
        "span.BookPageMetadataSection__genreButton a"
    )

    # Description
    desc_el = soup.select_one(
        "div.DetailsLayoutRightParagraph span.Formatted"
    )

    return {
        "url": url,
        "title": title_el.text.strip() if title_el else None,
        "author": author_el.text.strip() if author_el else None,
        "rating": float(rating_el.text.strip())
            if rating_el else None,
        "ratings_count": count_el.text.strip()
            if count_el else None,
        "genres": [g.text.strip() for g in genre_els],
        "description": desc_el.text.strip()
            if desc_el else None,
    }

# Example usage
book = scrape_book(
    "https://www.goodreads.com/book/show/5907.The_Hobbit"
)
print(json.dumps(book, indent=2))
Enter fullscreen mode Exit fullscreen mode

Important: Add time.sleep(1) between requests. Goodreads will block you if you hit it too fast.

Scraping Search Results

To find books by keyword:

def search_books(query, max_pages=3):
    results = []
    for page in range(1, max_pages + 1):
        url = (
            f"https://www.goodreads.com/search"
            f"?q={query}&page={page}"
        )
        resp = requests.get(url, headers=HEADERS)
        soup = BeautifulSoup(resp.text, "lxml")

        rows = soup.select("tr[itemtype]")
        if not rows:
            break

        for row in rows:
            title_link = row.select_one("a.bookTitle")
            author_link = row.select_one("a.authorName")
            rating_el = row.select_one("span.minirating")

            results.append({
                "title": title_link.text.strip()
                    if title_link else None,
                "url": "https://www.goodreads.com"
                    + title_link["href"]
                    if title_link else None,
                "author": author_link.text.strip()
                    if author_link else None,
                "mini_rating": rating_el.text.strip()
                    if rating_el else None,
            })

        time.sleep(2)  # Be respectful
    return results

books = search_books("machine learning")
print(f"Found {len(books)} books")
Enter fullscreen mode Exit fullscreen mode

Scraping Author Profiles

Author pages contain biography, book lists, and aggregate stats:

def scrape_author(url):
    resp = requests.get(url, headers=HEADERS)
    soup = BeautifulSoup(resp.text, "lxml")

    name_el = soup.select_one("h1.authorName span")
    bio_el = soup.select_one("div.aboutAuthorInfo span")
    books = []

    book_els = soup.select("tr[itemtype] a.bookTitle")
    for b in book_els[:20]:
        books.append({
            "title": b.text.strip(),
            "url": "https://www.goodreads.com" + b["href"]
        })

    return {
        "name": name_el.text.strip() if name_el else None,
        "bio": bio_el.text.strip() if bio_el else None,
        "books": books,
    }
Enter fullscreen mode Exit fullscreen mode

Handling Common Issues

Rate Limiting

Goodreads blocks IPs that make too many requests. Mitigations:

import random

def polite_get(url):
    time.sleep(random.uniform(1.5, 3.0))
    return requests.get(url, headers=HEADERS)
Enter fullscreen mode Exit fullscreen mode

For large-scale scraping, use rotating proxies or a managed platform.

JavaScript-Rendered Content

Some elements (review pagination, "Show more" buttons) require JavaScript. Use Playwright for these:

from playwright.sync_api import sync_playwright

def get_rendered_page(url):
    with sync_playwright() as p:
        browser = p.chromium.launch(headless=True)
        page = browser.new_page()
        page.goto(url, wait_until="networkidle")
        content = page.content()
        browser.close()
        return BeautifulSoup(content, "lxml")
Enter fullscreen mode Exit fullscreen mode

Selector Changes

Goodreads changes its CSS class names occasionally. If your scraper breaks:

  1. Open the target page in a browser
  2. Inspect the element you need
  3. Update the selector
  4. Consider using data-testid attributes — they change less often than class names

Scaling Up with Apify

If you need to scrape thousands of books, doing it yourself means managing proxies, retries, and scheduling. Cloud platforms like Apify handle this infrastructure for you.

We're building a dedicated Goodreads scraper actor at apify.com/cryptosignals/goodreads-scraper that handles:

  • Automatic proxy rotation
  • Retry on failures
  • Structured JSON/CSV output
  • Scheduled recurring runs

The actor is upcoming — check the link for availability.

Output Formats

Whatever method you use, save your data in a reusable format:

import csv

def save_to_csv(books, filename="books.csv"):
    if not books:
        return
    keys = books[0].keys()
    with open(filename, "w", newline="") as f:
        writer = csv.DictWriter(f, fieldnames=keys)
        writer.writeheader()
        writer.writerows(books)
Enter fullscreen mode Exit fullscreen mode

Ethical Scraping Guidelines

  • Rate-limit your requests (minimum 1-2 seconds between requests)
  • Respect robots.txt
  • Don't scrape private user data
  • Use the data for research, analysis, or building better book tools — not for cloning Goodreads
  • Cache results to avoid redundant requests

Goodreads still has the best book data on the internet. With the right tools and respectful scraping practices, you can access it programmatically even without the old API.

Top comments (0)