agenthustler

Posted on Mar 20

Best Goodreads Scrapers in 2026: Book Data Without the Old API

#webscraping #python #books #apify

Goodreads killed its public API in December 2020. If you need book data — ratings, reviews, author info, shelves — web scraping is now the only realistic option.

This guide covers the best tools available in 2026 for scraping Goodreads, from cloud-based actors to DIY approaches.

Why Goodreads Data Still Matters

Even without an API, Goodreads remains the largest book database on the web:

200+ million reviews across millions of titles
150+ million registered users generating reading activity
Rich metadata: genres, series info, edition details, author bios

Common use cases include:

Book recommendation engines — pull ratings and genre tags to build collaborative filtering models
Publisher market research — track which genres trend, what ratings new releases get
Author tracking — monitor new releases, rating changes, review sentiment
Academic research — reading habits, genre popularity over time
Library collection development — identify high-demand titles by rating volume

The Scraping Landscape in 2026

Goodreads uses server-rendered HTML, which makes it relatively straightforward to scrape compared to SPAs. However, there are challenges:

Rate limiting — aggressive request patterns get IP-blocked quickly
Dynamic content — some sections load via JavaScript (reviews pagination)
Anti-bot measures — CAPTCHAs appear after sustained scraping
Layout changes — Goodreads updates its HTML structure periodically

This is why managed scraping platforms like Apify have become popular — they handle proxies, retries, and browser rendering so you don't have to.

Option 1: Apify Store Actors

The Apify Store hosts several Goodreads scrapers built by the community. These run in the cloud with built-in proxy rotation and scheduling.

What to look for in an Apify actor:

Recent updates (actors abandoned for 6+ months may break on layout changes)
Proxy support (residential proxies work best for Goodreads)
Structured output (JSON with consistent field names)
Search + detail page support (not just one or the other)

Our Upcoming Actor

We're building a dedicated Goodreads scraper at apify.com/cryptosignals/goodreads-scraper focused on:

Book search — scrape search results by keyword, genre, or list
Book details — title, author, rating, review count, genres, description, ISBN, page count
Author profiles — bio, book list, average rating, follower count
List scraping — pull entire Goodreads lists (e.g., "Best Science Fiction of 2025")

This actor is upcoming and not yet publicly available — check the link for launch updates.

Option 2: DIY with Python

If you prefer to build your own scraper, here's a minimal approach using requests and BeautifulSoup4:

import requests
from bs4 import BeautifulSoup

def scrape_book(url):
    headers = {"User-Agent": "Mozilla/5.0"}
    resp = requests.get(url, headers=headers)
    soup = BeautifulSoup(resp.text, "html.parser")

    title = soup.select_one("h1.Text__title1")
    author = soup.select_one("span.ContributorLink__name")
    rating = soup.select_one("div.RatingStatistics__rating")

    return {
        "title": title.text.strip() if title else None,
        "author": author.text.strip() if author else None,
        "rating": rating.text.strip() if rating else None,
    }

Pros: Full control, no platform fees, customizable output.

Cons: You handle proxies, rate limiting, retries, and maintenance when Goodreads changes its HTML. Expect to spend significant time on infrastructure rather than data analysis.

Option 3: Browser Automation

For JavaScript-heavy pages (like paginated reviews), you may need Playwright or Puppeteer:

from playwright.sync_api import sync_playwright

def scrape_reviews(book_url):
    with sync_playwright() as p:
        browser = p.chromium.launch()
        page = browser.new_page()
        page.goto(book_url)
        page.wait_for_selector("article.ReviewCard")

        reviews = page.query_selector_all("article.ReviewCard")
        data = []
        for review in reviews[:10]:
            text = review.query_selector("span.Formatted")
            stars = review.query_selector("span.RatingStars")
            data.append({
                "text": text.inner_text() if text else "",
                "rating": stars.get_attribute("aria-label") if stars else ""
            })
        browser.close()
        return data

This is slower and more resource-intensive but handles dynamic content that plain HTTP requests miss.

Comparison Table

Feature	Apify Actors	DIY Python	Browser Automation
Setup time	Minutes	Hours	Hours
Proxy handling	Built-in	Manual	Manual
JavaScript support	Yes	No	Yes
Cost	Pay per usage	Free (+ proxy costs)	Free (+ proxy costs)
Maintenance	Actor maintainer	You	You
Scalability	High	Medium	Low

Legal Considerations

Goodreads' Terms of Service prohibit automated scraping. In practice:

Scraping public data for research or personal use is generally low-risk
Scraping at high volume or for commercial redistribution carries more legal exposure
The 2022 hiQ v. LinkedIn ruling supports scraping of publicly accessible data, but this is not settled law everywhere
Always respect robots.txt and rate-limit your requests

Recommendations

For most users: Start with an Apify actor. The time saved on proxy management and maintenance pays for itself quickly. Check our upcoming Goodreads scraper or browse the Apify Store for alternatives.

For developers who need full control: Build with requests + BeautifulSoup for metadata, add Playwright only for review scraping. Budget time for ongoing maintenance.

For one-off research: A simple Python script with time.sleep(2) between requests is often enough. No need for infrastructure.

Whatever approach you choose, Goodreads remains one of the richest sources of book data on the web — it just takes a bit more work to access it now that the API is gone.

DEV Community