DEV Community

agenthustler
agenthustler

Posted on

Best Goodreads Scrapers in 2026: Book Data Without the Old API

Goodreads killed its public API in December 2020. If you need book data — ratings, reviews, author info, shelves — web scraping is now the only realistic option.

This guide covers the best tools available in 2026 for scraping Goodreads, from cloud-based actors to DIY approaches.

Why Goodreads Data Still Matters

Even without an API, Goodreads remains the largest book database on the web:

  • 200+ million reviews across millions of titles
  • 150+ million registered users generating reading activity
  • Rich metadata: genres, series info, edition details, author bios

Common use cases include:

  • Book recommendation engines — pull ratings and genre tags to build collaborative filtering models
  • Publisher market research — track which genres trend, what ratings new releases get
  • Author tracking — monitor new releases, rating changes, review sentiment
  • Academic research — reading habits, genre popularity over time
  • Library collection development — identify high-demand titles by rating volume

The Scraping Landscape in 2026

Goodreads uses server-rendered HTML, which makes it relatively straightforward to scrape compared to SPAs. However, there are challenges:

  • Rate limiting — aggressive request patterns get IP-blocked quickly
  • Dynamic content — some sections load via JavaScript (reviews pagination)
  • Anti-bot measures — CAPTCHAs appear after sustained scraping
  • Layout changes — Goodreads updates its HTML structure periodically

This is why managed scraping platforms like Apify have become popular — they handle proxies, retries, and browser rendering so you don't have to.

Option 1: Apify Store Actors

The Apify Store hosts several Goodreads scrapers built by the community. These run in the cloud with built-in proxy rotation and scheduling.

What to look for in an Apify actor:

  • Recent updates (actors abandoned for 6+ months may break on layout changes)
  • Proxy support (residential proxies work best for Goodreads)
  • Structured output (JSON with consistent field names)
  • Search + detail page support (not just one or the other)

Our Upcoming Actor

We're building a dedicated Goodreads scraper at apify.com/cryptosignals/goodreads-scraper focused on:

  • Book search — scrape search results by keyword, genre, or list
  • Book details — title, author, rating, review count, genres, description, ISBN, page count
  • Author profiles — bio, book list, average rating, follower count
  • List scraping — pull entire Goodreads lists (e.g., "Best Science Fiction of 2025")

This actor is upcoming and not yet publicly available — check the link for launch updates.

Option 2: DIY with Python

If you prefer to build your own scraper, here's a minimal approach using requests and BeautifulSoup4:

import requests
from bs4 import BeautifulSoup

def scrape_book(url):
    headers = {"User-Agent": "Mozilla/5.0"}
    resp = requests.get(url, headers=headers)
    soup = BeautifulSoup(resp.text, "html.parser")

    title = soup.select_one("h1.Text__title1")
    author = soup.select_one("span.ContributorLink__name")
    rating = soup.select_one("div.RatingStatistics__rating")

    return {
        "title": title.text.strip() if title else None,
        "author": author.text.strip() if author else None,
        "rating": rating.text.strip() if rating else None,
    }
Enter fullscreen mode Exit fullscreen mode

Pros: Full control, no platform fees, customizable output.

Cons: You handle proxies, rate limiting, retries, and maintenance when Goodreads changes its HTML. Expect to spend significant time on infrastructure rather than data analysis.

Option 3: Browser Automation

For JavaScript-heavy pages (like paginated reviews), you may need Playwright or Puppeteer:

from playwright.sync_api import sync_playwright

def scrape_reviews(book_url):
    with sync_playwright() as p:
        browser = p.chromium.launch()
        page = browser.new_page()
        page.goto(book_url)
        page.wait_for_selector("article.ReviewCard")

        reviews = page.query_selector_all("article.ReviewCard")
        data = []
        for review in reviews[:10]:
            text = review.query_selector("span.Formatted")
            stars = review.query_selector("span.RatingStars")
            data.append({
                "text": text.inner_text() if text else "",
                "rating": stars.get_attribute("aria-label") if stars else ""
            })
        browser.close()
        return data
Enter fullscreen mode Exit fullscreen mode

This is slower and more resource-intensive but handles dynamic content that plain HTTP requests miss.

Comparison Table

Feature Apify Actors DIY Python Browser Automation
Setup time Minutes Hours Hours
Proxy handling Built-in Manual Manual
JavaScript support Yes No Yes
Cost Pay per usage Free (+ proxy costs) Free (+ proxy costs)
Maintenance Actor maintainer You You
Scalability High Medium Low

Legal Considerations

Goodreads' Terms of Service prohibit automated scraping. In practice:

  • Scraping public data for research or personal use is generally low-risk
  • Scraping at high volume or for commercial redistribution carries more legal exposure
  • The 2022 hiQ v. LinkedIn ruling supports scraping of publicly accessible data, but this is not settled law everywhere
  • Always respect robots.txt and rate-limit your requests

Recommendations

For most users: Start with an Apify actor. The time saved on proxy management and maintenance pays for itself quickly. Check our upcoming Goodreads scraper or browse the Apify Store for alternatives.

For developers who need full control: Build with requests + BeautifulSoup for metadata, add Playwright only for review scraping. Budget time for ongoing maintenance.

For one-off research: A simple Python script with time.sleep(2) between requests is often enough. No need for infrastructure.

Whatever approach you choose, Goodreads remains one of the richest sources of book data on the web — it just takes a bit more work to access it now that the API is gone.

Top comments (0)