Goodreads is the world's largest site for readers and book recommendations, with over 150 million members and billions of data points about books, reviews, ratings, and reading habits. Whether you're building a book recommendation engine, conducting publishing market research, or analyzing reading trends, Goodreads holds the data you need.
In this comprehensive guide, we'll explore Goodreads' web structure, demonstrate how to extract book metadata, reviews, author information, and reading lists, and discuss practical approaches to scaling your scraping pipeline.
Why Scrape Goodreads?
Since Goodreads deprecated its public API in December 2020, scraping has become the primary method for accessing its data programmatically. Common use cases include:
- Publishing market research: Analyze what genres, themes, and cover styles are trending
- Book recommendation engines: Build personalized recommendation systems using rating and review data
- Author analytics: Track an author's review velocity, rating trends, and audience growth
- Competitive analysis for authors: Compare your book's performance against similar titles
- Library and bookstore inventory: Match inventory to trending titles and reader demand
- Academic research: Study reading patterns, literary trends, and cultural analysis
- Price comparison tools: Cross-reference Goodreads ratings with prices from retailers
The deprecation of the official API actually increased the demand for scraping solutions, as thousands of apps and researchers lost their primary data access method.
Understanding Goodreads' Web Structure
Goodreads is a Rails application with a relatively straightforward URL structure. Here are the key page types you'll encounter:
Book Pages
Every book has a page at https://www.goodreads.com/book/show/{book_id}. For example:
https://www.goodreads.com/book/show/5907.The_Hobbit
Book pages contain rich metadata:
- Title, author, and cover image
- Average rating and total number of ratings
- Number of reviews and text reviews
- ISBN, ISBN13, and ASIN identifiers
- Page count, publication date, and edition info
- Genres and shelves (community-assigned categories)
- Book series information
- Similar book recommendations
Author Pages
Author pages live at https://www.goodreads.com/author/show/{author_id}:
- Author bio, photo, and website links
- Complete bibliography
- Average rating across all books
- Follower count
- Author quotes
List Pages
Goodreads Listopia is a community-curated collection of book lists:
https://www.goodreads.com/list/show/{list_id}- Lists include rankings, vote counts, and book metadata
Search Results
Search is accessible at https://www.goodreads.com/search?q={query} and returns books, authors, and series.
Extracting Book Metadata
Let's start with extracting detailed book information. Goodreads pages use a mix of server-rendered HTML and embedded JSON-LD structured data:
const axios = require('axios');
const cheerio = require('cheerio');
async function scrapeBookPage(bookId) {
const url = `https://www.goodreads.com/book/show/${bookId}`;
try {
const response = await axios.get(url, {
headers: {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36',
'Accept': 'text/html,application/xhtml+xml',
'Accept-Language': 'en-US,en;q=0.9'
},
timeout: 15000
});
const $ = cheerio.load(response.data);
// Extract JSON-LD data if available
let jsonLd = {};
$('script[type="application/ld+json"]').each((i, el) => {
try {
const parsed = JSON.parse($(el).html());
if (parsed['@type'] === 'Book') {
jsonLd = parsed;
}
} catch (e) {}
});
// Extract from HTML elements
const bookData = {
title: $('h1[data-testid="bookTitle"]').text().trim()
|| jsonLd.name || '',
author: $('span[data-testid="name"]').first().text().trim()
|| jsonLd.author?.[0]?.name || '',
rating: parseFloat(
$('div[class*="RatingStatistics__rating"]').first().text()
) || jsonLd.aggregateRating?.ratingValue || null,
ratingsCount: parseInt(
$('span[data-testid="ratingsCount"]').text()
.replace(/[^0-9]/g, '')
) || jsonLd.aggregateRating?.ratingCount || 0,
reviewsCount: parseInt(
$('span[data-testid="reviewsCount"]').text()
.replace(/[^0-9]/g, '')
) || jsonLd.aggregateRating?.reviewCount || 0,
description: $('div[data-testid="description"]')
.find('span').last().text().trim(),
genres: $('span[class*="BookPageMetadataSection__genreButton"]')
.map((i, el) => $(el).text().trim()).get(),
pages: parseInt(
$('p[data-testid="pagesFormat"]').text().match(/\d+/)?.[0]
) || null,
publishDate: $('p[data-testid="publicationInfo"]')
.text().trim(),
isbn: jsonLd.isbn || null,
coverImage: $('img[class*="ResponsiveImage"]').first()
.attr('src') || jsonLd.image || null,
url: url
};
return bookData;
} catch (error) {
console.error(`Error scraping book ${bookId}:`, error.message);
return null;
}
}
// Example usage
scrapeBookPage('5907.The_Hobbit').then(data => {
console.log(JSON.stringify(data, null, 2));
});
Extracting Series Information
Many books belong to a series, and extracting that relationship is valuable:
async function getSeriesBooks(seriesId) {
const url = `https://www.goodreads.com/series/${seriesId}`;
try {
const response = await axios.get(url, {
headers: {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36'
}
});
const $ = cheerio.load(response.data);
const books = [];
$('div[class*="listWithDividers__item"]').each((i, el) => {
const $el = $(el);
const bookLink = $el.find('a[href*="/book/show/"]');
const bookUrl = bookLink.attr('href') || '';
const bookIdMatch = bookUrl.match(/\/book\/show\/(\d+)/);
books.push({
position: i + 1,
title: bookLink.text().trim(),
bookId: bookIdMatch ? bookIdMatch[1] : null,
rating: parseFloat(
$el.find('span[class*="minirating"]').text()
.match(/[\d.]+/)?.[0]
) || null,
url: bookUrl
? `https://www.goodreads.com${bookUrl}`
: null
});
});
return {
seriesName: $('h1').first().text().trim(),
totalBooks: books.length,
books
};
} catch (error) {
console.error(`Error scraping series ${seriesId}:`, error.message);
return null;
}
}
Scraping Book Reviews
Reviews are the heart of Goodreads. Here's how to extract them:
async function scrapeBookReviews(bookId, maxPages = 5) {
const allReviews = [];
for (let page = 1; page <= maxPages; page++) {
const url = `https://www.goodreads.com/book/show/${bookId}`;
const params = { page };
try {
const response = await axios.get(url, {
params,
headers: {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36',
'Accept': 'text/html,application/xhtml+xml'
}
});
const $ = cheerio.load(response.data);
$('article[class*="ReviewCard"]').each((i, el) => {
const $review = $(el);
// Extract star rating from filled stars
const stars = $review
.find('span[class*="RatingStars"]')
.attr('aria-label');
const ratingMatch = stars?.match(/(\d)/);
const reviewData = {
reviewer: $review
.find('div[class*="ReviewerProfile__name"]')
.text().trim(),
rating: ratingMatch
? parseInt(ratingMatch[1]) : null,
reviewText: $review
.find('section[class*="ReviewText"]')
.find('span').last().text().trim(),
date: $review
.find('span[class*="ReviewCard__date"]')
.text().trim(),
likes: parseInt(
$review
.find('button[class*="SocialFooter__action"]')
.first().text().match(/\d+/)?.[0]
) || 0
};
if (reviewData.reviewText) {
allReviews.push(reviewData);
}
});
// Check if there are more pages
const hasNext = $('a[class*="next_page"]').length > 0;
if (!hasNext) break;
// Rate limiting between pages
await new Promise(resolve => setTimeout(resolve, 2000));
} catch (error) {
console.error(`Error on page ${page}:`, error.message);
break;
}
}
return allReviews;
}
Analyzing Review Data
Once you have reviews, you can derive insights:
function analyzeBookReviews(reviews) {
if (reviews.length === 0) return null;
const ratings = reviews
.filter(r => r.rating)
.map(r => r.rating);
const avgRating = ratings.reduce((a, b) => a + b, 0) / ratings.length;
// Rating distribution
const distribution = { 1: 0, 2: 0, 3: 0, 4: 0, 5: 0 };
ratings.forEach(r => distribution[r]++);
// Review length analysis
const avgReviewLength = reviews.reduce(
(sum, r) => sum + r.reviewText.length, 0
) / reviews.length;
// Most liked reviews
const topReviews = [...reviews]
.sort((a, b) => b.likes - a.likes)
.slice(0, 5);
return {
totalReviews: reviews.length,
averageRating: avgRating.toFixed(2),
ratingDistribution: distribution,
averageReviewLength: Math.round(avgReviewLength),
topReviews: topReviews.map(r => ({
reviewer: r.reviewer,
rating: r.rating,
likes: r.likes,
excerpt: r.reviewText.substring(0, 200) + '...'
}))
};
}
Scraping Author Pages
Author data provides valuable context for publishing analytics:
async function scrapeAuthorPage(authorId) {
const url = `https://www.goodreads.com/author/show/${authorId}`;
try {
const response = await axios.get(url, {
headers: {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36'
}
});
const $ = cheerio.load(response.data);
const books = [];
$('tr[itemtype="http://schema.org/Book"]').each((i, el) => {
const $book = $(el);
books.push({
title: $book.find('a.bookTitle').text().trim(),
rating: parseFloat(
$book.find('span.minirating').text()
.match(/[\d.]+/)?.[0]
) || null,
url: 'https://www.goodreads.com' +
($book.find('a.bookTitle').attr('href') || '')
});
});
return {
name: $('h1.authorName span[itemprop="name"]')
.text().trim(),
bio: $('div.aboutAuthorInfo span').last().text().trim(),
website: $('div.dataItem a[href*="http"]').first()
.attr('href') || null,
bornDate: $('div.dataItem[itemprop="birthDate"]')
.text().trim(),
genres: $('div.dataItem a[href*="/genres/"]')
.map((i, el) => $(el).text().trim()).get(),
averageRating: parseFloat(
$('span.average[itemprop="ratingValue"]').text()
) || null,
followerCount: $('div[class*="followerCount"]')
.text().trim(),
books: books,
totalBooks: books.length,
url: url
};
} catch (error) {
console.error(`Error scraping author ${authorId}:`, error.message);
return null;
}
}
Extracting Reading Lists (Listopia)
Goodreads lists are goldmines for trending book analysis:
async function scrapeList(listId, maxPages = 3) {
const allBooks = [];
for (let page = 1; page <= maxPages; page++) {
const url = `https://www.goodreads.com/list/show/${listId}`;
const params = { page };
try {
const response = await axios.get(url, {
params,
headers: {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36'
}
});
const $ = cheerio.load(response.data);
$('tr[itemtype="http://schema.org/Book"]').each((i, el) => {
const $book = $(el);
const bookLink = $book.find('a.bookTitle');
const href = bookLink.attr('href') || '';
const bookIdMatch = href.match(/\/book\/show\/(\d+)/);
allBooks.push({
rank: allBooks.length + 1,
title: bookLink.text().trim(),
author: $book.find('a.authorName span')
.text().trim(),
rating: parseFloat(
$book.find('span.minirating').text()
.match(/[\d.]+/)?.[0]
) || null,
votes: parseInt(
$book.find('span[class*="votes"]').text()
.replace(/[^0-9]/g, '')
) || 0,
bookId: bookIdMatch ? bookIdMatch[1] : null,
url: href
? `https://www.goodreads.com${href}`
: null
});
});
await new Promise(resolve => setTimeout(resolve, 2000));
} catch (error) {
console.error(`Error on list page ${page}:`, error.message);
break;
}
}
return allBooks;
}
Handling Goodreads' Anti-Scraping Measures
Goodreads has become increasingly aggressive about blocking scrapers. Common challenges include:
- Rate limiting: Goodreads will serve 403 or 503 errors after too many requests
- CAPTCHA pages: Automated traffic triggers CAPTCHA challenges
- Dynamic content loading: Some review content loads via JavaScript
- Login walls: Certain data requires authentication
- Cloudflare protection: Additional bot detection layer
Best practices for avoiding blocks:
// Request throttling with jitter
function randomDelay(min = 2000, max = 5000) {
const delay = Math.floor(
Math.random() * (max - min + 1) + min
);
return new Promise(resolve => setTimeout(resolve, delay));
}
// Rotate user agents
const userAgents = [
'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36',
'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/17.2 Safari/605.1.15',
'Mozilla/5.0 (X11; Linux x86_64; rv:121.0) Gecko/20100101 Firefox/121.0'
];
function getRandomUA() {
return userAgents[Math.floor(Math.random() * userAgents.length)];
}
// Respectful scraping wrapper
async function fetchWithRetry(url, maxRetries = 3) {
for (let attempt = 0; attempt < maxRetries; attempt++) {
try {
const response = await axios.get(url, {
headers: {
'User-Agent': getRandomUA(),
'Accept': 'text/html,application/xhtml+xml',
'Accept-Language': 'en-US,en;q=0.9',
'Accept-Encoding': 'gzip, deflate, br'
},
timeout: 15000
});
if (response.status === 200) return response;
} catch (error) {
if (error.response?.status === 429
|| error.response?.status === 503) {
// Exponential backoff
const wait = Math.pow(2, attempt) * 5000;
console.log(
`Rate limited. Waiting ${wait / 1000}s...`
);
await new Promise(r => setTimeout(r, wait));
} else {
throw error;
}
}
}
return null;
}
Scaling with Apify
For production-grade Goodreads scraping, managing proxies, retries, and rate limits manually becomes untenable. The Apify Store provides managed scraping infrastructure that handles these challenges.
Apify actors for Goodreads data provide:
- Residential proxy rotation to bypass Cloudflare and rate limits
- Automatic retry logic with exponential backoff
- Headless browser support for JavaScript-rendered content
- Scheduled runs for monitoring new releases and trending lists
- Built-in storage with export to JSON, CSV, or direct database integration
Example of running a Goodreads actor via the Apify API:
const { ApifyClient } = require('apify-client');
const client = new ApifyClient({
token: 'YOUR_APIFY_TOKEN'
});
async function scrapeGoodreadsBooks(bookUrls) {
const run = await client.actor('your-goodreads-actor').call({
startUrls: bookUrls.map(url => ({ url })),
maxReviewsPerBook: 100,
includeAuthorData: true,
proxy: {
useApifyProxy: true,
apifyProxyGroups: ['RESIDENTIAL']
}
});
const { items } = await client
.dataset(run.defaultDatasetId)
.listItems();
return items;
}
// Scrape multiple books
scrapeGoodreadsBooks([
'https://www.goodreads.com/book/show/5907.The_Hobbit',
'https://www.goodreads.com/book/show/3.Harry_Potter_and_the_Sorcerer_s_Stone',
'https://www.goodreads.com/book/show/11127.The_Hitchhiker_s_Guide_to_the_Galaxy'
]).then(books => {
console.log(`Scraped ${books.length} books`);
books.forEach(b => console.log(`${b.title}: ${b.rating}/5`));
});
Practical Use Case: Building a Genre Trend Tracker
Here's a complete example that combines multiple scraping techniques to track genre trends:
async function trackGenreTrends(genreSlugs) {
const trends = {};
for (const genre of genreSlugs) {
const url = `https://www.goodreads.com/shelf/show/${genre}`;
try {
const response = await fetchWithRetry(url);
if (!response) continue;
const $ = cheerio.load(response.data);
const books = [];
$('div.leftAlignedImage').each((i, el) => {
if (i >= 20) return false;
const $el = $(el);
books.push({
title: $el.find('a.bookTitle').text().trim(),
author: $el.find('a.authorName').text().trim()
});
});
trends[genre] = {
topBooks: books,
bookCount: parseInt(
$('div.mediumText')
.text().match(/([\d,]+) books/)?.[1]
?.replace(/,/g, '')
) || 0,
scrapedAt: new Date().toISOString()
};
await randomDelay(3000, 6000);
} catch (error) {
console.error(`Error tracking ${genre}:`, error.message);
}
}
return trends;
}
// Track multiple genres
trackGenreTrends([
'fantasy', 'science-fiction', 'romance',
'thriller', 'literary-fiction'
]).then(trends => {
for (const [genre, data] of Object.entries(trends)) {
console.log(
`\n${genre}: ${data.bookCount} total books`
);
data.topBooks.slice(0, 5).forEach((b, i) => {
console.log(` ${i + 1}. ${b.title} by ${b.author}`);
});
}
});
Legal and Ethical Considerations
Goodreads scraping comes with important caveats:
- Terms of Service: Goodreads' ToS prohibits automated scraping. Use at your own risk and for legitimate purposes
- Data privacy: Don't collect or store personal user data beyond what's publicly visible
- Rate limiting: Be respectful of their infrastructure — aggressive scraping hurts everyone
- robots.txt: Check and respect their robots.txt directives
- Commercial use: If building a commercial product with Goodreads data, consult a lawyer
- Attribution: If displaying Goodreads data publicly, provide proper attribution
Conclusion
Despite the API deprecation, Goodreads remains one of the richest sources of book data on the internet. With careful scraping techniques, proper rate limiting, and respect for the platform's resources, you can extract valuable insights for publishing research, recommendation engines, and literary analysis.
For production workloads, platforms like Apify handle the infrastructure complexity — proxy rotation, CAPTCHA solving, scheduling, and data storage — so you can focus on deriving value from the data rather than maintaining scraping infrastructure.
Whether you're an indie author researching your market, a publisher tracking trends, or a developer building the next great book discovery tool, the techniques in this guide give you a solid foundation for accessing Goodreads' wealth of book data.
Top comments (0)