agenthustler

Posted on Apr 9

Goodreads Scraping: Extract Book Data, Reviews, and Reading Lists

#webdev #javascript #programming #webscraping

Goodreads is the world's largest site for readers and book recommendations, with over 150 million members and billions of data points about books, reviews, ratings, and reading habits. Whether you're building a book recommendation engine, conducting publishing market research, or analyzing reading trends, Goodreads holds the data you need.

In this comprehensive guide, we'll explore Goodreads' web structure, demonstrate how to extract book metadata, reviews, author information, and reading lists, and discuss practical approaches to scaling your scraping pipeline.

Why Scrape Goodreads?

Since Goodreads deprecated its public API in December 2020, scraping has become the primary method for accessing its data programmatically. Common use cases include:

Publishing market research: Analyze what genres, themes, and cover styles are trending
Book recommendation engines: Build personalized recommendation systems using rating and review data
Author analytics: Track an author's review velocity, rating trends, and audience growth
Competitive analysis for authors: Compare your book's performance against similar titles
Library and bookstore inventory: Match inventory to trending titles and reader demand
Academic research: Study reading patterns, literary trends, and cultural analysis
Price comparison tools: Cross-reference Goodreads ratings with prices from retailers

The deprecation of the official API actually increased the demand for scraping solutions, as thousands of apps and researchers lost their primary data access method.

Understanding Goodreads' Web Structure

Goodreads is a Rails application with a relatively straightforward URL structure. Here are the key page types you'll encounter:

Book Pages

Every book has a page at https://www.goodreads.com/book/show/{book_id}. For example:

https://www.goodreads.com/book/show/5907.The_Hobbit

Book pages contain rich metadata:

Title, author, and cover image
Average rating and total number of ratings
Number of reviews and text reviews
ISBN, ISBN13, and ASIN identifiers
Page count, publication date, and edition info
Genres and shelves (community-assigned categories)
Book series information
Similar book recommendations

Author Pages

Author pages live at https://www.goodreads.com/author/show/{author_id}:

Author bio, photo, and website links
Complete bibliography
Average rating across all books
Follower count
Author quotes

List Pages

Goodreads Listopia is a community-curated collection of book lists:

https://www.goodreads.com/list/show/{list_id}
Lists include rankings, vote counts, and book metadata

Search Results

Search is accessible at https://www.goodreads.com/search?q={query} and returns books, authors, and series.

Extracting Book Metadata

Let's start with extracting detailed book information. Goodreads pages use a mix of server-rendered HTML and embedded JSON-LD structured data:

const axios = require('axios');
const cheerio = require('cheerio');

async function scrapeBookPage(bookId) {
    const url = `https://www.goodreads.com/book/show/${bookId}`;

    try {
        const response = await axios.get(url, {
            headers: {
                'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36',
                'Accept': 'text/html,application/xhtml+xml',
                'Accept-Language': 'en-US,en;q=0.9'
            },
            timeout: 15000
        });

        const $ = cheerio.load(response.data);

        // Extract JSON-LD data if available
        let jsonLd = {};
        $('script[type="application/ld+json"]').each((i, el) => {
            try {
                const parsed = JSON.parse($(el).html());
                if (parsed['@type'] === 'Book') {
                    jsonLd = parsed;
                }
            } catch (e) {}
        });

        // Extract from HTML elements
        const bookData = {
            title: $('h1[data-testid="bookTitle"]').text().trim()
                || jsonLd.name || '',
            author: $('span[data-testid="name"]').first().text().trim()
                || jsonLd.author?.[0]?.name || '',
            rating: parseFloat(
                $('div[class*="RatingStatistics__rating"]').first().text()
            ) || jsonLd.aggregateRating?.ratingValue || null,
            ratingsCount: parseInt(
                $('span[data-testid="ratingsCount"]').text()
                    .replace(/[^0-9]/g, '')
            ) || jsonLd.aggregateRating?.ratingCount || 0,
            reviewsCount: parseInt(
                $('span[data-testid="reviewsCount"]').text()
                    .replace(/[^0-9]/g, '')
            ) || jsonLd.aggregateRating?.reviewCount || 0,
            description: $('div[data-testid="description"]')
                .find('span').last().text().trim(),
            genres: $('span[class*="BookPageMetadataSection__genreButton"]')
                .map((i, el) => $(el).text().trim()).get(),
            pages: parseInt(
                $('p[data-testid="pagesFormat"]').text().match(/\d+/)?.[0]
            ) || null,
            publishDate: $('p[data-testid="publicationInfo"]')
                .text().trim(),
            isbn: jsonLd.isbn || null,
            coverImage: $('img[class*="ResponsiveImage"]').first()
                .attr('src') || jsonLd.image || null,
            url: url
        };

        return bookData;
    } catch (error) {
        console.error(`Error scraping book ${bookId}:`, error.message);
        return null;
    }
}

// Example usage
scrapeBookPage('5907.The_Hobbit').then(data => {
    console.log(JSON.stringify(data, null, 2));
});

Extracting Series Information

Many books belong to a series, and extracting that relationship is valuable:

async function getSeriesBooks(seriesId) {
    const url = `https://www.goodreads.com/series/${seriesId}`;

    try {
        const response = await axios.get(url, {
            headers: {
                'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36'
            }
        });

        const $ = cheerio.load(response.data);
        const books = [];

        $('div[class*="listWithDividers__item"]').each((i, el) => {
            const $el = $(el);
            const bookLink = $el.find('a[href*="/book/show/"]');
            const bookUrl = bookLink.attr('href') || '';
            const bookIdMatch = bookUrl.match(/\/book\/show\/(\d+)/);

            books.push({
                position: i + 1,
                title: bookLink.text().trim(),
                bookId: bookIdMatch ? bookIdMatch[1] : null,
                rating: parseFloat(
                    $el.find('span[class*="minirating"]').text()
                        .match(/[\d.]+/)?.[0]
                ) || null,
                url: bookUrl
                    ? `https://www.goodreads.com${bookUrl}`
                    : null
            });
        });

        return {
            seriesName: $('h1').first().text().trim(),
            totalBooks: books.length,
            books
        };
    } catch (error) {
        console.error(`Error scraping series ${seriesId}:`, error.message);
        return null;
    }
}

Scraping Book Reviews

Reviews are the heart of Goodreads. Here's how to extract them:

async function scrapeBookReviews(bookId, maxPages = 5) {
    const allReviews = [];

    for (let page = 1; page <= maxPages; page++) {
        const url = `https://www.goodreads.com/book/show/${bookId}`;
        const params = { page };

        try {
            const response = await axios.get(url, {
                params,
                headers: {
                    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36',
                    'Accept': 'text/html,application/xhtml+xml'
                }
            });

            const $ = cheerio.load(response.data);

            $('article[class*="ReviewCard"]').each((i, el) => {
                const $review = $(el);

                // Extract star rating from filled stars
                const stars = $review
                    .find('span[class*="RatingStars"]')
                    .attr('aria-label');
                const ratingMatch = stars?.match(/(\d)/);

                const reviewData = {
                    reviewer: $review
                        .find('div[class*="ReviewerProfile__name"]')
                        .text().trim(),
                    rating: ratingMatch
                        ? parseInt(ratingMatch[1]) : null,
                    reviewText: $review
                        .find('section[class*="ReviewText"]')
                        .find('span').last().text().trim(),
                    date: $review
                        .find('span[class*="ReviewCard__date"]')
                        .text().trim(),
                    likes: parseInt(
                        $review
                            .find('button[class*="SocialFooter__action"]')
                            .first().text().match(/\d+/)?.[0]
                    ) || 0
                };

                if (reviewData.reviewText) {
                    allReviews.push(reviewData);
                }
            });

            // Check if there are more pages
            const hasNext = $('a[class*="next_page"]').length > 0;
            if (!hasNext) break;

            // Rate limiting between pages
            await new Promise(resolve => setTimeout(resolve, 2000));

        } catch (error) {
            console.error(`Error on page ${page}:`, error.message);
            break;
        }
    }

    return allReviews;
}

Analyzing Review Data

Once you have reviews, you can derive insights:

function analyzeBookReviews(reviews) {
    if (reviews.length === 0) return null;

    const ratings = reviews
        .filter(r => r.rating)
        .map(r => r.rating);
    const avgRating = ratings.reduce((a, b) => a + b, 0) / ratings.length;

    // Rating distribution
    const distribution = { 1: 0, 2: 0, 3: 0, 4: 0, 5: 0 };
    ratings.forEach(r => distribution[r]++);

    // Review length analysis
    const avgReviewLength = reviews.reduce(
        (sum, r) => sum + r.reviewText.length, 0
    ) / reviews.length;

    // Most liked reviews
    const topReviews = [...reviews]
        .sort((a, b) => b.likes - a.likes)
        .slice(0, 5);

    return {
        totalReviews: reviews.length,
        averageRating: avgRating.toFixed(2),
        ratingDistribution: distribution,
        averageReviewLength: Math.round(avgReviewLength),
        topReviews: topReviews.map(r => ({
            reviewer: r.reviewer,
            rating: r.rating,
            likes: r.likes,
            excerpt: r.reviewText.substring(0, 200) + '...'
        }))
    };
}

Scraping Author Pages

Author data provides valuable context for publishing analytics:

async function scrapeAuthorPage(authorId) {
    const url = `https://www.goodreads.com/author/show/${authorId}`;

    try {
        const response = await axios.get(url, {
            headers: {
                'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36'
            }
        });

        const $ = cheerio.load(response.data);

        const books = [];
        $('tr[itemtype="http://schema.org/Book"]').each((i, el) => {
            const $book = $(el);
            books.push({
                title: $book.find('a.bookTitle').text().trim(),
                rating: parseFloat(
                    $book.find('span.minirating').text()
                        .match(/[\d.]+/)?.[0]
                ) || null,
                url: 'https://www.goodreads.com' +
                    ($book.find('a.bookTitle').attr('href') || '')
            });
        });

        return {
            name: $('h1.authorName span[itemprop="name"]')
                .text().trim(),
            bio: $('div.aboutAuthorInfo span').last().text().trim(),
            website: $('div.dataItem a[href*="http"]').first()
                .attr('href') || null,
            bornDate: $('div.dataItem[itemprop="birthDate"]')
                .text().trim(),
            genres: $('div.dataItem a[href*="/genres/"]')
                .map((i, el) => $(el).text().trim()).get(),
            averageRating: parseFloat(
                $('span.average[itemprop="ratingValue"]').text()
            ) || null,
            followerCount: $('div[class*="followerCount"]')
                .text().trim(),
            books: books,
            totalBooks: books.length,
            url: url
        };
    } catch (error) {
        console.error(`Error scraping author ${authorId}:`, error.message);
        return null;
    }
}

Extracting Reading Lists (Listopia)

Goodreads lists are goldmines for trending book analysis:

async function scrapeList(listId, maxPages = 3) {
    const allBooks = [];

    for (let page = 1; page <= maxPages; page++) {
        const url = `https://www.goodreads.com/list/show/${listId}`;
        const params = { page };

        try {
            const response = await axios.get(url, {
                params,
                headers: {
                    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36'
                }
            });

            const $ = cheerio.load(response.data);

            $('tr[itemtype="http://schema.org/Book"]').each((i, el) => {
                const $book = $(el);
                const bookLink = $book.find('a.bookTitle');
                const href = bookLink.attr('href') || '';
                const bookIdMatch = href.match(/\/book\/show\/(\d+)/);

                allBooks.push({
                    rank: allBooks.length + 1,
                    title: bookLink.text().trim(),
                    author: $book.find('a.authorName span')
                        .text().trim(),
                    rating: parseFloat(
                        $book.find('span.minirating').text()
                            .match(/[\d.]+/)?.[0]
                    ) || null,
                    votes: parseInt(
                        $book.find('span[class*="votes"]').text()
                            .replace(/[^0-9]/g, '')
                    ) || 0,
                    bookId: bookIdMatch ? bookIdMatch[1] : null,
                    url: href
                        ? `https://www.goodreads.com${href}`
                        : null
                });
            });

            await new Promise(resolve => setTimeout(resolve, 2000));
        } catch (error) {
            console.error(`Error on list page ${page}:`, error.message);
            break;
        }
    }

    return allBooks;
}

Handling Goodreads' Anti-Scraping Measures

Goodreads has become increasingly aggressive about blocking scrapers. Common challenges include:

Rate limiting: Goodreads will serve 403 or 503 errors after too many requests
CAPTCHA pages: Automated traffic triggers CAPTCHA challenges
Dynamic content loading: Some review content loads via JavaScript
Login walls: Certain data requires authentication
Cloudflare protection: Additional bot detection layer

Best practices for avoiding blocks:

// Request throttling with jitter
function randomDelay(min = 2000, max = 5000) {
    const delay = Math.floor(
        Math.random() * (max - min + 1) + min
    );
    return new Promise(resolve => setTimeout(resolve, delay));
}

// Rotate user agents
const userAgents = [
    'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36',
    'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/17.2 Safari/605.1.15',
    'Mozilla/5.0 (X11; Linux x86_64; rv:121.0) Gecko/20100101 Firefox/121.0'
];

function getRandomUA() {
    return userAgents[Math.floor(Math.random() * userAgents.length)];
}

// Respectful scraping wrapper
async function fetchWithRetry(url, maxRetries = 3) {
    for (let attempt = 0; attempt < maxRetries; attempt++) {
        try {
            const response = await axios.get(url, {
                headers: {
                    'User-Agent': getRandomUA(),
                    'Accept': 'text/html,application/xhtml+xml',
                    'Accept-Language': 'en-US,en;q=0.9',
                    'Accept-Encoding': 'gzip, deflate, br'
                },
                timeout: 15000
            });

            if (response.status === 200) return response;
        } catch (error) {
            if (error.response?.status === 429
                || error.response?.status === 503) {
                // Exponential backoff
                const wait = Math.pow(2, attempt) * 5000;
                console.log(
                    `Rate limited. Waiting ${wait / 1000}s...`
                );
                await new Promise(r => setTimeout(r, wait));
            } else {
                throw error;
            }
        }
    }
    return null;
}

Scaling with Apify

For production-grade Goodreads scraping, managing proxies, retries, and rate limits manually becomes untenable. The Apify Store provides managed scraping infrastructure that handles these challenges.

Apify actors for Goodreads data provide:

Residential proxy rotation to bypass Cloudflare and rate limits
Automatic retry logic with exponential backoff
Headless browser support for JavaScript-rendered content
Scheduled runs for monitoring new releases and trending lists
Built-in storage with export to JSON, CSV, or direct database integration

Example of running a Goodreads actor via the Apify API:

const { ApifyClient } = require('apify-client');

const client = new ApifyClient({
    token: 'YOUR_APIFY_TOKEN'
});

async function scrapeGoodreadsBooks(bookUrls) {
    const run = await client.actor('your-goodreads-actor').call({
        startUrls: bookUrls.map(url => ({ url })),
        maxReviewsPerBook: 100,
        includeAuthorData: true,
        proxy: {
            useApifyProxy: true,
            apifyProxyGroups: ['RESIDENTIAL']
        }
    });

    const { items } = await client
        .dataset(run.defaultDatasetId)
        .listItems();
    return items;
}

// Scrape multiple books
scrapeGoodreadsBooks([
    'https://www.goodreads.com/book/show/5907.The_Hobbit',
    'https://www.goodreads.com/book/show/3.Harry_Potter_and_the_Sorcerer_s_Stone',
    'https://www.goodreads.com/book/show/11127.The_Hitchhiker_s_Guide_to_the_Galaxy'
]).then(books => {
    console.log(`Scraped ${books.length} books`);
    books.forEach(b => console.log(`${b.title}: ${b.rating}/5`));
});

Practical Use Case: Building a Genre Trend Tracker

Here's a complete example that combines multiple scraping techniques to track genre trends:

async function trackGenreTrends(genreSlugs) {
    const trends = {};

    for (const genre of genreSlugs) {
        const url = `https://www.goodreads.com/shelf/show/${genre}`;

        try {
            const response = await fetchWithRetry(url);
            if (!response) continue;

            const $ = cheerio.load(response.data);
            const books = [];

            $('div.leftAlignedImage').each((i, el) => {
                if (i >= 20) return false;
                const $el = $(el);
                books.push({
                    title: $el.find('a.bookTitle').text().trim(),
                    author: $el.find('a.authorName').text().trim()
                });
            });

            trends[genre] = {
                topBooks: books,
                bookCount: parseInt(
                    $('div.mediumText')
                        .text().match(/([\d,]+) books/)?.[1]
                        ?.replace(/,/g, '')
                ) || 0,
                scrapedAt: new Date().toISOString()
            };

            await randomDelay(3000, 6000);
        } catch (error) {
            console.error(`Error tracking ${genre}:`, error.message);
        }
    }

    return trends;
}

// Track multiple genres
trackGenreTrends([
    'fantasy', 'science-fiction', 'romance',
    'thriller', 'literary-fiction'
]).then(trends => {
    for (const [genre, data] of Object.entries(trends)) {
        console.log(
            `\n${genre}: ${data.bookCount} total books`
        );
        data.topBooks.slice(0, 5).forEach((b, i) => {
            console.log(`  ${i + 1}. ${b.title} by ${b.author}`);
        });
    }
});

Legal and Ethical Considerations

Goodreads scraping comes with important caveats:

Terms of Service: Goodreads' ToS prohibits automated scraping. Use at your own risk and for legitimate purposes
Data privacy: Don't collect or store personal user data beyond what's publicly visible
Rate limiting: Be respectful of their infrastructure — aggressive scraping hurts everyone
robots.txt: Check and respect their robots.txt directives
Commercial use: If building a commercial product with Goodreads data, consult a lawyer
Attribution: If displaying Goodreads data publicly, provide proper attribution

Conclusion

Despite the API deprecation, Goodreads remains one of the richest sources of book data on the internet. With careful scraping techniques, proper rate limiting, and respect for the platform's resources, you can extract valuable insights for publishing research, recommendation engines, and literary analysis.

For production workloads, platforms like Apify handle the infrastructure complexity — proxy rotation, CAPTCHA solving, scheduling, and data storage — so you can focus on deriving value from the data rather than maintaining scraping infrastructure.

Whether you're an indie author researching your market, a publisher tracking trends, or a developer building the next great book discovery tool, the techniques in this guide give you a solid foundation for accessing Goodreads' wealth of book data.

DEV Community