Crawlbase

Posted on Oct 15, 2024 • Edited on Oct 17, 2024 • Originally published at crawlbase.com

Scrape Goodreads for Book Ratings and Comments

#scrapegoodreads #findbookratings

This blog was originally posted to Crawlbase Blog

Goodreads stands out as a top online destination for people to share their thoughts on books. With its community of over 90 million signed-up users, the site buzzes with reviews, comments, and ratings on countless books. This wealth of user-created content offers a goldmine to anyone looking to extract valuable information such as book scores and reader feedback.

This post will guide you through making a program to gather book ratings and comments using Python and the Crawlbase Crawling API. We'll walk you through setting up your workspace, dealing with page-by-page results, and saving the information in an organized way.

Ready to dive in?

Why Scrape Goodreads?

Goodreads is a great place for book lovers, researchers, and businesses. Scraping Goodreads can provide you with a lot of user-generated data, using which you can analyze book trends, gather user feedback, or build a list of popular books. Here are a few reasons why scraping Goodreads can be useful:

Rich Data: Goodreads provides ratings, reviews, and comments on books, making it an ideal place to understand the preferences of readers.
Large User Base: With millions of active users Goodreads has a massive dataset, ideal for in-depth analysis.
Market Research: Data available from Goodreads can be used to help businesses understand market trends, popular books, and customer feedback that can be useful for marketing or product development.
Personal Projects: Scraping Goodreads can be handy if you are working on a personal project, like building your own book recommendation engine or analyzing reading habits.

Key Data Points to Extract from Goodreads

When scraping Goodreads, you should focus on the most important data points to get useful insights. Here are the key ones to collect:

Book Title: This is essential for any analysis or reporting.
Author Name: To categorize and organize books and to track popular authors.
Average Rating: Goodreads average rating based on user reviews. This is the key to understanding the book’s popularity.
Number of Ratings: Total number of ratings. How many people have read the book.
User Comments/Reviews: User reviews are great for qualitative analysis. What did readers like or dislike?
Genres: Goodreads books are often tagged with genres. Helps to categorize and recommend similar books.
Publication Year: Useful to track trends over time or compare books published in the same year.
Book Synopsis: The synopsis provides a summary of the book’s plot and gives context to what the book is about.

Crawlbase Crawling API for Goodreads Scraping

When scraping dynamic websites like Goodreads, traditional request methods struggle due to JavaScript rendering and complex pagination. This is where the Crawlbase Crawling API comes in handy. It handles JavaScript rendering, paginated content, and captchas so Goodreads scraping is smoother.

Why Use Crawlbase for Goodreads Scraping?

JavaScript Rendering: Crawlbase handles the JavaScript Goodreads uses to display ratings, comments and other dynamic content.
Effortless Pagination: With dynamic pagination, navigating through multiple pages of reviews becomes automatic.
Prevention Against Blocks: Crawlbase manages proxies and captchas for you, reducing the risk of being blocked or detected.

Crawlbase Python Library

Crawlbase has a Python library that makes web scraping a lot easier. This library requires an access token to authenticate. You can get a token after creating an account on crawlbase.

Here’s an example function demonstrating how to use the Crawlbase Crawling API to send requests:

from crawlbase import CrawlingAPI

# Initialize Crawlbase API with your access token
crawling_api = CrawlingAPI({ 'token': 'YOUR_CRAWLBASE_TOKEN' })

def make_crawlbase_request(url):
    response = crawling_api.get(url)

    if response['headers']['pc_status'] == '200':
        html_content = response['body'].decode('utf-8')
        return html_content
    else:
        print(f"Failed to fetch the page. Crawlbase status code: {response['headers']['pc_status']}")
        return None

Note: Crawlbase offers two types of tokens:

Normal Token for static sites.
JavaScript (JS) Token for dynamic or browser-based requests.

For scraping dynamic sites like Goodreads, you’ll need the JS Token. Crawlbase provides 1,000 free requests to get you started, and no credit card is required for this trial. For more details, check out the Crawlbase Crawling API documentation.

Setting Up Your Python Environment

Before scraping Goodreads for book ratings and comments, you need to set up your Python environment properly. Here’s a quick guide to get started.

Installing Python and Required Libraries

Download Python: Go the Python website and fetch the current version made available for your OS. During the installation, remember to add Python to the system PATH.
Install Python: After that, check that you have successfully installed it by typing in the console or in the command window the following instructions:

python --version

Install Libraries: With the use of pip, install and import required libraries including crawlbase in order to make an HTTP request using Crawlbase Crawling API, and the BeautifulSoup from the bs4 library to parse web pages:

pip install crawlbase
pip install beautifulsoup4

Choosing an IDE

A good IDE simplifies your coding. Below are some of the popular ones:

VS Code: Simple and lightweight, multi-purpose, free with Python extensions.
PyCharm: A robust Python IDE with many built-in tools for professional development.
Jupyter Notebooks: Good for running codes with an interactive setting, especially for data projects.

With your environment ready, you can now move on to scraping Goodreads.

Scraping Goodreads for Book Ratings and Comments

While web scraping book ratings and comments from Goodreads, one must take in account the fact that the content is in constant change. The comments and reviews are loaded both asynchronously and the pagination is done through buttons. This part describes how to get this information and work with pagination through Crawlbase utilizing a JS Token and css_click_selector parameter for button navigation.

Inspecting the HTML for Selectors

First of all, one must look into the HTML code of the Goodreads page on which you want to scrape. For example, to scrape reviews for The Great Gatsby, use the URL:

https://www.goodreads.com/book/show/4671.The_Great_Gatsby/reviews

Open the developer tools in your browser and navigate to this URL.

Here are some key selectors to focus on:

Book Title: Found in an h1 tag with class H1Title, specifically in an anchor tag with data-testid="title".
Ratings: Located in a div with class RatingStatistics, with the value in a span tag of class RatingStars (using the aria-label attribute).
Reviews: Each review is within an article inside a div with class ReviewsList and class ReviewCard. Each review includes:
- User's name in a div with data-testid="name".
- Review text in a section with class ReviewText, containing a span with class Formatted.
Load More Button: The "Show More Reviews" button in the review section for pagination, identified by button:has(span[data-testid="loadMore"]).

Writing the Goodreads Scraper for Ratings and Comments

Crawlbase Crawling API provide multiple parameters which you can use with it. Using Crawlbase’s JS Token, you can handle dynamic content loading on Goodreads. The ajax_wait and page_wait parameters can be used to give the page time to load.

Here’s a Python script to scrape Goodreads for book details, ratings, and comments using Crawlbase Crawling API.

from crawlbase import CrawlingAPI
import json
from bs4 import BeautifulSoup

# Initialize Crawlbase API with JS Token
crawling_api = CrawlingAPI({ 'token': 'CRAWLBASE_JS_TOKEN' })

# Function to fetch and process Goodreads book details and reviews
def scrape_goodreads_reviews(base_url):
    page_data = []

    # Fetch initial page and reviews
    response = crawling_api.get(base_url, {
        'ajax_wait': 'true',
        'page_wait': '5000'
    })

    if response['headers']['pc_status'] == '200':
        html_content = response['body'].decode('utf-8')
        page_data = extract_book_details(html_content)

    return page_data

# Function to extract the book title, rating, and reviews from the page
def extract_book_details(html):
    soup = BeautifulSoup(html, 'html.parser')
    title = soup.select_one('h1.H1Title a[data-testid="title"]').text.strip()
    rating = soup.select_one('div.RatingStatistics span.RatingStars')['aria-label']

    reviews = []
    for review_div in soup.select('div.ReviewsList article.ReviewCard'):
        user = review_div.select_one('div[data-testid="name"]').text.strip()
        review_text = review_div.select_one('section.ReviewText span.Formatted').text.strip()
        reviews.append({'user': user, 'review': review_text})

    return {'title': title, 'rating': rating, 'reviews': reviews}

Handling Pagination

Goodreads uses a button-based pagination system to load more reviews. You can use Crawlbase's css_click_selector parameter to simulate clicking the "Next" button and scraping additional pages of reviews. This method helps you to collect the maximum number of reviews as possible.

Here’s how the pagination can be handled:

def scrape_goodreads_reviews_with_pagination(base_url):
    page_data = []

    # Fetch initial page and reviews
    response = crawling_api.get(base_url, {
        'ajax_wait': 'true',
        'page_wait': '5000',
        'css_click_selector': 'button:has(span[data-testid="loadMore"])'
    })

    if response['headers']['pc_status'] == '200':
        html_content = response['body'].decode('utf-8')
        page_data = extract_book_details(html_content)

    return page_data

Storing Data in a JSON File

After extracting the book details and reviews you can write the scraped data into a JSON File. This format is perfect for keeping structured data and very easy to process for later use.

Here’s how to save the data:

# Function to save scraped reviews to a JSON file
def save_reviews_to_json(data, filename='goodreads_reviews.json'):
    with open(filename, 'w', encoding='utf-8') as f:
        json.dump(data, f, ensure_ascii=False, indent=4)

# Example usage
book_reviews = scrape_goodreads_reviews_with_pagination('https://www.goodreads.com/book/show/4671.The_Great_Gatsby/reviews')
save_reviews_to_json(book_reviews)

Complete Code Example

Here is the complete code that scrapes Goodreads for book ratings and reviews, handles button-based pagination, and saves the data in a JSON file:

from crawlbase import CrawlingAPI
import json
from bs4 import BeautifulSoup

# Initialize Crawlbase API with JS Token
crawling_api = CrawlingAPI({ 'token': 'CRAWLBASE_JS_TOKEN' })

# Function to extract book details and reviews from the HTML content
def extract_book_details(html):
    soup = BeautifulSoup(html, 'html.parser')
    title = soup.select_one('h1.H1Title a[data-testid="title"]').text.strip()
    rating = soup.select_one('div.RatingStatistics span.RatingStars')['aria-label']

    reviews = []
    for review_div in soup.select('div.ReviewsList article.ReviewCard'):
        user = review_div.select_one('div[data-testid="name"]').text.strip()
        review_text = review_div.select_one('section.ReviewText span.Formatted').text.strip()
        reviews.append({'user': user, 'review': review_text})

    return {'title': title, 'rating': rating, 'reviews': reviews}

# Function to scrape Goodreads with pagination
def scrape_goodreads_reviews_with_pagination(base_url):
    page_data = []

    # Fetch initial page and reviews
    response = crawling_api.get(base_url, {
        'ajax_wait': 'true',
        'page_wait': '5000',
        'css_click_selector': 'button:has(span[data-testid="loadMore"])'
    })

    if response['headers']['pc_status'] == '200':
        html_content = response['body'].decode('utf-8')
        page_data = extract_book_details(html_content)

    return page_data

# Function to save the reviews in JSON format
def save_reviews_to_json(data, filename='goodreads_reviews.json'):
    with open(filename, 'w', encoding='utf-8') as f:
        json.dump(data, f, ensure_ascii=False, indent=4)

# Example usage
book_reviews = scrape_goodreads_reviews_with_pagination('https://www.goodreads.com/book/show/4671.The_Great_Gatsby/reviews')
save_reviews_to_json(book_reviews)

By using Crawlbase’s JS Token and handling button-based pagination, this scraper efficiently extracts Goodreads book ratings and reviews and stores them in a usable format.

Example Output:

{
    "title": "The Great Gatsby",
    "rating": "Rating 3.93 out of 5",
    "reviews": [
        {
            "user": "Alex",
            "review": "The Great Gatsby is your neighbor you're best friends with until you find out he's a drug dealer. It charms you with some of the most elegant English prose ever published, making it difficult to discuss the novel without the urge to stammer awestruck about its beauty. It would be evidence enough to argue that F. Scott Fitzgerald was superhuman, if it wasn't for the fact that we know he also wrote This Side of Paradise.But despite its magic, the rhetoric is just that, and it is a cruel facade. Behind the stunning glitter lies a story with all the discontent and intensity of the early Metallica albums. At its heart, The Great Gatsby throws the very nature of our desires into a harsh, shocking light. There may never be a character who so epitomizes tragically misplaced devotion as Jay Gatsby, and Daisy, his devotee, plays her part with perfect, innocent malevolence. Gatsby's competition, Tom Buchanan, stands aside watching, taunting and provoking with piercing vocal jabs and the constant boast of his enviable physique. The three jostle for position in an epic love triangle that lays waste to countless innocent victims, as well as both Eggs of Long Island. Every jab, hook, and uppercut is relayed by the instantly likable narrator Nick Carraway, seemingly the only voice of reason amongst all the chaos. But when those boats are finally borne back ceaselessly by the current, no one is left afloat. It is an ethical massacre, and Fitzgerald spares no lives; there is perhaps not a single character of any significance worthy even of a Sportsmanship Award from the Boys and Girls Club.In a word, The Great Gatsby is about deception; Fitzgerald tints our glasses rosy with gorgeous prose and a narrator you want so much to trust, but leaves the lenses just translucent enough for us to see that Gatsby is getting the same treatment. And if Gatsby represents the truth of the American Dream, it means trouble for us all. Consider it the most pleasant insult you'll ever receive."
        },
        {
            "user": "Lisa of Troy",
            "review": "Fitzgerald, you have ruined me.Fitzgerald can set a scene so perfectly, flawlessly. He paints a world of magic and introduces one of the greatest characters of all time, Jay Gatsby. Gatsby is the embodiment of hope, and no one can dissuade him from his dreams. Have you ever had a dream that carried you to heights you could never have dreamed otherwise? When Gatsby is reunited with Daisy Buchanan, he fills the space to the brim with flowers, creating a living dream. How is anyone supposed to compete with that?The Great Gatsby perfectly makes use of a narrator, Nick. Why is Gatsby so great? Because Nick tells us. If Gatsby told us, we would just think that he is a braggard, the least humble person in the world. This book is wildly addictive, so intricate yet perfectly woven together, a brilliant literary masterpiece. I have to keep going back to reconnect with Jay Gatsby, a naïve but beautiful and charming hope, perfectly imperfect, a relentless dreamer.2025 Reading ScheduleJan\tA Town Like AliceFeb\tBirdsongMar\tCaptain Corelli's Mandolin - Louis De BerniereApr\tWar and PeaceMay\tThe Woman in WhiteJun\tAtonementJul\tThe Shadow of the WindAug\tJude the ObscureSep\tUlyssesOct\tVanity FairNov\tA Fine BalanceDec\tGerminalConnect With Me!Blog Twitter BookTube Facebook Insta My Bookstore at Pango"
        },
        {
            "user": "Kemper",
            "review": "Jay Gatsby, you poor doomed bastard. You were ahead of your time. If you would have pulled your scam after the invention of reality TV, you would have been a huge star on a show like The Bachelor and a dozen shameless Daisy-types would have thrown themselves at you. Mass media and modern fame would have embraced the way you tried to push your way into a social circle you didn’t belong to in an effort to fulfill a fool’s dream as your entire existence became a lie and you desperately sought to rewrite history to an ending you wanted. You had a talent for it, Jay, but a modern PR expert would have made you bigger than Kate Gosselin. Your knack for self-promotion and over the top displays of wealth to try and buy respectability would have fit right in these days. I can just about see you on a red carpet with Paris Hilton. And the ending would have been different. No aftermath for rich folks these days. Lawyers and pay-off money would have quietly settled the matter. No harm, no foul. But then you’d have realized how worthless Daisy really was at some point. I’m sure you couldn’t have dealt with that. So maybe it is better that your story happened in the Jazz Age where you could keep your illusions intact to the bitter end.The greatest American novel? I don’t know if there is such an animal. But I think you'd have to include this one in the conversation."
        },
        {
            "user": "Inge",
            "review": "There was one thing I really liked about The Great Gatsby.It was short."
        },
        {
            "user": "may ➹",
            "review": "the only thing I got from this is that Nick is gay2.5"
        },
        .... more
    ]
}

Final Thoughts

Scrape Goodreads for book ratings and comments and get valuable insights from readers. Using Python with the Crawlbase Crawling API makes it easier especially when dealing with dynamic content and button-based pagination on Goodreads. With us handling the technical complexities you can focus on extracting the data.

Follow the steps in this guide and you’ll be set up and scraping reviews and ratings and storing the data in a structured format for analysis. If you want to do more web scraping, check out our guides on scraping other key websites.

📜 How to Scrape Monster.com
📜 How to Scrape Groupon
📜 How to Scrape TechCrunch
📜 How to Scrape X.com Tweet Pages
📜 How to Scrape Clutch.co

If you have questions or want to give feedback our support team can help with web scraping. Happy scraping!

Frequently Asked Questions

Q. What is the best way to scrape Goodreads for book ratings and comments?

Best way to scrape Goodreads is by using Python with Crawlbase Crawling API. This combination allows you to scrape dynamic content like book ratings and comments. Crawlbase Crawling API can handle JavaScript rendering and pagination so you can get all the data without any issues.

Q. What data points can I extract when scraping Goodreads?

When scraping Goodreads you can extract following data points: book titles, authors, average ratings, individual user ratings, comments, total reviews. This data will give you insights on how readers are receiving books and help you in making informed decisions for book recommendations or analysis.

Q. How does pagination work when scraping reviews from Goodreads?

Goodreads uses button-based pagination to load more reviews. By using Crawlbase Crawling API you can click the "Next" button programmatically. This way all reviews will be loaded and you can get complete data across multiple pages without manually navigating the site. You can set parameters like css_click_selector in the API call to handle this.

Top comments (0)

Some comments may only be visible to logged-in visitors. Sign in to view all comments.