Kingsley Ubah

Posted on Nov 10, 2023 • Originally published at letsusetech.com

How to Scrape Amazon Product Reviews Behind a Login

#python #datascience #programming #webdev

Amazon is arguably the largest e-commerce website in the world. With billions of product pages and a vast database of product reviews, it can be a valuable data source for e-commerce businesses or market researchers looking to make informed purchase decisions.

Scraping Amazon product reviews can give you valuable insights into product quality, customer satisfaction, and market trends. However, the product reviews scrapping process isn’t always straightforward, especially when login is required.

This guide will walk you through the steps for scrapping Amazon product reviews behind a login. You’ll learn to log in to Amazon, access the product page, retrieve its HTML source code, parse the review data, and export the reviews to CSV.

Are you feeling excited? Let’s begin!

Step 1: Prerequisites and Project Setup

To run the code samples in this tutorial, install Python on your local development machine (Python v3.5 or later is okay). If you don’t have it, visit the official Python website and install it.

After installing Python, install Selenium and the undetected_chromedriver library. You’ll use Selenium to access the Amazon page behind the login and retrieve its complete HTML code. The undetected_chromedriver library helps you bypass any bot detection mechanism by Amazon.

You need to install both of these tools in a clean, isolated environment. That entails setting up a virtual environment locally and installing the dependencies there. By doing so, you’ll prevent future conflicts among the dependencies in your Python projects. Let’s do that.

Open a terminal and create a new folder. Name the folder anything you want (for this tutorial, it’s product_reviews):

mkdir product_reviews

Change your current directory to the newly created folder:

cd product_reviews

Awesome! Now that you’re inside product_reviews, execute this command to create a virtual environment:

# Create a virtual environment named 'env'
python -m venv env

Replace ‘env’ with your preferred virtual environment name.

If you’re on Windows, execute this to activate the virtual environment:

env\Scripts\activate

If you’re on MacOS or Linux, use this instead:

source env/bin/activate

On your terminal prompt, you should see an indication that the virtual environment is active.

Now install the dependencies with the following command:

pip install selenium undetected-chromedriver

Below is a screenshot of the process:

Now open the folder in any code editor you prefer (this tutorial uses Visual Studio Code). Your directory structure should look like this:

NOTE: You’ll also need an Amazon account to proceed with the rest of this tutorial. If you don’t have an Amazon account, go to their sign-up page and register for one.

Step 2: Access the Public Page

The image below is that of the product page you’ll be scrapping. From that page, you’ll extract the author's name, review title, and date.

Here's the product URL: https://www.amazon.com/ENHANCE-Headphone-Customizable-Lighting-Flexible/dp/B07DR59JLP/

Create a new file named script.py in the root folder of your project. The code snippet below shows how to create a Chrome WebDriver instance, configure it, fetch the HTML code from the product URL, and output the HTML on the terminal. The code goes inside script.py:

import undetected_chromedriver as uc
import time

# Configure Chrome options and specify not to run it headless
chromeOptions = uc.ChromeOptions()
chromeOptions.headless = False

# Create an instance of the Chrome WebDriver and enable subprocess support
driver = uc.Chrome(use_subprocess=True, options=chromeOptions)

# The URL of the Amazon product page
product_url = "https://www.amazon.com/ENHANCE-Headphone-Customizable-Lighting-Flexible/dp/B07DR59JLP/"

driver.get(product_url)

time.sleep(10) # Wait for a few seconds to ensure the page loads completely

# Extract the full HTML of the page and store it in the 'page_html' variable
page_html = driver.page_source

print(page_html)

driver.close() # Close the WebDriver instance

Run the script by executing the Python file on your command line (inside your active virtual environment):

python script.py

NOTE: Ensure that your Google Chrome browser is up-to-date before running the above script. Otherwise, you might get an error.

You should see the HTML markup of the Amazon page outputted in the terminal.

Though scraping public web pages is relatively straightforward, some websites may restrict you from accessing certain content if you don’t log in. In the next section, you’ll get a walk-through of how to scrape pages behind a login using Selenium.

Step 3: Scrape Behind the Login

To sign into Amazon, you’re to go through two pages: the email and password pages.

With Selenium, you can sign into Amazon programmatically, just as you would log in if you had used the web interface. The process entails getting the login page, targeting the IDs for the email and password elements, inputting data in said elements, and then clicking the submit button.

To kick-start the process, you’ll need to get your Amazon login URL from your browser. First, you must sign out of your Amazon account. Then, go to amazon.com and click the “Sign in” button. Once you’re on the “log in” page, copy the URL from the address bar. You’ll need it in the code.

To get the ID of the HTML elements, right-click on the form on the login page, click Inspect, and then the Elements tab. Expand the HTML markup until you can see the ID of the element containing the form and button.

The code snippet below shows how to log in and access the elements. It uses Selenium to access the form elements inside which you’re to input your email and password. Afterward, it submits your login details by clicking the submit button. Finally, it retrieves the HTML code and outputs it on the terminal:

import undetected_chromedriver as uc
import time
from selenium.webdriver.common.by import By

chromeOptions = uc.ChromeOptions()
chromeOptions.headless = False

driver = uc.Chrome(use_subprocess=True, options=chromeOptions)


# Replace the Amazon link below with your login URL
driver.get("https://www.amazon.com/ap/signin?openid.pape.max_auth_age=0&openid.return_to=xxxxxxxxxxxx")

time.sleep(5) # Wait for a few seconds to ensure the page loads completely

email = driver.find_element(By.ID, "ap_email")

# Replace the xxx with your Amazon email
email.send_keys("xxxxxxxxx")

driver.find_element(By.ID, "continue").click()

time.sleep(5)

password = driver.find_element(By.ID, "ap_password")


# Replace the xxx with your Amazon password
password.send_keys("xxxxxxxx")

driver.find_element(By.ID, "signInSubmit").click()

time.sleep(10)

# Extract the full HTML of the page and store it in the 'page_html' variable
page_html = driver.page_source

print(page_html)

driver.close() # Close the WebDriver instance

The above code gives the same output as step 2. But this time, we’re logged into Amazon.

Note: Before pushing your code to a public repository, secure your login details.

Parse Review Data

If you only need a subset of the retrieved data, parsing it will allow you to extract the required information. The parsing process also involves inspecting the HTML to see the class or ID of the element containing the detail you want to parse.

For this tutorial, you’ll be extracting the author names, review texts, and review dates. You’ll locate these elements using their respective CSS selectors. The image below shows a review author’s name and its corresponding class ‘a-profile-name’:

You’ll do the same to get the class names for the review text and date.

Once you have the class names of those three elements, you can access their text content in your Python code and print them to the console. The code snippet below does that:

import undetected_chromedriver as uc
import time
from selenium.webdriver.common.by import By

chromeOptions = uc.ChromeOptions()
chromeOptions.headless = False

driver = uc.Chrome(use_subprocess=True, options=chromeOptions)

# Replace the Amazon link below with your login URL
driver.get("https://www.amazon.com/ap/signin?openid.pape.max_auth_age=0&openid.return_to=xxxxxxxxxxxx")

time.sleep(5) # Wait for a few seconds to ensure the page loads completely

email = driver.find_element(By.ID, "ap_email")

# Replace the xxx with your Amazon email
email.send_keys("xxxxxxxxx")

driver.find_element(By.ID, "continue").click()

time.sleep(5)

password = driver.find_element(By.ID, "ap_password")


# Replace the xxx with your Amazon password
password.send_keys("xxxxxxxx")

driver.find_element(By.ID, "signInSubmit").click()

time.sleep(10)

# Navigate to the Amazon product page
product_url = "https://www.amazon.com/ENHANCE-Headphone-Customizable-Lighting-Flexible/dp/B07DR59JLP/"

driver.get(product_url)

time.sleep(10) # Wait for a few seconds to ensure the page loads completely

# Locate and extract review elements
review_elements = driver.find_elements(By.CSS_SELECTOR, '.a-section.review')

for review_element in review_elements:

    # Extract the author name
    author_name = review_element.find_element(By.CLASS_NAME, 'a-profile-name').text

    # Extract review text
    review_text = review_element.find_element(By.CLASS_NAME, 'review-text').text

    # Extract review date
    review_date = review_element.find_element(By.CLASS_NAME, 'review-date').text

    # Print the extracted information
    print("Author: ", author_name)
    print("Review: ", review_text)
    print("Review Date: ", review_date)
    print("\n")

driver.close()

Execute the code by running the Python file:

python script.py

On the console, you should find only three sets of information for each review: the author's name, review text, and date. Your output should look like the following:

Now that you’ve extracted the information you need, the next step is to present this data in a format that’s easy to share and analyze. Your best bet is the CSV (Comma-Separated Values) format, which stores data in a tabular form. The section below explores how to export reviews to a CSV file.

Step 4: Export Reviews to CSV

Python’s built-in csv module allows you to work with CSV files in your Python script. After importing the module, you’re to open a CSV file and write a row to it consisting of “Author”, “Review” and “Review Date”. Then, for each review retrieved from the product page, add the author name, review text, and date as a new row.

The code snippet below shows how to do this:

import undetected_chromedriver as uc
import time
import csv
from selenium.webdriver.common.by import By

chromeOptions = uc.ChromeOptions()
chromeOptions.headless = False

driver = uc.Chrome(use_subprocess=True, options=chromeOptions)


# Replace the Amazon link below with your login URL
driver.get("https://www.amazon.com/ap/signin?openid.pape.max_auth_age=0&openid.return_to=xxxxxxxxxxxx")

time.sleep(5) # Wait for a few seconds to ensure the page loads completely

email = driver.find_element(By.ID, "ap_email")

# Replace the xxx with your Amazon email
email.send_keys("xxxxxxxxx")

driver.find_element(By.ID, "continue").click()

time.sleep(5)

password = driver.find_element(By.ID, "ap_password")


# Replace the xxx with your Amazon password
password.send_keys("xxxxxxxx")

driver.find_element(By.ID, "signInSubmit").click()

time.sleep(10)

# Navigate to the Amazon product page
product_url = "https://www.amazon.com/ENHANCE-Headphone-Customizable-Lighting-Flexible/dp/B07DR59JLP/"

driver.get(product_url)

time.sleep(10) # Wait for a few seconds to ensure the page loads completely

# Locate and extract review elements
review_elements = driver.find_elements(By.CSS_SELECTOR, '.a-section.review')

csv_filename = 'product_reviews.csv'
with open(csv_filename, 'w', newline='') as csv_file:
    csv_writer = csv.writer(csv_file)
    csv_writer.writerow(['Author', 'Review', 'Review Date'])

    for review_element in review_elements:

        # Extract the author's name
        author_name = review_element.find_element(By.CLASS_NAME, 'a-profile-name').text

        # Extract review text
        review_text = review_element.find_element(By.CLASS_NAME, 'review-text').text

        # Extract review date
        review_date = review_element.find_element(By.CLASS_NAME, 'review-date').text

        # Print the extracted information
        print("Author: ", author_name)
        print("Review: ", review_text)
        print("Review Date: ", review_date)
        print("\n")

        csv_writer.writerow([author_name, review_text, review_date])

driver.close()

When you run this code, it’ll create a CSV file that neatly organizes the scraped Amazon review data. Your output should look like the following:

You can easily share this CSV file, import it into a data analysis tool, or process it further for deeper insights into the product’s marketability.

Conclusion

This tutorial walked you through the steps for accessing Amazon review pages and extracting data from them. For websites like Amazon that require log in, you learned how to sign into them programmatically using Selenium. You also saw how to parse the returned data and export the data into a CSV file.

You can build on your knowledge by exploring more advanced web scrapping techniques. One technique we recommend you learn is how to automate how the login URL will come from your Amazon account. That’ll surely be a fun challenge!

DEV Community

How to Scrape Amazon Product Reviews Behind a Login

Step 1: Prerequisites and Project Setup

Step 2: Access the Public Page

Step 3: Scrape Behind the Login

Parse Review Data

Step 4: Export Reviews to CSV

Conclusion

Top comments (0)

Read next

10 VS Code extensions for maximum frontend development productivity

Docker

Insertion Sort in Java (With Intuition + Dry run + Code)

How Well Does Azure Static Web Apps (SWA) Support Next.js? | Pages Architecture