Amazon product reviews are one of the most valuable data sources for e-commerce analytics, sentiment analysis, and competitive research. With millions of reviews posted daily, extracting this data at scale requires the right approach.
In this guide, I'll show you how to collect Amazon product reviews programmatically and run sentiment analysis on the results.
Why Scrape Amazon Reviews?
- Product research: Understand what customers love or hate before launching a competing product
- Sentiment analysis: Track brand perception over time across thousands of SKUs
- Competitive intelligence: Monitor competitor product reception in real-time
- Quality monitoring: Detect emerging product defects from review patterns
Setting Up Your Environment
import requests
from bs4 import BeautifulSoup
import pandas as pd
from textblob import TextBlob
import time
import json
# Configure your session
session = requests.Session()
session.headers.update({
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36',
'Accept-Language': 'en-US,en;q=0.9',
})
Extracting Reviews from a Product Page
def scrape_amazon_reviews(asin, pages=5):
"""Scrape reviews for a given Amazon product ASIN."""
all_reviews = []
for page in range(1, pages + 1):
url = f"https://www.amazon.com/product-reviews/{asin}"
params = {
'pageNumber': page,
'sortBy': 'recent',
'reviewerType': 'all_reviews'
}
response = session.get(url, params=params)
soup = BeautifulSoup(response.text, 'html.parser')
review_elements = soup.select('[data-hook="review"]')
for review in review_elements:
title = review.select_one('[data-hook="review-title"]')
body = review.select_one('[data-hook="review-body"]')
rating = review.select_one('[data-hook="review-star-rating"]')
date = review.select_one('[data-hook="review-date"]')
all_reviews.append({
'title': title.get_text(strip=True) if title else '',
'body': body.get_text(strip=True) if body else '',
'rating': rating.get_text(strip=True) if rating else '',
'date': date.get_text(strip=True) if date else '',
})
time.sleep(2) # Respect rate limits
return all_reviews
# Example usage
reviews = scrape_amazon_reviews('B09V3KXJPB', pages=3)
print(f"Collected {len(reviews)} reviews")
Running Sentiment Analysis
Once you have reviews, analyze the sentiment:
def analyze_sentiment(reviews):
"""Add sentiment scores to review data."""
for review in reviews:
blob = TextBlob(review['body'])
review['polarity'] = blob.sentiment.polarity
review['subjectivity'] = blob.sentiment.subjectivity
review['sentiment'] = (
'positive' if blob.sentiment.polarity > 0.1
else 'negative' if blob.sentiment.polarity < -0.1
else 'neutral'
)
return reviews
analyzed = analyze_sentiment(reviews)
df = pd.DataFrame(analyzed)
# Summary statistics
print(f"Positive: {len(df[df.sentiment == 'positive'])}")
print(f"Negative: {len(df[df.sentiment == 'negative'])}")
print(f"Neutral: {len(df[df.sentiment == 'neutral'])}")
print(f"Average polarity: {df.polarity.mean():.3f}")
Scaling Up with Cloud Scrapers
For serious data collection across hundreds of ASINs, DIY scraping hits limits fast — CAPTCHAs, IP blocks, and dynamic rendering. Cloud-based scrapers solve this.
For Amazon reviews specifically, tools like Apify's Trustpilot Scraper show how cloud platforms handle review extraction at scale. The same pattern applies: define your inputs, run in the cloud, get structured JSON output.
Handling Common Challenges
Anti-Bot Detection
Amazon is aggressive with bot detection. Best practices:
import random
PROXIES = [
# Use a rotating proxy service for production
# Services like ThorData provide residential proxies
# that handle rotation automatically
]
def get_with_retry(url, max_retries=3):
for attempt in range(max_retries):
try:
proxy = random.choice(PROXIES) if PROXIES else None
response = session.get(
url,
proxies={'http': proxy, 'https': proxy} if proxy else None,
timeout=30
)
if response.status_code == 200:
return response
time.sleep(random.uniform(2, 5))
except requests.RequestException:
time.sleep(random.uniform(5, 10))
return None
Data Storage
For large-scale collection, stream results to a database:
import sqlite3
def store_reviews(reviews, db_path='amazon_reviews.db'):
conn = sqlite3.connect(db_path)
df = pd.DataFrame(reviews)
df.to_sql('reviews', conn, if_exists='append', index=False)
conn.close()
print(f"Stored {len(reviews)} reviews")
Proxy Services for Scale
When scraping Amazon at scale, reliable proxies are essential. ThorData provides residential rotating proxies optimized for e-commerce sites, handling session management and geographic targeting automatically.
Conclusion
Amazon review scraping at scale requires a combination of proper request handling, proxy rotation, and structured data storage. Start with the code above for small-scale collection, then move to cloud-based solutions when you need to monitor hundreds or thousands of products continuously.
The key is building a pipeline: collect → clean → analyze → alert. Sentiment shifts in reviews often predict product issues or competitor opportunities weeks before they show up in sales data.
Top comments (0)