DEV Community

agenthustler
agenthustler

Posted on

Amazon Product Reviews: How to Scrape Them at Scale in 2026

Amazon product reviews are one of the most valuable data sources for e-commerce analytics, sentiment analysis, and competitive research. With millions of reviews posted daily, extracting this data at scale requires the right approach.

In this guide, I'll show you how to collect Amazon product reviews programmatically and run sentiment analysis on the results.

Why Scrape Amazon Reviews?

  • Product research: Understand what customers love or hate before launching a competing product
  • Sentiment analysis: Track brand perception over time across thousands of SKUs
  • Competitive intelligence: Monitor competitor product reception in real-time
  • Quality monitoring: Detect emerging product defects from review patterns

Setting Up Your Environment

import requests
from bs4 import BeautifulSoup
import pandas as pd
from textblob import TextBlob
import time
import json

# Configure your session
session = requests.Session()
session.headers.update({
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36',
    'Accept-Language': 'en-US,en;q=0.9',
})
Enter fullscreen mode Exit fullscreen mode

Extracting Reviews from a Product Page

def scrape_amazon_reviews(asin, pages=5):
    """Scrape reviews for a given Amazon product ASIN."""
    all_reviews = []

    for page in range(1, pages + 1):
        url = f"https://www.amazon.com/product-reviews/{asin}"
        params = {
            'pageNumber': page,
            'sortBy': 'recent',
            'reviewerType': 'all_reviews'
        }

        response = session.get(url, params=params)
        soup = BeautifulSoup(response.text, 'html.parser')

        review_elements = soup.select('[data-hook="review"]')
        for review in review_elements:
            title = review.select_one('[data-hook="review-title"]')
            body = review.select_one('[data-hook="review-body"]')
            rating = review.select_one('[data-hook="review-star-rating"]')
            date = review.select_one('[data-hook="review-date"]')

            all_reviews.append({
                'title': title.get_text(strip=True) if title else '',
                'body': body.get_text(strip=True) if body else '',
                'rating': rating.get_text(strip=True) if rating else '',
                'date': date.get_text(strip=True) if date else '',
            })

        time.sleep(2)  # Respect rate limits

    return all_reviews

# Example usage
reviews = scrape_amazon_reviews('B09V3KXJPB', pages=3)
print(f"Collected {len(reviews)} reviews")
Enter fullscreen mode Exit fullscreen mode

Running Sentiment Analysis

Once you have reviews, analyze the sentiment:

def analyze_sentiment(reviews):
    """Add sentiment scores to review data."""
    for review in reviews:
        blob = TextBlob(review['body'])
        review['polarity'] = blob.sentiment.polarity
        review['subjectivity'] = blob.sentiment.subjectivity
        review['sentiment'] = (
            'positive' if blob.sentiment.polarity > 0.1
            else 'negative' if blob.sentiment.polarity < -0.1
            else 'neutral'
        )
    return reviews

analyzed = analyze_sentiment(reviews)
df = pd.DataFrame(analyzed)

# Summary statistics
print(f"Positive: {len(df[df.sentiment == 'positive'])}")
print(f"Negative: {len(df[df.sentiment == 'negative'])}")
print(f"Neutral: {len(df[df.sentiment == 'neutral'])}")
print(f"Average polarity: {df.polarity.mean():.3f}")
Enter fullscreen mode Exit fullscreen mode

Scaling Up with Cloud Scrapers

For serious data collection across hundreds of ASINs, DIY scraping hits limits fast — CAPTCHAs, IP blocks, and dynamic rendering. Cloud-based scrapers solve this.

For Amazon reviews specifically, tools like Apify's Trustpilot Scraper show how cloud platforms handle review extraction at scale. The same pattern applies: define your inputs, run in the cloud, get structured JSON output.

Handling Common Challenges

Anti-Bot Detection

Amazon is aggressive with bot detection. Best practices:

import random

PROXIES = [
    # Use a rotating proxy service for production
    # Services like ThorData provide residential proxies
    # that handle rotation automatically
]

def get_with_retry(url, max_retries=3):
    for attempt in range(max_retries):
        try:
            proxy = random.choice(PROXIES) if PROXIES else None
            response = session.get(
                url,
                proxies={'http': proxy, 'https': proxy} if proxy else None,
                timeout=30
            )
            if response.status_code == 200:
                return response
            time.sleep(random.uniform(2, 5))
        except requests.RequestException:
            time.sleep(random.uniform(5, 10))
    return None
Enter fullscreen mode Exit fullscreen mode

Data Storage

For large-scale collection, stream results to a database:

import sqlite3

def store_reviews(reviews, db_path='amazon_reviews.db'):
    conn = sqlite3.connect(db_path)
    df = pd.DataFrame(reviews)
    df.to_sql('reviews', conn, if_exists='append', index=False)
    conn.close()
    print(f"Stored {len(reviews)} reviews")
Enter fullscreen mode Exit fullscreen mode

Proxy Services for Scale

When scraping Amazon at scale, reliable proxies are essential. ThorData provides residential rotating proxies optimized for e-commerce sites, handling session management and geographic targeting automatically.

Conclusion

Amazon review scraping at scale requires a combination of proper request handling, proxy rotation, and structured data storage. Start with the code above for small-scale collection, then move to cloud-based solutions when you need to monitor hundreds or thousands of products continuously.

The key is building a pipeline: collect → clean → analyze → alert. Sentiment shifts in reviews often predict product issues or competitor opportunities weeks before they show up in sales data.

Top comments (0)