agenthustler

Posted on Mar 26

How to Scrape Product Reviews at Scale in 2026: Amazon, G2, Trustpilot, and Yelp

#webdev #python #webscraping #tutorial

Product reviews are one of the most valuable datasets on the internet. They tell you what customers actually think — not what marketing says. Whether you're building a sentiment analysis pipeline, monitoring brand reputation, or training an NLP model, scraping reviews at scale is a core competency.

Here's how to extract reviews from the four major platforms in 2026, including the real technical challenges and working solutions.

Use Cases for Review Data

Review scraping isn't just for e-commerce. Here's where the data creates real value:

Brand monitoring — Track sentiment across platforms in real time. Catch PR issues before they trend.
Competitive analysis — What do customers love/hate about competing products? Map feature gaps from actual user feedback.
Product development — Mine thousands of reviews for feature requests and pain points. Better than surveys.
Lead generation — Identify unhappy customers of competitors (negative reviewers) for targeted outreach.
Market research — Aggregate ratings across categories to identify underserved markets.
AI/ML training — Build sentiment classifiers, aspect-based analysis models, or recommendation engines.

Platform Overview

Platform	Review Volume	Anti-Bot	Data Quality	Best For
Amazon	1B+ reviews	Very High	Verified purchases, rich metadata	Consumer products, e-commerce
G2	2.5M+ reviews	Medium	Detailed pros/cons, feature ratings	B2B software, SaaS
Trustpilot	300M+ reviews	Medium-High	Company-level, response tracking	Service businesses, D2C
Yelp	265M+ reviews	High	Local business, photos, check-ins	Restaurants, local services

Scraping Amazon Reviews

Amazon has the largest review corpus but also the most sophisticated bot detection (PerimeterX/HUMAN).

What You Get

Each Amazon review includes:

Star rating and title
Full review text
Verified purchase badge
Helpful vote count
Reviewer profile (name, rank)
Date and product variant
Image/video attachments

Technical Approach

import requests
from bs4 import BeautifulSoup

def scrape_amazon_reviews(asin, page=1):
    url = f"https://www.amazon.com/product-reviews/{asin}"
    params = {
        "pageNumber": page,
        "sortBy": "recent",
        "filterByStar": "all_stars"
    }
    headers = {
        "User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7)",
        "Accept-Language": "en-US,en;q=0.9"
    }

    # Residential proxies are essential for Amazon
    proxies = {"https": "http://user:pass@proxy.thordata.com:9000"}
    resp = requests.get(url, params=params, headers=headers,
                       proxies=proxies, timeout=30)

    soup = BeautifulSoup(resp.text, "html.parser")
    reviews = []

    for review_div in soup.select("[data-hook='review']"):
        rating_el = review_div.select_one("[data-hook='review-star-rating']")
        title_el = review_div.select_one("[data-hook='review-title']")
        body_el = review_div.select_one("[data-hook='review-body']")
        date_el = review_div.select_one("[data-hook='review-date']")

        reviews.append({
            "rating": rating_el.text.split()[0] if rating_el else None,
            "title": title_el.text.strip() if title_el else None,
            "body": body_el.text.strip() if body_el else None,
            "date": date_el.text.strip() if date_el else None,
            "verified": bool(review_div.select_one("[data-hook='avp-badge']"))
        })

    return reviews

Scaling Challenges

Amazon rate-limits aggressively and uses device fingerprinting. For production workloads:

Rotate residential proxies — ThorData provides residential pools with automatic rotation. Essential for Amazon since datacenter IPs are instantly flagged.
Respect rate limits — 1-3 seconds between requests minimum, with random jitter.
Handle pagination — Amazon caps at 500 reviews via the web UI. For products with 10K+ reviews, you'll need to use filter combinations (by star rating, by date) to access the full corpus.

Scraping G2 Reviews

G2 is the go-to platform for B2B software reviews. The data is uniquely valuable because reviews include structured pros/cons, feature ratings, and detailed user profiles (company size, role, industry).

Why G2 Data Is Special

Unlike Amazon's free-form text, G2 reviews are structured:

Separate Pros and Cons sections
Feature-by-feature star ratings
User segment data (company size, industry)
Implementation feedback
Competitor comparisons mentioned in reviews

This structure makes G2 data immediately useful for competitive analysis without heavy NLP processing.

Approach

G2 uses standard Cloudflare protection. A stealth browser or managed scraper handles it well.

I maintain a G2 Reviews Scraper on Apify that extracts all review fields including the structured pros/cons and user metadata.

For DIY scraping, G2 loads reviews via XHR requests to their internal API:

# G2 reviews load via paginated API calls
review_api = f"https://www.g2.com/products/{slug}/reviews.json"
params = {"page": 1, "sort": "most_recent"}
# Requires session cookies from an authenticated browser session

Pro tip: G2 requires login to see full review text. The free preview truncates after ~200 characters. Plan for authentication in your pipeline.

Scraping Trustpilot

Trustpilot is one of the more scraper-friendly platforms, though they've tightened up in 2026. The key advantage: Trustpilot includes company responses to reviews, giving you both sides of the conversation.

Data Structure

trustpilot_review = {
    "rating": 4,
    "title": "Great service, slow shipping",
    "text": "Product quality is excellent but took 3 weeks...",
    "date": "2026-03-01",
    "verified": True,
    "reply": {
        "text": "Thank you for your feedback. We have improved...",
        "date": "2026-03-02"
    },
    "reviewer": {
        "name": "John D.",
        "reviews_count": 12,
        "location": "New York, US"
    }
}

Technical Notes

Trustpilot serves reviews as server-rendered HTML, making extraction straightforward with BeautifulSoup. They also expose a semi-public business API.

For production-scale extraction, I have a Trustpilot Scraper on Apify that handles pagination and exports clean JSON/CSV.

To manage request volume yourself, ScrapeOps provides proxy rotation and monitoring specifically designed for web scraping — useful for tracking success rates and identifying when sites change their structure.

Scraping Yelp

Yelp is arguably the hardest platform to scrape in 2026. They use aggressive bot detection (Datadome) and actively pursue legal action against scrapers.

What Makes Yelp Challenging

Datadome protection — Sophisticated JavaScript challenges and behavioral analysis
Review filtering — Yelp hides reviews it considers unreliable (sometimes 30-40% of total reviews)
Dynamic content — Reviews load via JavaScript, requiring browser automation
Legal stance — Yelp has sued scrapers before (hiQ Labs precedent helps, but tread carefully)

Working Approach

from playwright.async_api import async_playwright

async def scrape_yelp_reviews(business_url):
    async with async_playwright() as p:
        browser = await p.chromium.launch(headless=True)
        context = await browser.new_context(
            viewport={"width": 1920, "height": 1080},
            user_agent="Mozilla/5.0 ..."
        )
        page = await context.new_page()
        await page.goto(business_url, wait_until="networkidle")

        reviews = []
        review_elements = await page.query_selector_all("[data-review-id]")

        for el in review_elements:
            rating_el = await el.query_selector("[aria-label*='star rating']")
            text_el = await el.query_selector("p[lang]")
            reviews.append({
                "rating": await rating_el.get_attribute("aria-label") if rating_el else None,
                "text": await text_el.inner_text() if text_el else None
            })

        await browser.close()
        return reviews

For reliable Yelp scraping, residential proxies are non-negotiable. ThorData provides the residential IP diversity needed to avoid Datadome's fingerprinting.

Building a Multi-Source Review Pipeline

The real power comes from aggregating reviews across platforms. Here's a production architecture:

Scrapers --> Normalizer --> Database (PostgreSQL + pgvector)
                                |
              Dashboard <-- Sentiment Analysis (NLP/LLM)

Unified Review Schema

Normalize across sources for consistent analysis:

unified_review = {
    "source": "g2",            # amazon, g2, trustpilot, yelp
    "source_id": "rev_abc123",
    "product": "Slack",
    "rating": 4,               # normalized to 1-5
    "title": "...",
    "text": "...",
    "pros": "...",             # G2-specific, null for others
    "cons": "...",             # G2-specific, null for others
    "verified": True,
    "date": "2026-03-01",
    "sentiment": 0.72,         # computed post-scraping
    "aspects": ["pricing", "support"],  # extracted topics
    "scraped_at": "2026-03-09T10:00:00Z"
}

Sentiment Analysis at Scale

For processing thousands of reviews, use a lightweight model locally rather than paying per-API-call:

from transformers import pipeline

sentiment = pipeline("sentiment-analysis",
                    model="cardiffnlp/twitter-roberta-base-sentiment-latest")

def analyze_review(text):
    result = sentiment(text[:512])[0]
    return {
        "label": result["label"],
        "score": round(result["score"], 3)
    }

Handling Common Challenges

Rate Limiting

Every platform rate-limits. Space requests 2-5 seconds apart. ScrapeOps can help monitor your success rates across sources and alert you when a site starts blocking more requests.

Data Quality

Reviews can be fake, incentivized, or machine-generated. Filter by:

Verified purchase badges
Reviewer history (accounts with 1 review are suspicious)
Text length (very short reviews carry less signal)
Duplicate text detection across reviews

Legal Considerations

Web scraping of publicly available data is generally legal (hiQ v. LinkedIn, 2022). However:

Respect robots.txt as a best practice
Don't circumvent authentication barriers
Don't overload servers with aggressive request rates
Use data responsibly — don't republish raw reviews as your own content

Conclusion

Review scraping in 2026 comes down to three things: residential proxies for reliable access, structured extraction for each platform's unique data format, and normalization for cross-platform analysis.

For quick starts, use managed scrapers like the G2 Reviews Scraper or Trustpilot Scraper on Apify. For custom pipelines, combine proxy services with the code patterns above.

The most valuable insight often comes not from any single platform, but from triangulating sentiment across all of them.

What review data are you scraping? Share your pipeline architecture in the comments — always curious to see how others handle multi-source aggregation.

DEV Community