DEV Community

agenthustler
agenthustler

Posted on

How to Scrape Product Reviews at Scale in 2026: Amazon, G2, Trustpilot, and Yelp

Product reviews are one of the most valuable datasets on the internet. They tell you what customers actually think — not what marketing says. Whether you're building a sentiment analysis pipeline, monitoring brand reputation, or training an NLP model, scraping reviews at scale is a core competency.

Here's how to extract reviews from the four major platforms in 2026, including the real technical challenges and working solutions.

Use Cases for Review Data

Review scraping isn't just for e-commerce. Here's where the data creates real value:

  • Brand monitoring — Track sentiment across platforms in real time. Catch PR issues before they trend.
  • Competitive analysis — What do customers love/hate about competing products? Map feature gaps from actual user feedback.
  • Product development — Mine thousands of reviews for feature requests and pain points. Better than surveys.
  • Lead generation — Identify unhappy customers of competitors (negative reviewers) for targeted outreach.
  • Market research — Aggregate ratings across categories to identify underserved markets.
  • AI/ML training — Build sentiment classifiers, aspect-based analysis models, or recommendation engines.

Platform Overview

Platform Review Volume Anti-Bot Data Quality Best For
Amazon 1B+ reviews Very High Verified purchases, rich metadata Consumer products, e-commerce
G2 2.5M+ reviews Medium Detailed pros/cons, feature ratings B2B software, SaaS
Trustpilot 300M+ reviews Medium-High Company-level, response tracking Service businesses, D2C
Yelp 265M+ reviews High Local business, photos, check-ins Restaurants, local services

Scraping Amazon Reviews

Amazon has the largest review corpus but also the most sophisticated bot detection (PerimeterX/HUMAN).

What You Get

Each Amazon review includes:

  • Star rating and title
  • Full review text
  • Verified purchase badge
  • Helpful vote count
  • Reviewer profile (name, rank)
  • Date and product variant
  • Image/video attachments

Technical Approach

import requests
from bs4 import BeautifulSoup

def scrape_amazon_reviews(asin, page=1):
    url = f"https://www.amazon.com/product-reviews/{asin}"
    params = {
        "pageNumber": page,
        "sortBy": "recent",
        "filterByStar": "all_stars"
    }
    headers = {
        "User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7)",
        "Accept-Language": "en-US,en;q=0.9"
    }

    # Residential proxies are essential for Amazon
    proxies = {"https": "http://user:pass@proxy.thordata.com:9000"}
    resp = requests.get(url, params=params, headers=headers,
                       proxies=proxies, timeout=30)

    soup = BeautifulSoup(resp.text, "html.parser")
    reviews = []

    for review_div in soup.select("[data-hook='review']"):
        rating_el = review_div.select_one("[data-hook='review-star-rating']")
        title_el = review_div.select_one("[data-hook='review-title']")
        body_el = review_div.select_one("[data-hook='review-body']")
        date_el = review_div.select_one("[data-hook='review-date']")

        reviews.append({
            "rating": rating_el.text.split()[0] if rating_el else None,
            "title": title_el.text.strip() if title_el else None,
            "body": body_el.text.strip() if body_el else None,
            "date": date_el.text.strip() if date_el else None,
            "verified": bool(review_div.select_one("[data-hook='avp-badge']"))
        })

    return reviews
Enter fullscreen mode Exit fullscreen mode

Scaling Challenges

Amazon rate-limits aggressively and uses device fingerprinting. For production workloads:

  1. Rotate residential proxiesThorData provides residential pools with automatic rotation. Essential for Amazon since datacenter IPs are instantly flagged.
  2. Respect rate limits — 1-3 seconds between requests minimum, with random jitter.
  3. Handle pagination — Amazon caps at 500 reviews via the web UI. For products with 10K+ reviews, you'll need to use filter combinations (by star rating, by date) to access the full corpus.

Scraping G2 Reviews

G2 is the go-to platform for B2B software reviews. The data is uniquely valuable because reviews include structured pros/cons, feature ratings, and detailed user profiles (company size, role, industry).

Why G2 Data Is Special

Unlike Amazon's free-form text, G2 reviews are structured:

  • Separate Pros and Cons sections
  • Feature-by-feature star ratings
  • User segment data (company size, industry)
  • Implementation feedback
  • Competitor comparisons mentioned in reviews

This structure makes G2 data immediately useful for competitive analysis without heavy NLP processing.

Approach

G2 uses standard Cloudflare protection. A stealth browser or managed scraper handles it well.

I maintain a G2 Reviews Scraper on Apify that extracts all review fields including the structured pros/cons and user metadata.

For DIY scraping, G2 loads reviews via XHR requests to their internal API:

# G2 reviews load via paginated API calls
review_api = f"https://www.g2.com/products/{slug}/reviews.json"
params = {"page": 1, "sort": "most_recent"}
# Requires session cookies from an authenticated browser session
Enter fullscreen mode Exit fullscreen mode

Pro tip: G2 requires login to see full review text. The free preview truncates after ~200 characters. Plan for authentication in your pipeline.

Scraping Trustpilot

Trustpilot is one of the more scraper-friendly platforms, though they've tightened up in 2026. The key advantage: Trustpilot includes company responses to reviews, giving you both sides of the conversation.

Data Structure

trustpilot_review = {
    "rating": 4,
    "title": "Great service, slow shipping",
    "text": "Product quality is excellent but took 3 weeks...",
    "date": "2026-03-01",
    "verified": True,
    "reply": {
        "text": "Thank you for your feedback. We have improved...",
        "date": "2026-03-02"
    },
    "reviewer": {
        "name": "John D.",
        "reviews_count": 12,
        "location": "New York, US"
    }
}
Enter fullscreen mode Exit fullscreen mode

Technical Notes

Trustpilot serves reviews as server-rendered HTML, making extraction straightforward with BeautifulSoup. They also expose a semi-public business API.

For production-scale extraction, I have a Trustpilot Scraper on Apify that handles pagination and exports clean JSON/CSV.

To manage request volume yourself, ScrapeOps provides proxy rotation and monitoring specifically designed for web scraping — useful for tracking success rates and identifying when sites change their structure.

Scraping Yelp

Yelp is arguably the hardest platform to scrape in 2026. They use aggressive bot detection (Datadome) and actively pursue legal action against scrapers.

What Makes Yelp Challenging

  • Datadome protection — Sophisticated JavaScript challenges and behavioral analysis
  • Review filtering — Yelp hides reviews it considers unreliable (sometimes 30-40% of total reviews)
  • Dynamic content — Reviews load via JavaScript, requiring browser automation
  • Legal stance — Yelp has sued scrapers before (hiQ Labs precedent helps, but tread carefully)

Working Approach

from playwright.async_api import async_playwright

async def scrape_yelp_reviews(business_url):
    async with async_playwright() as p:
        browser = await p.chromium.launch(headless=True)
        context = await browser.new_context(
            viewport={"width": 1920, "height": 1080},
            user_agent="Mozilla/5.0 ..."
        )
        page = await context.new_page()
        await page.goto(business_url, wait_until="networkidle")

        reviews = []
        review_elements = await page.query_selector_all("[data-review-id]")

        for el in review_elements:
            rating_el = await el.query_selector("[aria-label*='star rating']")
            text_el = await el.query_selector("p[lang]")
            reviews.append({
                "rating": await rating_el.get_attribute("aria-label") if rating_el else None,
                "text": await text_el.inner_text() if text_el else None
            })

        await browser.close()
        return reviews
Enter fullscreen mode Exit fullscreen mode

For reliable Yelp scraping, residential proxies are non-negotiable. ThorData provides the residential IP diversity needed to avoid Datadome's fingerprinting.

Building a Multi-Source Review Pipeline

The real power comes from aggregating reviews across platforms. Here's a production architecture:

Scrapers --> Normalizer --> Database (PostgreSQL + pgvector)
                                |
              Dashboard <-- Sentiment Analysis (NLP/LLM)
Enter fullscreen mode Exit fullscreen mode

Unified Review Schema

Normalize across sources for consistent analysis:

unified_review = {
    "source": "g2",            # amazon, g2, trustpilot, yelp
    "source_id": "rev_abc123",
    "product": "Slack",
    "rating": 4,               # normalized to 1-5
    "title": "...",
    "text": "...",
    "pros": "...",             # G2-specific, null for others
    "cons": "...",             # G2-specific, null for others
    "verified": True,
    "date": "2026-03-01",
    "sentiment": 0.72,         # computed post-scraping
    "aspects": ["pricing", "support"],  # extracted topics
    "scraped_at": "2026-03-09T10:00:00Z"
}
Enter fullscreen mode Exit fullscreen mode

Sentiment Analysis at Scale

For processing thousands of reviews, use a lightweight model locally rather than paying per-API-call:

from transformers import pipeline

sentiment = pipeline("sentiment-analysis",
                    model="cardiffnlp/twitter-roberta-base-sentiment-latest")

def analyze_review(text):
    result = sentiment(text[:512])[0]
    return {
        "label": result["label"],
        "score": round(result["score"], 3)
    }
Enter fullscreen mode Exit fullscreen mode

Handling Common Challenges

Rate Limiting

Every platform rate-limits. Space requests 2-5 seconds apart. ScrapeOps can help monitor your success rates across sources and alert you when a site starts blocking more requests.

Data Quality

Reviews can be fake, incentivized, or machine-generated. Filter by:

  • Verified purchase badges
  • Reviewer history (accounts with 1 review are suspicious)
  • Text length (very short reviews carry less signal)
  • Duplicate text detection across reviews

Legal Considerations

Web scraping of publicly available data is generally legal (hiQ v. LinkedIn, 2022). However:

  • Respect robots.txt as a best practice
  • Don't circumvent authentication barriers
  • Don't overload servers with aggressive request rates
  • Use data responsibly — don't republish raw reviews as your own content

Conclusion

Review scraping in 2026 comes down to three things: residential proxies for reliable access, structured extraction for each platform's unique data format, and normalization for cross-platform analysis.

For quick starts, use managed scrapers like the G2 Reviews Scraper or Trustpilot Scraper on Apify. For custom pipelines, combine proxy services with the code patterns above.

The most valuable insight often comes not from any single platform, but from triangulating sentiment across all of them.


What review data are you scraping? Share your pipeline architecture in the comments — always curious to see how others handle multi-source aggregation.

Top comments (0)