Product reviews are one of the most valuable datasets on the internet. They tell you what customers actually think — not what marketing says. Whether you're building a sentiment analysis pipeline, monitoring brand reputation, or training an NLP model, scraping reviews at scale is a core competency.
Here's how to extract reviews from the four major platforms in 2026, including the real technical challenges and working solutions.
Use Cases for Review Data
Review scraping isn't just for e-commerce. Here's where the data creates real value:
- Brand monitoring — Track sentiment across platforms in real time. Catch PR issues before they trend.
- Competitive analysis — What do customers love/hate about competing products? Map feature gaps from actual user feedback.
- Product development — Mine thousands of reviews for feature requests and pain points. Better than surveys.
- Lead generation — Identify unhappy customers of competitors (negative reviewers) for targeted outreach.
- Market research — Aggregate ratings across categories to identify underserved markets.
- AI/ML training — Build sentiment classifiers, aspect-based analysis models, or recommendation engines.
Platform Overview
| Platform | Review Volume | Anti-Bot | Data Quality | Best For |
|---|---|---|---|---|
| Amazon | 1B+ reviews | Very High | Verified purchases, rich metadata | Consumer products, e-commerce |
| G2 | 2.5M+ reviews | Medium | Detailed pros/cons, feature ratings | B2B software, SaaS |
| Trustpilot | 300M+ reviews | Medium-High | Company-level, response tracking | Service businesses, D2C |
| Yelp | 265M+ reviews | High | Local business, photos, check-ins | Restaurants, local services |
Scraping Amazon Reviews
Amazon has the largest review corpus but also the most sophisticated bot detection (PerimeterX/HUMAN).
What You Get
Each Amazon review includes:
- Star rating and title
- Full review text
- Verified purchase badge
- Helpful vote count
- Reviewer profile (name, rank)
- Date and product variant
- Image/video attachments
Technical Approach
import requests
from bs4 import BeautifulSoup
def scrape_amazon_reviews(asin, page=1):
url = f"https://www.amazon.com/product-reviews/{asin}"
params = {
"pageNumber": page,
"sortBy": "recent",
"filterByStar": "all_stars"
}
headers = {
"User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7)",
"Accept-Language": "en-US,en;q=0.9"
}
# Residential proxies are essential for Amazon
proxies = {"https": "http://user:pass@proxy.thordata.com:9000"}
resp = requests.get(url, params=params, headers=headers,
proxies=proxies, timeout=30)
soup = BeautifulSoup(resp.text, "html.parser")
reviews = []
for review_div in soup.select("[data-hook='review']"):
rating_el = review_div.select_one("[data-hook='review-star-rating']")
title_el = review_div.select_one("[data-hook='review-title']")
body_el = review_div.select_one("[data-hook='review-body']")
date_el = review_div.select_one("[data-hook='review-date']")
reviews.append({
"rating": rating_el.text.split()[0] if rating_el else None,
"title": title_el.text.strip() if title_el else None,
"body": body_el.text.strip() if body_el else None,
"date": date_el.text.strip() if date_el else None,
"verified": bool(review_div.select_one("[data-hook='avp-badge']"))
})
return reviews
Scaling Challenges
Amazon rate-limits aggressively and uses device fingerprinting. For production workloads:
- Rotate residential proxies — ThorData provides residential pools with automatic rotation. Essential for Amazon since datacenter IPs are instantly flagged.
- Respect rate limits — 1-3 seconds between requests minimum, with random jitter.
- Handle pagination — Amazon caps at 500 reviews via the web UI. For products with 10K+ reviews, you'll need to use filter combinations (by star rating, by date) to access the full corpus.
Scraping G2 Reviews
G2 is the go-to platform for B2B software reviews. The data is uniquely valuable because reviews include structured pros/cons, feature ratings, and detailed user profiles (company size, role, industry).
Why G2 Data Is Special
Unlike Amazon's free-form text, G2 reviews are structured:
- Separate Pros and Cons sections
- Feature-by-feature star ratings
- User segment data (company size, industry)
- Implementation feedback
- Competitor comparisons mentioned in reviews
This structure makes G2 data immediately useful for competitive analysis without heavy NLP processing.
Approach
G2 uses standard Cloudflare protection. A stealth browser or managed scraper handles it well.
I maintain a G2 Reviews Scraper on Apify that extracts all review fields including the structured pros/cons and user metadata.
For DIY scraping, G2 loads reviews via XHR requests to their internal API:
# G2 reviews load via paginated API calls
review_api = f"https://www.g2.com/products/{slug}/reviews.json"
params = {"page": 1, "sort": "most_recent"}
# Requires session cookies from an authenticated browser session
Pro tip: G2 requires login to see full review text. The free preview truncates after ~200 characters. Plan for authentication in your pipeline.
Scraping Trustpilot
Trustpilot is one of the more scraper-friendly platforms, though they've tightened up in 2026. The key advantage: Trustpilot includes company responses to reviews, giving you both sides of the conversation.
Data Structure
trustpilot_review = {
"rating": 4,
"title": "Great service, slow shipping",
"text": "Product quality is excellent but took 3 weeks...",
"date": "2026-03-01",
"verified": True,
"reply": {
"text": "Thank you for your feedback. We have improved...",
"date": "2026-03-02"
},
"reviewer": {
"name": "John D.",
"reviews_count": 12,
"location": "New York, US"
}
}
Technical Notes
Trustpilot serves reviews as server-rendered HTML, making extraction straightforward with BeautifulSoup. They also expose a semi-public business API.
For production-scale extraction, I have a Trustpilot Scraper on Apify that handles pagination and exports clean JSON/CSV.
To manage request volume yourself, ScrapeOps provides proxy rotation and monitoring specifically designed for web scraping — useful for tracking success rates and identifying when sites change their structure.
Scraping Yelp
Yelp is arguably the hardest platform to scrape in 2026. They use aggressive bot detection (Datadome) and actively pursue legal action against scrapers.
What Makes Yelp Challenging
- Datadome protection — Sophisticated JavaScript challenges and behavioral analysis
- Review filtering — Yelp hides reviews it considers unreliable (sometimes 30-40% of total reviews)
- Dynamic content — Reviews load via JavaScript, requiring browser automation
- Legal stance — Yelp has sued scrapers before (hiQ Labs precedent helps, but tread carefully)
Working Approach
from playwright.async_api import async_playwright
async def scrape_yelp_reviews(business_url):
async with async_playwright() as p:
browser = await p.chromium.launch(headless=True)
context = await browser.new_context(
viewport={"width": 1920, "height": 1080},
user_agent="Mozilla/5.0 ..."
)
page = await context.new_page()
await page.goto(business_url, wait_until="networkidle")
reviews = []
review_elements = await page.query_selector_all("[data-review-id]")
for el in review_elements:
rating_el = await el.query_selector("[aria-label*='star rating']")
text_el = await el.query_selector("p[lang]")
reviews.append({
"rating": await rating_el.get_attribute("aria-label") if rating_el else None,
"text": await text_el.inner_text() if text_el else None
})
await browser.close()
return reviews
For reliable Yelp scraping, residential proxies are non-negotiable. ThorData provides the residential IP diversity needed to avoid Datadome's fingerprinting.
Building a Multi-Source Review Pipeline
The real power comes from aggregating reviews across platforms. Here's a production architecture:
Scrapers --> Normalizer --> Database (PostgreSQL + pgvector)
|
Dashboard <-- Sentiment Analysis (NLP/LLM)
Unified Review Schema
Normalize across sources for consistent analysis:
unified_review = {
"source": "g2", # amazon, g2, trustpilot, yelp
"source_id": "rev_abc123",
"product": "Slack",
"rating": 4, # normalized to 1-5
"title": "...",
"text": "...",
"pros": "...", # G2-specific, null for others
"cons": "...", # G2-specific, null for others
"verified": True,
"date": "2026-03-01",
"sentiment": 0.72, # computed post-scraping
"aspects": ["pricing", "support"], # extracted topics
"scraped_at": "2026-03-09T10:00:00Z"
}
Sentiment Analysis at Scale
For processing thousands of reviews, use a lightweight model locally rather than paying per-API-call:
from transformers import pipeline
sentiment = pipeline("sentiment-analysis",
model="cardiffnlp/twitter-roberta-base-sentiment-latest")
def analyze_review(text):
result = sentiment(text[:512])[0]
return {
"label": result["label"],
"score": round(result["score"], 3)
}
Handling Common Challenges
Rate Limiting
Every platform rate-limits. Space requests 2-5 seconds apart. ScrapeOps can help monitor your success rates across sources and alert you when a site starts blocking more requests.
Data Quality
Reviews can be fake, incentivized, or machine-generated. Filter by:
- Verified purchase badges
- Reviewer history (accounts with 1 review are suspicious)
- Text length (very short reviews carry less signal)
- Duplicate text detection across reviews
Legal Considerations
Web scraping of publicly available data is generally legal (hiQ v. LinkedIn, 2022). However:
- Respect robots.txt as a best practice
- Don't circumvent authentication barriers
- Don't overload servers with aggressive request rates
- Use data responsibly — don't republish raw reviews as your own content
Conclusion
Review scraping in 2026 comes down to three things: residential proxies for reliable access, structured extraction for each platform's unique data format, and normalization for cross-platform analysis.
For quick starts, use managed scrapers like the G2 Reviews Scraper or Trustpilot Scraper on Apify. For custom pipelines, combine proxy services with the code patterns above.
The most valuable insight often comes not from any single platform, but from triangulating sentiment across all of them.
What review data are you scraping? Share your pipeline architecture in the comments — always curious to see how others handle multi-source aggregation.
Top comments (0)