Trustpilot is one of the world's most influential consumer review platforms, hosting over 300 million reviews across nearly a million businesses. For market researchers, competitive analysts, and data-driven companies, extracting Trustpilot data at scale provides invaluable insights into customer satisfaction, brand reputation, and industry trends.
In this guide, I'll cover everything you need to know about scraping Trustpilot — from understanding the platform's structure to building production-grade scrapers that handle pagination, rate limiting, and data extraction.
Understanding Trustpilot's Structure
Trustpilot organizes its data around three core entities:
- Business Profiles — Company pages with aggregate ratings, review counts, and company information
- Reviews — Individual user reviews with ratings, text, dates, and reply data
- Categories — Industry groupings that enable discovery and comparison
URL Patterns
Trustpilot follows predictable URL patterns:
Business profile: https://www.trustpilot.com/review/example.com
Review pages: https://www.trustpilot.com/review/example.com?page=2
Category listing: https://www.trustpilot.com/categories/software_company
Search results: https://www.trustpilot.com/search?query=saas+tools
Understanding these patterns is the foundation of any scraping strategy.
Extracting Business Profile Data
Every business on Trustpilot has a profile page containing structured data — the overall TrustScore, total review count, star distribution, and company information.
JavaScript Approach (Node.js with Cheerio)
const axios = require('axios');
const cheerio = require('cheerio');
async function scrapeBusinessProfile(businessDomain) {
const url = `https://www.trustpilot.com/review/${businessDomain}`;
const response = await axios.get(url, {
headers: {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36',
'Accept-Language': 'en-US,en;q=0.9'
}
});
const $ = cheerio.load(response.data);
// Extract JSON-LD structured data (most reliable method)
const jsonLdScripts = $('script[type="application/ld+json"]');
let businessData = null;
jsonLdScripts.each((_, script) => {
try {
const data = JSON.parse($(script).html());
if (data['@type'] === 'LocalBusiness' || data['@type'] === 'Organization') {
businessData = data;
}
} catch (e) {
// Not all scripts contain valid JSON-LD
}
});
// Extract from page elements as fallback
const trustScore = $('[data-testid="trust-score"]').text().trim();
const totalReviews = $('[data-testid="total-review-count"]').text().trim();
const companyName = $('[data-testid="company-name"]').text().trim();
// Star distribution
const starDistribution = {};
$('[data-testid="star-distribution-row"]').each((_, row) => {
const stars = $(row).find('[class*="StarLabel"]').text().trim();
const percentage = $(row).find('[class*="Percentage"]').text().trim();
if (stars && percentage) {
starDistribution[stars] = percentage;
}
});
return {
domain: businessDomain,
name: companyName || (businessData && businessData.name),
trustScore: trustScore || (businessData && businessData.aggregateRating?.ratingValue),
totalReviews: totalReviews || (businessData && businessData.aggregateRating?.reviewCount),
starDistribution,
url,
scrapedAt: new Date().toISOString()
};
}
// Usage
scrapeBusinessProfile('example.com')
.then(data => console.log(JSON.stringify(data, null, 2)))
.catch(err => console.error(err.message));
Python Approach
import requests
from bs4 import BeautifulSoup
import json
def scrape_business_profile(business_domain):
# Extract business profile data from Trustpilot
url = f"https://www.trustpilot.com/review/{business_domain}"
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36',
'Accept-Language': 'en-US,en;q=0.9'
}
response = requests.get(url, headers=headers, timeout=30)
response.raise_for_status()
soup = BeautifulSoup(response.text, 'html.parser')
# Extract JSON-LD structured data
business_data = None
for script in soup.find_all('script', type='application/ld+json'):
try:
data = json.loads(script.string)
if data.get('@type') in ['LocalBusiness', 'Organization']:
business_data = data
break
except (json.JSONDecodeError, TypeError):
continue
# Build profile from structured data and page elements
profile = {
'domain': business_domain,
'url': url,
}
if business_data:
profile['name'] = business_data.get('name', '')
aggregate = business_data.get('aggregateRating', {})
profile['trust_score'] = aggregate.get('ratingValue')
profile['total_reviews'] = aggregate.get('reviewCount')
profile['best_rating'] = aggregate.get('bestRating')
profile['worst_rating'] = aggregate.get('worstRating')
# Extract star distribution from page
star_distribution = {}
rows = soup.select('[data-testid="star-distribution-row"]')
for row in rows:
label = row.select_one('[class*="StarLabel"]')
pct = row.select_one('[class*="Percentage"]')
if label and pct:
star_distribution[label.get_text(strip=True)] = pct.get_text(strip=True)
profile['star_distribution'] = star_distribution
return profile
# Usage
profile = scrape_business_profile("example.com")
print(json.dumps(profile, indent=2))
Extracting Reviews at Scale
The real value in Trustpilot scraping is the individual reviews. Each review contains the rating, text content, date, verification status, and any company reply.
Paginated Review Extraction
const axios = require('axios');
const cheerio = require('cheerio');
class TrustpilotReviewScraper {
constructor(options = {}) {
this.delay = options.delay || 2000;
this.maxRetries = options.maxRetries || 3;
}
async scrapeReviews(businessDomain, maxPages = null) {
const allReviews = [];
let page = 1;
let hasMore = true;
while (hasMore) {
if (maxPages && page > maxPages) break;
try {
const reviews = await this.scrapePage(businessDomain, page);
if (reviews.length === 0) {
hasMore = false;
break;
}
allReviews.push(...reviews);
console.log(`Page ${page}: ${reviews.length} reviews (total: ${allReviews.length})`);
page++;
await this.sleep(this.delay);
} catch (error) {
if (error.response?.status === 404) {
hasMore = false;
} else if (error.response?.status === 429) {
console.log('Rate limited. Waiting 60 seconds...');
await this.sleep(60000);
} else {
console.error(`Error on page ${page}: ${error.message}`);
hasMore = false;
}
}
}
return allReviews;
}
async scrapePage(businessDomain, page) {
const url = `https://www.trustpilot.com/review/${businessDomain}?page=${page}`;
const response = await axios.get(url, {
headers: {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36',
'Accept-Language': 'en-US,en;q=0.9'
}
});
const $ = cheerio.load(response.data);
const reviews = [];
// Extract reviews from JSON-LD
$('script[type="application/ld+json"]').each((_, script) => {
try {
const data = JSON.parse($(script).html());
if (data['@graph']) {
for (const item of data['@graph']) {
if (item['@type'] === 'Review') {
reviews.push({
id: item.identifier || null,
rating: item.reviewRating?.ratingValue,
title: item.headline || '',
body: item.reviewBody || '',
author: item.author?.name || 'Anonymous',
date: item.datePublished,
language: item.inLanguage,
verified: false
});
}
}
}
} catch (e) {}
});
// Enrich with page-level data if JSON-LD is incomplete
if (reviews.length === 0) {
$('[data-testid="review-card"]').each((_, card) => {
const $card = $(card);
const ratingEl = $card.find('[data-testid="review-star-rating"]');
const rating = ratingEl.attr('data-rating') || null;
const title = $card.find('[data-testid="review-title"]').text().trim();
const body = $card.find('[data-testid="review-content"]').text().trim();
const author = $card.find('[data-testid="reviewer-name"]').text().trim();
const date = $card.find('time').attr('datetime') || '';
const verified = $card.find('[data-testid="verified-badge"]').length > 0;
// Check for company reply
const reply = $card.find('[data-testid="company-reply"]').text().trim();
reviews.push({
rating: rating ? parseInt(rating) : null,
title,
body,
author,
date,
verified,
companyReply: reply || null
});
});
}
return reviews;
}
sleep(ms) {
return new Promise(resolve => setTimeout(resolve, ms));
}
}
// Usage
const scraper = new TrustpilotReviewScraper({ delay: 2000 });
scraper.scrapeReviews('example.com', 50)
.then(reviews => {
const fs = require('fs');
fs.writeFileSync('reviews.json', JSON.stringify(reviews, null, 2));
console.log(`Saved ${reviews.length} reviews`);
});
Python Review Scraper with Retry Logic
import requests
from bs4 import BeautifulSoup
import json
import time
class TrustpilotScraper:
BASE_URL = "https://www.trustpilot.com"
def __init__(self, delay=2.0):
self.delay = delay
self.session = requests.Session()
self.session.headers.update({
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64)',
'Accept-Language': 'en-US,en;q=0.9'
})
def scrape_reviews(self, business_domain, max_pages=None):
# Scrape all reviews for a business
all_reviews = []
page = 1
while True:
if max_pages and page > max_pages:
break
url = f"{self.BASE_URL}/review/{business_domain}?page={page}"
for attempt in range(3):
try:
response = self.session.get(url, timeout=30)
if response.status_code == 404:
return all_reviews
if response.status_code == 429:
wait_time = 60 * (attempt + 1)
print(f"Rate limited. Waiting {wait_time}s...")
time.sleep(wait_time)
continue
response.raise_for_status()
break
except requests.exceptions.RequestException as e:
if attempt == 2:
print(f"Failed after 3 attempts: {e}")
return all_reviews
time.sleep(5)
reviews = self._parse_reviews(response.text)
if not reviews:
break
all_reviews.extend(reviews)
print(f"Page {page}: {len(reviews)} reviews (total: {len(all_reviews)})")
page += 1
time.sleep(self.delay)
return all_reviews
def _parse_reviews(self, html):
# Parse reviews from HTML page
soup = BeautifulSoup(html, 'html.parser')
reviews = []
# Try JSON-LD first
for script in soup.find_all('script', type='application/ld+json'):
try:
data = json.loads(script.string)
graph = data.get('@graph', [])
for item in graph:
if item.get('@type') == 'Review':
reviews.append({
'rating': item.get('reviewRating', {}).get('ratingValue'),
'title': item.get('headline', ''),
'body': item.get('reviewBody', ''),
'author': item.get('author', {}).get('name', 'Anonymous'),
'date': item.get('datePublished', ''),
'language': item.get('inLanguage', 'en'),
})
except (json.JSONDecodeError, TypeError):
continue
# Fallback to HTML parsing
if not reviews:
cards = soup.select('[data-testid="review-card"]')
for card in cards:
rating_el = card.select_one('[data-testid="review-star-rating"]')
rating = rating_el.get('data-rating') if rating_el else None
title_el = card.select_one('[data-testid="review-title"]')
body_el = card.select_one('[data-testid="review-content"]')
author_el = card.select_one('[data-testid="reviewer-name"]')
time_el = card.select_one('time')
review = {
'rating': int(rating) if rating else None,
'title': title_el.get_text(strip=True) if title_el else '',
'body': body_el.get_text(strip=True) if body_el else '',
'author': author_el.get_text(strip=True) if author_el else 'Anonymous',
'date': time_el.get('datetime', '') if time_el else '',
}
verified = card.select_one('[data-testid="verified-badge"]')
review['verified'] = bool(verified)
reply = card.select_one('[data-testid="company-reply"]')
review['company_reply'] = reply.get_text(strip=True) if reply else None
reviews.append(review)
return reviews
# Usage
scraper = TrustpilotScraper(delay=2.0)
reviews = scraper.scrape_reviews("example.com", max_pages=100)
with open("trustpilot_reviews.json", "w") as f:
json.dump(reviews, f, indent=2, default=str)
print(f"Total reviews collected: {len(reviews)}")
Sentiment Analysis on Extracted Reviews
Once you have the review data, basic sentiment analysis adds a powerful analytical layer:
from collections import Counter
def analyze_review_sentiment(reviews):
# Basic sentiment analysis on Trustpilot reviews
total = len(reviews)
if total == 0:
return {}
# Rating distribution
ratings = Counter(r.get('rating') for r in reviews if r.get('rating'))
# Calculate averages
rated_reviews = [r for r in reviews if r.get('rating')]
avg_rating = sum(r['rating'] for r in rated_reviews) / len(rated_reviews) if rated_reviews else 0
# Keyword frequency analysis
positive_keywords = ['great', 'excellent', 'amazing', 'love', 'best', 'fantastic',
'wonderful', 'perfect', 'outstanding', 'recommend']
negative_keywords = ['terrible', 'horrible', 'awful', 'worst', 'scam', 'fraud',
'disappointing', 'avoid', 'waste', 'never']
positive_count = 0
negative_count = 0
keyword_freq = Counter()
for review in reviews:
text = (review.get('body', '') + ' ' + review.get('title', '')).lower()
for kw in positive_keywords:
if kw in text:
positive_count += 1
keyword_freq[f"+{kw}"] += 1
for kw in negative_keywords:
if kw in text:
negative_count += 1
keyword_freq[f"-{kw}"] += 1
# Response rate analysis
replied = sum(1 for r in reviews if r.get('company_reply'))
# Verified buyer percentage
verified = sum(1 for r in reviews if r.get('verified'))
return {
'total_reviews': total,
'average_rating': round(avg_rating, 2),
'rating_distribution': dict(ratings),
'positive_mentions': positive_count,
'negative_mentions': negative_count,
'sentiment_ratio': round(positive_count / max(negative_count, 1), 2),
'top_keywords': keyword_freq.most_common(20),
'company_response_rate': f"{(replied / total * 100):.1f}%",
'verified_buyer_rate': f"{(verified / total * 100):.1f}%"
}
# Usage
analysis = analyze_review_sentiment(reviews)
print(json.dumps(analysis, indent=2))
Scraping Category and Search Results
Beyond individual businesses, Trustpilot's category pages and search functionality let you discover businesses programmatically:
async function scrapeCategory(category, maxPages = 5) {
const businesses = [];
for (let page = 1; page <= maxPages; page++) {
const url = `https://www.trustpilot.com/categories/${category}?page=${page}`;
const response = await axios.get(url, {
headers: {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36'
}
});
const $ = cheerio.load(response.data);
$('[data-testid="business-card"]').each((_, card) => {
const $card = $(card);
const name = $card.find('[data-testid="business-name"]').text().trim();
const link = $card.find('a').attr('href');
const rating = $card.find('[data-testid="trust-score"]').text().trim();
const reviewCount = $card.find('[data-testid="review-count"]').text().trim();
if (name) {
businesses.push({
name,
profileUrl: link ? `https://www.trustpilot.com${link}` : null,
trustScore: rating,
reviewCount,
category
});
}
});
console.log(`Category ${category}, page ${page}: ${businesses.length} total businesses`);
await new Promise(r => setTimeout(r, 2000));
}
return businesses;
}
// Scrape multiple categories
async function scrapeMultipleCategories(categories) {
const results = {};
for (const cat of categories) {
results[cat] = await scrapeCategory(cat);
await new Promise(r => setTimeout(r, 5000));
}
return results;
}
// Usage
scrapeMultipleCategories(['software_company', 'hosting_company', 'bank'])
.then(results => {
const fs = require('fs');
fs.writeFileSync('trustpilot_categories.json', JSON.stringify(results, null, 2));
});
Using Apify for Production Trustpilot Scraping
For production-scale Trustpilot scraping, cloud-based solutions handle the infrastructure challenges — proxy rotation, browser rendering, and IP management — that make self-hosted scraping difficult to maintain.
Apify offers pre-built actors that specialize in review platform scraping. These handle Trustpilot's anti-bot measures and provide clean, structured output.
Why Use Apify for Trustpilot
- Anti-bot handling: Trustpilot uses Cloudflare and other protections. Apify actors manage this automatically
- Browser rendering: Some Trustpilot content requires JavaScript execution
- Proxy pools: Residential and datacenter proxy rotation prevents IP-based blocking
- Scheduling: Set up automated weekly or daily review monitoring
- Webhooks: Get notified when new data is available
Integration Example
from apify_client import ApifyClient
def scrape_trustpilot_via_apify(business_domain, max_reviews=1000):
client = ApifyClient("YOUR_APIFY_TOKEN")
run_input = {
"startUrls": [
{"url": f"https://www.trustpilot.com/review/{business_domain}"}
],
"maxReviews": max_reviews,
"includeCompanyInfo": True,
"proxy": {
"useApifyProxy": True,
"apifyProxyGroups": ["RESIDENTIAL"]
}
}
run = client.actor("apify/trustpilot-scraper").call(run_input=run_input)
items = []
for item in client.dataset(run["defaultDatasetId"]).iterate_items():
items.append(item)
return items
# Scrape and analyze
reviews = scrape_trustpilot_via_apify("example.com", max_reviews=5000)
print(f"Collected {len(reviews)} reviews")
Data Storage for Review Analytics
Storing reviews in a structured database enables powerful temporal analysis:
import sqlite3
import json
def create_review_database(db_path="trustpilot_data.db"):
conn = sqlite3.connect(db_path)
cursor = conn.cursor()
cursor.execute('''
CREATE TABLE IF NOT EXISTS businesses (
domain TEXT PRIMARY KEY,
name TEXT,
trust_score REAL,
total_reviews INTEGER,
last_scraped TIMESTAMP DEFAULT CURRENT_TIMESTAMP
)
''')
cursor.execute('''
CREATE TABLE IF NOT EXISTS reviews (
id TEXT PRIMARY KEY,
business_domain TEXT,
rating INTEGER,
title TEXT,
body TEXT,
author TEXT,
date TEXT,
verified BOOLEAN,
company_reply TEXT,
scraped_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
FOREIGN KEY (business_domain) REFERENCES businesses(domain)
)
''')
cursor.execute('''
CREATE INDEX IF NOT EXISTS idx_reviews_business
ON reviews(business_domain)
''')
cursor.execute('''
CREATE INDEX IF NOT EXISTS idx_reviews_date
ON reviews(date)
''')
conn.commit()
return conn
def store_reviews(conn, business_domain, reviews):
cursor = conn.cursor()
for review in reviews:
review_id = f"{business_domain}:{review.get('date', '')}:{review.get('author', '')}"
cursor.execute('''
INSERT OR REPLACE INTO reviews
(id, business_domain, rating, title, body, author, date, verified, company_reply)
VALUES (?, ?, ?, ?, ?, ?, ?, ?, ?)
''', (
review_id,
business_domain,
review.get('rating'),
review.get('title', ''),
review.get('body', ''),
review.get('author', 'Anonymous'),
review.get('date', ''),
review.get('verified', False),
review.get('company_reply')
))
conn.commit()
# Usage
conn = create_review_database()
store_reviews(conn, "example.com", reviews)
Best Practices for Trustpilot Scraping
Rate Limiting
Trustpilot is more aggressive with rate limiting than many other platforms. Follow these guidelines:
- Minimum 2-second delay between requests to the same domain
- 5-second delay between different business profile scrapes
- Back off exponentially on 429 responses (60s, 120s, 240s)
- Rotate User-Agent strings across a pool of realistic browser signatures
- Use residential proxies for large-scale operations
Legal and Ethical Considerations
- Always check Trustpilot's Terms of Service before scraping
- Use data for legitimate purposes such as market research and competitive analysis
- Don't republish review content without proper attribution
- Respect privacy — don't use reviewer personal information inappropriately
- Consider using Trustpilot's official API for authorized access to review data
Handling Anti-Bot Measures
import random
USER_AGENTS = [
'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36',
'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36',
'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:121.0) Gecko/20100101 Firefox/121.0',
'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/17.2.1 Safari/605.1.15',
]
def get_random_headers():
return {
'User-Agent': random.choice(USER_AGENTS),
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
'Accept-Language': 'en-US,en;q=0.9',
'Accept-Encoding': 'gzip, deflate, br',
'Connection': 'keep-alive',
'Upgrade-Insecure-Requests': '1',
'Cache-Control': 'max-age=0'
}
Monitoring Reputation Over Time
The most powerful application of Trustpilot scraping is tracking reputation changes over time:
def generate_reputation_report(conn, business_domain, days=30):
# Generate a reputation trend report
cursor = conn.cursor()
cursor.execute('''
SELECT
date(date) as review_date,
COUNT(*) as review_count,
AVG(rating) as avg_rating,
SUM(CASE WHEN rating >= 4 THEN 1 ELSE 0 END) as positive,
SUM(CASE WHEN rating <= 2 THEN 1 ELSE 0 END) as negative
FROM reviews
WHERE business_domain = ?
AND date >= date('now', ?)
GROUP BY date(date)
ORDER BY review_date
''', (business_domain, f'-{days} days'))
rows = cursor.fetchall()
report = {
'business': business_domain,
'period_days': days,
'daily_breakdown': [],
'summary': {}
}
total_reviews = 0
total_rating_sum = 0
for row in rows:
day_data = {
'date': row[0],
'reviews': row[1],
'avg_rating': round(row[2], 2),
'positive': row[3],
'negative': row[4]
}
report['daily_breakdown'].append(day_data)
total_reviews += row[1]
total_rating_sum += row[2] * row[1]
if total_reviews > 0:
report['summary'] = {
'total_reviews': total_reviews,
'overall_avg_rating': round(total_rating_sum / total_reviews, 2),
'reviews_per_day': round(total_reviews / days, 1)
}
return report
Conclusion
Trustpilot scraping opens up powerful possibilities for competitive intelligence, reputation monitoring, and market research. The platform's structured data and predictable URL patterns make it technically accessible, though its anti-bot protections require careful handling.
Key takeaways:
- Start with JSON-LD extraction — it's the most reliable data source on Trustpilot pages
- Implement robust rate limiting — Trustpilot is stricter than most platforms, so 2+ second delays are essential
- Use cloud platforms like Apify for production workloads that need proxy rotation and anti-bot handling
- Store data in databases for temporal analysis and trend tracking
- Add sentiment analysis to transform raw reviews into actionable intelligence
- Respect Terms of Service and use data ethically
Whether you're monitoring your own brand's reputation, analyzing competitors, or building a review aggregation service, the techniques in this guide provide a solid foundation for extracting meaningful insights from Trustpilot's vast review database.
Top comments (0)