Trustpilot is the world's most influential online review platform, hosting over 300 million reviews for more than 1 million businesses. For market researchers, brand managers, competitive analysts, and data scientists, being able to extract Trustpilot review data at scale unlocks powerful insights into customer sentiment, brand reputation, and industry trends.
In this guide, we'll break down Trustpilot's architecture, walk through extracting reviews, business profiles, and sentiment data, tackle pagination and anti-scraping measures, and show how to scale everything using Apify.
Understanding Trustpilot's Structure
Before writing a single line of code, understanding how Trustpilot organizes its data will save you hours of debugging.
URL Patterns
Trustpilot follows a clean, predictable URL structure:
-
Business profile:
https://www.trustpilot.com/review/company-domain.com -
Reviews page N:
https://www.trustpilot.com/review/company-domain.com?page=2 -
Filtered reviews:
https://www.trustpilot.com/review/company-domain.com?stars=5 -
Categories:
https://www.trustpilot.com/categories/electronics -
Search:
https://www.trustpilot.com/search?query=company+name
The key insight: Trustpilot identifies businesses by their domain name, not by an internal ID. So trustpilot.com/review/amazon.com gives you Amazon's reviews. This makes it trivial to look up any business programmatically.
Page Structure
Every Trustpilot business profile page contains several data-rich sections:
- Business header: Company name, overall rating, total review count, TrustScore, claimed/unclaimed status, response rate
- Review cards: Individual reviews with star rating, title, body text, author, date, verification status, company reply
- Rating distribution: Breakdown by star count (e.g., 65% 5-star, 15% 4-star...)
- Business details: Category, location, website, contact info
Embedded Structured Data
Trustpilot embeds rich JSON-LD structured data in every page — this is your primary extraction target:
// Trustpilot pages contain JSON-LD with aggregate rating
{
"@type": "LocalBusiness",
"name": "Company Name",
"aggregateRating": {
"@type": "AggregateRating",
"ratingValue": "4.5",
"bestRating": "5",
"ratingCount": "12453"
}
}
This structured data is more reliable than DOM selectors because Trustpilot maintains it for SEO purposes.
Extracting Business Profile Data
Let's start with the business profile — the summary data that appears at the top of every company's Trustpilot page.
Using Cheerio for Profile Extraction
import { CheerioCrawler, Dataset } from 'crawlee';
const crawler = new CheerioCrawler({
async requestHandler({ request, $ }) {
// Extract JSON-LD structured data
const jsonLdScripts = $('script[type="application/ld+json"]');
let businessData = {};
jsonLdScripts.each((i, el) => {
try {
const data = JSON.parse($(el).html());
if (data['@type'] === 'LocalBusiness' || data.aggregateRating) {
businessData = data;
}
} catch {}
});
// Extract from DOM for additional details
const profile = {
name: businessData.name ||
$('h1[data-business-unit-name]').text().trim(),
trustScore: parseFloat(
$('[data-rating-typography]').first().text() ||
businessData.aggregateRating?.ratingValue || '0'
),
totalReviews: parseInt(
$('[data-reviews-count-typography]').text()
.replace(/[^0-9]/g, '') ||
businessData.aggregateRating?.ratingCount || '0'
),
// Rating distribution
ratingDistribution: extractRatingDistribution($),
// Business metadata
category: $('[data-business-unit-category]').text().trim(),
website: businessData.url || $('a[data-business-url]').attr('href'),
location: businessData.address?.addressLocality || '',
claimed: $('[data-claimed-status]').length > 0,
// Response metrics
responseRate: $('[data-response-time]').text().trim(),
url: request.url,
scrapedAt: new Date().toISOString()
};
await Dataset.pushData(profile);
}
});
function extractRatingDistribution($) {
const distribution = {};
// Trustpilot shows rating bars with percentages
$('[data-rating-distribution] label, .rating-distribution__bar').each((i, el) => {
const stars = 5 - i; // bars go from 5 to 1
const percentage = $(el).find('[data-rating-distribution-percentage]').text()
|| $(el).text().match(/(\d+)%/)?.[1] + '%';
if (percentage) distribution[`${stars}_star`] = percentage;
});
return distribution;
}
Extracting Profile Data via Next.js Props
Trustpilot is built with Next.js, which means page data is often available in the __NEXT_DATA__ script tag:
async requestHandler({ $, request }) {
// Try Next.js data first - most reliable source
const nextDataScript = $('#__NEXT_DATA__').html();
if (nextDataScript) {
try {
const nextData = JSON.parse(nextDataScript);
const pageProps = nextData.props?.pageProps;
if (pageProps?.businessUnit) {
const bu = pageProps.businessUnit;
return {
name: bu.displayName,
trustScore: bu.trustScore,
totalReviews: bu.numberOfReviews,
stars: bu.stars,
category: bu.categories?.[0]?.displayName,
website: bu.websiteUrl,
identifyingName: bu.identifyingName,
profileImageUrl: bu.profileImageUrl
};
}
} catch {}
}
// Fall back to DOM parsing...
}
This __NEXT_DATA__ approach is powerful because it gives you pre-processed, structured data without worrying about CSS selectors changing between Trustpilot redesigns.
Extracting Review Data
Reviews are the core value of Trustpilot scraping. Each review contains multiple data points worth capturing.
Individual Review Extraction
function extractReviews($) {
const reviews = [];
// Trustpilot wraps each review in an article tag
$('article[data-service-review-card-paper]').each((i, el) => {
const $review = $(el);
// Star rating - Trustpilot uses data attributes
const ratingEl = $review.find('[data-service-review-rating]');
const rating = parseInt(ratingEl.attr('data-service-review-rating')) ||
$review.find('.star-rating').attr('data-rating') ||
$review.find('img[alt*="star"]').length;
// Review dates
const dateEl = $review.find('time[datetime]');
const reviewDate = dateEl.attr('datetime') || dateEl.text().trim();
// Experience date (when the transaction happened)
const experienceDateText = $review.find('[data-service-review-date-of-experience-typography]')
.text().trim();
// Review content
const title = $review.find('[data-service-review-title-typography]')
.text().trim() || $review.find('h2').text().trim();
const body = $review.find('[data-service-review-text-typography]')
.text().trim() || $review.find('.review-content__text').text().trim();
// Author details
const authorName = $review.find('[data-consumer-name-typography]')
.text().trim();
const authorLocation = $review.find('[data-consumer-country-typography]')
.text().trim();
const reviewCount = $review.find('[data-consumer-reviews-count-typography]')
.text().trim();
// Verification status
const verified = $review.find('[data-review-verification-label]').length > 0
|| $review.text().includes('Verified');
// Company reply
const reply = $review.find('[data-service-review-business-reply-text-typography]')
.text().trim();
const replyDate = $review.find('[data-service-review-business-reply-date]')
.attr('datetime') || '';
reviews.push({
rating,
title,
body,
reviewDate,
experienceDate: experienceDateText,
author: {
name: authorName,
location: authorLocation,
totalReviews: reviewCount
},
verified,
companyReply: reply || null,
companyReplyDate: replyDate || null,
useful: parseInt(
$review.find('[data-service-review-useful-count]').text() || '0'
)
});
});
return reviews;
}
Using Next.js Data for Reviews
Again, the __NEXT_DATA__ approach often yields cleaner results:
function extractReviewsFromNextData(nextData) {
const reviews = nextData.props?.pageProps?.reviews || [];
return reviews.map(review => ({
id: review.id,
rating: review.rating,
title: review.title,
text: review.text,
language: review.language,
createdAt: review.dates?.publishedDate,
experiencedAt: review.dates?.experiencedDate,
updatedAt: review.dates?.updatedDate,
author: {
id: review.consumer?.id,
displayName: review.consumer?.displayName,
countryCode: review.consumer?.countryCode,
numberOfReviews: review.consumer?.numberOfReviews
},
verified: review.labels?.verification?.isVerified || false,
verificationSource: review.labels?.verification?.verificationSource,
companyReply: review.reply ? {
text: review.reply.message,
publishedDate: review.reply.publishedDate,
updatedDate: review.reply.updatedDate
} : null,
likes: review.likes || 0,
report: review.report || null
}));
}
Handling Pagination
Trustpilot limits reviews to 20 per page and caps visible pages at around 50 (1,000 reviews). For businesses with tens of thousands of reviews, you need strategies to access the full dataset.
Basic Pagination
import { CheerioCrawler, Dataset } from 'crawlee';
async function scrapeAllReviews(businessDomain, maxPages = 50) {
const baseUrl = `https://www.trustpilot.com/review/${businessDomain}`;
const startUrls = [];
// Generate page URLs upfront
for (let page = 1; page <= maxPages; page++) {
startUrls.push(`${baseUrl}?page=${page}`);
}
const crawler = new CheerioCrawler({
maxConcurrency: 2, // Be gentle with Trustpilot
maxRequestsPerMinute: 15,
async requestHandler({ request, $ }) {
const reviews = extractReviews($);
// Stop if no reviews found (past last page)
if (reviews.length === 0) return;
// Add page context to each review
const pageNum = new URL(request.url).searchParams.get('page') || '1';
for (const review of reviews) {
review.pageNumber = parseInt(pageNum);
review.businessDomain = businessDomain;
}
await Dataset.pushData(reviews);
}
});
await crawler.run(startUrls);
}
Star-Filtered Pagination for Large Datasets
To access more than 1,000 reviews, paginate through each star rating separately:
async function scrapeAllReviewsByStars(businessDomain) {
const baseUrl = `https://www.trustpilot.com/review/${businessDomain}`;
const allUrls = [];
// Each star filter has its own pagination
for (const stars of [1, 2, 3, 4, 5]) {
for (let page = 1; page <= 50; page++) {
allUrls.push(`${baseUrl}?stars=${stars}&page=${page}`);
}
}
// This gives you access to up to 5,000 reviews (5 x 1,000)
const crawler = new CheerioCrawler({
maxConcurrency: 2,
maxRequestsPerMinute: 12,
async requestHandler({ request, $ }) {
const reviews = extractReviews($);
if (reviews.length === 0) return;
const url = new URL(request.url);
for (const review of reviews) {
review.filterStars = url.searchParams.get('stars');
review.page = url.searchParams.get('page');
}
await Dataset.pushData(reviews);
},
async failedRequestHandler({ request }) {
console.log(`Failed: ${request.url}`);
}
});
await crawler.run(allUrls);
}
Language-Based Pagination
For international businesses, you can also paginate by language:
// Combine star filter + language for even more coverage
const languages = ['en', 'de', 'fr', 'es', 'it', 'nl', 'da', 'sv', 'nb'];
const urls = [];
for (const lang of languages) {
for (const stars of [1, 2, 3, 4, 5]) {
for (let page = 1; page <= 20; page++) {
urls.push(
`${baseUrl}?languages=${lang}&stars=${stars}&page=${page}`
);
}
}
}
// Potential access to 45,000 reviews (9 x 5 x 1,000)
Sentiment Analysis on Extracted Data
Once you've extracted reviews, the real value comes from analysis. Here's how to add basic sentiment scoring to your pipeline:
// Simple keyword-based sentiment scoring
function analyzeSentiment(reviewText) {
const text = reviewText.toLowerCase();
const positiveWords = [
'excellent', 'amazing', 'fantastic', 'great', 'wonderful',
'outstanding', 'perfect', 'love', 'best', 'recommend',
'reliable', 'professional', 'helpful', 'quick', 'easy',
'friendly', 'efficient', 'impressed', 'satisfied', 'happy'
];
const negativeWords = [
'terrible', 'awful', 'horrible', 'worst', 'scam',
'fraud', 'avoid', 'never', 'poor', 'bad',
'disappointing', 'waste', 'rude', 'slow', 'broken',
'refund', 'complaint', 'problem', 'issue', 'unresponsive'
];
let positiveCount = 0;
let negativeCount = 0;
for (const word of positiveWords) {
if (text.includes(word)) positiveCount++;
}
for (const word of negativeWords) {
if (text.includes(word)) negativeCount++;
}
const total = positiveCount + negativeCount;
if (total === 0) return { score: 0, label: 'neutral' };
const score = (positiveCount - negativeCount) / total;
const label = score > 0.2 ? 'positive' : score < -0.2 ? 'negative' : 'mixed';
return {
score: Math.round(score * 100) / 100,
label,
positiveSignals: positiveCount,
negativeSignals: negativeCount
};
}
// Apply to extracted reviews
function enrichReviewsWithSentiment(reviews) {
return reviews.map(review => ({
...review,
sentiment: analyzeSentiment(review.body || ''),
titleSentiment: analyzeSentiment(review.title || '')
}));
}
Aggregating Sentiment Across Reviews
function generateSentimentReport(reviews) {
const enriched = enrichReviewsWithSentiment(reviews);
const sentimentCounts = { positive: 0, negative: 0, mixed: 0, neutral: 0 };
enriched.forEach(r => sentimentCounts[r.sentiment.label]++);
// Identify trending topics in negative reviews
const negativeReviews = enriched.filter(r => r.sentiment.label === 'negative');
const topicFrequency = {};
const topics = ['delivery', 'refund', 'customer service', 'quality',
'price', 'shipping', 'communication', 'warranty'];
for (const review of negativeReviews) {
const text = (review.body + ' ' + review.title).toLowerCase();
for (const topic of topics) {
if (text.includes(topic)) {
topicFrequency[topic] = (topicFrequency[topic] || 0) + 1;
}
}
}
return {
totalReviews: reviews.length,
sentimentBreakdown: sentimentCounts,
averageRating: reviews.reduce((s, r) => s + r.rating, 0) / reviews.length,
negativeTopics: Object.entries(topicFrequency)
.sort((a, b) => b[1] - a[1])
.map(([topic, count]) => ({ topic, count, percentage: Math.round(count / negativeReviews.length * 100) })),
responseRate: reviews.filter(r => r.companyReply).length / reviews.length * 100
};
}
Scaling with Apify
For production-grade Trustpilot scraping, Apify provides the infrastructure you need.
Why Use Apify for Trustpilot?
Trustpilot aggressively blocks scrapers. You need:
- Proxy rotation: Residential proxies to avoid IP bans
- Fingerprint randomization: Browser-like request patterns
- Retry logic: Automatic retries on blocked requests
- Scheduling: Daily/weekly monitoring of competitor reviews
Using Apify Store Actors
The Apify Store has purpose-built Trustpilot scrapers that handle all anti-bot measures:
import { ApifyClient } from 'apify-client';
const client = new ApifyClient({ token: 'YOUR_APIFY_TOKEN' });
// Scrape reviews for multiple businesses
const run = await client.actor('trustpilot-review-scraper').call({
businessUrls: [
'https://www.trustpilot.com/review/example1.com',
'https://www.trustpilot.com/review/example2.com'
],
maxReviewsPerBusiness: 5000,
includeReplies: true,
sortBy: 'recency',
proxyConfiguration: {
useApifyProxy: true,
apifyProxyGroups: ['RESIDENTIAL']
}
});
const { items } = await client.dataset(run.defaultDatasetId).listItems();
console.log(`Extracted ${items.length} reviews`);
// Export in multiple formats
const csv = await client.dataset(run.defaultDatasetId).downloadItems('csv');
const jsonl = await client.dataset(run.defaultDatasetId).downloadItems('jsonl');
Building a Custom Trustpilot Actor
import { Actor } from 'apify';
import { CheerioCrawler, Dataset } from 'crawlee';
await Actor.init();
const input = await Actor.getInput();
const { domains, maxReviewsPerDomain = 1000 } = input;
const proxyConfig = await Actor.createProxyConfiguration({
groups: ['RESIDENTIAL']
});
const crawler = new CheerioCrawler({
proxyConfiguration: proxyConfig,
maxConcurrency: 2,
maxRequestsPerMinute: 10,
additionalMimeTypes: ['application/json'],
async requestHandler({ request, $, enqueueLinks }) {
const domain = request.userData.domain;
const pageNum = request.userData.page || 1;
// Extract reviews from current page
const reviews = extractReviews($);
for (const review of reviews) {
review.businessDomain = domain;
review.sentiment = analyzeSentiment(review.body || '');
}
await Dataset.pushData(reviews);
// Enqueue next page if reviews exist and under limit
if (reviews.length > 0 && pageNum < Math.ceil(maxReviewsPerDomain / 20)) {
const nextUrl = `https://www.trustpilot.com/review/${domain}?page=${pageNum + 1}`;
await enqueueLinks({
urls: [nextUrl],
userData: { domain, page: pageNum + 1 }
});
}
},
async failedRequestHandler({ request, error }) {
console.log(`Failed ${request.url}: ${error.message}`);
}
});
// Create start URLs from domains
const startUrls = domains.map(domain => ({
url: `https://www.trustpilot.com/review/${domain}`,
userData: { domain, page: 1 }
}));
await crawler.run(startUrls);
await Actor.exit();
Dealing with Anti-Scraping on Trustpilot
Trustpilot has invested heavily in bot detection. Here are proven strategies to handle their protections:
Request Headers
const headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/121.0.0.0 Safari/537.36',
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8',
'Accept-Language': 'en-US,en;q=0.9',
'Accept-Encoding': 'gzip, deflate, br',
'DNT': '1',
'Connection': 'keep-alive',
'Upgrade-Insecure-Requests': '1',
'Sec-Fetch-Dest': 'document',
'Sec-Fetch-Mode': 'navigate',
'Sec-Fetch-Site': 'none',
'Sec-Fetch-User': '?1'
};
Session Management
// Maintain sessions with Apify's session pool
import { SessionPool } from 'crawlee';
const sessionPool = await SessionPool.open({
maxPoolSize: 20,
sessionOptions: {
maxAgeSecs: 300, // 5 minute sessions
maxUsageCount: 10 // Max 10 requests per session
}
});
const crawler = new CheerioCrawler({
useSessionPool: true,
sessionPoolOptions: {
maxPoolSize: 20
},
persistCookiesPerSession: true,
// ...
});
Handling CAPTCHAs
When Trustpilot serves a CAPTCHA, detect and skip gracefully:
async requestHandler({ request, $, session }) {
// Check for CAPTCHA or block page
if ($('title').text().includes('Attention Required') ||
$('[data-captcha]').length > 0 ||
$.html().includes('cf-challenge')) {
// Mark session as blocked
session?.retire();
// Re-enqueue with different session
throw new Error('CAPTCHA detected - retiring session');
}
// Proceed with extraction...
}
Practical Use Cases
1. Competitive Brand Monitoring
Track how your competitors' ratings change over time by scheduling daily scrapes and comparing:
async function compareCompetitors(domains) {
const results = {};
for (const domain of domains) {
results[domain] = {
trustScore: await getTrustScore(domain),
recentSentiment: await getRecentReviewSentiment(domain, 30), // last 30 days
responseRate: await getResponseRate(domain),
topComplaints: await getTopComplaints(domain)
};
}
return results;
}
2. Lead Generation
Identify businesses with poor ratings in your industry — they might need your product or service:
// Find businesses in a category with ratings below 3.0
async function findUnhappyCustomers(category) {
const url = `https://www.trustpilot.com/categories/${category}?sort=rating_asc`;
// Extract businesses with low ratings
// These businesses' customers are looking for alternatives
}
3. Product Development Insights
Analyze negative reviews to identify common pain points in your industry — then build products that solve those specific problems.
Best Practices for Trustpilot Scraping
Start with
__NEXT_DATA__— this is the most reliable and structured data source on Trustpilot pages.Use star-filtered pagination to access more reviews than the default 1,000-review limit per business.
Respect rate limits — keep requests under 15/minute. Trustpilot will temporarily block aggressive scrapers.
Rotate residential proxies — datacenter IPs are quickly detected and blocked by Trustpilot's bot protection.
Monitor for page structure changes — Trustpilot updates their frontend regularly. Build alerts for extraction failures.
Deduplicate reviews — when using star-filtered pagination, some reviews may appear in multiple filters. Use the review ID for deduplication.
Comply with Trustpilot's Terms — review their terms of service and ensure your use case is legitimate.
Cache aggressively — historical reviews rarely change. Only scrape new reviews after your initial full extraction.
Legal and Ethical Considerations
Trustpilot's terms of service restrict automated access. When scraping Trustpilot:
- Use data for legitimate purposes: Market research, competitive analysis, brand monitoring
- Don't republish reviews without proper attribution and compliance
- Respect GDPR: Reviewer names and locations are personal data in the EU
- Don't overwhelm their servers: Rate limit properly and use caching
- Consider their API: Trustpilot offers a paid API for businesses — evaluate whether it meets your needs before scraping
- Stay informed: Web scraping laws evolve rapidly — consult legal counsel for commercial use
Conclusion
Trustpilot's consistent page structure and embedded JSON-LD data make it a viable target for review extraction, but its aggressive bot detection requires a sophisticated approach. The combination of __NEXT_DATA__ parsing, star-filtered pagination for full coverage, sentiment analysis for actionable insights, and cloud infrastructure through the Apify Store for scale creates a robust pipeline.
Whether you're monitoring your own brand reputation, tracking competitor sentiment, or building a review aggregation product, the techniques in this guide give you a solid foundation. Start small with a single business domain, validate your extraction logic, then scale progressively using residential proxies and Apify's Actor infrastructure.
The key to successful Trustpilot scraping is patience: respect rate limits, rotate sessions, handle failures gracefully, and always enrich your raw data with sentiment and trend analysis to extract maximum value from every review you collect.
Top comments (0)