DEV Community

agenthustler
agenthustler

Posted on

Trustpilot Scraping: Extract Business Reviews and Ratings at Scale

Trustpilot is the world's most influential online review platform, hosting over 300 million reviews for more than 1 million businesses. For market researchers, brand managers, competitive analysts, and data scientists, being able to extract Trustpilot review data at scale unlocks powerful insights into customer sentiment, brand reputation, and industry trends.

In this guide, we'll break down Trustpilot's architecture, walk through extracting reviews, business profiles, and sentiment data, tackle pagination and anti-scraping measures, and show how to scale everything using Apify.


Understanding Trustpilot's Structure

Before writing a single line of code, understanding how Trustpilot organizes its data will save you hours of debugging.

URL Patterns

Trustpilot follows a clean, predictable URL structure:

  • Business profile: https://www.trustpilot.com/review/company-domain.com
  • Reviews page N: https://www.trustpilot.com/review/company-domain.com?page=2
  • Filtered reviews: https://www.trustpilot.com/review/company-domain.com?stars=5
  • Categories: https://www.trustpilot.com/categories/electronics
  • Search: https://www.trustpilot.com/search?query=company+name

The key insight: Trustpilot identifies businesses by their domain name, not by an internal ID. So trustpilot.com/review/amazon.com gives you Amazon's reviews. This makes it trivial to look up any business programmatically.

Page Structure

Every Trustpilot business profile page contains several data-rich sections:

  1. Business header: Company name, overall rating, total review count, TrustScore, claimed/unclaimed status, response rate
  2. Review cards: Individual reviews with star rating, title, body text, author, date, verification status, company reply
  3. Rating distribution: Breakdown by star count (e.g., 65% 5-star, 15% 4-star...)
  4. Business details: Category, location, website, contact info

Embedded Structured Data

Trustpilot embeds rich JSON-LD structured data in every page — this is your primary extraction target:

// Trustpilot pages contain JSON-LD with aggregate rating
{
  "@type": "LocalBusiness",
  "name": "Company Name",
  "aggregateRating": {
    "@type": "AggregateRating",
    "ratingValue": "4.5",
    "bestRating": "5",
    "ratingCount": "12453"
  }
}
Enter fullscreen mode Exit fullscreen mode

This structured data is more reliable than DOM selectors because Trustpilot maintains it for SEO purposes.


Extracting Business Profile Data

Let's start with the business profile — the summary data that appears at the top of every company's Trustpilot page.

Using Cheerio for Profile Extraction

import { CheerioCrawler, Dataset } from 'crawlee';

const crawler = new CheerioCrawler({
    async requestHandler({ request, $ }) {
        // Extract JSON-LD structured data
        const jsonLdScripts = $('script[type="application/ld+json"]');
        let businessData = {};

        jsonLdScripts.each((i, el) => {
            try {
                const data = JSON.parse($(el).html());
                if (data['@type'] === 'LocalBusiness' || data.aggregateRating) {
                    businessData = data;
                }
            } catch {}
        });

        // Extract from DOM for additional details
        const profile = {
            name: businessData.name ||
                $('h1[data-business-unit-name]').text().trim(),

            trustScore: parseFloat(
                $('[data-rating-typography]').first().text() ||
                businessData.aggregateRating?.ratingValue || '0'
            ),

            totalReviews: parseInt(
                $('[data-reviews-count-typography]').text()
                    .replace(/[^0-9]/g, '') ||
                businessData.aggregateRating?.ratingCount || '0'
            ),

            // Rating distribution
            ratingDistribution: extractRatingDistribution($),

            // Business metadata
            category: $('[data-business-unit-category]').text().trim(),
            website: businessData.url || $('a[data-business-url]').attr('href'),
            location: businessData.address?.addressLocality || '',
            claimed: $('[data-claimed-status]').length > 0,

            // Response metrics
            responseRate: $('[data-response-time]').text().trim(),

            url: request.url,
            scrapedAt: new Date().toISOString()
        };

        await Dataset.pushData(profile);
    }
});

function extractRatingDistribution($) {
    const distribution = {};
    // Trustpilot shows rating bars with percentages
    $('[data-rating-distribution] label, .rating-distribution__bar').each((i, el) => {
        const stars = 5 - i; // bars go from 5 to 1
        const percentage = $(el).find('[data-rating-distribution-percentage]').text()
            || $(el).text().match(/(\d+)%/)?.[1] + '%';
        if (percentage) distribution[`${stars}_star`] = percentage;
    });
    return distribution;
}
Enter fullscreen mode Exit fullscreen mode

Extracting Profile Data via Next.js Props

Trustpilot is built with Next.js, which means page data is often available in the __NEXT_DATA__ script tag:

async requestHandler({ $, request }) {
    // Try Next.js data first - most reliable source
    const nextDataScript = $('#__NEXT_DATA__').html();
    if (nextDataScript) {
        try {
            const nextData = JSON.parse(nextDataScript);
            const pageProps = nextData.props?.pageProps;

            if (pageProps?.businessUnit) {
                const bu = pageProps.businessUnit;
                return {
                    name: bu.displayName,
                    trustScore: bu.trustScore,
                    totalReviews: bu.numberOfReviews,
                    stars: bu.stars,
                    category: bu.categories?.[0]?.displayName,
                    website: bu.websiteUrl,
                    identifyingName: bu.identifyingName,
                    profileImageUrl: bu.profileImageUrl
                };
            }
        } catch {}
    }
    // Fall back to DOM parsing...
}
Enter fullscreen mode Exit fullscreen mode

This __NEXT_DATA__ approach is powerful because it gives you pre-processed, structured data without worrying about CSS selectors changing between Trustpilot redesigns.


Extracting Review Data

Reviews are the core value of Trustpilot scraping. Each review contains multiple data points worth capturing.

Individual Review Extraction

function extractReviews($) {
    const reviews = [];

    // Trustpilot wraps each review in an article tag
    $('article[data-service-review-card-paper]').each((i, el) => {
        const $review = $(el);

        // Star rating - Trustpilot uses data attributes
        const ratingEl = $review.find('[data-service-review-rating]');
        const rating = parseInt(ratingEl.attr('data-service-review-rating')) ||
            $review.find('.star-rating').attr('data-rating') ||
            $review.find('img[alt*="star"]').length;

        // Review dates
        const dateEl = $review.find('time[datetime]');
        const reviewDate = dateEl.attr('datetime') || dateEl.text().trim();

        // Experience date (when the transaction happened)
        const experienceDateText = $review.find('[data-service-review-date-of-experience-typography]')
            .text().trim();

        // Review content
        const title = $review.find('[data-service-review-title-typography]')
            .text().trim() || $review.find('h2').text().trim();
        const body = $review.find('[data-service-review-text-typography]')
            .text().trim() || $review.find('.review-content__text').text().trim();

        // Author details
        const authorName = $review.find('[data-consumer-name-typography]')
            .text().trim();
        const authorLocation = $review.find('[data-consumer-country-typography]')
            .text().trim();
        const reviewCount = $review.find('[data-consumer-reviews-count-typography]')
            .text().trim();

        // Verification status
        const verified = $review.find('[data-review-verification-label]').length > 0
            || $review.text().includes('Verified');

        // Company reply
        const reply = $review.find('[data-service-review-business-reply-text-typography]')
            .text().trim();
        const replyDate = $review.find('[data-service-review-business-reply-date]')
            .attr('datetime') || '';

        reviews.push({
            rating,
            title,
            body,
            reviewDate,
            experienceDate: experienceDateText,
            author: {
                name: authorName,
                location: authorLocation,
                totalReviews: reviewCount
            },
            verified,
            companyReply: reply || null,
            companyReplyDate: replyDate || null,
            useful: parseInt(
                $review.find('[data-service-review-useful-count]').text() || '0'
            )
        });
    });

    return reviews;
}
Enter fullscreen mode Exit fullscreen mode

Using Next.js Data for Reviews

Again, the __NEXT_DATA__ approach often yields cleaner results:

function extractReviewsFromNextData(nextData) {
    const reviews = nextData.props?.pageProps?.reviews || [];

    return reviews.map(review => ({
        id: review.id,
        rating: review.rating,
        title: review.title,
        text: review.text,
        language: review.language,
        createdAt: review.dates?.publishedDate,
        experiencedAt: review.dates?.experiencedDate,
        updatedAt: review.dates?.updatedDate,
        author: {
            id: review.consumer?.id,
            displayName: review.consumer?.displayName,
            countryCode: review.consumer?.countryCode,
            numberOfReviews: review.consumer?.numberOfReviews
        },
        verified: review.labels?.verification?.isVerified || false,
        verificationSource: review.labels?.verification?.verificationSource,
        companyReply: review.reply ? {
            text: review.reply.message,
            publishedDate: review.reply.publishedDate,
            updatedDate: review.reply.updatedDate
        } : null,
        likes: review.likes || 0,
        report: review.report || null
    }));
}
Enter fullscreen mode Exit fullscreen mode

Handling Pagination

Trustpilot limits reviews to 20 per page and caps visible pages at around 50 (1,000 reviews). For businesses with tens of thousands of reviews, you need strategies to access the full dataset.

Basic Pagination

import { CheerioCrawler, Dataset } from 'crawlee';

async function scrapeAllReviews(businessDomain, maxPages = 50) {
    const baseUrl = `https://www.trustpilot.com/review/${businessDomain}`;
    const startUrls = [];

    // Generate page URLs upfront
    for (let page = 1; page <= maxPages; page++) {
        startUrls.push(`${baseUrl}?page=${page}`);
    }

    const crawler = new CheerioCrawler({
        maxConcurrency: 2, // Be gentle with Trustpilot
        maxRequestsPerMinute: 15,

        async requestHandler({ request, $ }) {
            const reviews = extractReviews($);

            // Stop if no reviews found (past last page)
            if (reviews.length === 0) return;

            // Add page context to each review
            const pageNum = new URL(request.url).searchParams.get('page') || '1';
            for (const review of reviews) {
                review.pageNumber = parseInt(pageNum);
                review.businessDomain = businessDomain;
            }

            await Dataset.pushData(reviews);
        }
    });

    await crawler.run(startUrls);
}
Enter fullscreen mode Exit fullscreen mode

Star-Filtered Pagination for Large Datasets

To access more than 1,000 reviews, paginate through each star rating separately:

async function scrapeAllReviewsByStars(businessDomain) {
    const baseUrl = `https://www.trustpilot.com/review/${businessDomain}`;
    const allUrls = [];

    // Each star filter has its own pagination
    for (const stars of [1, 2, 3, 4, 5]) {
        for (let page = 1; page <= 50; page++) {
            allUrls.push(`${baseUrl}?stars=${stars}&page=${page}`);
        }
    }

    // This gives you access to up to 5,000 reviews (5 x 1,000)
    const crawler = new CheerioCrawler({
        maxConcurrency: 2,
        maxRequestsPerMinute: 12,

        async requestHandler({ request, $ }) {
            const reviews = extractReviews($);
            if (reviews.length === 0) return;

            const url = new URL(request.url);
            for (const review of reviews) {
                review.filterStars = url.searchParams.get('stars');
                review.page = url.searchParams.get('page');
            }

            await Dataset.pushData(reviews);
        },

        async failedRequestHandler({ request }) {
            console.log(`Failed: ${request.url}`);
        }
    });

    await crawler.run(allUrls);
}
Enter fullscreen mode Exit fullscreen mode

Language-Based Pagination

For international businesses, you can also paginate by language:

// Combine star filter + language for even more coverage
const languages = ['en', 'de', 'fr', 'es', 'it', 'nl', 'da', 'sv', 'nb'];
const urls = [];

for (const lang of languages) {
    for (const stars of [1, 2, 3, 4, 5]) {
        for (let page = 1; page <= 20; page++) {
            urls.push(
                `${baseUrl}?languages=${lang}&stars=${stars}&page=${page}`
            );
        }
    }
}
// Potential access to 45,000 reviews (9 x 5 x 1,000)
Enter fullscreen mode Exit fullscreen mode

Sentiment Analysis on Extracted Data

Once you've extracted reviews, the real value comes from analysis. Here's how to add basic sentiment scoring to your pipeline:

// Simple keyword-based sentiment scoring
function analyzeSentiment(reviewText) {
    const text = reviewText.toLowerCase();

    const positiveWords = [
        'excellent', 'amazing', 'fantastic', 'great', 'wonderful',
        'outstanding', 'perfect', 'love', 'best', 'recommend',
        'reliable', 'professional', 'helpful', 'quick', 'easy',
        'friendly', 'efficient', 'impressed', 'satisfied', 'happy'
    ];

    const negativeWords = [
        'terrible', 'awful', 'horrible', 'worst', 'scam',
        'fraud', 'avoid', 'never', 'poor', 'bad',
        'disappointing', 'waste', 'rude', 'slow', 'broken',
        'refund', 'complaint', 'problem', 'issue', 'unresponsive'
    ];

    let positiveCount = 0;
    let negativeCount = 0;

    for (const word of positiveWords) {
        if (text.includes(word)) positiveCount++;
    }
    for (const word of negativeWords) {
        if (text.includes(word)) negativeCount++;
    }

    const total = positiveCount + negativeCount;
    if (total === 0) return { score: 0, label: 'neutral' };

    const score = (positiveCount - negativeCount) / total;
    const label = score > 0.2 ? 'positive' : score < -0.2 ? 'negative' : 'mixed';

    return {
        score: Math.round(score * 100) / 100,
        label,
        positiveSignals: positiveCount,
        negativeSignals: negativeCount
    };
}

// Apply to extracted reviews
function enrichReviewsWithSentiment(reviews) {
    return reviews.map(review => ({
        ...review,
        sentiment: analyzeSentiment(review.body || ''),
        titleSentiment: analyzeSentiment(review.title || '')
    }));
}
Enter fullscreen mode Exit fullscreen mode

Aggregating Sentiment Across Reviews

function generateSentimentReport(reviews) {
    const enriched = enrichReviewsWithSentiment(reviews);

    const sentimentCounts = { positive: 0, negative: 0, mixed: 0, neutral: 0 };
    enriched.forEach(r => sentimentCounts[r.sentiment.label]++);

    // Identify trending topics in negative reviews
    const negativeReviews = enriched.filter(r => r.sentiment.label === 'negative');
    const topicFrequency = {};
    const topics = ['delivery', 'refund', 'customer service', 'quality',
        'price', 'shipping', 'communication', 'warranty'];

    for (const review of negativeReviews) {
        const text = (review.body + ' ' + review.title).toLowerCase();
        for (const topic of topics) {
            if (text.includes(topic)) {
                topicFrequency[topic] = (topicFrequency[topic] || 0) + 1;
            }
        }
    }

    return {
        totalReviews: reviews.length,
        sentimentBreakdown: sentimentCounts,
        averageRating: reviews.reduce((s, r) => s + r.rating, 0) / reviews.length,
        negativeTopics: Object.entries(topicFrequency)
            .sort((a, b) => b[1] - a[1])
            .map(([topic, count]) => ({ topic, count, percentage: Math.round(count / negativeReviews.length * 100) })),
        responseRate: reviews.filter(r => r.companyReply).length / reviews.length * 100
    };
}
Enter fullscreen mode Exit fullscreen mode

Scaling with Apify

For production-grade Trustpilot scraping, Apify provides the infrastructure you need.

Why Use Apify for Trustpilot?

Trustpilot aggressively blocks scrapers. You need:

  • Proxy rotation: Residential proxies to avoid IP bans
  • Fingerprint randomization: Browser-like request patterns
  • Retry logic: Automatic retries on blocked requests
  • Scheduling: Daily/weekly monitoring of competitor reviews

Using Apify Store Actors

The Apify Store has purpose-built Trustpilot scrapers that handle all anti-bot measures:

import { ApifyClient } from 'apify-client';

const client = new ApifyClient({ token: 'YOUR_APIFY_TOKEN' });

// Scrape reviews for multiple businesses
const run = await client.actor('trustpilot-review-scraper').call({
    businessUrls: [
        'https://www.trustpilot.com/review/example1.com',
        'https://www.trustpilot.com/review/example2.com'
    ],
    maxReviewsPerBusiness: 5000,
    includeReplies: true,
    sortBy: 'recency',
    proxyConfiguration: {
        useApifyProxy: true,
        apifyProxyGroups: ['RESIDENTIAL']
    }
});

const { items } = await client.dataset(run.defaultDatasetId).listItems();
console.log(`Extracted ${items.length} reviews`);

// Export in multiple formats
const csv = await client.dataset(run.defaultDatasetId).downloadItems('csv');
const jsonl = await client.dataset(run.defaultDatasetId).downloadItems('jsonl');
Enter fullscreen mode Exit fullscreen mode

Building a Custom Trustpilot Actor

import { Actor } from 'apify';
import { CheerioCrawler, Dataset } from 'crawlee';

await Actor.init();

const input = await Actor.getInput();
const { domains, maxReviewsPerDomain = 1000 } = input;

const proxyConfig = await Actor.createProxyConfiguration({
    groups: ['RESIDENTIAL']
});

const crawler = new CheerioCrawler({
    proxyConfiguration: proxyConfig,
    maxConcurrency: 2,
    maxRequestsPerMinute: 10,
    additionalMimeTypes: ['application/json'],

    async requestHandler({ request, $, enqueueLinks }) {
        const domain = request.userData.domain;
        const pageNum = request.userData.page || 1;

        // Extract reviews from current page
        const reviews = extractReviews($);

        for (const review of reviews) {
            review.businessDomain = domain;
            review.sentiment = analyzeSentiment(review.body || '');
        }

        await Dataset.pushData(reviews);

        // Enqueue next page if reviews exist and under limit
        if (reviews.length > 0 && pageNum < Math.ceil(maxReviewsPerDomain / 20)) {
            const nextUrl = `https://www.trustpilot.com/review/${domain}?page=${pageNum + 1}`;
            await enqueueLinks({
                urls: [nextUrl],
                userData: { domain, page: pageNum + 1 }
            });
        }
    },

    async failedRequestHandler({ request, error }) {
        console.log(`Failed ${request.url}: ${error.message}`);
    }
});

// Create start URLs from domains
const startUrls = domains.map(domain => ({
    url: `https://www.trustpilot.com/review/${domain}`,
    userData: { domain, page: 1 }
}));

await crawler.run(startUrls);
await Actor.exit();
Enter fullscreen mode Exit fullscreen mode

Dealing with Anti-Scraping on Trustpilot

Trustpilot has invested heavily in bot detection. Here are proven strategies to handle their protections:

Request Headers

const headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/121.0.0.0 Safari/537.36',
    'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8',
    'Accept-Language': 'en-US,en;q=0.9',
    'Accept-Encoding': 'gzip, deflate, br',
    'DNT': '1',
    'Connection': 'keep-alive',
    'Upgrade-Insecure-Requests': '1',
    'Sec-Fetch-Dest': 'document',
    'Sec-Fetch-Mode': 'navigate',
    'Sec-Fetch-Site': 'none',
    'Sec-Fetch-User': '?1'
};
Enter fullscreen mode Exit fullscreen mode

Session Management

// Maintain sessions with Apify's session pool
import { SessionPool } from 'crawlee';

const sessionPool = await SessionPool.open({
    maxPoolSize: 20,
    sessionOptions: {
        maxAgeSecs: 300, // 5 minute sessions
        maxUsageCount: 10 // Max 10 requests per session
    }
});

const crawler = new CheerioCrawler({
    useSessionPool: true,
    sessionPoolOptions: {
        maxPoolSize: 20
    },
    persistCookiesPerSession: true,
    // ...
});
Enter fullscreen mode Exit fullscreen mode

Handling CAPTCHAs

When Trustpilot serves a CAPTCHA, detect and skip gracefully:

async requestHandler({ request, $, session }) {
    // Check for CAPTCHA or block page
    if ($('title').text().includes('Attention Required') ||
        $('[data-captcha]').length > 0 ||
        $.html().includes('cf-challenge')) {

        // Mark session as blocked
        session?.retire();

        // Re-enqueue with different session
        throw new Error('CAPTCHA detected - retiring session');
    }

    // Proceed with extraction...
}
Enter fullscreen mode Exit fullscreen mode

Practical Use Cases

1. Competitive Brand Monitoring

Track how your competitors' ratings change over time by scheduling daily scrapes and comparing:

async function compareCompetitors(domains) {
    const results = {};
    for (const domain of domains) {
        results[domain] = {
            trustScore: await getTrustScore(domain),
            recentSentiment: await getRecentReviewSentiment(domain, 30), // last 30 days
            responseRate: await getResponseRate(domain),
            topComplaints: await getTopComplaints(domain)
        };
    }
    return results;
}
Enter fullscreen mode Exit fullscreen mode

2. Lead Generation

Identify businesses with poor ratings in your industry — they might need your product or service:

// Find businesses in a category with ratings below 3.0
async function findUnhappyCustomers(category) {
    const url = `https://www.trustpilot.com/categories/${category}?sort=rating_asc`;
    // Extract businesses with low ratings
    // These businesses' customers are looking for alternatives
}
Enter fullscreen mode Exit fullscreen mode

3. Product Development Insights

Analyze negative reviews to identify common pain points in your industry — then build products that solve those specific problems.


Best Practices for Trustpilot Scraping

  1. Start with __NEXT_DATA__ — this is the most reliable and structured data source on Trustpilot pages.

  2. Use star-filtered pagination to access more reviews than the default 1,000-review limit per business.

  3. Respect rate limits — keep requests under 15/minute. Trustpilot will temporarily block aggressive scrapers.

  4. Rotate residential proxies — datacenter IPs are quickly detected and blocked by Trustpilot's bot protection.

  5. Monitor for page structure changes — Trustpilot updates their frontend regularly. Build alerts for extraction failures.

  6. Deduplicate reviews — when using star-filtered pagination, some reviews may appear in multiple filters. Use the review ID for deduplication.

  7. Comply with Trustpilot's Terms — review their terms of service and ensure your use case is legitimate.

  8. Cache aggressively — historical reviews rarely change. Only scrape new reviews after your initial full extraction.


Legal and Ethical Considerations

Trustpilot's terms of service restrict automated access. When scraping Trustpilot:

  • Use data for legitimate purposes: Market research, competitive analysis, brand monitoring
  • Don't republish reviews without proper attribution and compliance
  • Respect GDPR: Reviewer names and locations are personal data in the EU
  • Don't overwhelm their servers: Rate limit properly and use caching
  • Consider their API: Trustpilot offers a paid API for businesses — evaluate whether it meets your needs before scraping
  • Stay informed: Web scraping laws evolve rapidly — consult legal counsel for commercial use

Conclusion

Trustpilot's consistent page structure and embedded JSON-LD data make it a viable target for review extraction, but its aggressive bot detection requires a sophisticated approach. The combination of __NEXT_DATA__ parsing, star-filtered pagination for full coverage, sentiment analysis for actionable insights, and cloud infrastructure through the Apify Store for scale creates a robust pipeline.

Whether you're monitoring your own brand reputation, tracking competitor sentiment, or building a review aggregation product, the techniques in this guide give you a solid foundation. Start small with a single business domain, validate your extraction logic, then scale progressively using residential proxies and Apify's Actor infrastructure.

The key to successful Trustpilot scraping is patience: respect rate limits, rotate sessions, handle failures gracefully, and always enrich your raw data with sentiment and trend analysis to extract maximum value from every review you collect.

Top comments (0)