agenthustler

Posted on Apr 9 • Edited on Apr 19

Trustpilot Scraping: Extract Business Reviews and Ratings at Scale

#webdev #javascript #programming #webscraping

Trustpilot is the world's most influential online review platform, hosting over 300 million reviews for more than 1 million businesses. For market researchers, brand managers, competitive analysts, and data scientists, being able to extract Trustpilot review data at scale unlocks powerful insights into customer sentiment, brand reputation, and industry trends.

In this guide, we'll break down Trustpilot's architecture, walk through extracting reviews, business profiles, and sentiment data, tackle pagination and anti-scraping measures, and show how to scale everything using Apify.

Understanding Trustpilot's Structure

Before writing a single line of code, understanding how Trustpilot organizes its data will save you hours of debugging.

URL Patterns

Trustpilot follows a clean, predictable URL structure:

Business profile: https://www.trustpilot.com/review/company-domain.com
Reviews page N: https://www.trustpilot.com/review/company-domain.com?page=2
Filtered reviews: https://www.trustpilot.com/review/company-domain.com?stars=5
Categories: https://www.trustpilot.com/categories/electronics
Search: https://www.trustpilot.com/search?query=company+name

The key insight: Trustpilot identifies businesses by their domain name, not by an internal ID. So trustpilot.com/review/amazon.com gives you Amazon's reviews. This makes it trivial to look up any business programmatically.

Page Structure

Every Trustpilot business profile page contains several data-rich sections:

Business header: Company name, overall rating, total review count, TrustScore, claimed/unclaimed status, response rate
Review cards: Individual reviews with star rating, title, body text, author, date, verification status, company reply
Rating distribution: Breakdown by star count (e.g., 65% 5-star, 15% 4-star...)
Business details: Category, location, website, contact info

Embedded Structured Data

Trustpilot embeds rich JSON-LD structured data in every page — this is your primary extraction target:

// Trustpilot pages contain JSON-LD with aggregate rating
{
  "@type": "LocalBusiness",
  "name": "Company Name",
  "aggregateRating": {
    "@type": "AggregateRating",
    "ratingValue": "4.5",
    "bestRating": "5",
    "ratingCount": "12453"
  }
}

This structured data is more reliable than DOM selectors because Trustpilot maintains it for SEO purposes.

Extracting Business Profile Data

Let's start with the business profile — the summary data that appears at the top of every company's Trustpilot page.

Using Cheerio for Profile Extraction

# Implementation is proprietary (that IS the moat).
# Skip the build — use our ready-made Apify actor:
# see the CTA below for the link (fpr=yw6md3).

Extracting Profile Data via Next.js Props

Trustpilot is built with Next.js, which means page data is often available in the __NEXT_DATA__ script tag:

async requestHandler({ $, request }) {
    // Try Next.js data first - most reliable source
    const nextDataScript = $('#__NEXT_DATA__').html();
    if (nextDataScript) {
        try {
            const nextData = JSON.parse(nextDataScript);
            const pageProps = nextData.props?.pageProps;

            if (pageProps?.businessUnit) {
                const bu = pageProps.businessUnit;
                return {
                    name: bu.displayName,
                    trustScore: bu.trustScore,
                    totalReviews: bu.numberOfReviews,
                    stars: bu.stars,
                    category: bu.categories?.[0]?.displayName,
                    website: bu.websiteUrl,
                    identifyingName: bu.identifyingName,
                    profileImageUrl: bu.profileImageUrl
                };
            }
        } catch {}
    }
    // Fall back to DOM parsing...
}

This __NEXT_DATA__ approach is powerful because it gives you pre-processed, structured data without worrying about CSS selectors changing between Trustpilot redesigns.

Extracting Review Data

Reviews are the core value of Trustpilot scraping. Each review contains multiple data points worth capturing.

Individual Review Extraction

function extractReviews($) {
    const reviews = [];

    // Trustpilot wraps each review in an article tag
    $('article[data-service-review-card-paper]').each((i, el) => {
        const $review = $(el);

        // Star rating - Trustpilot uses data attributes
        const ratingEl = $review.find('[data-service-review-rating]');
        const rating = parseInt(ratingEl.attr('data-service-review-rating')) ||
            $review.find('.star-rating').attr('data-rating') ||
            $review.find('img[alt*="star"]').length;

        // Review dates
        const dateEl = $review.find('time[datetime]');
        const reviewDate = dateEl.attr('datetime') || dateEl.text().trim();

        // Experience date (when the transaction happened)
        const experienceDateText = $review.find('[data-service-review-date-of-experience-typography]')
            .text().trim();

        // Review content
        const title = $review.find('[data-service-review-title-typography]')
            .text().trim() || $review.find('h2').text().trim();
        const body = $review.find('[data-service-review-text-typography]')
            .text().trim() || $review.find('.review-content__text').text().trim();

        // Author details
        const authorName = $review.find('[data-consumer-name-typography]')
            .text().trim();
        const authorLocation = $review.find('[data-consumer-country-typography]')
            .text().trim();
        const reviewCount = $review.find('[data-consumer-reviews-count-typography]')
            .text().trim();

        // Verification status
        const verified = $review.find('[data-review-verification-label]').length > 0
            || $review.text().includes('Verified');

        // Company reply
        const reply = $review.find('[data-service-review-business-reply-text-typography]')
            .text().trim();
        const replyDate = $review.find('[data-service-review-business-reply-date]')
            .attr('datetime') || '';

        reviews.push({
            rating,
            title,
            body,
            reviewDate,
            experienceDate: experienceDateText,
            author: {
                name: authorName,
                location: authorLocation,
                totalReviews: reviewCount
            },
            verified,
            companyReply: reply || null,
            companyReplyDate: replyDate || null,
            useful: parseInt(
                $review.find('[data-service-review-useful-count]').text() || '0'
            )
        });
    });

    return reviews;
}

Using Next.js Data for Reviews

Again, the __NEXT_DATA__ approach often yields cleaner results:

function extractReviewsFromNextData(nextData) {
    const reviews = nextData.props?.pageProps?.reviews || [];

    return reviews.map(review => ({
        id: review.id,
        rating: review.rating,
        title: review.title,
        text: review.text,
        language: review.language,
        createdAt: review.dates?.publishedDate,
        experiencedAt: review.dates?.experiencedDate,
        updatedAt: review.dates?.updatedDate,
        author: {
            id: review.consumer?.id,
            displayName: review.consumer?.displayName,
            countryCode: review.consumer?.countryCode,
            numberOfReviews: review.consumer?.numberOfReviews
        },
        verified: review.labels?.verification?.isVerified || false,
        verificationSource: review.labels?.verification?.verificationSource,
        companyReply: review.reply ? {
            text: review.reply.message,
            publishedDate: review.reply.publishedDate,
            updatedDate: review.reply.updatedDate
        } : null,
        likes: review.likes || 0,
        report: review.report || null
    }));
}

Handling Pagination

Trustpilot limits reviews to 20 per page and caps visible pages at around 50 (1,000 reviews). For businesses with tens of thousands of reviews, you need strategies to access the full dataset.

Basic Pagination

# Implementation is proprietary (that IS the moat).
# Skip the build — use our ready-made Apify actor:
# see the CTA below for the link (fpr=yw6md3).

Star-Filtered Pagination for Large Datasets

To access more than 1,000 reviews, paginate through each star rating separately:

async function scrapeAllReviewsByStars(businessDomain) {
    const baseUrl = `https://www.trustpilot.com/review/${businessDomain}`;
    const allUrls = [];

    // Each star filter has its own pagination
    for (const stars of [1, 2, 3, 4, 5]) {
        for (let page = 1; page <= 50; page++) {
            allUrls.push(`${baseUrl}?stars=${stars}&page=${page}`);
        }
    }

    // This gives you access to up to 5,000 reviews (5 x 1,000)
    const crawler = new CheerioCrawler({
        maxConcurrency: 2,
        maxRequestsPerMinute: 12,

        async requestHandler({ request, $ }) {
            const reviews = extractReviews($);
            if (reviews.length === 0) return;

            const url = new URL(request.url);
            for (const review of reviews) {
                review.filterStars = url.searchParams.get('stars');
                review.page = url.searchParams.get('page');
            }

            await Dataset.pushData(reviews);
        },

        async failedRequestHandler({ request }) {
            console.log(`Failed: ${request.url}`);
        }
    });

    await crawler.run(allUrls);
}

Language-Based Pagination

For international businesses, you can also paginate by language:

// Combine star filter + language for even more coverage
const languages = ['en', 'de', 'fr', 'es', 'it', 'nl', 'da', 'sv', 'nb'];
const urls = [];

for (const lang of languages) {
    for (const stars of [1, 2, 3, 4, 5]) {
        for (let page = 1; page <= 20; page++) {
            urls.push(
                `${baseUrl}?languages=${lang}&stars=${stars}&page=${page}`
            );
        }
    }
}
// Potential access to 45,000 reviews (9 x 5 x 1,000)

Sentiment Analysis on Extracted Data

Once you've extracted reviews, the real value comes from analysis. Here's how to add basic sentiment scoring to your pipeline:

// Simple keyword-based sentiment scoring
function analyzeSentiment(reviewText) {
    const text = reviewText.toLowerCase();

    const positiveWords = [
        'excellent', 'amazing', 'fantastic', 'great', 'wonderful',
        'outstanding', 'perfect', 'love', 'best', 'recommend',
        'reliable', 'professional', 'helpful', 'quick', 'easy',
        'friendly', 'efficient', 'impressed', 'satisfied', 'happy'
    ];

    const negativeWords = [
        'terrible', 'awful', 'horrible', 'worst', 'scam',
        'fraud', 'avoid', 'never', 'poor', 'bad',
        'disappointing', 'waste', 'rude', 'slow', 'broken',
        'refund', 'complaint', 'problem', 'issue', 'unresponsive'
    ];

    let positiveCount = 0;
    let negativeCount = 0;

    for (const word of positiveWords) {
        if (text.includes(word)) positiveCount++;
    }
    for (const word of negativeWords) {
        if (text.includes(word)) negativeCount++;
    }

    const total = positiveCount + negativeCount;
    if (total === 0) return { score: 0, label: 'neutral' };

    const score = (positiveCount - negativeCount) / total;
    const label = score > 0.2 ? 'positive' : score < -0.2 ? 'negative' : 'mixed';

    return {
        score: Math.round(score * 100) / 100,
        label,
        positiveSignals: positiveCount,
        negativeSignals: negativeCount
    };
}

// Apply to extracted reviews
function enrichReviewsWithSentiment(reviews) {
    return reviews.map(review => ({
        ...review,
        sentiment: analyzeSentiment(review.body || ''),
        titleSentiment: analyzeSentiment(review.title || '')
    }));
}

Aggregating Sentiment Across Reviews

function generateSentimentReport(reviews) {
    const enriched = enrichReviewsWithSentiment(reviews);

    const sentimentCounts = { positive: 0, negative: 0, mixed: 0, neutral: 0 };
    enriched.forEach(r => sentimentCounts[r.sentiment.label]++);

    // Identify trending topics in negative reviews
    const negativeReviews = enriched.filter(r => r.sentiment.label === 'negative');
    const topicFrequency = {};
    const topics = ['delivery', 'refund', 'customer service', 'quality',
        'price', 'shipping', 'communication', 'warranty'];

    for (const review of negativeReviews) {
        const text = (review.body + ' ' + review.title).toLowerCase();
        for (const topic of topics) {
            if (text.includes(topic)) {
                topicFrequency[topic] = (topicFrequency[topic] || 0) + 1;
            }
        }
    }

    return {
        totalReviews: reviews.length,
        sentimentBreakdown: sentimentCounts,
        averageRating: reviews.reduce((s, r) => s + r.rating, 0) / reviews.length,
        negativeTopics: Object.entries(topicFrequency)
            .sort((a, b) => b[1] - a[1])
            .map(([topic, count]) => ({ topic, count, percentage: Math.round(count / negativeReviews.length * 100) })),
        responseRate: reviews.filter(r => r.companyReply).length / reviews.length * 100
    };
}

Scaling with Apify

For production-grade Trustpilot scraping, Apify provides the infrastructure you need.

Why Use Apify for Trustpilot?

Trustpilot aggressively blocks scrapers. You need:

Proxy rotation: Residential proxies to avoid IP bans
Fingerprint randomization: Browser-like request patterns
Retry logic: Automatic retries on blocked requests
Scheduling: Daily/weekly monitoring of competitor reviews

Using Apify Store Actors

The Apify Store has purpose-built Trustpilot scrapers that handle all anti-bot measures:

import { ApifyClient } from 'apify-client';

const client = new ApifyClient({ token: 'YOUR_APIFY_TOKEN' });

// Scrape reviews for multiple businesses
const run = await client.actor('trustpilot-review-scraper').call({
    businessUrls: [
        'https://www.trustpilot.com/review/example1.com',
        'https://www.trustpilot.com/review/example2.com'
    ],
    maxReviewsPerBusiness: 5000,
    includeReplies: true,
    sortBy: 'recency',
    proxyConfiguration: {
        useApifyProxy: true,
        apifyProxyGroups: ['RESIDENTIAL']
    }
});

const { items } = await client.dataset(run.defaultDatasetId).listItems();
console.log(`Extracted ${items.length} reviews`);

// Export in multiple formats
const csv = await client.dataset(run.defaultDatasetId).downloadItems('csv');
const jsonl = await client.dataset(run.defaultDatasetId).downloadItems('jsonl');

Building a Custom Trustpilot Actor

# Implementation is proprietary (that IS the moat).
# Skip the build — use our ready-made Apify actor:
# see the CTA below for the link (fpr=yw6md3).

Dealing with Anti-Scraping on Trustpilot

Trustpilot has invested heavily in bot detection. Here are proven strategies to handle their protections:

Request Headers

const headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/121.0.0.0 Safari/537.36',
    'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8',
    'Accept-Language': 'en-US,en;q=0.9',
    'Accept-Encoding': 'gzip, deflate, br',
    'DNT': '1',
    'Connection': 'keep-alive',
    'Upgrade-Insecure-Requests': '1',
    'Sec-Fetch-Dest': 'document',
    'Sec-Fetch-Mode': 'navigate',
    'Sec-Fetch-Site': 'none',
    'Sec-Fetch-User': '?1'
};

Session Management

# Implementation is proprietary (that IS the moat).
# Skip the build — use our ready-made Apify actor:
# see the CTA below for the link (fpr=yw6md3).

Handling CAPTCHAs

When Trustpilot serves a CAPTCHA, detect and skip gracefully:

async requestHandler({ request, $, session }) {
    // Check for CAPTCHA or block page
    if ($('title').text().includes('Attention Required') ||
        $('[data-captcha]').length > 0 ||
        $.html().includes('cf-challenge')) {

        // Mark session as blocked
        session?.retire();

        // Re-enqueue with different session
        throw new Error('CAPTCHA detected - retiring session');
    }

    // Proceed with extraction...
}

Practical Use Cases

1. Competitive Brand Monitoring

Track how your competitors' ratings change over time by scheduling daily scrapes and comparing:

async function compareCompetitors(domains) {
    const results = {};
    for (const domain of domains) {
        results[domain] = {
            trustScore: await getTrustScore(domain),
            recentSentiment: await getRecentReviewSentiment(domain, 30), // last 30 days
            responseRate: await getResponseRate(domain),
            topComplaints: await getTopComplaints(domain)
        };
    }
    return results;
}

2. Lead Generation

Identify businesses with poor ratings in your industry — they might need your product or service:

// Find businesses in a category with ratings below 3.0
async function findUnhappyCustomers(category) {
    const url = `https://www.trustpilot.com/categories/${category}?sort=rating_asc`;
    // Extract businesses with low ratings
    // These businesses' customers are looking for alternatives
}

3. Product Development Insights

Analyze negative reviews to identify common pain points in your industry — then build products that solve those specific problems.

Best Practices for Trustpilot Scraping

Start with __NEXT_DATA__ — this is the most reliable and structured data source on Trustpilot pages.
Use star-filtered pagination to access more reviews than the default 1,000-review limit per business.
Respect rate limits — keep requests under 15/minute. Trustpilot will temporarily block aggressive scrapers.
Rotate residential proxies — datacenter IPs are quickly detected and blocked by Trustpilot's bot protection.
Monitor for page structure changes — Trustpilot updates their frontend regularly. Build alerts for extraction failures.
Deduplicate reviews — when using star-filtered pagination, some reviews may appear in multiple filters. Use the review ID for deduplication.
Comply with Trustpilot's Terms — review their terms of service and ensure your use case is legitimate.
Cache aggressively — historical reviews rarely change. Only scrape new reviews after your initial full extraction.

Legal and Ethical Considerations

Trustpilot's terms of service restrict automated access. When scraping Trustpilot:

Use data for legitimate purposes: Market research, competitive analysis, brand monitoring
Don't republish reviews without proper attribution and compliance
Respect GDPR: Reviewer names and locations are personal data in the EU
Don't overwhelm their servers: Rate limit properly and use caching
Consider their API: Trustpilot offers a paid API for businesses — evaluate whether it meets your needs before scraping
Stay informed: Web scraping laws evolve rapidly — consult legal counsel for commercial use

Conclusion

Trustpilot's consistent page structure and embedded JSON-LD data make it a viable target for review extraction, but its aggressive bot detection requires a sophisticated approach. The combination of __NEXT_DATA__ parsing, star-filtered pagination for full coverage, sentiment analysis for actionable insights, and cloud infrastructure through the Apify Store for scale creates a robust pipeline.

Whether you're monitoring your own brand reputation, tracking competitor sentiment, or building a review aggregation product, the techniques in this guide give you a solid foundation. Start small with a single business domain, validate your extraction logic, then scale progressively using residential proxies and Apify's Actor infrastructure.

The key to successful Trustpilot scraping is patience: respect rate limits, rotate sessions, handle failures gracefully, and always enrich your raw data with sentiment and trend analysis to extract maximum value from every review you collect.

DEV Community