agenthustler

Posted on Apr 9 • Edited on Apr 19

Trustpilot Scraping: Extract Business Reviews and Ratings at Scale

#webdev #javascript #programming #webscraping

Trustpilot is one of the world's most influential consumer review platforms, hosting over 300 million reviews across nearly a million businesses. For market researchers, competitive analysts, and data-driven companies, extracting Trustpilot data at scale provides invaluable insights into customer satisfaction, brand reputation, and industry trends.

In this guide, I'll cover everything you need to know about scraping Trustpilot — from understanding the platform's structure to building production-grade scrapers that handle pagination, rate limiting, and data extraction.

Understanding Trustpilot's Structure

Trustpilot organizes its data around three core entities:

Business Profiles — Company pages with aggregate ratings, review counts, and company information
Reviews — Individual user reviews with ratings, text, dates, and reply data
Categories — Industry groupings that enable discovery and comparison

URL Patterns

Trustpilot follows predictable URL patterns:

Business profile:  https://www.trustpilot.com/review/example.com
Review pages:      https://www.trustpilot.com/review/example.com?page=2
Category listing:  https://www.trustpilot.com/categories/software_company
Search results:    https://www.trustpilot.com/search?query=saas+tools

Understanding these patterns is the foundation of any scraping strategy.

Extracting Business Profile Data

Every business on Trustpilot has a profile page containing structured data — the overall TrustScore, total review count, star distribution, and company information.

JavaScript Approach (Node.js with Cheerio)

const axios = require('axios');
const cheerio = require('cheerio');

async function scrapeBusinessProfile(businessDomain) {
    const url = `https://www.trustpilot.com/review/${businessDomain}`;

    const response = await axios.get(url, {
        headers: {
            'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36',
            'Accept-Language': 'en-US,en;q=0.9'
        }
    });

    const $ = cheerio.load(response.data);

    // Extract JSON-LD structured data (most reliable method)
    const jsonLdScripts = $('script[type="application/ld+json"]');
    let businessData = null;

    jsonLdScripts.each((_, script) => {
        try {
            const data = JSON.parse($(script).html());
            if (data['@type'] === 'LocalBusiness' || data['@type'] === 'Organization') {
                businessData = data;
            }
        } catch (e) {
            // Not all scripts contain valid JSON-LD
        }
    });

    // Extract from page elements as fallback
    const trustScore = $('[data-testid="trust-score"]').text().trim();
    const totalReviews = $('[data-testid="total-review-count"]').text().trim();
    const companyName = $('[data-testid="company-name"]').text().trim();

    // Star distribution
    const starDistribution = {};
    $('[data-testid="star-distribution-row"]').each((_, row) => {
        const stars = $(row).find('[class*="StarLabel"]').text().trim();
        const percentage = $(row).find('[class*="Percentage"]').text().trim();
        if (stars && percentage) {
            starDistribution[stars] = percentage;
        }
    });

    return {
        domain: businessDomain,
        name: companyName || (businessData && businessData.name),
        trustScore: trustScore || (businessData && businessData.aggregateRating?.ratingValue),
        totalReviews: totalReviews || (businessData && businessData.aggregateRating?.reviewCount),
        starDistribution,
        url,
        scrapedAt: new Date().toISOString()
    };
}

// Usage
scrapeBusinessProfile('example.com')
    .then(data => console.log(JSON.stringify(data, null, 2)))
    .catch(err => console.error(err.message));

Python Approach

# Implementation is proprietary (that IS the moat).
# Skip the build — use our ready-made Apify actor:
# see the CTA below for the link (fpr=yw6md3).

Extracting Reviews at Scale

The real value in Trustpilot scraping is the individual reviews. Each review contains the rating, text content, date, verification status, and any company reply.

Paginated Review Extraction

const axios = require('axios');
const cheerio = require('cheerio');

class TrustpilotReviewScraper {
    constructor(options = {}) {
        this.delay = options.delay || 2000;
        this.maxRetries = options.maxRetries || 3;
    }

    async scrapeReviews(businessDomain, maxPages = null) {
        const allReviews = [];
        let page = 1;
        let hasMore = true;

        while (hasMore) {
            if (maxPages && page > maxPages) break;

            try {
                const reviews = await this.scrapePage(businessDomain, page);

                if (reviews.length === 0) {
                    hasMore = false;
                    break;
                }

                allReviews.push(...reviews);
                console.log(`Page ${page}: ${reviews.length} reviews (total: ${allReviews.length})`);

                page++;
                await this.sleep(this.delay);

            } catch (error) {
                if (error.response?.status === 404) {
                    hasMore = false;
                } else if (error.response?.status === 429) {
                    console.log('Rate limited. Waiting 60 seconds...');
                    await this.sleep(60000);
                } else {
                    console.error(`Error on page ${page}: ${error.message}`);
                    hasMore = false;
                }
            }
        }

        return allReviews;
    }

    async scrapePage(businessDomain, page) {
        const url = `https://www.trustpilot.com/review/${businessDomain}?page=${page}`;

        const response = await axios.get(url, {
            headers: {
                'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36',
                'Accept-Language': 'en-US,en;q=0.9'
            }
        });

        const $ = cheerio.load(response.data);
        const reviews = [];

        // Extract reviews from JSON-LD
        $('script[type="application/ld+json"]').each((_, script) => {
            try {
                const data = JSON.parse($(script).html());
                if (data['@graph']) {
                    for (const item of data['@graph']) {
                        if (item['@type'] === 'Review') {
                            reviews.push({
                                id: item.identifier || null,
                                rating: item.reviewRating?.ratingValue,
                                title: item.headline || '',
                                body: item.reviewBody || '',
                                author: item.author?.name || 'Anonymous',
                                date: item.datePublished,
                                language: item.inLanguage,
                                verified: false
                            });
                        }
                    }
                }
            } catch (e) {}
        });

        // Enrich with page-level data if JSON-LD is incomplete
        if (reviews.length === 0) {
            $('[data-testid="review-card"]').each((_, card) => {
                const $card = $(card);

                const ratingEl = $card.find('[data-testid="review-star-rating"]');
                const rating = ratingEl.attr('data-rating') || null;

                const title = $card.find('[data-testid="review-title"]').text().trim();
                const body = $card.find('[data-testid="review-content"]').text().trim();
                const author = $card.find('[data-testid="reviewer-name"]').text().trim();
                const date = $card.find('time').attr('datetime') || '';

                const verified = $card.find('[data-testid="verified-badge"]').length > 0;

                // Check for company reply
                const reply = $card.find('[data-testid="company-reply"]').text().trim();

                reviews.push({
                    rating: rating ? parseInt(rating) : null,
                    title,
                    body,
                    author,
                    date,
                    verified,
                    companyReply: reply || null
                });
            });
        }

        return reviews;
    }

    sleep(ms) {
        return new Promise(resolve => setTimeout(resolve, ms));
    }
}

// Usage
const scraper = new TrustpilotReviewScraper({ delay: 2000 });
scraper.scrapeReviews('example.com', 50)
    .then(reviews => {
        const fs = require('fs');
        fs.writeFileSync('reviews.json', JSON.stringify(reviews, null, 2));
        console.log(`Saved ${reviews.length} reviews`);
    });

Python Review Scraper with Retry Logic

# Implementation is proprietary (that IS the moat).
# Skip the build — use our ready-made Apify actor:
# see the CTA below for the link (fpr=yw6md3).

Sentiment Analysis on Extracted Reviews

Once you have the review data, basic sentiment analysis adds a powerful analytical layer:

from collections import Counter

def analyze_review_sentiment(reviews):
    # Basic sentiment analysis on Trustpilot reviews
    total = len(reviews)
    if total == 0:
        return {}

    # Rating distribution
    ratings = Counter(r.get('rating') for r in reviews if r.get('rating'))

    # Calculate averages
    rated_reviews = [r for r in reviews if r.get('rating')]
    avg_rating = sum(r['rating'] for r in rated_reviews) / len(rated_reviews) if rated_reviews else 0

    # Keyword frequency analysis
    positive_keywords = ['great', 'excellent', 'amazing', 'love', 'best', 'fantastic',
                        'wonderful', 'perfect', 'outstanding', 'recommend']
    negative_keywords = ['terrible', 'horrible', 'awful', 'worst', 'scam', 'fraud',
                        'disappointing', 'avoid', 'waste', 'never']

    positive_count = 0
    negative_count = 0
    keyword_freq = Counter()

    for review in reviews:
        text = (review.get('body', '') + ' ' + review.get('title', '')).lower()

        for kw in positive_keywords:
            if kw in text:
                positive_count += 1
                keyword_freq[f"+{kw}"] += 1

        for kw in negative_keywords:
            if kw in text:
                negative_count += 1
                keyword_freq[f"-{kw}"] += 1

    # Response rate analysis
    replied = sum(1 for r in reviews if r.get('company_reply'))

    # Verified buyer percentage
    verified = sum(1 for r in reviews if r.get('verified'))

    return {
        'total_reviews': total,
        'average_rating': round(avg_rating, 2),
        'rating_distribution': dict(ratings),
        'positive_mentions': positive_count,
        'negative_mentions': negative_count,
        'sentiment_ratio': round(positive_count / max(negative_count, 1), 2),
        'top_keywords': keyword_freq.most_common(20),
        'company_response_rate': f"{(replied / total * 100):.1f}%",
        'verified_buyer_rate': f"{(verified / total * 100):.1f}%"
    }

# Usage
analysis = analyze_review_sentiment(reviews)
print(json.dumps(analysis, indent=2))

Scraping Category and Search Results

Beyond individual businesses, Trustpilot's category pages and search functionality let you discover businesses programmatically:

async function scrapeCategory(category, maxPages = 5) {
    const businesses = [];

    for (let page = 1; page <= maxPages; page++) {
        const url = `https://www.trustpilot.com/categories/${category}?page=${page}`;

        const response = await axios.get(url, {
            headers: {
                'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36'
            }
        });

        const $ = cheerio.load(response.data);

        $('[data-testid="business-card"]').each((_, card) => {
            const $card = $(card);
            const name = $card.find('[data-testid="business-name"]').text().trim();
            const link = $card.find('a').attr('href');
            const rating = $card.find('[data-testid="trust-score"]').text().trim();
            const reviewCount = $card.find('[data-testid="review-count"]').text().trim();

            if (name) {
                businesses.push({
                    name,
                    profileUrl: link ? `https://www.trustpilot.com${link}` : null,
                    trustScore: rating,
                    reviewCount,
                    category
                });
            }
        });

        console.log(`Category ${category}, page ${page}: ${businesses.length} total businesses`);
        await new Promise(r => setTimeout(r, 2000));
    }

    return businesses;
}

// Scrape multiple categories
async function scrapeMultipleCategories(categories) {
    const results = {};

    for (const cat of categories) {
        results[cat] = await scrapeCategory(cat);
        await new Promise(r => setTimeout(r, 5000));
    }

    return results;
}

// Usage
scrapeMultipleCategories(['software_company', 'hosting_company', 'bank'])
    .then(results => {
        const fs = require('fs');
        fs.writeFileSync('trustpilot_categories.json', JSON.stringify(results, null, 2));
    });

Using Apify for Production Trustpilot Scraping

For production-scale Trustpilot scraping, cloud-based solutions handle the infrastructure challenges — proxy rotation, browser rendering, and IP management — that make self-hosted scraping difficult to maintain.

Apify offers pre-built actors that specialize in review platform scraping. These handle Trustpilot's anti-bot measures and provide clean, structured output.

Why Use Apify for Trustpilot

Anti-bot handling: Trustpilot uses Cloudflare and other protections. Apify actors manage this automatically
Browser rendering: Some Trustpilot content requires JavaScript execution
Proxy pools: Residential and datacenter proxy rotation prevents IP-based blocking
Scheduling: Set up automated weekly or daily review monitoring
Webhooks: Get notified when new data is available

Integration Example

from apify_client import ApifyClient

def scrape_trustpilot_via_apify(business_domain, max_reviews=1000):
    client = ApifyClient("YOUR_APIFY_TOKEN")

    run_input = {
        "startUrls": [
            {"url": f"https://www.trustpilot.com/review/{business_domain}"}
        ],
        "maxReviews": max_reviews,
        "includeCompanyInfo": True,
        "proxy": {
            "useApifyProxy": True,
            "apifyProxyGroups": ["RESIDENTIAL"]
        }
    }

    run = client.actor("apify/trustpilot-scraper").call(run_input=run_input)

    items = []
    for item in client.dataset(run["defaultDatasetId"]).iterate_items():
        items.append(item)

    return items

# Scrape and analyze
reviews = scrape_trustpilot_via_apify("example.com", max_reviews=5000)
print(f"Collected {len(reviews)} reviews")

Data Storage for Review Analytics

Storing reviews in a structured database enables powerful temporal analysis:

import sqlite3
import json

def create_review_database(db_path="trustpilot_data.db"):
    conn = sqlite3.connect(db_path)
    cursor = conn.cursor()

    cursor.execute('''
        CREATE TABLE IF NOT EXISTS businesses (
            domain TEXT PRIMARY KEY,
            name TEXT,
            trust_score REAL,
            total_reviews INTEGER,
            last_scraped TIMESTAMP DEFAULT CURRENT_TIMESTAMP
        )
    ''')

    cursor.execute('''
        CREATE TABLE IF NOT EXISTS reviews (
            id TEXT PRIMARY KEY,
            business_domain TEXT,
            rating INTEGER,
            title TEXT,
            body TEXT,
            author TEXT,
            date TEXT,
            verified BOOLEAN,
            company_reply TEXT,
            scraped_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
            FOREIGN KEY (business_domain) REFERENCES businesses(domain)
        )
    ''')

    cursor.execute('''
        CREATE INDEX IF NOT EXISTS idx_reviews_business
        ON reviews(business_domain)
    ''')

    cursor.execute('''
        CREATE INDEX IF NOT EXISTS idx_reviews_date
        ON reviews(date)
    ''')

    conn.commit()
    return conn

def store_reviews(conn, business_domain, reviews):
    cursor = conn.cursor()

    for review in reviews:
        review_id = f"{business_domain}:{review.get('date', '')}:{review.get('author', '')}"

        cursor.execute('''
            INSERT OR REPLACE INTO reviews
            (id, business_domain, rating, title, body, author, date, verified, company_reply)
            VALUES (?, ?, ?, ?, ?, ?, ?, ?, ?)
        ''', (
            review_id,
            business_domain,
            review.get('rating'),
            review.get('title', ''),
            review.get('body', ''),
            review.get('author', 'Anonymous'),
            review.get('date', ''),
            review.get('verified', False),
            review.get('company_reply')
        ))

    conn.commit()

# Usage
conn = create_review_database()
store_reviews(conn, "example.com", reviews)

Best Practices for Trustpilot Scraping

Rate Limiting

Trustpilot is more aggressive with rate limiting than many other platforms. Follow these guidelines:

Minimum 2-second delay between requests to the same domain
5-second delay between different business profile scrapes
Back off exponentially on 429 responses (60s, 120s, 240s)
Rotate User-Agent strings across a pool of realistic browser signatures
Use residential proxies for large-scale operations

Legal and Ethical Considerations

Always check Trustpilot's Terms of Service before scraping
Use data for legitimate purposes such as market research and competitive analysis
Don't republish review content without proper attribution
Respect privacy — don't use reviewer personal information inappropriately
Consider using Trustpilot's official API for authorized access to review data

Handling Anti-Bot Measures

import random

USER_AGENTS = [
    'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36',
    'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36',
    'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:121.0) Gecko/20100101 Firefox/121.0',
    'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/17.2.1 Safari/605.1.15',
]

def get_random_headers():
    return {
        'User-Agent': random.choice(USER_AGENTS),
        'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
        'Accept-Language': 'en-US,en;q=0.9',
        'Accept-Encoding': 'gzip, deflate, br',
        'Connection': 'keep-alive',
        'Upgrade-Insecure-Requests': '1',
        'Cache-Control': 'max-age=0'
    }

Monitoring Reputation Over Time

The most powerful application of Trustpilot scraping is tracking reputation changes over time:

def generate_reputation_report(conn, business_domain, days=30):
    # Generate a reputation trend report
    cursor = conn.cursor()

    cursor.execute('''
        SELECT
            date(date) as review_date,
            COUNT(*) as review_count,
            AVG(rating) as avg_rating,
            SUM(CASE WHEN rating >= 4 THEN 1 ELSE 0 END) as positive,
            SUM(CASE WHEN rating <= 2 THEN 1 ELSE 0 END) as negative
        FROM reviews
        WHERE business_domain = ?
          AND date >= date('now', ?)
        GROUP BY date(date)
        ORDER BY review_date
    ''', (business_domain, f'-{days} days'))

    rows = cursor.fetchall()

    report = {
        'business': business_domain,
        'period_days': days,
        'daily_breakdown': [],
        'summary': {}
    }

    total_reviews = 0
    total_rating_sum = 0

    for row in rows:
        day_data = {
            'date': row[0],
            'reviews': row[1],
            'avg_rating': round(row[2], 2),
            'positive': row[3],
            'negative': row[4]
        }
        report['daily_breakdown'].append(day_data)
        total_reviews += row[1]
        total_rating_sum += row[2] * row[1]

    if total_reviews > 0:
        report['summary'] = {
            'total_reviews': total_reviews,
            'overall_avg_rating': round(total_rating_sum / total_reviews, 2),
            'reviews_per_day': round(total_reviews / days, 1)
        }

    return report

Conclusion

Trustpilot scraping opens up powerful possibilities for competitive intelligence, reputation monitoring, and market research. The platform's structured data and predictable URL patterns make it technically accessible, though its anti-bot protections require careful handling.

Key takeaways:

Start with JSON-LD extraction — it's the most reliable data source on Trustpilot pages
Implement robust rate limiting — Trustpilot is stricter than most platforms, so 2+ second delays are essential
Use cloud platforms like Apify for production workloads that need proxy rotation and anti-bot handling
Store data in databases for temporal analysis and trend tracking
Add sentiment analysis to transform raw reviews into actionable intelligence
Respect Terms of Service and use data ethically

Whether you're monitoring your own brand's reputation, analyzing competitors, or building a review aggregation service, the techniques in this guide provide a solid foundation for extracting meaningful insights from Trustpilot's vast review database.

DEV Community