DEV Community

agenthustler
agenthustler

Posted on

Tripadvisor Scraping: Extract Hotel Reviews, Ratings and Travel Data

The travel industry generates massive amounts of publicly available data every day. Tripadvisor alone hosts over 1 billion reviews across hotels, restaurants, attractions, and experiences worldwide. For businesses in hospitality, market research firms, and data analysts, this information is incredibly valuable — but manually collecting it is practically impossible.

In this guide, we'll walk through how Tripadvisor structures its data, what you can extract, and how to build reliable scrapers that handle the platform's complexity. Whether you're tracking competitor hotel ratings, analyzing restaurant review sentiment, or building a travel data aggregator, this article covers everything you need to know.

Understanding Tripadvisor's Data Structure

Before writing any code, it's essential to understand how Tripadvisor organizes its content. The platform follows a hierarchical structure:

Location Pages → Entity Pages → Review Pages

Location Hierarchy

Tripadvisor organizes content geographically:

  • Continent → Country → State/Region → City → Neighborhood
  • Each level has its own URL pattern: tripadvisor.com/Tourism-g187147-Paris-Vacations.html
  • The g parameter represents a geographic ID that's consistent across the platform

Entity Types

Within each location, entities are categorized:

  • Hotels: /Hotel_Review-g187147-d188150-Reviews-Hotel_Name.html
  • Restaurants: /Restaurant_Review-g187147-d1751525-Reviews-Restaurant_Name.html
  • Attractions: /Attraction_Review-g187147-d188757-Reviews-Attraction_Name.html

The d parameter is the unique entity ID. Understanding these URL patterns is crucial for building systematic scrapers.

Review Pagination

Reviews on Tripadvisor follow a predictable pagination pattern:

  • First page: /Hotel_Review-g187147-d188150-Reviews-Hotel_Name.html
  • Second page: /Hotel_Review-g187147-d188150-Reviews-or10-Hotel_Name.html
  • Third page: /Hotel_Review-g187147-d188150-Reviews-or20-Hotel_Name.html

The or parameter increments by 10 (the default number of reviews per page). This means for a hotel with 500 reviews, you'd need to paginate through 50 pages.

What Data Can You Extract?

Hotel Data Points

From a typical hotel listing page, you can extract:

Data Point Location Notes
Hotel name H1 tag Primary identifier
Overall rating span.biGQs Scale of 1-5 (bubbles)
Number of reviews Review count section Total across all languages
Price range Price section Nightly rate range
Amenities Amenities section Pool, WiFi, parking, etc.
Location/address Address section Full street address
Star classification Hotel class 1-5 star rating
Photos count Photo gallery Total uploaded photos
Nearby attractions Nearby section Points of interest

Review Data Points

Each individual review contains:

  • Reviewer name and profile link
  • Rating (1-5 bubbles)
  • Review title and full text
  • Date of stay (month and year)
  • Trip type (business, couples, family, solo, friends)
  • Room tip (optional)
  • Photos attached to the review
  • Helpful votes count
  • Management response (if any)

Traveler Photos Metadata

Tripadvisor's photo section is rich with metadata:

  • Upload date
  • Caption text
  • Associated review link
  • Photo category (room, view, pool, food, etc.)
  • Uploader profile information

Building a Tripadvisor Scraper with Node.js

Let's build a practical scraper. We'll use Crawlee, the open-source web scraping library that powers Apify actors.

Setting Up the Project

// package.json dependencies
// crawlee, cheerio, apify

const { CheerioCrawler, Dataset } = require('crawlee');

const crawler = new CheerioCrawler({
    maxConcurrency: 2, // Be respectful with request rate
    maxRequestRetries: 3,
    requestHandlerTimeoutSecs: 60,

    async requestHandler({ request, $, enqueueLinks, log }) {
        const url = request.url;

        if (url.includes('/Hotels-g')) {
            // This is a hotel listing page
            await handleHotelList($, enqueueLinks, log);
        } else if (url.includes('/Hotel_Review-')) {
            // This is an individual hotel page
            await handleHotelDetail($, request, log);
        }
    },
});
Enter fullscreen mode Exit fullscreen mode

Extracting Hotel Listings

async function handleHotelList($, enqueueLinks, log) {
    const hotels = [];

    $('div[data-automation="hotel-card-title"]').each((i, el) => {
        const name = $(el).text().trim();
        const link = $(el).find('a').attr('href');
        const fullUrl = `https://www.tripadvisor.com${link}`;

        hotels.push({ name, url: fullUrl });
    });

    log.info(`Found ${hotels.length} hotels on listing page`);

    // Enqueue individual hotel pages for detail scraping
    await enqueueLinks({
        urls: hotels.map(h => h.url),
        label: 'HOTEL_DETAIL',
    });

    // Handle pagination - find next page link
    const nextPage = $('a[data-page-number]').last().attr('href');
    if (nextPage) {
        await enqueueLinks({
            urls: [`https://www.tripadvisor.com${nextPage}`],
            label: 'HOTEL_LIST',
        });
    }
}
Enter fullscreen mode Exit fullscreen mode

Extracting Hotel Details and Reviews

async function handleHotelDetail($, request, log) {
    const hotelData = {
        url: request.url,
        name: $('h1[data-automation="hotel-header-name"]')
              .text().trim(),
        overallRating: parseFloat(
            $('span.biGQs._P.fiohW.uuBRH').first().text()
        ),
        totalReviews: parseInt(
            $('span.biGQs._P.pZUbB.biKBZ').first()
             .text().replace(/[^0-9]/g, '')
        ),
        ranking: $('span.biGQs._P.pZUbB.hmDzD')
                 .first().text().trim(),
        priceRange: $('div[data-automation="hotel-price"]')
                    .text().trim(),
        address: $('span.biGQs._P.pZUbB.egaXP.KxBGd')
                 .text().trim(),
        amenities: [],
        reviews: [],
    };

    // Extract amenities
    $('div[data-automation="amenity"]').each((i, el) => {
        hotelData.amenities.push($(el).text().trim());
    });

    // Extract reviews from current page
    $('div[data-automation="reviewCard"]').each((i, el) => {
        const review = {
            rating: $(el).find('svg.UctUV').length,
            title: $(el).find('a.Qwuub span')
                        .text().trim(),
            text: $(el).find('span.QewHA div span')
                       .text().trim(),
            dateOfStay: $(el).find('span.teHYY')
                            .text().replace('Date of stay: ', ''),
            tripType: $(el).find('span.TDKzw')
                          .text().trim(),
            reviewer: $(el).find('a.ui_header_link')
                          .text().trim(),
            helpfulVotes: parseInt(
                $(el).find('span.biGQs._P.FwFXZ')
                     .text() || '0'
            ),
        };
        hotelData.reviews.push(review);
    });

    log.info(`Extracted ${hotelData.reviews.length} reviews for ${hotelData.name}`);
    await Dataset.pushData(hotelData);
}
Enter fullscreen mode Exit fullscreen mode

Python Approach with Beautiful Soup

For those who prefer Python, here's how to approach the same task:

import requests
from bs4 import BeautifulSoup
import json
import time
import random

class TripadvisorScraper:
    BASE_URL = "https://www.tripadvisor.com"
    HEADERS = {
        "User-Agent": (
            "Mozilla/5.0 (Windows NT 10.0; Win64; x64) "
            "AppleWebKit/537.36 (KHTML, like Gecko) "
            "Chrome/120.0.0.0 Safari/537.36"
        ),
        "Accept-Language": "en-US,en;q=0.9",
    }

    def __init__(self):
        self.session = requests.Session()
        self.session.headers.update(self.HEADERS)

    def get_hotel_reviews(self, hotel_url, max_pages=5):
        all_reviews = []

        for page in range(max_pages):
            if page == 0:
                url = hotel_url
            else:
                offset = page * 10
                url = hotel_url.replace(
                    "-Reviews-",
                    f"-Reviews-or{offset}-"
                )

            response = self.session.get(url)
            soup = BeautifulSoup(response.text, "html.parser")

            review_cards = soup.find_all(
                "div",
                attrs={"data-automation": "reviewCard"}
            )

            if not review_cards:
                print(f"No reviews found on page {page + 1}.")
                break

            for card in review_cards:
                review = self._parse_review(card)
                all_reviews.append(review)

            print(
                f"Page {page + 1}: "
                f"extracted {len(review_cards)} reviews"
            )

            # Respectful delay between requests
            time.sleep(random.uniform(2, 5))

        return all_reviews

    def _parse_review(self, card):
        return {
            "title": self._safe_text(
                card.find("a", class_="Qwuub")
            ),
            "text": self._safe_text(
                card.find("span", class_="QewHA")
            ),
            "rating": len(
                card.find_all("svg", class_="UctUV")
            ),
            "date_of_stay": self._safe_text(
                card.find("span", class_="teHYY")
            ),
            "trip_type": self._safe_text(
                card.find("span", class_="TDKzw")
            ),
        }

    @staticmethod
    def _safe_text(element):
        return element.text.strip() if element else None


# Usage
scraper = TripadvisorScraper()
reviews = scraper.get_hotel_reviews(
    "https://www.tripadvisor.com/Hotel_Review-g60763"
    "-d93450-Reviews-The_Plaza-New_York_City.html",
    max_pages=3
)
print(f"Total reviews collected: {len(reviews)}")
Enter fullscreen mode Exit fullscreen mode

Handling Tripadvisor's Anti-Scraping Measures

Tripadvisor employs several techniques to prevent automated access:

1. Rate Limiting

The platform monitors request frequency. Best practices:

  • Keep requests under 1 per 3 seconds per IP
  • Randomize delays between requests
  • Use exponential backoff on HTTP 429 responses

2. Dynamic Content Loading

Many review sections load via JavaScript. For these cases, you need a browser-based approach:

const { PlaywrightCrawler } = require('crawlee');

const crawler = new PlaywrightCrawler({
    launchContext: {
        launchOptions: {
            headless: true,
        },
    },
    async requestHandler({ page, request, log }) {
        // Wait for reviews to load
        await page.waitForSelector(
            'div[data-automation="reviewCard"]',
            { timeout: 15000 }
        );

        // Click "Read more" on truncated reviews
        const readMoreButtons = await page.$$(
            'button[data-automation="readMore"]'
        );
        for (const btn of readMoreButtons) {
            await btn.click();
            await page.waitForTimeout(500);
        }

        // Now extract the full review text
        const reviews = await page.$$eval(
            'div[data-automation="reviewCard"]',
            (cards) => cards.map(card => ({
                title: card.querySelector('a.Qwuub span')
                           ?.textContent?.trim(),
                text: card.querySelector('span.QewHA div span')
                          ?.textContent?.trim(),
            }))
        );

        log.info(`Extracted ${reviews.length} full reviews`);
    },
});
Enter fullscreen mode Exit fullscreen mode

3. IP Blocking

For large-scale scraping, proxy rotation is essential:

const { CheerioCrawler, ProxyConfiguration } = require('crawlee');

const proxyConfiguration = new ProxyConfiguration({
    proxyUrls: [
        'http://proxy1:8080',
        'http://proxy2:8080',
        'http://proxy3:8080',
    ],
});

const crawler = new CheerioCrawler({
    proxyConfiguration,
    // ... rest of config
});
Enter fullscreen mode Exit fullscreen mode

Using Apify for Tripadvisor Scraping

Building and maintaining your own scraper infrastructure is time-consuming. Apify provides ready-made actors that handle all the complexity:

Why Use Apify?

  • Pre-built actors for Tripadvisor that handle anti-scraping measures
  • Proxy management built in with residential and datacenter proxies
  • Automatic retries and error handling
  • Scheduled runs for regular data collection
  • Dataset export in JSON, CSV, Excel, or via API

Running a Tripadvisor Scraper on Apify

const Apify = require('apify');

// Using Apify SDK to run a Tripadvisor actor
const run = await Apify.call('tripadvisor/scraper', {
    startUrls: [
        {
            url: 'https://www.tripadvisor.com/Hotels-g60763-New_York_City-Hotels.html'
        }
    ],
    maxItems: 100,
    includeReviews: true,
    reviewsPerHotel: 50,
    language: 'en',
    proxy: {
        useApifyProxy: true,
        apifyProxyGroups: ['RESIDENTIAL'],
    },
});

// Fetch results
const dataset = await Apify.openDataset(run.defaultDatasetId);
const { items } = await dataset.getData();
console.log(`Scraped ${items.length} hotels with reviews`);
Enter fullscreen mode Exit fullscreen mode

Scheduling Regular Data Collection

For ongoing monitoring (tracking competitor ratings, price changes, new reviews), Apify's scheduling feature is invaluable:

// Create a scheduled task via Apify API
const schedule = {
    name: 'tripadvisor-weekly-scrape',
    cronExpression: '0 6 * * 1', // Every Monday at 6 AM
    actions: [{
        type: 'RUN_ACTOR',
        actorId: 'tripadvisor/scraper',
        runInput: {
            startUrls: [
                {
                    url: 'https://www.tripadvisor.com/'
                         + 'Hotels-g60763-New_York_City.html'
                }
            ],
            maxItems: 200,
        },
    }],
};
Enter fullscreen mode Exit fullscreen mode

Practical Use Cases

1. Competitive Hotel Analysis

Track how your hotel compares to competitors:

import pandas as pd

def analyze_competitors(scraped_data):
    df = pd.DataFrame(scraped_data)

    analysis = df.groupby('hotel_name').agg({
        'overall_rating': 'mean',
        'total_reviews': 'max',
        'price_min': 'min',
        'price_max': 'max',
    }).round(2)

    # Rating trend over time
    review_df = pd.json_normalize(
        scraped_data,
        record_path='reviews',
        meta=['hotel_name']
    )
    review_df['date'] = pd.to_datetime(
        review_df['date_of_stay']
    )

    monthly_ratings = review_df.groupby(
        [review_df['date'].dt.to_period('M'), 'hotel_name']
    )['rating'].mean()

    return analysis, monthly_ratings
Enter fullscreen mode Exit fullscreen mode

2. Review Sentiment Analysis

Combine scraped reviews with NLP:

from collections import Counter

def extract_themes(reviews):
    positive_keywords = [
        'clean', 'friendly', 'location', 'comfortable',
        'spacious', 'helpful', 'breakfast', 'view'
    ]
    negative_keywords = [
        'noise', 'dirty', 'small', 'rude', 'expensive',
        'old', 'broken', 'slow'
    ]

    pos_counts = Counter()
    neg_counts = Counter()

    for review in reviews:
        text = review['text'].lower()
        for kw in positive_keywords:
            if kw in text:
                pos_counts[kw] += 1
        for kw in negative_keywords:
            if kw in text:
                neg_counts[kw] += 1

    return {
        'top_positives': pos_counts.most_common(5),
        'top_negatives': neg_counts.most_common(5),
    }
Enter fullscreen mode Exit fullscreen mode

3. Travel Photo Data Mining

Extract and categorize traveler photos for market research:

async function extractPhotoMetadata($, hotelUrl) {
    const photos = [];

    $('div[data-automation="photo-card"]').each((i, el) => {
        photos.push({
            imageUrl: $(el).find('img').attr('src'),
            caption: $(el).find('span.caption')
                         .text().trim(),
            category: $(el).find('span.category')
                          .text().trim(),
            uploadDate: $(el).find('span.date')
                            .text().trim(),
            uploaderName: $(el).find('a.profile-link')
                              .text().trim(),
        });
    });

    return photos;
}
Enter fullscreen mode Exit fullscreen mode

Data Storage and Export Best Practices

Once you've scraped the data, proper storage matters:

// Export to multiple formats with Apify Dataset
const dataset = await Dataset.open('tripadvisor-hotels');

// Push data during scraping
await dataset.pushData(hotelData);

// After scraping, export as needed
await dataset.exportToCSV('hotels_export');
await dataset.exportToJSON('hotels_export');
Enter fullscreen mode Exit fullscreen mode

For large datasets, consider streaming:

import json

def stream_to_jsonl(reviews, output_file):
    with open(output_file, 'w') as f:
        for review in reviews:
            f.write(json.dumps(review) + '\n')
Enter fullscreen mode Exit fullscreen mode

Legal and Ethical Considerations

Web scraping exists in a legal gray area. Key points for Tripadvisor:

  • Tripadvisor's ToS restricts automated access — understand the risks
  • Only scrape publicly available data — never attempt to access private user data
  • Respect robots.txt directives
  • Rate limit your requests to avoid impacting the platform's performance
  • GDPR compliance: if collecting data about EU users, ensure you have a lawful basis
  • Use data responsibly: don't republish reviews as your own content

Conclusion

Tripadvisor scraping opens up powerful possibilities for travel industry analytics, competitive intelligence, and market research. The platform's structured data — from hotel ratings and review text to traveler photos and pricing — provides a goldmine of insights when properly collected and analyzed.

The key to successful Tripadvisor scraping is understanding the platform's structure, respecting rate limits, handling dynamic content properly, and using reliable infrastructure. Whether you build a custom scraper or leverage Apify's ready-made solutions, the techniques covered in this guide will help you extract the travel data you need efficiently and responsibly.

Start small with a single hotel or restaurant, validate your extraction logic, and then scale up gradually. The travel data landscape is vast — the opportunities for analysis and insight are virtually unlimited.

Top comments (0)