The travel industry generates massive amounts of publicly available data every day. Tripadvisor alone hosts over 1 billion reviews across hotels, restaurants, attractions, and experiences worldwide. For businesses in hospitality, market research firms, and data analysts, this information is incredibly valuable — but manually collecting it is practically impossible.
In this guide, we'll walk through how Tripadvisor structures its data, what you can extract, and how to build reliable scrapers that handle the platform's complexity. Whether you're tracking competitor hotel ratings, analyzing restaurant review sentiment, or building a travel data aggregator, this article covers everything you need to know.
Understanding Tripadvisor's Data Structure
Before writing any code, it's essential to understand how Tripadvisor organizes its content. The platform follows a hierarchical structure:
Location Pages → Entity Pages → Review Pages
Location Hierarchy
Tripadvisor organizes content geographically:
- Continent → Country → State/Region → City → Neighborhood
- Each level has its own URL pattern:
tripadvisor.com/Tourism-g187147-Paris-Vacations.html - The
gparameter represents a geographic ID that's consistent across the platform
Entity Types
Within each location, entities are categorized:
-
Hotels:
/Hotel_Review-g187147-d188150-Reviews-Hotel_Name.html -
Restaurants:
/Restaurant_Review-g187147-d1751525-Reviews-Restaurant_Name.html -
Attractions:
/Attraction_Review-g187147-d188757-Reviews-Attraction_Name.html
The d parameter is the unique entity ID. Understanding these URL patterns is crucial for building systematic scrapers.
Review Pagination
Reviews on Tripadvisor follow a predictable pagination pattern:
- First page:
/Hotel_Review-g187147-d188150-Reviews-Hotel_Name.html - Second page:
/Hotel_Review-g187147-d188150-Reviews-or10-Hotel_Name.html - Third page:
/Hotel_Review-g187147-d188150-Reviews-or20-Hotel_Name.html
The or parameter increments by 10 (the default number of reviews per page). This means for a hotel with 500 reviews, you'd need to paginate through 50 pages.
What Data Can You Extract?
Hotel Data Points
From a typical hotel listing page, you can extract:
| Data Point | Location | Notes |
|---|---|---|
| Hotel name | H1 tag | Primary identifier |
| Overall rating | span.biGQs |
Scale of 1-5 (bubbles) |
| Number of reviews | Review count section | Total across all languages |
| Price range | Price section | Nightly rate range |
| Amenities | Amenities section | Pool, WiFi, parking, etc. |
| Location/address | Address section | Full street address |
| Star classification | Hotel class | 1-5 star rating |
| Photos count | Photo gallery | Total uploaded photos |
| Nearby attractions | Nearby section | Points of interest |
Review Data Points
Each individual review contains:
- Reviewer name and profile link
- Rating (1-5 bubbles)
- Review title and full text
- Date of stay (month and year)
- Trip type (business, couples, family, solo, friends)
- Room tip (optional)
- Photos attached to the review
- Helpful votes count
- Management response (if any)
Traveler Photos Metadata
Tripadvisor's photo section is rich with metadata:
- Upload date
- Caption text
- Associated review link
- Photo category (room, view, pool, food, etc.)
- Uploader profile information
Building a Tripadvisor Scraper with Node.js
Let's build a practical scraper. We'll use Crawlee, the open-source web scraping library that powers Apify actors.
Setting Up the Project
// package.json dependencies
// crawlee, cheerio, apify
const { CheerioCrawler, Dataset } = require('crawlee');
const crawler = new CheerioCrawler({
maxConcurrency: 2, // Be respectful with request rate
maxRequestRetries: 3,
requestHandlerTimeoutSecs: 60,
async requestHandler({ request, $, enqueueLinks, log }) {
const url = request.url;
if (url.includes('/Hotels-g')) {
// This is a hotel listing page
await handleHotelList($, enqueueLinks, log);
} else if (url.includes('/Hotel_Review-')) {
// This is an individual hotel page
await handleHotelDetail($, request, log);
}
},
});
Extracting Hotel Listings
async function handleHotelList($, enqueueLinks, log) {
const hotels = [];
$('div[data-automation="hotel-card-title"]').each((i, el) => {
const name = $(el).text().trim();
const link = $(el).find('a').attr('href');
const fullUrl = `https://www.tripadvisor.com${link}`;
hotels.push({ name, url: fullUrl });
});
log.info(`Found ${hotels.length} hotels on listing page`);
// Enqueue individual hotel pages for detail scraping
await enqueueLinks({
urls: hotels.map(h => h.url),
label: 'HOTEL_DETAIL',
});
// Handle pagination - find next page link
const nextPage = $('a[data-page-number]').last().attr('href');
if (nextPage) {
await enqueueLinks({
urls: [`https://www.tripadvisor.com${nextPage}`],
label: 'HOTEL_LIST',
});
}
}
Extracting Hotel Details and Reviews
async function handleHotelDetail($, request, log) {
const hotelData = {
url: request.url,
name: $('h1[data-automation="hotel-header-name"]')
.text().trim(),
overallRating: parseFloat(
$('span.biGQs._P.fiohW.uuBRH').first().text()
),
totalReviews: parseInt(
$('span.biGQs._P.pZUbB.biKBZ').first()
.text().replace(/[^0-9]/g, '')
),
ranking: $('span.biGQs._P.pZUbB.hmDzD')
.first().text().trim(),
priceRange: $('div[data-automation="hotel-price"]')
.text().trim(),
address: $('span.biGQs._P.pZUbB.egaXP.KxBGd')
.text().trim(),
amenities: [],
reviews: [],
};
// Extract amenities
$('div[data-automation="amenity"]').each((i, el) => {
hotelData.amenities.push($(el).text().trim());
});
// Extract reviews from current page
$('div[data-automation="reviewCard"]').each((i, el) => {
const review = {
rating: $(el).find('svg.UctUV').length,
title: $(el).find('a.Qwuub span')
.text().trim(),
text: $(el).find('span.QewHA div span')
.text().trim(),
dateOfStay: $(el).find('span.teHYY')
.text().replace('Date of stay: ', ''),
tripType: $(el).find('span.TDKzw')
.text().trim(),
reviewer: $(el).find('a.ui_header_link')
.text().trim(),
helpfulVotes: parseInt(
$(el).find('span.biGQs._P.FwFXZ')
.text() || '0'
),
};
hotelData.reviews.push(review);
});
log.info(`Extracted ${hotelData.reviews.length} reviews for ${hotelData.name}`);
await Dataset.pushData(hotelData);
}
Python Approach with Beautiful Soup
For those who prefer Python, here's how to approach the same task:
import requests
from bs4 import BeautifulSoup
import json
import time
import random
class TripadvisorScraper:
BASE_URL = "https://www.tripadvisor.com"
HEADERS = {
"User-Agent": (
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) "
"AppleWebKit/537.36 (KHTML, like Gecko) "
"Chrome/120.0.0.0 Safari/537.36"
),
"Accept-Language": "en-US,en;q=0.9",
}
def __init__(self):
self.session = requests.Session()
self.session.headers.update(self.HEADERS)
def get_hotel_reviews(self, hotel_url, max_pages=5):
all_reviews = []
for page in range(max_pages):
if page == 0:
url = hotel_url
else:
offset = page * 10
url = hotel_url.replace(
"-Reviews-",
f"-Reviews-or{offset}-"
)
response = self.session.get(url)
soup = BeautifulSoup(response.text, "html.parser")
review_cards = soup.find_all(
"div",
attrs={"data-automation": "reviewCard"}
)
if not review_cards:
print(f"No reviews found on page {page + 1}.")
break
for card in review_cards:
review = self._parse_review(card)
all_reviews.append(review)
print(
f"Page {page + 1}: "
f"extracted {len(review_cards)} reviews"
)
# Respectful delay between requests
time.sleep(random.uniform(2, 5))
return all_reviews
def _parse_review(self, card):
return {
"title": self._safe_text(
card.find("a", class_="Qwuub")
),
"text": self._safe_text(
card.find("span", class_="QewHA")
),
"rating": len(
card.find_all("svg", class_="UctUV")
),
"date_of_stay": self._safe_text(
card.find("span", class_="teHYY")
),
"trip_type": self._safe_text(
card.find("span", class_="TDKzw")
),
}
@staticmethod
def _safe_text(element):
return element.text.strip() if element else None
# Usage
scraper = TripadvisorScraper()
reviews = scraper.get_hotel_reviews(
"https://www.tripadvisor.com/Hotel_Review-g60763"
"-d93450-Reviews-The_Plaza-New_York_City.html",
max_pages=3
)
print(f"Total reviews collected: {len(reviews)}")
Handling Tripadvisor's Anti-Scraping Measures
Tripadvisor employs several techniques to prevent automated access:
1. Rate Limiting
The platform monitors request frequency. Best practices:
- Keep requests under 1 per 3 seconds per IP
- Randomize delays between requests
- Use exponential backoff on HTTP 429 responses
2. Dynamic Content Loading
Many review sections load via JavaScript. For these cases, you need a browser-based approach:
const { PlaywrightCrawler } = require('crawlee');
const crawler = new PlaywrightCrawler({
launchContext: {
launchOptions: {
headless: true,
},
},
async requestHandler({ page, request, log }) {
// Wait for reviews to load
await page.waitForSelector(
'div[data-automation="reviewCard"]',
{ timeout: 15000 }
);
// Click "Read more" on truncated reviews
const readMoreButtons = await page.$$(
'button[data-automation="readMore"]'
);
for (const btn of readMoreButtons) {
await btn.click();
await page.waitForTimeout(500);
}
// Now extract the full review text
const reviews = await page.$$eval(
'div[data-automation="reviewCard"]',
(cards) => cards.map(card => ({
title: card.querySelector('a.Qwuub span')
?.textContent?.trim(),
text: card.querySelector('span.QewHA div span')
?.textContent?.trim(),
}))
);
log.info(`Extracted ${reviews.length} full reviews`);
},
});
3. IP Blocking
For large-scale scraping, proxy rotation is essential:
const { CheerioCrawler, ProxyConfiguration } = require('crawlee');
const proxyConfiguration = new ProxyConfiguration({
proxyUrls: [
'http://proxy1:8080',
'http://proxy2:8080',
'http://proxy3:8080',
],
});
const crawler = new CheerioCrawler({
proxyConfiguration,
// ... rest of config
});
Using Apify for Tripadvisor Scraping
Building and maintaining your own scraper infrastructure is time-consuming. Apify provides ready-made actors that handle all the complexity:
Why Use Apify?
- Pre-built actors for Tripadvisor that handle anti-scraping measures
- Proxy management built in with residential and datacenter proxies
- Automatic retries and error handling
- Scheduled runs for regular data collection
- Dataset export in JSON, CSV, Excel, or via API
Running a Tripadvisor Scraper on Apify
const Apify = require('apify');
// Using Apify SDK to run a Tripadvisor actor
const run = await Apify.call('tripadvisor/scraper', {
startUrls: [
{
url: 'https://www.tripadvisor.com/Hotels-g60763-New_York_City-Hotels.html'
}
],
maxItems: 100,
includeReviews: true,
reviewsPerHotel: 50,
language: 'en',
proxy: {
useApifyProxy: true,
apifyProxyGroups: ['RESIDENTIAL'],
},
});
// Fetch results
const dataset = await Apify.openDataset(run.defaultDatasetId);
const { items } = await dataset.getData();
console.log(`Scraped ${items.length} hotels with reviews`);
Scheduling Regular Data Collection
For ongoing monitoring (tracking competitor ratings, price changes, new reviews), Apify's scheduling feature is invaluable:
// Create a scheduled task via Apify API
const schedule = {
name: 'tripadvisor-weekly-scrape',
cronExpression: '0 6 * * 1', // Every Monday at 6 AM
actions: [{
type: 'RUN_ACTOR',
actorId: 'tripadvisor/scraper',
runInput: {
startUrls: [
{
url: 'https://www.tripadvisor.com/'
+ 'Hotels-g60763-New_York_City.html'
}
],
maxItems: 200,
},
}],
};
Practical Use Cases
1. Competitive Hotel Analysis
Track how your hotel compares to competitors:
import pandas as pd
def analyze_competitors(scraped_data):
df = pd.DataFrame(scraped_data)
analysis = df.groupby('hotel_name').agg({
'overall_rating': 'mean',
'total_reviews': 'max',
'price_min': 'min',
'price_max': 'max',
}).round(2)
# Rating trend over time
review_df = pd.json_normalize(
scraped_data,
record_path='reviews',
meta=['hotel_name']
)
review_df['date'] = pd.to_datetime(
review_df['date_of_stay']
)
monthly_ratings = review_df.groupby(
[review_df['date'].dt.to_period('M'), 'hotel_name']
)['rating'].mean()
return analysis, monthly_ratings
2. Review Sentiment Analysis
Combine scraped reviews with NLP:
from collections import Counter
def extract_themes(reviews):
positive_keywords = [
'clean', 'friendly', 'location', 'comfortable',
'spacious', 'helpful', 'breakfast', 'view'
]
negative_keywords = [
'noise', 'dirty', 'small', 'rude', 'expensive',
'old', 'broken', 'slow'
]
pos_counts = Counter()
neg_counts = Counter()
for review in reviews:
text = review['text'].lower()
for kw in positive_keywords:
if kw in text:
pos_counts[kw] += 1
for kw in negative_keywords:
if kw in text:
neg_counts[kw] += 1
return {
'top_positives': pos_counts.most_common(5),
'top_negatives': neg_counts.most_common(5),
}
3. Travel Photo Data Mining
Extract and categorize traveler photos for market research:
async function extractPhotoMetadata($, hotelUrl) {
const photos = [];
$('div[data-automation="photo-card"]').each((i, el) => {
photos.push({
imageUrl: $(el).find('img').attr('src'),
caption: $(el).find('span.caption')
.text().trim(),
category: $(el).find('span.category')
.text().trim(),
uploadDate: $(el).find('span.date')
.text().trim(),
uploaderName: $(el).find('a.profile-link')
.text().trim(),
});
});
return photos;
}
Data Storage and Export Best Practices
Once you've scraped the data, proper storage matters:
// Export to multiple formats with Apify Dataset
const dataset = await Dataset.open('tripadvisor-hotels');
// Push data during scraping
await dataset.pushData(hotelData);
// After scraping, export as needed
await dataset.exportToCSV('hotels_export');
await dataset.exportToJSON('hotels_export');
For large datasets, consider streaming:
import json
def stream_to_jsonl(reviews, output_file):
with open(output_file, 'w') as f:
for review in reviews:
f.write(json.dumps(review) + '\n')
Legal and Ethical Considerations
Web scraping exists in a legal gray area. Key points for Tripadvisor:
- Tripadvisor's ToS restricts automated access — understand the risks
- Only scrape publicly available data — never attempt to access private user data
- Respect robots.txt directives
- Rate limit your requests to avoid impacting the platform's performance
- GDPR compliance: if collecting data about EU users, ensure you have a lawful basis
- Use data responsibly: don't republish reviews as your own content
Conclusion
Tripadvisor scraping opens up powerful possibilities for travel industry analytics, competitive intelligence, and market research. The platform's structured data — from hotel ratings and review text to traveler photos and pricing — provides a goldmine of insights when properly collected and analyzed.
The key to successful Tripadvisor scraping is understanding the platform's structure, respecting rate limits, handling dynamic content properly, and using reliable infrastructure. Whether you build a custom scraper or leverage Apify's ready-made solutions, the techniques covered in this guide will help you extract the travel data you need efficiently and responsibly.
Start small with a single hotel or restaurant, validate your extraction logic, and then scale up gradually. The travel data landscape is vast — the opportunities for analysis and insight are virtually unlimited.
Top comments (0)