agenthustler

Posted on Mar 26

How to Scrape Tripadvisor in 2026: Reviews, Hotels, and Attractions

#python #tutorial #datascience #webscraping

Tripadvisor has over 1 billion reviews across 8 million listings. Whether you're building a travel aggregator, running competitive analysis for a hotel chain, or doing sentiment research on tourism trends, that data is incredibly valuable.

In this guide, I'll show you how to scrape Tripadvisor hotel listings, reviews, and attraction data using Python in 2026 — including how to handle their aggressive anti-bot protection.

Why Scrape Tripadvisor?

Competitive intelligence: Track competitor hotel ratings, review counts, and pricing across markets
Sentiment analysis: Aggregate guest feedback to identify common complaints or praise
Market research: Analyze attraction popularity and seasonal trends in specific destinations
Price monitoring: Track nightly rates across hotels in a region

What You'll Need

pip install requests beautifulsoup4 lxml

You'll also need a proxy solution — Tripadvisor uses Cloudflare and aggressive fingerprinting. I recommend ScraperAPI which handles rotation, CAPTCHAs, and headers automatically, or ThorData residential proxies if you want more control.

Step 1: Scrape Hotel Search Results

Tripadvisor's search URLs follow a predictable pattern. Here's how to grab hotel listings for a given city:

import requests
from bs4 import BeautifulSoup
import json
import time

SCRAPER_API_KEY = "YOUR_SCRAPERAPI_KEY"

def scrape_hotels(city_url: str, pages: int = 3) -> list[dict]:
    """Scrape hotel listings from Tripadvisor search results."""
    hotels = []

    for page in range(pages):
        offset = page * 30
        url = city_url.replace("-Hotels-", f"-Hotels-oa{offset}-")

        # Use ScraperAPI to bypass anti-bot
        api_url = f"http://api.scraperapi.com?api_key={SCRAPER_API_KEY}&url={url}&render=true"

        response = requests.get(api_url, timeout=60)
        soup = BeautifulSoup(response.text, "lxml")

        # Extract structured data from JSON-LD
        for script in soup.find_all("script", type="application/ld+json"):
            try:
                data = json.loads(script.string)
                if isinstance(data, dict) and data.get("@type") == "Hotel":
                    hotels.append({
                        "name": data.get("name"),
                        "rating": data.get("aggregateRating", {}).get("ratingValue"),
                        "review_count": data.get("aggregateRating", {}).get("reviewCount"),
                        "price_range": data.get("priceRange"),
                        "address": data.get("address", {}).get("streetAddress"),
                    })
            except json.JSONDecodeError:
                continue

        time.sleep(2)  # Be respectful with request timing

    return hotels

# Example: Hotels in Barcelona
hotels = scrape_hotels(
    "https://www.tripadvisor.com/Hotels-g187497-Barcelona_Catalonia-Hotels.html"
)
for h in hotels[:5]:
    print(f"{h['name']} — {h['rating']}⭐ ({h['review_count']} reviews)")

The key insight here is using JSON-LD structured data that Tripadvisor embeds for SEO. It's cleaner and more reliable than parsing HTML classes that change frequently.

Step 2: Scrape Individual Hotel Reviews

Once you have hotel URLs, you can drill into individual reviews:

def scrape_reviews(hotel_url: str, max_pages: int = 5) -> list[dict]:
    """Scrape reviews for a specific hotel."""
    reviews = []

    for page in range(max_pages):
        offset = page * 10
        url = hotel_url.replace("-Reviews-", f"-Reviews-or{offset}-")

        api_url = f"http://api.scraperapi.com?api_key={SCRAPER_API_KEY}&url={url}&render=true"
        response = requests.get(api_url, timeout=60)
        soup = BeautifulSoup(response.text, "lxml")

        review_cards = soup.select("[data-reviewid]")

        for card in review_cards:
            title_el = card.select_one(".yCeTE")
            body_el = card.select_one(".QewHA span")
            rating_el = card.select_one("svg.UctUV title")
            date_el = card.select_one(".teHYY")

            reviews.append({
                "title": title_el.text.strip() if title_el else None,
                "body": body_el.text.strip() if body_el else None,
                "rating": rating_el.text.split()[0] if rating_el else None,
                "date": date_el.text.strip() if date_el else None,
            })

        time.sleep(2)

    return reviews

reviews = scrape_reviews(
    "https://www.tripadvisor.com/Hotel_Review-g187497-d228489-Reviews-Hotel_Arts_Barcelona.html"
)
print(f"Scraped {len(reviews)} reviews")

Step 3: Scrape Attractions and Things to Do

Tripadvisor's attraction pages follow a similar structure. Here's how to pull the top things to do in a city:

def scrape_attractions(city_url: str) -> list[dict]:
    """Scrape top attractions for a city."""
    api_url = f"http://api.scraperapi.com?api_key={SCRAPER_API_KEY}&url={city_url}&render=true"
    response = requests.get(api_url, timeout=60)
    soup = BeautifulSoup(response.text, "lxml")

    attractions = []
    for card in soup.select("div.alPVI"):
        name_el = card.select_one("header.aJiSb")
        rating_el = card.select_one("svg.UctUV title")
        count_el = card.select_one("span.biGQs")

        attractions.append({
            "name": name_el.text.strip() if name_el else None,
            "rating": rating_el.text.split()[0] if rating_el else None,
            "review_count": count_el.text.strip() if count_el else None,
        })

    return attractions

attractions = scrape_attractions(
    "https://www.tripadvisor.com/Attractions-g187497-Activities-Barcelona_Catalonia.html"
)
for a in attractions[:10]:
    print(f"{a['name']} — {a['rating']}⭐ ({a['review_count']})")

Handling Anti-Bot Protection

Tripadvisor is one of the harder sites to scrape in 2026. Here's what you'll encounter:

Cloudflare challenges — JavaScript challenges that block simple HTTP requests
Fingerprint detection — They track TLS fingerprints, canvas hashes, and WebGL data
Rate limiting — Too many requests from one IP will get you blocked fast
Dynamic selectors — CSS class names change with each deployment

Solutions

ScraperAPI handles all of this automatically — Cloudflare bypass, residential proxy rotation, and JavaScript rendering. It's the fastest way to get started.
ThorData residential proxies give you a pool of real residential IPs. Pair them with Playwright for JavaScript rendering if you want full control.
Pre-built scrapers: If you don't want to maintain your own code, check out ready-made Tripadvisor actors on Apify that handle all the edge cases.

Scaling with Async Requests

For production workloads scraping hundreds of pages, use async to speed things up:

import asyncio
import aiohttp

async def scrape_batch(urls: list[str], concurrency: int = 5):
    """Scrape multiple pages concurrently with rate limiting."""
    semaphore = asyncio.Semaphore(concurrency)
    results = []

    async def fetch(session, url):
        async with semaphore:
            api_url = f"http://api.scraperapi.com?api_key={SCRAPER_API_KEY}&url={url}&render=true"
            async with session.get(api_url) as resp:
                html = await resp.text()
                results.append(html)
                await asyncio.sleep(1)

    async with aiohttp.ClientSession() as session:
        tasks = [fetch(session, url) for url in urls]
        await asyncio.gather(*tasks)

    return results

Keep concurrency at 3-5 for Tripadvisor. Going higher risks triggering their rate limiters even with rotating proxies.

Legal Considerations

Tripadvisor's Terms of Service prohibit scraping. That said, courts have generally ruled that scraping publicly available data is legal (see hiQ Labs v. LinkedIn). A few ground rules:

Don't scrape private user data (emails, full names with identifying info)
Respect robots.txt rate limits
Don't overload their servers — add delays between requests
Use the data for analysis, not for republishing their content verbatim

Wrapping Up

Tripadvisor scraping in 2026 is very doable with the right tools. The JSON-LD approach is the cleanest path for structured data, and a proxy service like ScraperAPI or ThorData will save you hours of fighting anti-bot systems.

If you want to skip the code entirely, Apify has pre-built scrapers that run in the cloud with zero infrastructure to manage.

Questions? Drop them in the comments.

DEV Community