Devil Scrapes

Posted on Jun 3

Viator Scraper + GetYourGuide in One Run: OTA price data for $3.05/1K

#webscraping #python #apify #data

Quick answer: There is no unified API for Viator and GetYourGuide. To compare how both platforms price and rank the same tour, you scrape them. A Viator scraper + GetYourGuide scraper in a single Actor call returns every activity on the first search-results page for a destination — titles, prices, ratings, durations, booking URLs — normalized into one Pydantic-validated row. The Tours and Activity Unified Actor does this for $0.003 per activity row (~$3.05 per 1,000), with TLS fingerprinting, proxy rotation, and per-platform failure isolation handled for you.

If you work in the tours-and-activities layer of the travel industry, you already know the pain: Viator and GetYourGuide list many of the same experiences under different SKUs, different prices, and different ranking signals. A quick manual check for one destination takes twenty minutes. A systematic comparison across twenty destinations is a two-developer-week project — write a scraper per platform, figure out both HTML structures, normalize the schemas, reconcile the currency formats. Nobody ships this as a repeatable data feed.

Until now you had to run two single-source scrapers and write the glue yourself. This Actor does all three steps in one API call.

What are Viator and GetYourGuide? 🗺

Viator (owned by Tripadvisor) and GetYourGuide are the two dominant online marketplaces for tours, activities, and experiences — think skip-the-line museum tickets, food tours, sailing trips, cooking classes. Together they list millions of bookable experiences across every major destination. Both display structured search results: title, price, duration, rating, review count, and a booking URL per activity card.

Neither publishes an official data API for their search results.

Do Viator or GetYourGuide have an API? 🔌

No official public API for search results exists on either platform. Viator has a closed partner API (for booking integration, not price research); GetYourGuide has a private affiliate feed accessible only to approved partners. Neither exposes a documented endpoint you can call for arbitrary destination queries and get back structured activity data. The only programmatic surface is each platform's own search page — which means scraping is the path.

And scraping these two platforms is genuinely difficult for reasons that go beyond writing a CSS selector.

What the data looks like

Each activity comes back as one flat, typed row. Here is a real Viator result for Paris, verified 2026-05-16:

{
  "platform": "viator",
  "activity_id": "382015P1",
  "activity_title": "Eiffel Tower Dedicated Reserved Access Top or 2nd floor by lift",
  "location_query": "Paris",
  "location_city": "Paris",
  "location_country": null,
  "price_usd": null,
  "currency_original": "EUR",
  "price_original": 39.0,
  "duration_hours": 2.0,
  "rating": 4.6,
  "review_count": 3704,
  "operator_name": null,
  "category": "tour",
  "booking_url": "https://www.viator.com/tours/Paris/Eiffel-Tower-Summit-Access/d479-382015P1",
  "image_url": "https://dynamic-media.tacdn.com/media/photo-o/2e/a7/07/c5/caption.jpg?w=800&h=600&s=1",
  "scraped_at": "2026-05-16T22:00:00.000Z"
}

Seventeen fields, same shape across both platforms, validated with Pydantic v2 before anything is written. Drops straight into Pandas, a BI dashboard, or your own warehouse — no per-platform wrangling on your side.

The naive approach (and why it falls apart) 🧱

The obvious path:

Open Chrome DevTools on viator.com/searchResults/all?text=Paris
Find the card HTML, write a requests.get() loop
Repeat for GetYourGuide, build a small normalization layer
Ship

Here is where each step breaks:

Viator's TLS fingerprint wall. Viator's Apache server passes a 200 to browsers, but datacenter Python requests without a real browser TLS profile get throttled or silently served degraded pages. We rotate through Chrome 131, Chrome 124, and Firefox 147 impersonation profiles via curl-cffi — so the TLS ClientHello, ALPN extension order, and HTTP/2 SETTINGS frame match a real browser. Without this, you get a page; the page just doesn't have your cards on it.

GetYourGuide sits behind Cloudflare. GetYourGuide's search endpoint returns a 200 with full server-rendered card content for clean traffic — but Cloudflare activates at scale and for datacenter IP ranges. We thread Apify residential proxies (BUYPROXIES94952) with sticky sessions so each destination query keeps one stable exit IP and cookie jar. We retry on 408/429/503 with exponential backoff (base 2 s, doubles, capped at 30 s, up to 5 attempts), honouring Retry-After headers when the platform sends them.

Two completely different HTML structures. Viator uses data-automation test-attribute selectors (ttd-product-list-card, ttd-product-list-card-title, etc.) — stable testid-style attributes that survive most redesigns. GetYourGuide uses Vue.js component class names (granular-layout-activity-card-<id>) with the activity ID embedded in the class name itself. A currency normalization layer maps €/$/£/¥ to EUR/USD/GBP/JPY. A duration parser resolves every format both platforms use — "3 hours" → 3.0, "30 minutes" → 0.5, "5 to 9 hours" → 7.0 (midpoint), "1 day" → 24.0. All of this is the glue you don't want to maintain.

Per-platform failure isolation. If GetYourGuide is temporarily blocked on a given run, the Viator rows still arrive. The Actor never aborts on a single-platform failure — it surfaces a partial-success message and keeps going. One platform's Cloudflare mood does not nuke your entire dataset.

None of this is glamorous. All of it is exactly the difference between a script that worked once and a feed you can schedule weekly.

The Actor ⚙️

The result is packaged as an Apify Actor: Tours and Activity Unified.

Type a destination in the Apify Console and click Start, or run it programmatically:

from apify_client import ApifyClient

client = ApifyClient("YOUR_APIFY_TOKEN")

run = client.actor("DevilScrapes/tours-activity-unified").call(
    run_input={
        "locationQuery": "Paris",
        "platforms": ["viator", "getyourguide"],
        "maxPerPlatform": 50,
    }
)

for item in client.dataset(run["defaultDatasetId"]).iterate_items():
    print(item["platform"], item["activity_title"], item["price_original"], item["currency_original"])

Input parameters from the schema:

Parameter	Type	Default	Notes
`locationQuery`	string	(required)	Any destination: `"Paris"`, `"Tokyo"`, `"New York"`
`platforms`	array	`[]` (= all)	Subset of `viator`, `getyourguide`, `klook`
`maxPerPlatform`	integer	`20`	Cap per platform, 1–100
`useProxy`	boolean	`true`	Recommended on; GetYourGuide is Cloudflare-fronted

One call, two platforms, one schema. The location_query field echoes your input on every row so downstream pivot tables work correctly when you batch multiple cities.

Use cases 💡

Tour operator competitive intelligence. Your competitor lists the same Eiffel Tower experience at €42 on Viator and €38 on GetYourGuide. You didn't know that until this run. At maxPerPlatform=50, a Paris sweep costs roughly $0.35 and surfaces ~100 rows side-by-side. Schedule weekly, diff the price columns, and you know exactly when a competitor repriced.

OTA analyst dashboards. Feed a Metabase, Looker, or Google Data Studio instance with weekly snapshots for 20 destinations. Watch how each platform's ranking algorithm surfaces different operators for the same query over time — useful both for operators trying to improve their listing rank and for investors benchmarking platform behaviour.

Dynamic pricing strategy. If you are a tour operator listed on both platforms, you can pull your own activity_id across both and watch how your price position changes relative to the first page of results. Adjust and re-measure.

Travel-blogger affiliate research. Which Kyoto experiences consistently land on page 1 of both platforms with a 4.8+ rating and 300+ reviews? That is your affiliate content calendar. Pull the data, sort by rating DESC, review_count DESC, and pick your top five for the month's guide.

Travel-tech investor diligence. Track top-of-funnel pricing trends across the experience-booking layer of the travel stack. How does average first-page price move for "Tokyo" between January and July? The dataset answers that question at the cost of a coffee per destination per week.

Pricing — exact numbers 💰

Pay-per-event. You pay for rows you receive, nothing for rows that don't arrive. No data, no charge (beyond the $0.05 run warm-up).

Event	Rate
Actor start (once per run)	$0.05
Per activity row emitted	$0.003

Run size	Cost
40 rows (~default, 2 platforms × 20)	~$0.17
100 rows	~$0.35
1,000 rows	~$3.05
10,000 rows	~$30.05

Apify's $5 free trial credit covers your first ~1,600 activity rows with no credit card. For comparison, doing this manually across two platforms for a single destination — opening two browser tabs, scrolling, copying data — takes 30–45 minutes per destination and produces an untyped spreadsheet you still need to normalize.

The technically interesting part

The most non-obvious piece of this build is the GetYourGuide ID extraction. GetYourGuide does not expose an activity ID in any consistent data-id or href attribute on the card container. The ID is encoded in the component's CSS class name — granular-layout-activity-card-508441 — so the extractor uses re.search(r'granular-layout-activity-card-(\d+)', class_string) to pull it. This is fragile in the sense that a Vue.js component rename would break it, but it is the only reliable ID surface on the card, and the class name pattern has been stable across multiple verified scrapes. The QA fixture for Paris locks a regression — if GetYourGuide renames the component, the fixture fails immediately and the parser gets updated before any user sees a broken run.

The Viator side uses data-automation testid attributes, which are typically more stable than class names because they are wired to end-to-end tests and engineers try not to rename them casually. Both approaches are documented in the spec so the next person maintaining the parser knows exactly why the selectors look the way they do.

Limitations 🚧

Klook returns 0 rows in v1. Klook gates every search endpoint behind DataDome, which requires full browser execution to clear. The platform literal is in the schema so v2 can land without breaking the dataset shape, but for now platforms: ["viator", "getyourguide"] is the working set.
First page only. Each platform returns ~20–24 cards on the first search-results page. No pagination across pages in v1. The maxPerPlatform cap is 100 — but in practice each platform only surfaces ~20–24 unique cards on the first page.
No detail-page scraping. Itineraries, availability calendars, photo galleries, cancellation policies, and meeting points are out of scope. v1 is the search-results marketing surface.
Currency follows platform display. price_usd is only populated when the platform itself shows a USD price. No FX conversion is performed — the Actor emits currency_original + price_original for you to convert at query time if needed.
Search relevance is the platform's. A "Paris" query can return Versailles or nearby destinations, depending on how each platform's relevance engine ranks the result. That is expected behaviour.

FAQ ❓

Is scraping Viator and GetYourGuide legal?
Both platforms publish publicly accessible search pages with no authentication required to view activity listings. This Actor reads only what any anonymous visitor sees — activity titles, prices, ratings, and booking URLs. It does not access any account-gated content or booking engine. As with any scraping project, check your jurisdiction and intended use. Respecting robots.txt and rate limits is the operating posture here.

Does Viator have a public data API?
Viator has a partner API for booking integration, available to approved affiliates. It is not a general-purpose data feed. The search results the Actor scrapes are from the public website, not the partner API.

Does GetYourGuide have a public data API?
GetYourGuide has a closed supplier API and a private affiliate feed. Neither is available for arbitrary search queries without an approval process. The Actor uses the public search page.

Can I schedule this to track a destination weekly?
Yes. Set up an Apify scheduled task with your desired locationQuery and a named dataset (Actor.open_dataset(name="paris-weekly")) to accumulate rows across runs. Apify's default dataset retention is 7 days; named datasets persist until you delete them.

Try it

The Actor is live on the Apify Store: apify.com/DevilScrapes/tours-activity-unified.

Free $5 trial credit, no credit card. Run it on "Paris" or "Tokyo" and you will have a cross-platform activity dataset in under a minute. If you find a parser edge case — a price format we don't handle, a platform that started returning different HTML — drop it in the comments. The parsers are updated based on what buyers actually hit in production.

Built by Devil Scrapes — Apify Actors for the data work nobody wants to do by hand. Pay-per-event, transparent pricing, no junk fields. 😈

DEV Community