Building a Komoot Hiking & Outdoor Routes data pipeline: from raw scrape to first insight

#webscraping #apify #data #dataengineering

When you need Komoot Hiking & Outdoor Routes as a recurring feed, the gap between "got a few rows out" and "have a clean nightly dataset in the warehouse" is wider than it looks. Here is the pipeline I sketched out, with the decisions I made at each step.

Source survey

Komoot Hiking & Outdoor Routes Scraper Scrape Komoot Routes by Location or Coordinates Scrape hiking routes, cycling tours and outdoor activities from Komoot, Europe's leading outdoor navigation platform with 200M+ planned routes across 50+ countries. For pipeline purposes, the relevant questions are: how stable is the source markup, what is the natural pagination unit, and how aggressively does it rate-limit. For this source the answer is "stable enough, list-based pagination, moderate rate-limiting" -- which makes it a good candidate for a daily incremental job rather than a streaming one.

Output schema

The actor I used emits records with these fields:

tourId -- tour id
name -- name
sport -- sport
status -- status
distanceM -- distance m
distanceKm -- distance km
durationMin -- duration min
elevationUp -- elevation up
elevationDown -- elevation down
difficulty -- difficulty
visitors -- visitors
ratingScore -- rating score
ratingCount -- rating count
startLat -- start lat
startLng -- start lng
startAlt -- start alt
surfaces -- surfaces
wayTypes -- way types
coverImage -- cover image
mapImageUrl -- map image url
highlightsCount -- highlights count
highlights -- highlights
createdAt -- created at
updatedAt -- updated at
url -- url
scrapedAt -- scraped at

For warehouse ingestion I would keep this almost as-is. Promote the obvious identifier field to a primary key, cast the timestamp columns to native types, and stash any deeply nested or free-text fields in a TEXT column rather than trying to normalise them.

Sample records

A peek at two raw rows from a sample run:

{
  "tourId": "e28260717",
  "name": "Wasserläufer Waalweg Mooserstegle – Wandern im Ötztal",
  "sport": "hike",
  "status": "public",
  "distanceM": "6160",
  "distanceKm": "6.16",
  "durationMin": "117",
  "elevationUp": "241",
  "elevationDown": "241",
  "difficulty": "easy"
}

{
  "tourId": "e985847069",
  "name": "Small tour at Moos in Passeier - Stieber Waterfall",
  "sport": "hike",
  "status": "public",
  "distanceM": "3202",
  "distanceKm": "3.20",
  "durationMin": "59",
  "elevationUp": "111",
  "elevationDown": "110",
  "difficulty": "easy"
}

The flat structure is forgiving. You can drop this straight into a staging table with CREATE TABLE ... AS SELECT * FROM read_json_auto(...) in DuckDB, or pd.json_normalize(rows) in Python, and the downstream model layer barely needs any work.

Pipeline stages

For data engineers and analysts this is the rough shape I would build:

Extract: schedule the scraper to run every N hours, write the raw JSON to object storage partitioned by date.
Land: load the raw JSON into a staging table with minimal type coercion -- you want to be able to replay history without re-scraping.
Transform: dedupe on the natural key, enrich with reference data, surface a curated view for powering dashboards, feeding ML pipelines and answering ad-hoc analytical questions.
Serve: expose a thin API or dashboard on the curated view. This is the layer your stakeholders actually touch.

Operational considerations

Three things bite people on these pipelines: schema drift in the upstream source, duplicate records from overlapping scrape windows, and quietly failing runs. Wire up record-count assertions early -- a sudden 50% drop is almost always a sign that the site changed and your selectors need a refresh, not a real shift in supply.

Tooling choices

A few opinionated picks I would default to for this kind of pipeline: object storage (S3, GCS, R2) for the raw landing zone because it is cheap and replayable; a columnar warehouse (BigQuery, Snowflake, DuckDB if you are small) for the staging and curated layers because the analytical queries you will run over this dataset are pretty much exclusively column-scans; a tiny dbt or SQLMesh project for the transformations because version-controlled, tested SQL is much nicer to maintain than ad-hoc queries; and a workflow orchestrator (Airflow, Prefect, GitHub Actions on a cron) for scheduling. None of those are exotic choices, which is the point -- the boring stack is the right stack for a feed like this.

Verdict

For a single-source feed like Komoot Hiking & Outdoor Routes, the work is mostly in the staging and dedup logic. The extraction itself is a solved problem if you do not insist on rolling your own crawler. Once the data is landing reliably, the analytical layer is where you spend your time -- and that is the layer where the dataset actually pays for itself.

For live, customizable extractions of this data, the actor that produced the dataset shown above is published on the Apify Store: logiover/komoot-hiking-outdoor-routes-scraper. It supports JSON, CSV and Excel exports and runs on a schedule.