<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Muhammad Farizal Afkar</title>
    <description>The latest articles on DEV Community by Muhammad Farizal Afkar (@mfarizala).</description>
    <link>https://dev.to/mfarizala</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3881632%2Ff1fe166e-85c5-449c-9560-962177427780.png</url>
      <title>DEV Community: Muhammad Farizal Afkar</title>
      <link>https://dev.to/mfarizala</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/mfarizala"/>
    <language>en</language>
    <item>
      <title>How I Built an End-to-End Data Engineering Pipeline for Hong Kong's Public Transport Network</title>
      <dc:creator>Muhammad Farizal Afkar</dc:creator>
      <pubDate>Thu, 16 Apr 2026 05:31:52 +0000</pubDate>
      <link>https://dev.to/mfarizala/how-i-built-an-end-to-end-data-engineering-pipeline-for-hong-kongs-public-transport-network-1bkl</link>
      <guid>https://dev.to/mfarizala/how-i-built-an-end-to-end-data-engineering-pipeline-for-hong-kongs-public-transport-network-1bkl</guid>
      <description>&lt;p&gt;&lt;em&gt;From raw GTFS files to a live dashboard — batch, streaming, and CI/CD all on GCP free tier&lt;/em&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  The Problem
&lt;/h2&gt;

&lt;p&gt;Hong Kong runs one of the world's most complex and efficient public transport networks. MTR heavy rail, Light Rail, over 700 bus routes, trams, and ferries all coexist in a city of 7.5 million people. But the raw data behind all of this — GTFS feeds, MTR open data CSVs, real-time schedule APIs — sits scattered across government portals in formats that are nearly impossible for the public to actually use.&lt;/p&gt;

&lt;p&gt;I wanted to answer a few simple questions:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Which stops and routes carry the most traffic?&lt;/li&gt;
&lt;li&gt;How does service differ between weekdays and weekends?&lt;/li&gt;
&lt;li&gt;When does the first and last bus run on each route?&lt;/li&gt;
&lt;li&gt;How do the MTR and bus networks complement each other geographically?&lt;/li&gt;
&lt;li&gt;What does live MTR train activity look like, right now?&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This project — &lt;strong&gt;HK Transit Pulse (香港交通脈搏)&lt;/strong&gt; — is my answer. It's an end-to-end data engineering pipeline that ingests, transforms, and visualises Hong Kong public transport data, with a batch pipeline running daily and a streaming layer updating every minute.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Live dashboard:&lt;/strong&gt; &lt;a href="https://hk-transit-pulse-fclvz6lara-de.a.run.app" rel="noopener noreferrer"&gt;https://hk-transit-pulse-fclvz6lara-de.a.run.app&lt;/a&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  The Stack
&lt;/h2&gt;

&lt;p&gt;Before diving in, here's what I used and why:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Layer&lt;/th&gt;
&lt;th&gt;Tool&lt;/th&gt;
&lt;th&gt;Why&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Orchestration&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;Bruin&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Unified Python + SQL pipeline in one tool&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Infrastructure&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;OpenTofu&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Open-source Terraform, reproducible IaC&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Data Lake&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;Google Cloud Storage&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Simple, cheap raw zone&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Data Warehouse&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;BigQuery&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Serverless, columnar, free tier generous&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Streaming&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;Redpanda Cloud&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Kafka-compatible, no ZooKeeper, free tier&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Dashboard&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;Streamlit&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Pure Python, no frontend skills needed&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Deployment&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;Cloud Run&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Serverless containers, scales to zero&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;CI/CD&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;GitHub Actions + WIF&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Keyless auth, no service account keys&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Scheduling&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;Cloud Scheduler&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Trigger Cloud Run Jobs on a cron&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Everything runs on GCP free tier.&lt;/p&gt;




&lt;h2&gt;
  
  
  Architecture
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://mermaid.live/edit#pako:eNqdVs2O40QQfpWW95pk7XiSTXJAyv9kYJiZJOIAQVbHbv9sHLfpbs8QZkYCCQkJIS0SKwF74cABxIkbnHmUfQF4BKrbvzObHcL45O6ur-r7qqrLvtZs6hCtp7khvbJ9zARajlYRgmd6_NFK--enl7-gaSCOkzXq2yKgEV-tojjhPhIUbXEQrbSPc3tUr79z45A4pLvGbhveoOHcGvUX0s_rH774-88XaBjSxEHzJEILwi4Dm4Azf1MXDEc8EPU4CbncWghG8DYMBBph7q8pZs79MGssbL-IMugvh3vDnNA1-FPG9efqfcCSIELnQUzCICLKbep4uDCkhxe_Z_CF7RMnCQkDELkkbIcMtK3oHS6a_9PekMwl3fP52ejtbGNGncQmTBGuRMvRw7P3F29H21CiZFtBp_i-KuY336Ljd9FSpjumUOvpcrJAC4FFYAPUwQI3PHrZ8DdF3IHCvfoKnS7n6CwmEZREYKDzgWwEChsKtRWsYdNtFQkAq38-q-LzFCHYBzQTjfshM7FZRZXi_p69QSYqTYlMxqsfZTLS4s4ij3DZqxAjUO-WJ1xucSW0Ee9KdcrB6AAHINCy-WUOzjgp-FhJ_O1nNKXUA3FZOwjKsCe7uRL7ad7sMvtPCxoj5WfyH34khcxNhcNYXYfBBXqPYkf2wA2aKkfff6kEBd5FIpvx9ecvEcNXOR9GE0E4-usPlNGjcbkSLIh5yZzGlgi2FWsbh7LurBAweZPE8cMkpBh5AbPMQNNK73J3nWR0MisXM1IchoHnC4vhIMxsykRMVRJn-dBKKwm97UEFUZ9zIqRHLrxSrFyUiZCrXHluVgqXO2_oTrtxllOYqeWJovDd1xmFUzlUi_hbWKUEQEW0IY5SJjdVaCsmLKWU28YEbyyfJszCEQ53POAFQtlZPJ2kyqaMkG26jHySkMjeVcJA-7kQxU_WhfkVIRsH76xLrl5BZGEf0vQCpHnKAQSzcGdFaTWKDBY6ZJTnxIbWDQi_e7Hl7FNpyubDnWGh2giGZyT4DZqfp4n8Fc2JE2NIfHol0m8GdEOd5_ME8gkTLK_K_Lw6KovIcqECcPV5kW0RRJAocYMGF9ZiOR_3T9_WtAUkv4dZZEt2c6nvJA8sP3vZiMld3z1KD7nYAXv4pLlBGPaeNPVOZzKp2TSkrPfEdd0axKUbkp9UQZmnDHnU7LQmR_uQ6ck9ZDpJHwVV9XsUUuX_EUj4bqYo86jfaZn7UOnJXVTzESjonBQ0bjUHLX0fKD2pgvoZhnTcI7fzgJLBfsNhx9Cb4zvkM0OIaxK9MJx0n5lGu2o4OtRwnIc23abrPMBxcqjhtBBDsNt-wPD4UMPZoWJODjUs796DFLSa5rHA0XqCJaSmwc8T_NjCUruWzlaa8MkW_hR78OoQFyehWGmr6BZgMJU-pHSbI2ESer7WczH8xta0JIa_GzIKsMdwaQKzlbAhTSKh9Yy2oXxovWvtU61nGnrjyNQNs2sanaZZ03aw1210zWctXW93zVZb11u3Ne0zFVJvdOS-rhsto9sy25327b_9cgdF" rel="noopener noreferrer"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fmermaid.ink%2Fimg%2Fpako%3AeNqdVs2O40QQfpWW95pk7XiSTXJAyv9kYJiZJOIAQVbHbv9sHLfpbs8QZkYCCQkJIS0SKwF74cABxIkbnHmUfQF4BKrbvzObHcL45O6ur-r7qqrLvtZs6hCtp7khvbJ9zARajlYRgmd6_NFK--enl7-gaSCOkzXq2yKgEV-tojjhPhIUbXEQrbSPc3tUr79z45A4pLvGbhveoOHcGvUX0s_rH774-88XaBjSxEHzJEILwi4Dm4Azf1MXDEc8EPU4CbncWghG8DYMBBph7q8pZs79MGssbL-IMugvh3vDnNA1-FPG9efqfcCSIELnQUzCICLKbep4uDCkhxe_Z_CF7RMnCQkDELkkbIcMtK3oHS6a_9PekMwl3fP52ejtbGNGncQmTBGuRMvRw7P3F29H21CiZFtBp_i-KuY336Ljd9FSpjumUOvpcrJAC4FFYAPUwQI3PHrZ8DdF3IHCvfoKnS7n6CwmEZREYKDzgWwEChsKtRWsYdNtFQkAq38-q-LzFCHYBzQTjfshM7FZRZXi_p69QSYqTYlMxqsfZTLS4s4ij3DZqxAjUO-WJ1xucSW0Ee9KdcrB6AAHINCy-WUOzjgp-FhJ_O1nNKXUA3FZOwjKsCe7uRL7ad7sMvtPCxoj5WfyH34khcxNhcNYXYfBBXqPYkf2wA2aKkfff6kEBd5FIpvx9ecvEcNXOR9GE0E4-usPlNGjcbkSLIh5yZzGlgi2FWsbh7LurBAweZPE8cMkpBh5AbPMQNNK73J3nWR0MisXM1IchoHnC4vhIMxsykRMVRJn-dBKKwm97UEFUZ9zIqRHLrxSrFyUiZCrXHluVgqXO2_oTrtxllOYqeWJovDd1xmFUzlUi_hbWKUEQEW0IY5SJjdVaCsmLKWU28YEbyyfJszCEQ53POAFQtlZPJ2kyqaMkG26jHySkMjeVcJA-7kQxU_WhfkVIRsH76xLrl5BZGEf0vQCpHnKAQSzcGdFaTWKDBY6ZJTnxIbWDQi_e7Hl7FNpyubDnWGh2giGZyT4DZqfp4n8Fc2JE2NIfHol0m8GdEOd5_ME8gkTLK_K_Lw6KovIcqECcPV5kW0RRJAocYMGF9ZiOR_3T9_WtAUkv4dZZEt2c6nvJA8sP3vZiMld3z1KD7nYAXv4pLlBGPaeNPVOZzKp2TSkrPfEdd0axKUbkp9UQZmnDHnU7LQmR_uQ6ck9ZDpJHwVV9XsUUuX_EUj4bqYo86jfaZn7UOnJXVTzESjonBQ0bjUHLX0fKD2pgvoZhnTcI7fzgJLBfsNhx9Cb4zvkM0OIaxK9MJx0n5lGu2o4OtRwnIc23abrPMBxcqjhtBBDsNt-wPD4UMPZoWJODjUs796DFLSa5rHA0XqCJaSmwc8T_NjCUruWzlaa8MkW_hR78OoQFyehWGmr6BZgMJU-pHSbI2ESer7WczH8xta0JIa_GzIKsMdwaQKzlbAhTSKh9Yy2oXxovWvtU61nGnrjyNQNs2sanaZZ03aw1210zWctXW93zVZb11u3Ne0zFVJvdOS-rhsto9sy25327b_9cgdF%3Ftype%3Dpng" alt="HK Transit Pulse data architecture" width="1356" height="1478"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The full pipeline splits into two tracks:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Batch (daily at 06:00 HKT):&lt;/strong&gt;&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Cloud Scheduler → Cloud Run Job (Bruin) → GCS → BigQuery (raw → staging → marts)&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;&lt;strong&gt;Streaming (every minute):&lt;/strong&gt;&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Cloud Scheduler → producer Cloud Run Job → MTR API → Redpanda → consumer Cloud Run Job → BigQuery streaming&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Both tracks converge at the Streamlit dashboard, which queries BigQuery directly.&lt;/p&gt;

&lt;p&gt;GitHub Actions handles CI/CD: push to &lt;code&gt;main&lt;/code&gt; builds Docker images, pushes to Artifact Registry, and deploys the dashboard and batch job.&lt;/p&gt;




&lt;h2&gt;
  
  
  Data Sources
&lt;/h2&gt;

&lt;h3&gt;
  
  
  HK Transport GTFS Static
&lt;/h3&gt;

&lt;p&gt;GTFS (General Transit Feed Specification) is the open standard for public transport schedules. Hong Kong's GTFS feed — published at &lt;code&gt;data.gov.hk&lt;/code&gt; — covers KMB buses, CTB/NWFB Citybus, trams, and ferries. It's a ZIP of text files: routes, stops, trips, stop times, and calendar.&lt;/p&gt;

&lt;p&gt;One important caveat: &lt;strong&gt;MTR does not publish GTFS&lt;/strong&gt;. Trip-level data for the heavy rail network is not publicly available. So for the MTR, I used their Open Data portal which publishes CSV files for station lists, fares, and Light Rail routes.&lt;/p&gt;

&lt;h3&gt;
  
  
  MTR Open Data
&lt;/h3&gt;

&lt;p&gt;Four CSVs from &lt;code&gt;opendata.mtr.com.hk&lt;/code&gt;:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;mtr_lines_stations&lt;/strong&gt; — Heavy rail line and station data&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;mtr_bus_stops&lt;/strong&gt; — MTR feeder bus routes and stops&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;mtr_fares&lt;/strong&gt; — Station-to-station fare table&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;mtr_light_rail_stops&lt;/strong&gt; — Light Rail routes and stops&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  MTR Schedule API
&lt;/h3&gt;

&lt;p&gt;Real-time train arrival predictions from &lt;code&gt;rt.data.gov.hk&lt;/code&gt;. Used exclusively for the streaming layer — polled every minute and pushed to Redpanda.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Pipeline in Detail
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Orchestration with Bruin
&lt;/h3&gt;

&lt;p&gt;&lt;a href="https://getbruin.com" rel="noopener noreferrer"&gt;Bruin&lt;/a&gt; is a modern data pipeline framework that lets you define Python ingestion scripts and SQL transformation assets under one orchestrator. This was my first time using it — it was chosen as the sponsor tool for DE Zoomcamp 2026.&lt;/p&gt;

&lt;p&gt;The pipeline structure looks like this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;bruin/
└── assets/
    ├── ingestion/
    │   ├── ingest_gtfs_static.py
    │   └── ingest_mtr_csv.py
    ├── staging/
    │   ├── stg_stops.sql
    │   ├── stg_routes.sql
    │   └── ...
    └── marts/
        ├── mart_stops_ranked.sql
        ├── mart_peak_hour_analysis.sql
        └── ...
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Each asset declares its dependencies and type in a header comment block. Bruin resolves the DAG and runs assets in order. A single &lt;code&gt;bruin run .&lt;/code&gt; executes the full pipeline.&lt;/p&gt;

&lt;h3&gt;
  
  
  Ingestion
&lt;/h3&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fcayjj0n93xf7tfw503q6.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fcayjj0n93xf7tfw503q6.png" alt="BigQuery Raw" width="800" height="363"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;code&gt;ingest_gtfs_static.py&lt;/code&gt; downloads the GTFS ZIP from &lt;code&gt;data.gov.hk&lt;/code&gt;, extracts each file, uploads to GCS, and triggers BigQuery load jobs with &lt;code&gt;WRITE_TRUNCATE&lt;/code&gt; — so every daily run is a full refresh.&lt;/p&gt;

&lt;p&gt;&lt;code&gt;ingest_mtr_csv.py&lt;/code&gt; fetches 4 CSVs from the MTR portal. Two gotchas I hit here:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;The MTR site drops connections without a User-Agent header.&lt;/strong&gt; Add &lt;code&gt;User-Agent: Mozilla/5.0&lt;/code&gt; to every request.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;MTR CSVs have a UTF-8 BOM character (&lt;code&gt;\ufeff&lt;/code&gt;) at the start.&lt;/strong&gt; Decode with &lt;code&gt;utf-8-sig&lt;/code&gt; not &lt;code&gt;utf-8&lt;/code&gt; or your column names will be corrupted.&lt;/li&gt;
&lt;/ol&gt;

&lt;h3&gt;
  
  
  Staging Layer
&lt;/h3&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fjhx4g2lfbnyb2czs1bzi.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fjhx4g2lfbnyb2czs1bzi.png" alt="BigQuery Staging" width="800" height="433"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Five SQL assets clean and standardise the raw data — cast types, rename to snake_case, filter nulls on primary keys. Nothing fancy, just making the data reliable for downstream use.&lt;/p&gt;

&lt;p&gt;Example from &lt;code&gt;stg_stops.sql&lt;/code&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;SELECT&lt;/span&gt;
  &lt;span class="k"&gt;CAST&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;stop_id&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;INT64&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;stop_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="n"&gt;stop_name&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="k"&gt;CAST&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;stop_lat&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;FLOAT64&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;latitude&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="k"&gt;CAST&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;stop_lon&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;FLOAT64&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;longitude&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;raw&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;gtfs_stops&lt;/span&gt;
&lt;span class="k"&gt;WHERE&lt;/span&gt; &lt;span class="n"&gt;stop_id&lt;/span&gt; &lt;span class="k"&gt;IS&lt;/span&gt; &lt;span class="k"&gt;NOT&lt;/span&gt; &lt;span class="k"&gt;NULL&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Mart Layer
&lt;/h3&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fugfsp5kap4gm2pwr6r6l.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fugfsp5kap4gm2pwr6r6l.png" alt="BigQuery Marts" width="800" height="366"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Ten aggregated mart tables power the dashboard. These are always built from staging, never from raw. A few interesting ones:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;mart_peak_hour_analysis&lt;/strong&gt; — Groups all departures by hour of day to find peak periods&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;mart_transfer_hubs&lt;/strong&gt; — Counts how many distinct routes serve each stop&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;mart_weekday_vs_weekend&lt;/strong&gt; — Compares service volume split by calendar type&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;mart_trip_trajectories&lt;/strong&gt; — Joins trips with stop coordinates to build route paths for map rendering&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Streaming Layer
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F82kgmd7h28bks9vlickp.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F82kgmd7h28bks9vlickp.png" alt="Redpanda overview page" width="800" height="419"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The streaming layer adds real-time MTR train schedule data on top of the batch pipeline.&lt;/p&gt;

&lt;h3&gt;
  
  
  Producer (Cloud Run Job)
&lt;/h3&gt;

&lt;p&gt;Polls the MTR Schedule API for each configured line, serialises events as JSON, and publishes them to a Redpanda topic (&lt;code&gt;hk-mtr-schedule&lt;/code&gt;). It's &lt;strong&gt;one-shot&lt;/strong&gt; — no infinite loop — designed to run as a Cloud Run Job and exit cleanly after publishing.&lt;/p&gt;

&lt;h3&gt;
  
  
  Consumer (Cloud Run Job)
&lt;/h3&gt;

&lt;p&gt;Reads from Redpanda and writes to BigQuery using streaming inserts. It runs for &lt;strong&gt;50 seconds&lt;/strong&gt; then exits — a time-bounded window that fits within the 1-minute Cloud Scheduler cadence:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;RUN_DURATION&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;50&lt;/span&gt;
&lt;span class="n"&gt;deadline&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;time&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;time&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;RUN_DURATION&lt;/span&gt;

&lt;span class="k"&gt;while&lt;/span&gt; &lt;span class="n"&gt;time&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;time&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&lt;/span&gt; &lt;span class="n"&gt;deadline&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;msg&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;consumer&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;poll&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mf"&gt;1.0&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="c1"&gt;# process and insert to BigQuery
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Why Redpanda instead of Kafka? Redpanda is Kafka-compatible (same Python client), has no ZooKeeper dependency, and the free cloud tier is generous enough for this scale. Zero ops overhead.&lt;/p&gt;




&lt;h2&gt;
  
  
  CI/CD and Deployment
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Keyless Auth with Workload Identity Federation
&lt;/h3&gt;

&lt;p&gt;I used Workload Identity Federation (WIF) for GitHub Actions — no service account JSON keys stored as secrets. The workflow authenticates via OIDC token exchange, which GCP validates against the WIF provider.&lt;/p&gt;

&lt;p&gt;The tricky part: when creating the OIDC provider, you &lt;strong&gt;must&lt;/strong&gt; include an &lt;code&gt;--attribute-condition&lt;/code&gt; that restricts which GitHub repositories can use it. Without it, GCP rejects the provider:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;gcloud iam workload-identity-pools providers create-oidc github-provider &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--attribute-condition&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"assertion.repository_owner=='YourGitHubUsername'"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Three Cloud Run Deployments
&lt;/h3&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F60simqtz4ih3h91zl5zb.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F60simqtz4ih3h91zl5zb.png" alt="Cloud Run Services and Jobs console showing the dashboard service, batch-job, producer-job, and consumer-jo" width="800" height="348"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The project deploys three separate workloads:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Cloud Run Service&lt;/strong&gt; — Streamlit dashboard, always-on, public URL&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Cloud Run Job (batch-job)&lt;/strong&gt; — Bruin pipeline, triggered daily at 22:00 UTC (06:00 HKT) by Cloud Scheduler&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Cloud Run Jobs (producer-job + consumer-job)&lt;/strong&gt; — Streaming pipeline, triggered every minute&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;All three share the same Artifact Registry but use different Docker images. The dashboard image runs &lt;code&gt;streamlit run dashboard/app.py&lt;/code&gt;. The batch image installs Bruin CLI and runs &lt;code&gt;bruin run .&lt;/code&gt;.&lt;/p&gt;

&lt;h3&gt;
  
  
  Bruin in Docker — The Gotchas
&lt;/h3&gt;

&lt;p&gt;Getting Bruin to run inside a Docker container took a few iterations:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;PATH issue&lt;/strong&gt; — Bruin's install script puts the binary in &lt;code&gt;~/.local/bin&lt;/code&gt;. Add &lt;code&gt;ENV PATH="/root/.local/bin:${PATH}"&lt;/code&gt; to the Dockerfile.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Git repo required&lt;/strong&gt; — Bruin looks for a git repo root to resolve asset paths. Add &lt;code&gt;git init &amp;amp;&amp;amp; git commit&lt;/code&gt; in the Dockerfile.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;.bruin.yml&lt;/code&gt; is gitignored&lt;/strong&gt; — The config file can't be committed (contains project ID). Create it inline with &lt;code&gt;printf&lt;/code&gt; in the Dockerfile.&lt;/li&gt;
&lt;/ol&gt;




&lt;h2&gt;
  
  
  Dashboard
&lt;/h2&gt;

&lt;p&gt;The Streamlit dashboard has 4 tabs:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fh6q8jl8gwztw8sn1suz6.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fh6q8jl8gwztw8sn1suz6.png" alt="Network Analytics Tab" width="800" height="423"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Network Analytics&lt;/strong&gt; — GTFS-based charts filtered by transport type:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Stop locations map (pydeck ScatterplotLayer)&lt;/li&gt;
&lt;li&gt;Top 10 busiest stops by departure count&lt;/li&gt;
&lt;li&gt;Peak hour analysis chart&lt;/li&gt;
&lt;li&gt;Weekday vs weekend service comparison&lt;/li&gt;
&lt;li&gt;Transfer hubs, longest routes, first/last services&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fd93qk506xrei93tn30bz.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fd93qk506xrei93tn30bz.png" alt="MTR Live Tab" width="800" height="321"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;MTR Live 港鐵&lt;/strong&gt; — MTR station and fare data from the open data CSVs&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fav34m49oie1lf4qhcq8a.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fav34m49oie1lf4qhcq8a.png" alt="Streaming Analytics" width="800" height="321"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Streaming Analytics 實時分析&lt;/strong&gt; — Real-time event counts from the Redpanda stream, refreshed on demand&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;About&lt;/strong&gt; — Project background&lt;/p&gt;

&lt;p&gt;One Streamlit bug I hit in production: &lt;code&gt;st.pydeck_chart()&lt;/code&gt; throws a &lt;strong&gt;SessionInfo error&lt;/strong&gt; when the session isn't fully initialised — which happens on Cloud Run before the first request completes. The fix is twofold: upgrade to Streamlit 1.45.0 (which patches the underlying bug) and wrap every &lt;code&gt;pydeck_chart()&lt;/code&gt; call in a try-except as a belt-and-suspenders fallback.&lt;/p&gt;




&lt;h2&gt;
  
  
  Infrastructure as Code
&lt;/h2&gt;

&lt;p&gt;All GCP infrastructure is provisioned with OpenTofu (open-source Terraform fork):&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;GCS bucket for the data lake&lt;/li&gt;
&lt;li&gt;BigQuery datasets: &lt;code&gt;raw&lt;/code&gt;, &lt;code&gt;staging&lt;/code&gt;, &lt;code&gt;marts&lt;/code&gt;, &lt;code&gt;streaming&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;Service account with appropriate IAM bindings&lt;/li&gt;
&lt;li&gt;Artifact Registry for Docker images&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;One gotcha: BigQuery tables must be manually deleted before &lt;code&gt;tofu destroy&lt;/code&gt; — OpenTofu can't drop non-empty tables. Use &lt;code&gt;bq rm -f -t project.dataset.table&lt;/code&gt; first.&lt;/p&gt;




&lt;h2&gt;
  
  
  Key Learnings
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;1. Bruin is genuinely elegant for combined Python + SQL pipelines.&lt;/strong&gt; The single-orchestrator approach removes the glue code you'd otherwise write to pass data between ingestion and transformation stages.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;2. WIF is the right call over SA keys.&lt;/strong&gt; No key rotation, no leakage risk, and it fits cleanly into the GitHub Actions OIDC flow. The setup is more involved upfront but it's worth it.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;3. Cloud Run Jobs are perfect for one-shot workloads.&lt;/strong&gt; The producer and consumer aren't servers — they're functions that run, do their work, and exit. Cloud Run Jobs model this naturally, and Cloud Scheduler gives you the cron trigger.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;4. MTR's public data coverage has real gaps.&lt;/strong&gt; No GTFS means no trip-level analysis for the heavy rail network — you can't do headway or crowding analysis with what's available. The streaming layer partially compensates for this, but real-time schedule data isn't the same as historical trip records.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;5. GTFS &lt;code&gt;stop_times.txt&lt;/code&gt; is large&lt;/strong&gt; — over 100MB — but BigQuery handles it cleanly via the load job API. No need to chunk it.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;6. Watch your query scope.&lt;/strong&gt; I had a &lt;code&gt;LIMIT 20&lt;/code&gt; on a trips query that was feeding a KPI metric — it showed 16 routes instead of 700+. Always use a separate &lt;code&gt;COUNT(DISTINCT ...)&lt;/code&gt; query for aggregate KPIs rather than deriving them from display-scoped queries.&lt;/p&gt;




&lt;h2&gt;
  
  
  What's Next
&lt;/h2&gt;

&lt;p&gt;A few things I'd improve with more time:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Real KMB stop IDs for the streaming layer&lt;/strong&gt; — the bus ETA producer currently uses placeholder stop IDs&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Historical streaming data&lt;/strong&gt; — the current setup doesn't persist stream history beyond what's in &lt;code&gt;mtr_schedule_raw&lt;/code&gt;; proper partitioning and retention would make this more useful&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;dbt instead of raw SQL&lt;/strong&gt; — Bruin's SQL assets work well but lack some of dbt's testing and documentation features&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Looker Studio or Metabase&lt;/strong&gt; — Streamlit is fast to build but harder to maintain than a proper BI tool for this kind of analytical dashboard&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Try It Yourself
&lt;/h2&gt;

&lt;p&gt;The full project is open source:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;GitHub:&lt;/strong&gt; TBA&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Live dashboard:&lt;/strong&gt; &lt;a href="https://hk-transit-pulse-fclvz6lara-de.a.run.app" rel="noopener noreferrer"&gt;https://hk-transit-pulse-fclvz6lara-de.a.run.app&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The README has full reproduction steps — you can clone it, provision your own GCP project with &lt;code&gt;tofu apply&lt;/code&gt;, and run the full pipeline locally with a single &lt;code&gt;bruin run&lt;/code&gt;.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Built as a capstone project for the Data Engineering Zoomcamp 2026, sponsored by Bruin.&lt;/em&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  Credit
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://unsplash.com/photos/a-bunch-of-different-types-of-signs-on-a-wall-aXbjkCp5-T8?utm_source=unsplash&amp;amp;utm_medium=referral&amp;amp;utm_content=creditCopyText" rel="noopener noreferrer"&gt;Cover Image Source&lt;/a&gt; by &lt;a href="https://unsplash.com/@nick_y?utm_source=unsplash&amp;amp;utm_medium=referral&amp;amp;utm_content=creditCopyText" rel="noopener noreferrer"&gt;Nick Young&lt;/a&gt; on &lt;a href="https://unsplash.com/photos/a-bunch-of-different-types-of-signs-on-a-wall-aXbjkCp5-T8?utm_source=unsplash&amp;amp;utm_medium=referral&amp;amp;utm_content=creditCopyText" rel="noopener noreferrer"&gt;Unsplash&lt;/a&gt;&lt;/p&gt;




</description>
      <category>dataengineering</category>
      <category>datazoomcamp</category>
      <category>googlecloud</category>
      <category>bigquery</category>
    </item>
  </channel>
</rss>
