DEV Community

Egor Tarasenko
Egor Tarasenko

Posted on

Building a Flight Data Pipeline Without Trusting AI

A few months ago I was cleaning up my GitHub and found this repo -- https://github.com/dat-a-ish/data-gov. It was built at that weird inflection point -- AI tools existed, Copilot was already in my editor, but I still rewrote every suggestion by hand because I didn't fully trust what came out. So the code in here is genuinely mine, typed line by line, and that's exactly what makes it useful as a learning resource.

The irony? The project pulls real flight data from the Israeli Government API -- Ben Gurion Airport departures and arrivals. With the current situation in the Middle East making flights in and out of Tel Aviv a daily news story, a pipeline that tracks exactly that data feels more relevant than ever. But jokes aside -- this repo is not about geopolitics. It's a study in how a production-grade batch data pipeline actually fits together, from raw API ingestion all the way to a live dashboard.

My honest recommendation: clone it, read it, and then build your own version without AI help. Not because AI is bad -- but because the only way to truly understand how these tools fit together is to wire them up yourself at least once.


What Does This Project Actually Do?

The pipeline collects real-time flight data from Israel's open government API, enriches it with geolocation data, transforms it through multiple modeling layers, validates it, and serves it on a Streamlit dashboard. Every layer is automated with Apache Airflow.

Layer Tool
Orchestration Apache Airflow 2.8.1
Transformation dbt 1.7.8
Database PostgreSQL 15
Visualization Streamlit
Data Quality Great Expectations
Geospatial GeoPandas
Infra Docker, Terraform, AWS

The Architecture at a Glance

The pipeline has four stages that flow into each other using Airflow's Dataset-based triggering -- no fixed cron dependencies, just data contracts:

flight_sensor -> flights_load -> flights_geocode -> flights_dbt -> dashboard
Enter fullscreen mode Exit fullscreen mode

Each stage is a separate DAG. When one finishes and emits a Dataset event, the next one wakes up automatically. This is a clean way to build loosely-coupled, event-driven batch pipelines in Airflow without hardcoding schedules everywhere.


Part 1: Data Ingestion -- base/pipelines/

This is where raw data enters the system. The flights.py pipeline hits the government API, pulls the latest flight records, and loads them into PostgreSQL using a passive table pattern.

The idea is simple but powerful: you never write directly to the production table. Instead you write to a _passive staging table, validate it, and only then atomically swap it with the production table. Zero downtime, easy rollback.

# The core atomic swap
exchange_tables(source=flight_source_passive, target=flight_source)
create_or_replace_synonym_view(table=flight_source)
Enter fullscreen mode Exit fullscreen mode

The geocoding pipeline (flights_geocode.py) enriches the raw data by converting city/country pairs to latitude/longitude using GeoPandas and OpenStreetMap. This runs as a separate DAG triggered automatically after ingestion completes.


Part 2: Orchestration -- airflow/dags/

Four DAGs, each with a single responsibility:

flight_sensor -- runs continuously, polls the source API for new data. When it detects a change, it fires a trigger to flights_load. This is your entry point.

flights_load -- the main ingestion DAG. Runs hourly, or on-demand from the sensor. Fetches data, loads to the passive table, runs dbt tests, runs Great Expectations validation, then emits a Dataset('flight_load') event.

flights_geocode -- triggered by flight_load. Enriches the dataset with coordinates. Emits Dataset('flights_geocode').

dbt_dags -- dynamically generates two dbt DAGs from config. flights_dbt builds the transformation models and is triggered when either flight_load or flights_geocode completes. clean_up_database_dbt runs daily and drops any database objects not touched in the current run -- a clean, automated way to handle schema drift.

One thing worth studying: the dag_template.py custom decorator. It wraps every DAG definition with consistent defaults -- retry logic, alerting, tags -- so you don't repeat yourself across DAG files.


Part 3: Transformation -- base/dbt/

This is where raw flight records become usable analytical models. The dbt project implements three modeling layers -- a classic medallion-style architecture:

Silver -- Normalized (3NF / Inmon)
Breaks the raw table into clean, normalized entities: flights, airports, cities, countries, flight_statuses, flight_companies. Good for integrity, joins, and historical tracking. Slowly Changing Dimensions (SCD Type 2) are implemented here.

Gold Star Schema -- Dimensional (Kimball)
Fact and dimension tables optimized for analytical queries: fact_flights_amount, dim_airports, dim_terminals, dim_time. The classic BI-ready format.

Gold OBT -- One Big Table
A fully denormalized flights_obt table. Fast for dashboards, no joins needed. This is what Streamlit reads.

The progression from raw -> silver -> gold -> OBT is something every data engineer should build manually at least once. It makes the trade-offs between normalization and query performance very concrete.


Part 4: Data Quality -- base/gx/

Two layers of validation run automatically after every ingestion:

  1. dbt tests -- schema tests on the raw source (not null, unique, accepted values). These run before the table swap, so bad data never reaches production.
  2. Great Expectations -- more complex expectation suites on the loaded data: column value ranges, row count thresholds, distribution checks.

If either layer fails, the DAG fails cleanly and the passive table is never promoted.


Part 5: Visualization -- streamlit/

A multi-page Streamlit app that reads directly from the flights_obt view in PostgreSQL. The main page shows a live map of flights; other pages break down data by airline, status, terminal, and time.

The interesting design choice here: Streamlit reads from a view (flight_source_v), not the underlying table. The atomic swap in the ingestion layer updates the view pointer, so the dashboard always reads a consistent snapshot without downtime or stale reads.


Part 6: Infrastructure -- prod_deploy/ and .github/workflows/

Three separate CI/CD pipelines -- one each for Airflow, dbt, and Streamlit -- triggered only when relevant directories change. Each pipeline lints the code, builds a Docker image, pushes it to AWS ECR, and deploys to EC2 via SSH.

Terraform manages ECR repositories and the EC2 instance. The local development setup mirrors production using Docker Compose, so what runs locally runs in prod.


Try It Yourself

Local setup takes about five minutes:

git clone https://github.com/dat-a-ish/data-gov
docker-compose -f local_deploy/docker-compose.yml up

# Airflow:   http://localhost:8080
# dbt docs:  http://localhost
# Streamlit: http://localhost:8501
Enter fullscreen mode Exit fullscreen mode

If you want to run just the ingestion without Airflow:

docker-compose -f local_deploy/docker-compose-database.yml up
python -m base.pipelines.flights
Enter fullscreen mode Exit fullscreen mode

And dbt alone:

cd base/dbt/data_gov_il
dbt run -t dev
Enter fullscreen mode Exit fullscreen mode

Why Build This Without AI?

Not as a purist exercise -- just because the concepts here (atomic swaps, passive tables, dataset-triggered DAGs, dbt layer separation, GX validation) become intuitive only when you've had to debug them yourself. When something breaks at 2am and your DAG is stuck, you need to know the system, not just remember that AI generated it for you.

Build it, break it, fix it. Then use AI to build the next one 10x faster.


Thanks for reading! If you found this useful, follow me here on dev.to for more Data & AI Engineering content.

Top comments (0)