Rida MEFTAH

Posted on Jan 5

Building a Modern Data Platform — Dagster - Dbt - Iceberg

#architecture #dataengineering #docker #opensource

🚀 I built a full-featured retail data platform — 100% open-source, zero cloud lock-in.

✅ Synthetic data generation (Faker)

✅ Raw storage (MinIO)

✅ Transactional lakehouse (Iceberg + Nessie)

✅ Modular transformation (dbt)

✅ Orchestration with lineage (Dagster)

💡 All running locally via Docker — no Snowflake, no Databricks.

🔧 Stack: Spark 3.5, Iceberg 1.10, dbt 1.10, Dagster 1.7

🌱 Perfect for learning, prototyping, or building cost-efficient pipelines in startups/SMEs.

🔗 Code: github.com/RidaMft/dagster-dbt-iceberg

👇 What’s your go-to stack for modern analytics engineering? OSS or cloud-managed?

#DataEngineering #Lakehouse #OpenSource #Dagster #dbt #Iceberg #Nessie #Spark #MinIO #AnalyticsEngineering

An open-source retail analytics pipeline with Dagster, dbt, Spark, Iceberg & Nessie

A few months ago, I set out to answer a simple question:

Can we build a production-grade data platform — from raw data to analytics — using only open-source tools, without relying on cloud-managed services?

The answer is yes. And here’s the code.

This end-to-end pipeline simulates a retail business:

🔹 Synthetic data (stores, products, employees, sales)

🔹 Ingestion into an Iceberg lakehouse (via Spark)

🔹 Transformation with dbt (modular, tested, documented)

🔹 Orchestration & observability with Dagster

All running locally on Docker — no $500/month dev clusters.

🔍 Why Bother? The Open-Source Lakehouse Advantage

Use Case	Cloud-Managed (e.g., Databricks)	Open-Source Stack
Learning	Abstracted internals (Delta, Unity Catalog)	✅ Deep understanding of Spark, Iceberg, Nessie
Cost (dev/test)	$200–500+/mo	✅ $0 — Docker on a `t3a.xlarge`
Portability	Vendor lock-in (proprietary formats)	✅ MinIO → S3, Spark standalone → EMR/K8s
Innovation	Limited to vendor roadmap	✅ Full control: custom dbt macros, Nessie branching, Iceberg maintenance

👉 This isn’t meant to replace Databricks in production — but it’s ideal for:

Upskilling engineers (data + analytics),
Rapid prototyping,
Startups/SMEs needing a low-cost MVP.

🧱 The Stack — Why Each Component?

Tool	Role	Key Benefit
Dagster	Orchestration + asset lineage + checks	Observable data pipelines — no more “black box” DAGs
dbt	Transformation layer (SQL + DAG)	Tests, documentation, and modularity by design
Spark	Distributed processing (thrift client)	Handles large-scale workloads — local or remote cluster
Iceberg	Table format (ACID, time-travel, schema evolution)	Production-ready tables — no more “.parquet hell”
Nessie	Git-like branching for data	`dev`/`main` workflows, safe experiments, PR-like merges
MinIO	S3-compatible object storage	Local dev that mirrors cloud workflows

💡 The Nessie + Iceberg combo is particularly powerful:

→ Branch your data like code,

→ Test transformations in isolation,

→ Merge with confidence.

🛠️ Key Technical Wins

Idempotent dbt integration: Dagster recompiles only when models change — no redundant dbt compile.
No resource conflicts: Clean separation between dagster/ (orchestration) and dbt/retail_lakehouse/ (transformation).
Full observability: Every asset shows lineage, materialization history, and test results in Dagster UI.
Spark + Nessie config: Verified working with Spark Thrift, Iceberg catalog, and MinIO S3 endpoint.

📦 What’s Next?

Add Trino for ad-hoc querying,
Automate dbt docs + Dagster lineage publishing,
Kubernetes deployment (dev → staging → prod).

🔗 Try It Yourself

git clone https://github.com/RidaMft/dagster-dbt-iceberg.git  
cd dagster-dbt-iceberg
docker compose -f docker-compose.yaml -f docker-compose-dagster.yaml --env-file .env up -d --build

→ Open http://localhost:3000 and explore the assets.

🤝 Your Turn

Are you using open-source or cloud-managed tools for your lakehouse?
What’s missing in the OSS ecosystem?
Want a step-by-step tutorial to reproduce this?

👉 Star/fork the repo — contributions and feedback welcome!

👉 DM me for consulting or tailored workshops.

#DataEngineering #OpenSource #Lakehouse #dbt #Dagster #Iceberg #Nessie #Spark #AnalyticsEngineering #RetailAnalytics

DEV Community