DEV Community

Cover image for Building a Modern Data Platform β€” Dagster - Dbt - Iceberg
Rida MEFTAH
Rida MEFTAH

Posted on

Building a Modern Data Platform β€” Dagster - Dbt - Iceberg

πŸš€ I built a full-featured retail data platform β€” 100% open-source, zero cloud lock-in.

βœ… Synthetic data generation (Faker)

βœ… Raw storage (MinIO)

βœ… Transactional lakehouse (Iceberg + Nessie)

βœ… Modular transformation (dbt)

βœ… Orchestration with lineage (Dagster)

πŸ’‘ All running locally via Docker β€” no Snowflake, no Databricks.

πŸ”§ Stack: Spark 3.5, Iceberg 1.10, dbt 1.10, Dagster 1.7

🌱 Perfect for learning, prototyping, or building cost-efficient pipelines in startups/SMEs.

πŸ”— Code: github.com/RidaMft/dagster-dbt-iceberg

πŸ‘‡ What’s your go-to stack for modern analytics engineering? OSS or cloud-managed?

#DataEngineering #Lakehouse #OpenSource #Dagster #dbt #Iceberg #Nessie #Spark #MinIO #AnalyticsEngineering


An open-source retail analytics pipeline with Dagster, dbt, Spark, Iceberg & Nessie

A few months ago, I set out to answer a simple question:

Can we build a production-grade data platform β€” from raw data to analytics β€” using only open-source tools, without relying on cloud-managed services?

The answer is yes. And here’s the code.

This end-to-end pipeline simulates a retail business:

πŸ”Ή Synthetic data (stores, products, employees, sales)

πŸ”Ή Ingestion into an Iceberg lakehouse (via Spark)

πŸ”Ή Transformation with dbt (modular, tested, documented)

πŸ”Ή Orchestration & observability with Dagster

All running locally on Docker β€” no $500/month dev clusters.


πŸ” Why Bother? The Open-Source Lakehouse Advantage

Use Case Cloud-Managed (e.g., Databricks) Open-Source Stack
Learning Abstracted internals (Delta, Unity Catalog) βœ… Deep understanding of Spark, Iceberg, Nessie
Cost (dev/test) $200–500+/mo βœ… $0 β€” Docker on a t3a.xlarge
Portability Vendor lock-in (proprietary formats) βœ… MinIO β†’ S3, Spark standalone β†’ EMR/K8s
Innovation Limited to vendor roadmap βœ… Full control: custom dbt macros, Nessie branching, Iceberg maintenance

πŸ‘‰ This isn’t meant to replace Databricks in production β€” but it’s ideal for:

  • Upskilling engineers (data + analytics),
  • Rapid prototyping,
  • Startups/SMEs needing a low-cost MVP.

🧱 The Stack β€” Why Each Component?

Tool Role Key Benefit
Dagster Orchestration + asset lineage + checks Observable data pipelines β€” no more β€œblack box” DAGs
dbt Transformation layer (SQL + DAG) Tests, documentation, and modularity by design
Spark Distributed processing (thrift client) Handles large-scale workloads β€” local or remote cluster
Iceberg Table format (ACID, time-travel, schema evolution) Production-ready tables β€” no more β€œ.parquet hell”
Nessie Git-like branching for data dev/main workflows, safe experiments, PR-like merges
MinIO S3-compatible object storage Local dev that mirrors cloud workflows

πŸ’‘ The Nessie + Iceberg combo is particularly powerful:

β†’ Branch your data like code,

β†’ Test transformations in isolation,

β†’ Merge with confidence.


πŸ› οΈ Key Technical Wins

  • Idempotent dbt integration: Dagster recompiles only when models change β€” no redundant dbt compile.
  • No resource conflicts: Clean separation between dagster/ (orchestration) and dbt/retail_lakehouse/ (transformation).
  • Full observability: Every asset shows lineage, materialization history, and test results in Dagster UI.
  • Spark + Nessie config: Verified working with Spark Thrift, Iceberg catalog, and MinIO S3 endpoint.

πŸ“¦ What’s Next?

  • Add Trino for ad-hoc querying,
  • Automate dbt docs + Dagster lineage publishing,
  • Kubernetes deployment (dev β†’ staging β†’ prod).

πŸ”— Try It Yourself

git clone https://github.com/RidaMft/dagster-dbt-iceberg.git  
cd dagster-dbt-iceberg
docker compose -f docker-compose.yaml -f docker-compose-dagster.yaml --env-file .env up -d --build
Enter fullscreen mode Exit fullscreen mode

β†’ Open http://localhost:3000 and explore the assets.


🀝 Your Turn

  • Are you using open-source or cloud-managed tools for your lakehouse?
  • What’s missing in the OSS ecosystem?
  • Want a step-by-step tutorial to reproduce this?

πŸ‘‰ Star/fork the repo β€” contributions and feedback welcome!

πŸ‘‰ DM me for consulting or tailored workshops.

#DataEngineering #OpenSource #Lakehouse #dbt #Dagster #Iceberg #Nessie #Spark #AnalyticsEngineering #RetailAnalytics

Top comments (0)