π I built a full-featured retail data platform β 100% open-source, zero cloud lock-in.
β Synthetic data generation (Faker)
β Raw storage (MinIO)
β Transactional lakehouse (Iceberg + Nessie)
β Modular transformation (dbt)
β Orchestration with lineage (Dagster)π‘ All running locally via Docker β no Snowflake, no Databricks.
π§ Stack: Spark 3.5, Iceberg 1.10, dbt 1.10, Dagster 1.7π± Perfect for learning, prototyping, or building cost-efficient pipelines in startups/SMEs.
π Code: github.com/RidaMft/dagster-dbt-iceberg
π Whatβs your go-to stack for modern analytics engineering? OSS or cloud-managed?#DataEngineering #Lakehouse #OpenSource #Dagster #dbt #Iceberg #Nessie #Spark #MinIO #AnalyticsEngineering
An open-source retail analytics pipeline with Dagster, dbt, Spark, Iceberg & Nessie
A few months ago, I set out to answer a simple question:
Can we build a production-grade data platform β from raw data to analytics β using only open-source tools, without relying on cloud-managed services?
The answer is yes. And hereβs the code.
This end-to-end pipeline simulates a retail business:
πΉ Synthetic data (stores, products, employees, sales)
πΉ Ingestion into an Iceberg lakehouse (via Spark)
πΉ Transformation with dbt (modular, tested, documented)
πΉ Orchestration & observability with Dagster
All running locally on Docker β no $500/month dev clusters.
π Why Bother? The Open-Source Lakehouse Advantage
| Use Case | Cloud-Managed (e.g., Databricks) | Open-Source Stack |
|---|---|---|
| Learning | Abstracted internals (Delta, Unity Catalog) | β Deep understanding of Spark, Iceberg, Nessie |
| Cost (dev/test) | $200β500+/mo | β
$0 β Docker on a t3a.xlarge
|
| Portability | Vendor lock-in (proprietary formats) | β MinIO β S3, Spark standalone β EMR/K8s |
| Innovation | Limited to vendor roadmap | β Full control: custom dbt macros, Nessie branching, Iceberg maintenance |
π This isnβt meant to replace Databricks in production β but itβs ideal for:
- Upskilling engineers (data + analytics),
- Rapid prototyping,
- Startups/SMEs needing a low-cost MVP.
π§± The Stack β Why Each Component?
| Tool | Role | Key Benefit |
|---|---|---|
| Dagster | Orchestration + asset lineage + checks | Observable data pipelines β no more βblack boxβ DAGs |
| dbt | Transformation layer (SQL + DAG) | Tests, documentation, and modularity by design |
| Spark | Distributed processing (thrift client) | Handles large-scale workloads β local or remote cluster |
| Iceberg | Table format (ACID, time-travel, schema evolution) | Production-ready tables β no more β.parquet hellβ |
| Nessie | Git-like branching for data |
dev/main workflows, safe experiments, PR-like merges |
| MinIO | S3-compatible object storage | Local dev that mirrors cloud workflows |
π‘ The Nessie + Iceberg combo is particularly powerful:
β Branch your data like code,
β Test transformations in isolation,
β Merge with confidence.
π οΈ Key Technical Wins
-
Idempotent dbt integration: Dagster recompiles only when models change β no redundant
dbt compile. -
No resource conflicts: Clean separation between
dagster/(orchestration) anddbt/retail_lakehouse/(transformation). - Full observability: Every asset shows lineage, materialization history, and test results in Dagster UI.
- Spark + Nessie config: Verified working with Spark Thrift, Iceberg catalog, and MinIO S3 endpoint.
π¦ Whatβs Next?
- Add Trino for ad-hoc querying,
- Automate
dbt docs+ Dagster lineage publishing, - Kubernetes deployment (dev β staging β prod).
π Try It Yourself
git clone https://github.com/RidaMft/dagster-dbt-iceberg.git
cd dagster-dbt-iceberg
docker compose -f docker-compose.yaml -f docker-compose-dagster.yaml --env-file .env up -d --build
β Open http://localhost:3000 and explore the assets.
π€ Your Turn
- Are you using open-source or cloud-managed tools for your lakehouse?
- Whatβs missing in the OSS ecosystem?
- Want a step-by-step tutorial to reproduce this?
π Star/fork the repo β contributions and feedback welcome!
π DM me for consulting or tailored workshops.
#DataEngineering #OpenSource #Lakehouse #dbt #Dagster #Iceberg #Nessie #Spark #AnalyticsEngineering #RetailAnalytics
Top comments (0)