<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Rida MEFTAH</title>
    <description>The latest articles on DEV Community by Rida MEFTAH (@ridameftah).</description>
    <link>https://dev.to/ridameftah</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3694228%2F296902c0-f3e5-4120-b33b-14084eacc9fb.png</url>
      <title>DEV Community: Rida MEFTAH</title>
      <link>https://dev.to/ridameftah</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/ridameftah"/>
    <language>en</language>
    <item>
      <title>Building a Modern Data Platform — Dagster - Dbt - Iceberg</title>
      <dc:creator>Rida MEFTAH</dc:creator>
      <pubDate>Mon, 05 Jan 2026 11:56:12 +0000</pubDate>
      <link>https://dev.to/ridameftah/building-a-modern-data-platform-dagster-dbt-iceberg-5b25</link>
      <guid>https://dev.to/ridameftah/building-a-modern-data-platform-dagster-dbt-iceberg-5b25</guid>
      <description>&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;🚀 I built a &lt;em&gt;full-featured retail data platform&lt;/em&gt; — 100% open-source, zero cloud lock-in.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;✅ Synthetic data generation (Faker)&lt;br&gt;&lt;br&gt;
✅ Raw storage (MinIO)&lt;br&gt;&lt;br&gt;
✅ Transactional lakehouse (Iceberg + Nessie)&lt;br&gt;&lt;br&gt;
✅ Modular transformation (dbt)&lt;br&gt;&lt;br&gt;
✅ Orchestration with lineage (Dagster)  &lt;/p&gt;

&lt;p&gt;💡 All running locally via Docker — no Snowflake, no Databricks.&lt;br&gt;&lt;br&gt;
🔧 Stack: Spark 3.5, Iceberg 1.10, dbt 1.10, Dagster 1.7  &lt;/p&gt;

&lt;p&gt;🌱 Perfect for learning, prototyping, or building cost-efficient pipelines in startups/SMEs.  &lt;/p&gt;

&lt;p&gt;🔗 Code: &lt;a href="https://github.com/RidaMft/dagster-dbt-iceberg" rel="noopener noreferrer"&gt;github.com/RidaMft/dagster-dbt-iceberg&lt;/a&gt;&lt;br&gt;&lt;br&gt;
👇 What’s your go-to stack for modern analytics engineering? OSS or cloud-managed?&lt;/p&gt;

&lt;p&gt;#DataEngineering #Lakehouse #OpenSource #Dagster #dbt #Iceberg #Nessie #Spark #MinIO #AnalyticsEngineering&lt;/p&gt;
&lt;/blockquote&gt;




&lt;h4&gt;
  
  
  &lt;em&gt;An open-source retail analytics pipeline with Dagster, dbt, Spark, Iceberg &amp;amp; Nessie&lt;/em&gt;
&lt;/h4&gt;

&lt;p&gt;A few months ago, I set out to answer a simple question:  &lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Can we build a production-grade data platform — from raw data to analytics — using only open-source tools, without relying on cloud-managed services?&lt;/strong&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;The answer is &lt;strong&gt;yes&lt;/strong&gt;. And here’s &lt;strong&gt;&lt;a href="https://github.com/RidaMft/dagster-dbt-iceberg" rel="noopener noreferrer"&gt;the code&lt;/a&gt;&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;This end-to-end pipeline simulates a retail business:&lt;br&gt;&lt;br&gt;
🔹 Synthetic data (stores, products, employees, sales)&lt;br&gt;&lt;br&gt;
🔹 Ingestion into an Iceberg lakehouse (via Spark)&lt;br&gt;&lt;br&gt;
🔹 Transformation with dbt (modular, tested, documented)&lt;br&gt;&lt;br&gt;
🔹 Orchestration &amp;amp; observability with Dagster  &lt;/p&gt;

&lt;p&gt;All running &lt;strong&gt;locally&lt;/strong&gt; on Docker — no $500/month dev clusters.&lt;/p&gt;




&lt;h3&gt;
  
  
  🔍 Why Bother? The Open-Source Lakehouse Advantage
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Use Case&lt;/th&gt;
&lt;th&gt;Cloud-Managed (e.g., Databricks)&lt;/th&gt;
&lt;th&gt;Open-Source Stack&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Learning&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Abstracted internals (Delta, Unity Catalog)&lt;/td&gt;
&lt;td&gt;✅ Deep understanding of Spark, Iceberg, Nessie&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Cost (dev/test)&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;$200–500+/mo&lt;/td&gt;
&lt;td&gt;✅ &lt;strong&gt;$0&lt;/strong&gt; — Docker on a &lt;code&gt;t3a.xlarge&lt;/code&gt;
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Portability&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Vendor lock-in (proprietary formats)&lt;/td&gt;
&lt;td&gt;✅ MinIO → S3, Spark standalone → EMR/K8s&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Innovation&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Limited to vendor roadmap&lt;/td&gt;
&lt;td&gt;✅ Full control: custom dbt macros, Nessie branching, Iceberg maintenance&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;👉 This isn’t meant to &lt;em&gt;replace&lt;/em&gt; Databricks in production — but it’s ideal for:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Upskilling engineers (data + analytics),&lt;/li&gt;
&lt;li&gt;Rapid prototyping,&lt;/li&gt;
&lt;li&gt;Startups/SMEs needing a low-cost MVP.&lt;/li&gt;
&lt;/ul&gt;




&lt;h3&gt;
  
  
  🧱 The Stack — Why Each Component?
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Tool&lt;/th&gt;
&lt;th&gt;Role&lt;/th&gt;
&lt;th&gt;Key Benefit&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Dagster&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Orchestration + asset lineage + checks&lt;/td&gt;
&lt;td&gt;Observable data pipelines — no more “black box” DAGs&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;dbt&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Transformation layer (SQL + DAG)&lt;/td&gt;
&lt;td&gt;Tests, documentation, and modularity by design&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Spark&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Distributed processing (thrift client)&lt;/td&gt;
&lt;td&gt;Handles large-scale workloads — local or remote cluster&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Iceberg&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Table format (ACID, time-travel, schema evolution)&lt;/td&gt;
&lt;td&gt;Production-ready tables — no more “.parquet hell”&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Nessie&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Git-like branching for data&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;dev&lt;/code&gt;/&lt;code&gt;main&lt;/code&gt; workflows, safe experiments, PR-like merges&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;MinIO&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;S3-compatible object storage&lt;/td&gt;
&lt;td&gt;Local dev that mirrors cloud workflows&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;💡 The &lt;strong&gt;Nessie + Iceberg&lt;/strong&gt; combo is particularly powerful:&lt;br&gt;&lt;br&gt;
→ Branch your &lt;em&gt;data&lt;/em&gt; like code,&lt;br&gt;&lt;br&gt;
→ Test transformations in isolation,&lt;br&gt;&lt;br&gt;
→ Merge with confidence.&lt;/p&gt;




&lt;h3&gt;
  
  
  🛠️ Key Technical Wins
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Idempotent dbt integration&lt;/strong&gt;: Dagster recompiles only when models change — no redundant &lt;code&gt;dbt compile&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;No resource conflicts&lt;/strong&gt;: Clean separation between &lt;code&gt;dagster/&lt;/code&gt; (orchestration) and &lt;code&gt;dbt/retail_lakehouse/&lt;/code&gt; (transformation).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Full observability&lt;/strong&gt;: Every asset shows lineage, materialization history, and test results in Dagster UI.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Spark + Nessie config&lt;/strong&gt;: Verified working with Spark Thrift, Iceberg catalog, and MinIO S3 endpoint.&lt;/li&gt;
&lt;/ul&gt;




&lt;h3&gt;
  
  
  📦 What’s Next?
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Add Trino for ad-hoc querying,&lt;/li&gt;
&lt;li&gt;Automate &lt;code&gt;dbt docs&lt;/code&gt; + Dagster lineage publishing,&lt;/li&gt;
&lt;li&gt;Kubernetes deployment (dev → staging → prod).&lt;/li&gt;
&lt;/ul&gt;




&lt;h3&gt;
  
  
  🔗 Try It Yourself
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;git clone https://github.com/RidaMft/dagster-dbt-iceberg.git  
&lt;span class="nb"&gt;cd &lt;/span&gt;dagster-dbt-iceberg
docker compose &lt;span class="nt"&gt;-f&lt;/span&gt; docker-compose.yaml &lt;span class="nt"&gt;-f&lt;/span&gt; docker-compose-dagster.yaml &lt;span class="nt"&gt;--env-file&lt;/span&gt; .env up &lt;span class="nt"&gt;-d&lt;/span&gt; &lt;span class="nt"&gt;--build&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;→ Open &lt;a href="http://localhost:3000" rel="noopener noreferrer"&gt;&lt;code&gt;http://localhost:3000&lt;/code&gt;&lt;/a&gt; and explore the assets.&lt;/p&gt;




&lt;h3&gt;
  
  
  🤝 Your Turn
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Are you using open-source or cloud-managed tools for your lakehouse?&lt;/li&gt;
&lt;li&gt;What’s missing in the OSS ecosystem?&lt;/li&gt;
&lt;li&gt;Want a &lt;strong&gt;step-by-step tutorial&lt;/strong&gt; to reproduce this?&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;👉 Star/fork the repo — contributions and feedback welcome!&lt;br&gt;&lt;br&gt;
👉 DM me for consulting or tailored workshops.&lt;/p&gt;

&lt;p&gt;#DataEngineering #OpenSource #Lakehouse #dbt #Dagster #Iceberg #Nessie #Spark #AnalyticsEngineering #RetailAnalytics&lt;/p&gt;

</description>
      <category>architecture</category>
      <category>dataengineering</category>
      <category>docker</category>
      <category>opensource</category>
    </item>
  </channel>
</rss>
