<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Anthony Gicheru</title>
    <description>The latest articles on DEV Community by Anthony Gicheru (@anthony-gicheru).</description>
    <link>https://dev.to/anthony-gicheru</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F1186529%2Fed9dc374-bfac-4eee-bce7-90ea63105510.jpeg</url>
      <title>DEV Community: Anthony Gicheru</title>
      <link>https://dev.to/anthony-gicheru</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/anthony-gicheru"/>
    <language>en</language>
    <item>
      <title>Why Your Code Breaks in Production (and How Docker Fixes It)</title>
      <dc:creator>Anthony Gicheru</dc:creator>
      <pubDate>Tue, 12 May 2026 05:28:52 +0000</pubDate>
      <link>https://dev.to/anthony-gicheru/why-your-code-breaks-in-production-and-how-docker-fixes-it-3meo</link>
      <guid>https://dev.to/anthony-gicheru/why-your-code-breaks-in-production-and-how-docker-fixes-it-3meo</guid>
      <description>&lt;h2&gt;
  
  
  1. Why This Matters
&lt;/h2&gt;

&lt;p&gt;You write your code.&lt;br&gt;
You test it locally.&lt;br&gt;
Everything works perfectly.&lt;/p&gt;

&lt;p&gt;Then it goes to production… and breaks.&lt;/p&gt;

&lt;p&gt;You spend hours debugging, only to realize:&lt;br&gt;
&lt;strong&gt;nothing is wrong with your code — the environment is the problem.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;In data engineering, this happens all the time:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;A Spark job runs locally but fails in production&lt;/li&gt;
&lt;li&gt;Airflow works on Ubuntu but breaks on macOS&lt;/li&gt;
&lt;li&gt;Kafka pipelines behave differently across environments&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;At its core, the issue is simple:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Your environment is not consistent.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;&lt;strong&gt;Containerization solves this&lt;/strong&gt; by packaging everything your application needs into a single, portable unit that runs the same way anywhere.&lt;/p&gt;


&lt;h2&gt;
  
  
  2. Core Concept — What is Containerization?
&lt;/h2&gt;

&lt;p&gt;Let’s simplify it with an analogy.&lt;/p&gt;
&lt;h2&gt;
  
  
  Analogy: A Fully Equipped House
&lt;/h2&gt;

&lt;p&gt;Imagine being placed in an empty field with nothing around you.&lt;/p&gt;

&lt;p&gt;No food.&lt;br&gt;
No water.&lt;br&gt;
No electricity.&lt;br&gt;
No shelter.&lt;/p&gt;

&lt;p&gt;You might survive for a while, but functioning properly would be difficult.&lt;/p&gt;

&lt;p&gt;Now imagine being placed inside a fully equipped house.&lt;/p&gt;

&lt;p&gt;Everything you need is already there:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;food&lt;/li&gt;
&lt;li&gt;water&lt;/li&gt;
&lt;li&gt;electricity&lt;/li&gt;
&lt;li&gt;furniture&lt;/li&gt;
&lt;li&gt;internet&lt;/li&gt;
&lt;li&gt;a bed&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;No matter where that house is moved, you can still live comfortably because your essentials move with you.&lt;/p&gt;

&lt;p&gt;Applications work the same way.&lt;/p&gt;

&lt;p&gt;An application needs certain things to function:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;libraries&lt;/li&gt;
&lt;li&gt;runtime versions&lt;/li&gt;
&lt;li&gt;system tools&lt;/li&gt;
&lt;li&gt;environment variables&lt;/li&gt;
&lt;li&gt;dependencies&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Without them, the application breaks.&lt;/p&gt;

&lt;p&gt;Containerization solves this problem by packaging the application together with everything it needs to run.&lt;/p&gt;

&lt;p&gt;Think of a container as:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;a fully equipped house for your application.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fryt9amwrext4s0y67ptq.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fryt9amwrext4s0y67ptq.png" alt="Diagram comparing a Docker container to a fully equipped house containing everything an application needs to run consistently across different environments." width="800" height="533"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Inside the container, the app already has:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;its dependencies&lt;/li&gt;
&lt;li&gt;configurations&lt;/li&gt;
&lt;li&gt;runtime environment&lt;/li&gt;
&lt;li&gt;required tools&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;So whether the container runs on:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;your laptop&lt;/li&gt;
&lt;li&gt;a cloud server&lt;/li&gt;
&lt;li&gt;a teammate’s machine&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;…the application still behaves the same way.&lt;/p&gt;
&lt;h3&gt;
  
  
  The Mental Model
&lt;/h3&gt;

&lt;blockquote&gt;
&lt;p&gt;Containerization gives your application its own portable environment with everything it needs to survive and run consistently.&lt;/p&gt;
&lt;/blockquote&gt;


&lt;h2&gt;
  
  
  3. Docker Basics
&lt;/h2&gt;
&lt;h3&gt;
  
  
  Key Components
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Image&lt;/strong&gt; - A blueprint/template&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Container&lt;/strong&gt; - A running instance of that image&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Dockerfile&lt;/strong&gt; - Instructions to build the image&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Frusvrsj0vtnh3eey8x6m.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Frusvrsj0vtnh3eey8x6m.png" alt="A clean modern DevOps-style diagram showing the relationship between a Dockerfile, a Docker Image, and a running Docker Container." width="800" height="600"&gt;&lt;/a&gt;&lt;/p&gt;
&lt;h3&gt;
  
  
  Let’s Make It Real
&lt;/h3&gt;

&lt;p&gt;Here’s the smallest possible Docker setup for a Python app.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;app.py&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Hello from Docker!&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Dockerfile&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight docker"&gt;&lt;code&gt;&lt;span class="k"&gt;FROM&lt;/span&gt;&lt;span class="s"&gt; python:3.10-slim&lt;/span&gt;

&lt;span class="k"&gt;WORKDIR&lt;/span&gt;&lt;span class="s"&gt; /app&lt;/span&gt;
&lt;span class="k"&gt;COPY&lt;/span&gt;&lt;span class="s"&gt; app.py .&lt;/span&gt;

&lt;span class="k"&gt;CMD&lt;/span&gt;&lt;span class="s"&gt; ["python", "app.py"]&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Build and Run
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;docker build &lt;span class="nt"&gt;-t&lt;/span&gt; my-python-app &lt;span class="nb"&gt;.&lt;/span&gt;
docker run my-python-app
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Notice what we didn’t do:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Install Python manually&lt;/li&gt;
&lt;li&gt;Manage versions&lt;/li&gt;
&lt;li&gt;Configure anything&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The environment is fully defined in the Dockerfile.&lt;/p&gt;




&lt;h2&gt;
  
  
  4. Why Docker is Useful in Data Engineering
&lt;/h2&gt;

&lt;p&gt;In real-world data systems, you work with tools like:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Apache Airflow&lt;/li&gt;
&lt;li&gt;Spark / PySpark&lt;/li&gt;
&lt;li&gt;PostgreSQL or another data warehouse&lt;/li&gt;
&lt;li&gt;Reporting tools or dashboards&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Each of these has:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Different dependencies&lt;/li&gt;
&lt;li&gt;Different configurations&lt;/li&gt;
&lt;li&gt;Different runtime requirements&lt;/li&gt;
&lt;li&gt;Different ports&lt;/li&gt;
&lt;li&gt;Different environment variables&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Without Docker, they often conflict.&lt;/p&gt;

&lt;p&gt;For example:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Airflow may require specific Python packages&lt;/li&gt;
&lt;li&gt;PySpark may need Java and Spark installed&lt;/li&gt;
&lt;li&gt;PostgreSQL may need database credentials and storage&lt;/li&gt;
&lt;li&gt;Dashboard tools may need access to the processed data&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;With Docker:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;each tool runs in its own isolated environment — no conflicts, no surprises.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;This is especially useful in batch data pipelines because the entire workflow can be reproduced across different machines and environments.&lt;/p&gt;




&lt;h2&gt;
  
  
  5. Docker Compose — Managing Multiple Containers
&lt;/h2&gt;

&lt;p&gt;Real systems are never just one container.&lt;/p&gt;

&lt;p&gt;A Dockerized data engineering pipeline may include:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;An Airflow webserver&lt;/li&gt;
&lt;li&gt;An Airflow scheduler&lt;/li&gt;
&lt;li&gt;A PostgreSQL database&lt;/li&gt;
&lt;li&gt;A Spark / PySpark processing service&lt;/li&gt;
&lt;li&gt;Shared folders for DAGs, logs, scripts, and data&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Running each service manually quickly becomes painful.&lt;/p&gt;




&lt;h3&gt;
  
  
  Docker vs Docker Compose
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Docker&lt;/strong&gt; - runs one container&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Docker Compose&lt;/strong&gt; - runs an entire system made up of multiple containers&lt;/li&gt;
&lt;/ul&gt;




&lt;h3&gt;
  
  
  The Key Insight
&lt;/h3&gt;

&lt;p&gt;Without Docker Compose:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Multiple terminals&lt;/li&gt;
&lt;li&gt;Manual startup order&lt;/li&gt;
&lt;li&gt;Constant configuration issues&lt;/li&gt;
&lt;li&gt;Harder networking between services&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;With Docker Compose:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;one command starts everything.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fkm87ktdgl98ec2vm9yf9.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fkm87ktdgl98ec2vm9yf9.png" alt="Docker Compose orchestration diagram showing multiple services being started and managed from one docker-compose.yml file." width="800" height="600"&gt;&lt;/a&gt;&lt;/p&gt;




&lt;h3&gt;
  
  
  Example: Multi-Service Setup
&lt;/h3&gt;

&lt;p&gt;A simplified Docker Compose setup for a batch pipeline may include Airflow and PostgreSQL.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;docker-compose.yml&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;services&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;airflow-webserver&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;image&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;apache/airflow:3.2.1&lt;/span&gt;
    &lt;span class="na"&gt;container_name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;airflow_webserver&lt;/span&gt;
    &lt;span class="na"&gt;command&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;airflow webserver&lt;/span&gt;
    &lt;span class="na"&gt;ports&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;8080:8080"&lt;/span&gt;
    &lt;span class="na"&gt;environment&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;AIRFLOW__CORE__EXECUTOR&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;LocalExecutor&lt;/span&gt;
      &lt;span class="na"&gt;AIRFLOW__DATABASE__SQL_ALCHEMY_CONN&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;postgresql+psycopg2://airflow:airflow@postgres:5432/airflow&lt;/span&gt;
    &lt;span class="na"&gt;volumes&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;./dags:/opt/airflow/dags&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;./logs:/opt/airflow/logs&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;./jobs:/opt/airflow/jobs&lt;/span&gt;
    &lt;span class="na"&gt;depends_on&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;postgres&lt;/span&gt;

  &lt;span class="na"&gt;airflow-scheduler&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;image&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;apache/airflow:3.2.1&lt;/span&gt;
    &lt;span class="na"&gt;container_name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;airflow_scheduler&lt;/span&gt;
    &lt;span class="na"&gt;command&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;airflow scheduler&lt;/span&gt;
    &lt;span class="na"&gt;environment&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;AIRFLOW__CORE__EXECUTOR&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;LocalExecutor&lt;/span&gt;
      &lt;span class="na"&gt;AIRFLOW__DATABASE__SQL_ALCHEMY_CONN&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;postgresql+psycopg2://airflow:airflow@postgres:5432/airflow&lt;/span&gt;
    &lt;span class="na"&gt;volumes&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;./dags:/opt/airflow/dags&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;./logs:/opt/airflow/logs&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;./jobs:/opt/airflow/jobs&lt;/span&gt;
    &lt;span class="na"&gt;depends_on&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;postgres&lt;/span&gt;

  &lt;span class="na"&gt;postgres&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;image&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;postgres:16&lt;/span&gt;
    &lt;span class="na"&gt;container_name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;postgres_db&lt;/span&gt;
    &lt;span class="na"&gt;environment&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;POSTGRES_USER&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;airflow&lt;/span&gt;
      &lt;span class="na"&gt;POSTGRES_PASSWORD&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;airflow&lt;/span&gt;
      &lt;span class="na"&gt;POSTGRES_DB&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;airflow&lt;/span&gt;
    &lt;span class="na"&gt;ports&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;5433:5432"&lt;/span&gt;
    &lt;span class="na"&gt;volumes&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;postgres_data:/var/lib/postgresql/data&lt;/span&gt;

&lt;span class="na"&gt;volumes&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;postgres_data&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  8. Common Mistakes
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Using &lt;code&gt;localhost&lt;/code&gt; inside containers&lt;/strong&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This breaks almost everyone at first.&lt;/p&gt;

&lt;p&gt;Inside a container:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;code&gt;localhost&lt;/code&gt; refers to the container itself, not your machine.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Forgetting environment variables&lt;/strong&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Missing configs often cause silent failures.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Not persisting data&lt;/strong&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Containers are temporary. Without volumes, your data disappears.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;  &lt;span class="na"&gt;volumes&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;postgres_data:/var/lib/postgresql/data&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Rebuilding unnecessarily&lt;/strong&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Poor Dockerfile structure can slow builds significantly.&lt;/p&gt;




&lt;h2&gt;
  
  
  9. Best Practices
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Use lightweight images&lt;/strong&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight docker"&gt;&lt;code&gt;  FROM python:3.10-slim
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Add a &lt;code&gt;.dockerignore&lt;/code&gt;&lt;/strong&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;  node_modules
  .git
  .env
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Avoid &lt;code&gt;latest&lt;/code&gt; in production&lt;/strong&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Use fixed versions to keep builds predictable.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Separate dev and production setups&lt;/strong&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;They have different requirements.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Use Docker Compose for local development&lt;/strong&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;It helps simulate real systems easily.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Use clear service names&lt;/strong&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Examples:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;code&gt;kafka&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;postgres&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;airflow&lt;/code&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This simplifies networking and debugging.&lt;/p&gt;




&lt;h2&gt;
  
  
  10. Conclusion
&lt;/h2&gt;

&lt;p&gt;Containerization changes how you think about environments.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Docker packages your application into a portable unit.&lt;/li&gt;
&lt;li&gt;Docker Compose runs entire systems with one command.&lt;/li&gt;
&lt;li&gt;Your pipelines become reproducible and consistent.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The real shift is this:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;You stop debugging environments — and start defining them as code.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;And once you reach that point:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;You’re no longer just writing code — you’re building systems.&lt;/p&gt;
&lt;/blockquote&gt;

</description>
      <category>containers</category>
      <category>docker</category>
      <category>devops</category>
    </item>
    <item>
      <title>Data Warehouses, Data Marts, Data Lakes, and Lakehouses - Explained Like You’re Building Them in Real Life</title>
      <dc:creator>Anthony Gicheru</dc:creator>
      <pubDate>Sun, 03 May 2026 08:10:27 +0000</pubDate>
      <link>https://dev.to/anthony-gicheru/data-warehouses-data-marts-data-lakes-and-lakehouses-explained-like-youre-building-them-in-3hmm</link>
      <guid>https://dev.to/anthony-gicheru/data-warehouses-data-marts-data-lakes-and-lakehouses-explained-like-youre-building-them-in-3hmm</guid>
      <description>&lt;p&gt;If you’ve been around data engineering long enough, you’ve probably heard these terms thrown around in meetings:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;“Just dump it in the data lake”&lt;/li&gt;
&lt;li&gt;“We’ll expose it through the warehouse”&lt;/li&gt;
&lt;li&gt;“That goes into the mart”&lt;/li&gt;
&lt;li&gt;“We’re moving to a lakehouse architecture”&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;And honestly… it can sound like four different ways of saying the same thing.&lt;/p&gt;

&lt;p&gt;They’re not.&lt;/p&gt;

&lt;p&gt;Each one solves a slightly different problem in the data ecosystem. Once you understand the “why” behind each, the architecture suddenly feels a lot less like buzzwords and more like a clean system design.&lt;/p&gt;

&lt;p&gt;Let’s break it down in a practical, engineer-first way.&lt;/p&gt;




&lt;h1&gt;
  
  
  1. The Big Picture (Why all these systems exist)
&lt;/h1&gt;

&lt;p&gt;In most companies, data doesn’t come from one place — it flows in from everywhere:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;User clicks from web/mobile apps&lt;/li&gt;
&lt;li&gt;Payments and transactions&lt;/li&gt;
&lt;li&gt;Logs from servers&lt;/li&gt;
&lt;li&gt;Third-party APIs (Stripe, Shopify, etc.)&lt;/li&gt;
&lt;li&gt;IoT or streaming data (Kafka, sensors, etc.)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Now here’s the problem:&lt;/p&gt;

&lt;p&gt;Raw data is messy. Business users don’t want messy.&lt;/p&gt;

&lt;p&gt;So we build systems that progressively refine data from:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Raw → Clean → Structured → Business-ready&lt;/strong&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;That’s where these four concepts come in:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Data Lake&lt;/strong&gt; → store everything raw&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Data Warehouse&lt;/strong&gt; → structured analytics-ready data&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Data Mart&lt;/strong&gt; → department-specific slices of warehouse data&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Lakehouse&lt;/strong&gt; → hybrid of lake + warehouse&lt;/li&gt;
&lt;/ul&gt;




&lt;h1&gt;
  
  
  2. Data Lake — “Store everything first, figure it out later”
&lt;/h1&gt;

&lt;p&gt;A &lt;strong&gt;data lake&lt;/strong&gt; is basically a massive storage system where you dump raw data in its original format.&lt;/p&gt;

&lt;p&gt;Think of it like:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;A giant warehouse where you throw every box in as-is, without opening it.&lt;br&gt;
Or even better: a farm storage system where everything is stored right after harvest, unprocessed and mixed together.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h3&gt;
  
  
  Characteristics:
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Stores structured, semi-structured, and unstructured data&lt;/li&gt;
&lt;li&gt;Cheap storage (usually object storage like S3)&lt;/li&gt;
&lt;li&gt;Schema is applied &lt;strong&gt;when reading&lt;/strong&gt;, not writing (schema-on-read)&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Example tools:
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Amazon S3&lt;/li&gt;
&lt;li&gt;Azure Data Lake Storage&lt;/li&gt;
&lt;li&gt;Google Cloud Storage&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Example:
&lt;/h3&gt;

&lt;p&gt;You might store:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;/events/clicks/2026/05/01.json
/logs/api/2026/05/01.log
/payments/stripe/2026/05/01.parquet
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;No transformations. No enforcement. Just storage.&lt;/p&gt;

&lt;h3&gt;
  
  
  Here’s the catch:
&lt;/h3&gt;

&lt;p&gt;If you’re not careful, a data lake becomes a &lt;strong&gt;data swamp&lt;/strong&gt; — lots of data, zero usability.&lt;/p&gt;




&lt;h1&gt;
  
  
  3. Data Warehouse — “Clean, structured, and business-ready”
&lt;/h1&gt;

&lt;p&gt;A &lt;strong&gt;data warehouse&lt;/strong&gt; is where data goes after it has been cleaned, transformed, and modeled for analytics.&lt;/p&gt;

&lt;p&gt;Think of it like:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;A well-organized supermarket where everything is cleaned, packaged, labeled, and placed on the right shelves.&lt;br&gt;
You don’t pick raw potatoes from the soil — you get them washed, sorted, and priced.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h3&gt;
  
  
  Characteristics:
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Structured data only&lt;/li&gt;
&lt;li&gt;Schema-on-write (you define structure before loading)&lt;/li&gt;
&lt;li&gt;Optimized for analytics queries (OLAP systems)&lt;/li&gt;
&lt;li&gt;Highly curated and trustworthy&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Example tools:
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Amazon Redshift&lt;/li&gt;
&lt;li&gt;Snowflake&lt;/li&gt;
&lt;li&gt;Google BigQuery&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Typical workflow:
&lt;/h3&gt;

&lt;ol&gt;
&lt;li&gt;Extract data from sources&lt;/li&gt;
&lt;li&gt;Transform (clean, join, aggregate)&lt;/li&gt;
&lt;li&gt;Load into warehouse tables&lt;/li&gt;
&lt;/ol&gt;

&lt;h3&gt;
  
  
  Example SQL model:
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;CREATE&lt;/span&gt; &lt;span class="k"&gt;TABLE&lt;/span&gt; &lt;span class="n"&gt;sales_fact&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;order_id&lt;/span&gt; &lt;span class="nb"&gt;INT&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;customer_id&lt;/span&gt; &lt;span class="nb"&gt;INT&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;product_id&lt;/span&gt; &lt;span class="nb"&gt;INT&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;amount&lt;/span&gt; &lt;span class="nb"&gt;DECIMAL&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;10&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
    &lt;span class="n"&gt;order_date&lt;/span&gt; &lt;span class="nb"&gt;DATE&lt;/span&gt;
&lt;span class="p"&gt;);&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Now business analysts can run queries like:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="n"&gt;product_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="k"&gt;SUM&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;amount&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;sales_fact&lt;/span&gt;
&lt;span class="k"&gt;GROUP&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="n"&gt;product_id&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h1&gt;
  
  
  4. Data Marts — “Department-specific mini warehouses”
&lt;/h1&gt;

&lt;p&gt;A &lt;strong&gt;data mart&lt;/strong&gt; is a subset of a data warehouse focused on a specific business domain.&lt;/p&gt;

&lt;p&gt;Think of it like:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;A grocery store or specialty shop — like a bakery, butcher, or vegetable shop.&lt;br&gt;
It doesn’t sell everything. It only sells what its customers actually need.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h3&gt;
  
  
  Characteristics:
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Smaller scope than a warehouse&lt;/li&gt;
&lt;li&gt;Built for a specific team (finance, marketing, sales)&lt;/li&gt;
&lt;li&gt;Faster queries for targeted use cases&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Example:
&lt;/h3&gt;

&lt;p&gt;A marketing data mart might include:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Campaign performance&lt;/li&gt;
&lt;li&gt;Customer acquisition metrics&lt;/li&gt;
&lt;li&gt;Ad spend data&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Example structure:
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;CREATE&lt;/span&gt; &lt;span class="k"&gt;TABLE&lt;/span&gt; &lt;span class="n"&gt;marketing_campaign_performance&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt;
&lt;span class="k"&gt;SELECT&lt;/span&gt;
    &lt;span class="n"&gt;campaign_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="k"&gt;SUM&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;clicks&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;total_clicks&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="k"&gt;SUM&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;impressions&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;total_impressions&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;ad_events&lt;/span&gt;
&lt;span class="k"&gt;GROUP&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="n"&gt;campaign_id&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Why it exists:
&lt;/h3&gt;

&lt;p&gt;Instead of everyone querying a massive warehouse, teams get &lt;strong&gt;pre-optimized datasets&lt;/strong&gt;.&lt;/p&gt;




&lt;h1&gt;
  
  
  5. Data Lakehouse — “Best of both worlds”
&lt;/h1&gt;

&lt;p&gt;Now this is where things get interesting.&lt;/p&gt;

&lt;p&gt;A &lt;strong&gt;lakehouse&lt;/strong&gt; combines:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;The flexibility of a data lake&lt;/li&gt;
&lt;li&gt;The structure and performance of a data warehouse&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Think of it like:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;A modern retail system where the warehouse and supermarket are combined into one smart facility.&lt;br&gt;
Raw goods arrive, but they are immediately tracked, organized, and made queryable without losing flexibility.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h3&gt;
  
  
  Characteristics:
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Uses low-cost storage (like a lake)&lt;/li&gt;
&lt;li&gt;Adds structure, ACID transactions, and governance&lt;/li&gt;
&lt;li&gt;Supports both analytics and ML workloads&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Example tools:
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Apache Spark + Delta Lake&lt;/li&gt;
&lt;li&gt;Apache Iceberg&lt;/li&gt;
&lt;li&gt;Apache Hudi&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Why it matters:
&lt;/h3&gt;

&lt;p&gt;In traditional setups:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Data lakes = flexible but messy&lt;/li&gt;
&lt;li&gt;Warehouses = clean but expensive and rigid&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Lakehouses try to remove that tradeoff.&lt;/p&gt;




&lt;h1&gt;
  
  
  6. How They Work Together (Real Architecture Flow)
&lt;/h1&gt;

&lt;p&gt;A modern data pipeline often looks like this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;[ Data Sources ]
      ↓
   DATA LAKE (raw storage)
      ↓
ETL / ELT pipelines (Airflow, Spark)
      ↓
DATA WAREHOUSE (modeled data)
      ↓
DATA MARTS (team-specific views)
      ↓
Dashboards / BI tools
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Or in a lakehouse setup:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;[ Data Sources ]
      ↓
DATA LAKEHOUSE (single system)
      ↓
BI + ML + Analytics directly
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h1&gt;
  
  
  7. Practical Example (Mini Pipeline)
&lt;/h1&gt;

&lt;p&gt;Let’s say we’re processing e-commerce data.&lt;/p&gt;

&lt;h3&gt;
  
  
  Step 1: Raw data in S3 (Data Lake)
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"order_id"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;101&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"user_id"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;55&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"amount"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;250&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"timestamp"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"2026-05-01T10:00:00Z"&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Step 2: Spark transformation
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;pyspark.sql&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;SparkSession&lt;/span&gt;

&lt;span class="n"&gt;spark&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;SparkSession&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;builder&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;appName&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;etl&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;getOrCreate&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

&lt;span class="n"&gt;df&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;spark&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;read&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;json&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;s3://datalake/raw/orders/&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;clean_df&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;dropna&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; \
             &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;withColumnRenamed&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;amount&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;order_amount&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;clean_df&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;write&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;mode&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;overwrite&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;parquet&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;s3://warehouse/sales_fact/&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Step 3: Load into warehouse (Redshift example)
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;COPY&lt;/span&gt; &lt;span class="n"&gt;sales_fact&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="s1"&gt;'s3://warehouse/sales_fact/'&lt;/span&gt;
&lt;span class="n"&gt;IAM_ROLE&lt;/span&gt; &lt;span class="s1"&gt;'arn:aws:iam::123456:role/RedshiftRole'&lt;/span&gt;
&lt;span class="n"&gt;FORMAT&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;PARQUET&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Step 4: Create a data mart
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;CREATE&lt;/span&gt; &lt;span class="k"&gt;TABLE&lt;/span&gt; &lt;span class="n"&gt;sales_summary&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt;
&lt;span class="k"&gt;SELECT&lt;/span&gt;
    &lt;span class="nb"&gt;DATE&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;order_date&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="nb"&gt;date&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="k"&gt;SUM&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;order_amount&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;revenue&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;sales_fact&lt;/span&gt;
&lt;span class="k"&gt;GROUP&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="nb"&gt;DATE&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;order_date&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h1&gt;
  
  
  8. Common Pitfalls (Where most teams mess up)
&lt;/h1&gt;

&lt;h3&gt;
  
  
  1. Turning the data lake into a swamp
&lt;/h3&gt;

&lt;p&gt;Dumping everything without metadata or structure leads to chaos.&lt;/p&gt;

&lt;h3&gt;
  
  
  2. Over-modeling too early
&lt;/h3&gt;

&lt;p&gt;Trying to build perfect schemas upfront slows everything down.&lt;/p&gt;

&lt;h3&gt;
  
  
  3. Duplicating logic across marts
&lt;/h3&gt;

&lt;p&gt;You end up with inconsistent metrics like “Revenue_v1”, “Revenue_final”, “Revenue_real_final”.&lt;/p&gt;

&lt;h3&gt;
  
  
  4. No governance layer
&lt;/h3&gt;

&lt;p&gt;Without access control and cataloging, nobody trusts the data.&lt;/p&gt;




&lt;h1&gt;
  
  
  9. Best Practices
&lt;/h1&gt;

&lt;h3&gt;
  
  
  1. Use layered architecture
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Raw (lake)&lt;/li&gt;
&lt;li&gt;Cleaned (staging)&lt;/li&gt;
&lt;li&gt;Modeled (warehouse)&lt;/li&gt;
&lt;li&gt;Aggregated (marts)&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  2. Standardize transformations
&lt;/h3&gt;

&lt;p&gt;Use tools like:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;dbt&lt;/li&gt;
&lt;li&gt;Apache Airflow&lt;/li&gt;
&lt;li&gt;Spark jobs with clear ownership&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  3. Define a single source of truth
&lt;/h3&gt;

&lt;p&gt;One metric definition per business KPI. No duplicates.&lt;/p&gt;

&lt;h3&gt;
  
  
  4. Treat data like software
&lt;/h3&gt;

&lt;p&gt;Version it, test it, document it.&lt;/p&gt;

&lt;h3&gt;
  
  
  5. Monitor everything
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Pipeline failures&lt;/li&gt;
&lt;li&gt;Data freshness&lt;/li&gt;
&lt;li&gt;Schema changes&lt;/li&gt;
&lt;/ul&gt;




&lt;h1&gt;
  
  
  10. Conclusion — The mental model that matters
&lt;/h1&gt;

&lt;p&gt;If you strip away the jargon, it’s really simple:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Data Lake&lt;/strong&gt; → store everything&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Data Warehouse&lt;/strong&gt; → clean and organize it&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Data Mart&lt;/strong&gt; → tailor it for teams&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Lakehouse&lt;/strong&gt; → unify storage and analytics&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The real skill in data engineering isn’t memorizing definitions.&lt;/p&gt;

&lt;p&gt;It’s knowing:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;When to keep data raw, when to structure it, and when to specialize it.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Once that clicks, designing data systems becomes a lot more intuitive — and honestly, more fun to build.&lt;/p&gt;

</description>
      <category>datawarehouse</category>
      <category>datamarts</category>
      <category>datalake</category>
      <category>lakehouse</category>
    </item>
    <item>
      <title>Refactoring Airflow Pipelines: From PythonOperator to TaskFlow</title>
      <dc:creator>Anthony Gicheru</dc:creator>
      <pubDate>Fri, 24 Apr 2026 10:52:39 +0000</pubDate>
      <link>https://dev.to/anthony-gicheru/refactoring-airflow-pipelines-from-pythonoperator-to-taskflow-25mk</link>
      <guid>https://dev.to/anthony-gicheru/refactoring-airflow-pipelines-from-pythonoperator-to-taskflow-25mk</guid>
      <description>&lt;h1&gt;
  
  
  Actually Embracing TaskFlow After a Year of Doing It the “Old Way”
&lt;/h1&gt;

&lt;h2&gt;
  
  
  1. Introduction: This Isn’t New… But It &lt;em&gt;Feels&lt;/em&gt; New
&lt;/h2&gt;

&lt;p&gt;If you’ve been using Airflow for a while-like I have-you probably didn’t start with the TaskFlow API.&lt;/p&gt;

&lt;p&gt;You likely started with the classic Airflow 2.x style:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;PythonOperator&lt;/li&gt;
&lt;li&gt;&lt;code&gt;**kwargs&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;ti.xcom_push()&lt;/code&gt; and &lt;code&gt;ti.xcom_pull()&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;Explicit task chaining with &lt;code&gt;&amp;gt;&amp;gt;&lt;/code&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;I spent over a year building pipelines this way. And to be clear-it works. It’s stable, production-ready, and widely used.&lt;/p&gt;

&lt;p&gt;But here’s the interesting part:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;The TaskFlow API has existed since Airflow 2.0. I just didn’t fully adopt it.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Honestly, I ignored TaskFlow for a long time because I thought it was just ‘syntactic sugar’. That’s more common than people admit.&lt;/p&gt;

&lt;p&gt;Most production systems and tutorials still rely on operators, so you naturally stay in that pattern. It’s only later-when readability and maintainability start to matter-that TaskFlow becomes interesting.&lt;/p&gt;

&lt;p&gt;And once it clicks, it changes how you think about Airflow.&lt;/p&gt;




&lt;h2&gt;
  
  
  2. Core Concepts: Same Engine, Different Experience
&lt;/h2&gt;

&lt;p&gt;TaskFlow doesn’t replace Airflow concepts-it abstracts them.&lt;/p&gt;

&lt;p&gt;You still work with:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Tasks&lt;/li&gt;
&lt;li&gt;DAGs&lt;/li&gt;
&lt;li&gt;Scheduling&lt;/li&gt;
&lt;li&gt;XComs&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The difference is &lt;em&gt;how&lt;/em&gt; you express them.&lt;/p&gt;

&lt;h3&gt;
  
  
  Traditional Approach
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;airflow&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;DAG&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;airflow.operators.python&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;PythonOperator&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;datetime&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;datetime&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;extract&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;**&lt;/span&gt;&lt;span class="n"&gt;kwargs&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;data&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
    &lt;span class="n"&gt;kwargs&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;ti&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="nf"&gt;xcom_push&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;key&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;data&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;value&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;transform&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;**&lt;/span&gt;&lt;span class="n"&gt;kwargs&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;ti&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;kwargs&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;ti&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
    &lt;span class="n"&gt;data&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;ti&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;xcom_pull&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;task_ids&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;extract&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;key&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;data&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;x&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;

&lt;span class="k"&gt;with&lt;/span&gt; &lt;span class="nc"&gt;DAG&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;traditional_dag&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;start_date&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="nf"&gt;datetime&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;2023&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;dag&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;

    &lt;span class="n"&gt;t1&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;PythonOperator&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;task_id&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;extract&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;python_callable&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;extract&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;t2&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;PythonOperator&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;task_id&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;transform&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;python_callable&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;transform&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="n"&gt;t1&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;t2&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This works, but it introduces a lot of orchestration boilerplate into your business logic.&lt;/p&gt;

&lt;h3&gt;
  
  
  TaskFlow Approach
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;airflow.sdk&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;dag&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;task&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;datetime&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;datetime&lt;/span&gt;

&lt;span class="nd"&gt;@dag&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;start_date&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="nf"&gt;datetime&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;2023&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="n"&gt;schedule&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;@daily&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;catchup&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;False&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;taskflow_dag&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;

    &lt;span class="nd"&gt;@task&lt;/span&gt;
    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;extract&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;

    &lt;span class="nd"&gt;@task&lt;/span&gt;
    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;transform&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;x&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;

    &lt;span class="nf"&gt;transform&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;extract&lt;/span&gt;&lt;span class="p"&gt;())&lt;/span&gt;

&lt;span class="n"&gt;dag&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;taskflow_dag&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This feels simpler because it is.&lt;/p&gt;

&lt;p&gt;TaskFlow removes explicit XCom handling and lets function returns define data flow.&lt;/p&gt;




&lt;h2&gt;
  
  
  3. The Real Shift: From Wiring Tasks to Modeling Data Flow
&lt;/h2&gt;

&lt;p&gt;With the traditional approach, your mental model looks like this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Task A - XCom - Task B - XCom - Task C
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;With TaskFlow, it becomes:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;data&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;extract&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="n"&gt;result&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;transform&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Flj2rveoa9vso89mowp9p.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Flj2rveoa9vso89mowp9p.png" alt="Airflow DAG: Traditional vs Taskflow" width="800" height="533"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Same execution engine. Different abstraction.&lt;/p&gt;

&lt;p&gt;The shift is from &lt;em&gt;task orchestration&lt;/em&gt; to &lt;em&gt;data flow composition&lt;/em&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  4. XComs: Manual vs Automatic
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Manual XComs
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;extract&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;**&lt;/span&gt;&lt;span class="n"&gt;kwargs&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;kwargs&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;ti&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="nf"&gt;xcom_push&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;key&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;data&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;value&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;transform&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;**&lt;/span&gt;&lt;span class="n"&gt;kwargs&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;ti&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;kwargs&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;ti&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
    &lt;span class="n"&gt;data&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;ti&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;xcom_pull&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;task_ids&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;extract&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;key&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;data&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;You manage everything explicitly.&lt;/p&gt;

&lt;h3&gt;
  
  
  TaskFlow XComs
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="nd"&gt;@task&lt;/span&gt;
&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;extract&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;

&lt;span class="nd"&gt;@task&lt;/span&gt;
&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;transform&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;x&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Airflow handles:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;serialization&lt;/li&gt;
&lt;li&gt;storage&lt;/li&gt;
&lt;li&gt;retrieval&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;You focus on logic.&lt;/p&gt;

&lt;h3&gt;
  
  
  When You Still Need Control
&lt;/h3&gt;

&lt;p&gt;TaskFlow still allows explicit control when needed:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;airflow.models.xcom_arg&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;XComArg&lt;/span&gt;

&lt;span class="nd"&gt;@task&lt;/span&gt;
&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;extract&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;numbers&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;]}&lt;/span&gt;

&lt;span class="nd"&gt;@task&lt;/span&gt;
&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;transform&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;x&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;numbers&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]]&lt;/span&gt;

&lt;span class="nf"&gt;transform&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nc"&gt;XComArg&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;extract&lt;/span&gt;&lt;span class="p"&gt;()))&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  5. Real-World Example: Gas Prices ETL Refactor
&lt;/h2&gt;

&lt;p&gt;I didn’t build two versions of this pipeline at once.&lt;/p&gt;

&lt;p&gt;I originally built it using the traditional Airflow 2.x approach and later refactored it using TaskFlow.&lt;/p&gt;

&lt;p&gt;That’s when the difference became clear.&lt;/p&gt;

&lt;h3&gt;
  
  
  Pipeline Overview
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;API - Extract gas prices - Transform - Store in PostgreSQL
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  GitHub Reference
&lt;/h3&gt;

&lt;p&gt;Full project: &lt;a href="https://github.com/Anthony-Gicheru/Gas-Prices-ETL-with-Apache-Airflow" rel="noopener noreferrer"&gt;Github Link to the project&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;It includes both the original DAG and the TaskFlow refactor.&lt;/p&gt;

&lt;h3&gt;
  
  
  Traditional Version
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;fetch_gas_prices&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;**&lt;/span&gt;&lt;span class="n"&gt;kwargs&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;kwargs&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;ti&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="nf"&gt;xcom_push&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;key&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;raw_gas_data&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;value&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;decoded_data&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;transform_gas_prices&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;**&lt;/span&gt;&lt;span class="n"&gt;kwargs&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;raw_data&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;kwargs&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;ti&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="nf"&gt;xcom_pull&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="n"&gt;task_ids&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;fetch_gas_prices&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;key&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;raw_gas_data&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This approach tightly couples logic with Airflow internals.&lt;/p&gt;

&lt;p&gt;Data must be serialized manually:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;json_data&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;to_json&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;orient&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;records&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  TaskFlow Version
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="nd"&gt;@task&lt;/span&gt;
&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;fetch_gas_prices&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;decoded_data&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="nd"&gt;@task&lt;/span&gt;
&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;transform_gas_prices&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;raw_data&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;to_json&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;orient&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;records&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;And the pipeline becomes:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;raw&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;fetch_gas_prices&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="n"&gt;cleaned&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;transform_gas_prices&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;raw&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="nf"&gt;store_gas_prices&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;cleaned&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This reads like standard Python.&lt;/p&gt;

&lt;h3&gt;
  
  
  What Changed
&lt;/h3&gt;

&lt;p&gt;The logic stayed the same. The structure changed completely.&lt;/p&gt;

&lt;p&gt;Instead of manually managing XComs, data flows naturally between functions.&lt;/p&gt;

&lt;h3&gt;
  
  
  Before vs After
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Aspect&lt;/th&gt;
&lt;th&gt;Traditional&lt;/th&gt;
&lt;th&gt;TaskFlow&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Task definition&lt;/td&gt;
&lt;td&gt;PythonOperator&lt;/td&gt;
&lt;td&gt;@task&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Data passing&lt;/td&gt;
&lt;td&gt;Manual XCom&lt;/td&gt;
&lt;td&gt;Automatic&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Readability&lt;/td&gt;
&lt;td&gt;Medium&lt;/td&gt;
&lt;td&gt;High&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Boilerplate&lt;/td&gt;
&lt;td&gt;High&lt;/td&gt;
&lt;td&gt;Low&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Mental model&lt;/td&gt;
&lt;td&gt;Wiring tasks&lt;/td&gt;
&lt;td&gt;Data flow&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;




&lt;h2&gt;
  
  
  6. Lessons From the Refactor
&lt;/h2&gt;

&lt;h3&gt;
  
  
  1. TaskFlow doesn’t remove XComs
&lt;/h3&gt;

&lt;p&gt;It only hides them.&lt;/p&gt;

&lt;p&gt;You still need to respect serialization limits:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;big_dataframe&lt;/span&gt;  &lt;span class="c1"&gt;# still not ideal
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  2. Passing data is easier-but not always better
&lt;/h3&gt;

&lt;p&gt;TaskFlow makes it easy to pass data between tasks, but large payloads should still live in external storage.&lt;/p&gt;

&lt;h3&gt;
  
  
  3. Refactoring was mostly structural
&lt;/h3&gt;

&lt;p&gt;Most of the work was:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;removing &lt;code&gt;**kwargs&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;replacing XCom logic with returns&lt;/li&gt;
&lt;li&gt;simplifying task boundaries&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  4. The biggest change is mental
&lt;/h3&gt;

&lt;p&gt;The shift was not technical-it was conceptual.&lt;br&gt;
From:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;How do I connect tasks?&lt;br&gt;
to:&lt;/p&gt;
&lt;h2&gt;
  
  
  How does data flow through this pipeline?
&lt;/h2&gt;
&lt;/blockquote&gt;

&lt;h2&gt;
  
  
  7. Pitfalls to Avoid
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;Don’t push large objects through XCom&lt;/li&gt;
&lt;li&gt;Don’t mix styles without intention&lt;/li&gt;
&lt;li&gt;Don’t overuse TaskFlow just because it’s cleaner&lt;/li&gt;
&lt;li&gt;Don’t forget serialization still exists&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  8. Conclusion
&lt;/h2&gt;

&lt;p&gt;TaskFlow isn’t new-but adopting it after using the traditional approach makes its benefits clearer.&lt;/p&gt;

&lt;p&gt;It moves you from writing orchestration-heavy DAGs to writing clean Python workflows.&lt;/p&gt;

&lt;p&gt;And that shift improves:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;readability&lt;/li&gt;
&lt;li&gt;maintainability&lt;/li&gt;
&lt;li&gt;reasoning about pipelines&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Key Takeaways
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;TaskFlow simplifies DAG structure without changing Airflow’s core engine&lt;/li&gt;
&lt;li&gt;XComs still exist but are abstracted&lt;/li&gt;
&lt;li&gt;The real improvement is cleaner data flow modeling&lt;/li&gt;
&lt;li&gt;Refactoring old DAGs is one of the best ways to understand it&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>python</category>
      <category>etl</category>
      <category>apacheairflow</category>
      <category>productivity</category>
    </item>
    <item>
      <title>Data Pipelines Explained Simply (and How to Build Them with Python)</title>
      <dc:creator>Anthony Gicheru</dc:creator>
      <pubDate>Fri, 17 Apr 2026 07:34:55 +0000</pubDate>
      <link>https://dev.to/anthony-gicheru/data-pipelines-explained-simply-and-how-to-build-them-with-python-555</link>
      <guid>https://dev.to/anthony-gicheru/data-pipelines-explained-simply-and-how-to-build-them-with-python-555</guid>
      <description>&lt;p&gt;Data pipelines are the backbone of modern data-driven organizations. They automate the movement, transformation, and storage of data - from raw sources to actionable insights.&lt;/p&gt;

&lt;p&gt;Python has become the go-to language for building scalable pipelines because of its rich ecosystem, flexibility, and ease of use.&lt;/p&gt;

&lt;p&gt;This guide walks through the fundamentals, tools, and best practices for building robust data pipelines using Python.&lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;strong&gt;Understanding Data Pipelines&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;Imagine you need to supply clean water to a village. The process involves collecting water from different sources (rivers, wells, rain), purifying it, transporting it, and storing it so people can access it whenever they need it.&lt;/p&gt;

&lt;p&gt;A data pipeline works in a very similar way.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F5sonxpmecasd03c5xzhw.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F5sonxpmecasd03c5xzhw.png" alt="A data pipeline represented as a water system, showing how raw data flows through ingestion, transformation, storage, and finally consumption." width="800" height="533"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;It automates the journey of raw, unstructured data from multiple sources (like databases, APIs, or IoT devices) and transforms it into clean, usable data stored in a destination (like a data warehouse) for analysis.&lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;strong&gt;Components of a Data Pipeline&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;Let’s break it down using the same analogy:&lt;/p&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;1. Collecting Water (Data Ingestion)&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;Just like gathering water from lakes or wells, a pipeline starts by extracting data from sources such as databases, APIs, spreadsheets, or sensors.&lt;/p&gt;

&lt;p&gt;The goal here is simple: get all the data into one system, no matter how scattered it is.&lt;/p&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;2. Filtering and Purifying (Data Transformation)&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;Raw water isn’t clean—and neither is raw data.&lt;/p&gt;

&lt;p&gt;At this stage, the pipeline:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Removes duplicates&lt;/li&gt;
&lt;li&gt;Handles missing values&lt;/li&gt;
&lt;li&gt;Standardizes formats&lt;/li&gt;
&lt;li&gt;Enriches data&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This is where messy data becomes usable.&lt;/p&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;3. Transporting Through Pipes (Data Movement)&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;Once cleaned, water flows through pipes. In data pipelines, this represents the movement of data between systems.&lt;/p&gt;

&lt;p&gt;This can involve:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;ETL processes&lt;/li&gt;
&lt;li&gt;Message queues (like Kafka)&lt;/li&gt;
&lt;li&gt;Cloud data transfer services&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The goal is to move data efficiently without delays or bottlenecks.&lt;/p&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;4. Storing in Tanks (Data Storage)&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;Clean water is stored in tanks. Similarly, processed data is stored in:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Data warehouses (like Snowflake)&lt;/li&gt;
&lt;li&gt;Data lakes (like AWS S3)&lt;/li&gt;
&lt;li&gt;Databases&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This is where data becomes ready for use.&lt;/p&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;5. Accessing on Demand (Data Consumption)&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;Finally, people use the water.&lt;/p&gt;

&lt;p&gt;In the same way, data is consumed through:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Dashboards&lt;/li&gt;
&lt;li&gt;APIs&lt;/li&gt;
&lt;li&gt;Machine learning models&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This is where insights actually happen.&lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;strong&gt;Essential Python Libraries and Tools&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;Python supports every stage of a pipeline:&lt;/p&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;Data Ingestion&lt;/strong&gt;
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;requests&lt;/code&gt; - API calls&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;pandas&lt;/code&gt; - handling CSV/JSON files&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;Transformation&lt;/strong&gt;
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;pandas&lt;/code&gt; - cleaning and aggregation&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;PySpark&lt;/code&gt; - large-scale distributed processing&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;Storage&lt;/strong&gt;
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;SQLAlchemy&lt;/code&gt; - database interaction&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;boto3&lt;/code&gt; - AWS S3 integration&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;Orchestration&lt;/strong&gt;
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;Apache Airflow&lt;/code&gt; - workflow scheduling and automation&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;Dagster&lt;/code&gt; - modern pipeline orchestration with observability&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  &lt;strong&gt;Best Practices&lt;/strong&gt;
&lt;/h2&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;Error Handling&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;Implement retries and proper logging to avoid silent failures.&lt;/p&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;Monitoring&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;Track pipeline health using tools like Airflow’s UI.&lt;/p&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;Documentation&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;Keep clear documentation for:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Code&lt;/li&gt;
&lt;li&gt;Dependencies&lt;/li&gt;
&lt;li&gt;Workflow logic&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;Testing&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;Test each stage of the pipeline using:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Unit tests&lt;/li&gt;
&lt;li&gt;Sample datasets&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  &lt;strong&gt;Popular Frameworks for Advanced Use Cases&lt;/strong&gt;
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Apache Airflow&lt;/strong&gt; - Best for complex workflows with dependencies&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Dagster&lt;/strong&gt; - Strong focus on testing and data asset visibility&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Prefect&lt;/strong&gt; - Simplifies building fault-tolerant pipelines&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Luigi&lt;/strong&gt; - Good for batch processing and dependency management&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>etl</category>
      <category>python</category>
      <category>datapipeline</category>
      <category>dataengineering</category>
    </item>
    <item>
      <title>ETL vs ELT: Which One Should You Use and Why?</title>
      <dc:creator>Anthony Gicheru</dc:creator>
      <pubDate>Sun, 12 Apr 2026 21:36:22 +0000</pubDate>
      <link>https://dev.to/anthony-gicheru/etl-vs-elt-which-one-should-you-use-and-why-412e</link>
      <guid>https://dev.to/anthony-gicheru/etl-vs-elt-which-one-should-you-use-and-why-412e</guid>
      <description>&lt;p&gt;When I first started learning data engineering, ETL and ELT honestly felt like the same thing with just swapped letters. Everyone kept mentioning them like they were obvious concepts, but I had to sit down and really break them apart before it made sense.&lt;/p&gt;

&lt;p&gt;If you’re in the same place, don’t worry, you’re not alone.&lt;/p&gt;

&lt;p&gt;Let’s make it simple.&lt;/p&gt;

&lt;h2&gt;
  
  
  First things first: what do ETL and ELT even mean?
&lt;/h2&gt;

&lt;p&gt;Both ETL and ELT are ways of moving and processing data from one place to another.&lt;/p&gt;

&lt;h3&gt;
  
  
  ETL (Extract, Transform, Load)
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Extract&lt;/strong&gt; data from a source (like an API or database)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Transform&lt;/strong&gt; it before storing it (cleaning, filtering, joining, etc.)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Load&lt;/strong&gt; the final cleaned data into a target system (like a data warehouse)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The key idea: &lt;em&gt;you clean the data before storing it.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F1mu9s8n6tstb1jvvl1rn.PNG" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F1mu9s8n6tstb1jvvl1rn.PNG" alt="ELT" width="800" height="797"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  ELT (Extract, Load, Transform)
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Extract&lt;/strong&gt; data from the source&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Load&lt;/strong&gt; it directly into the storage system first&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Transform&lt;/strong&gt; it inside the database/warehouse later&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The key idea: &lt;em&gt;you store raw data first, then clean it inside the system.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F108qm1z9tg391xj0sqrb.PNG" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F108qm1z9tg391xj0sqrb.PNG" alt="ETL" width="800" height="797"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  So what’s the real difference?
&lt;/h2&gt;

&lt;p&gt;The biggest difference is &lt;strong&gt;where the transformation happens&lt;/strong&gt;.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;ETL → Transform happens outside the warehouse&lt;/li&gt;
&lt;li&gt;ELT → Transform happens inside the warehouse&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;That one shift changes a lot more than you’d think.&lt;/p&gt;

&lt;h2&gt;
  
  
  When ETL makes sense
&lt;/h2&gt;

&lt;p&gt;ETL is usually used when:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;You have smaller datasets&lt;/li&gt;
&lt;li&gt;You need strict data control before loading&lt;/li&gt;
&lt;li&gt;Your system can’t handle heavy processing&lt;/li&gt;
&lt;li&gt;Data quality must be enforced early&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Think of it like cleaning your room before putting things in storage.&lt;/p&gt;

&lt;p&gt;You don’t want messy data entering your system at all.&lt;/p&gt;

&lt;h2&gt;
  
  
  When ELT makes sense
&lt;/h2&gt;

&lt;p&gt;ELT is more common in modern systems, especially with cloud platforms.&lt;/p&gt;

&lt;p&gt;It works well when:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;You have large volumes of data&lt;/li&gt;
&lt;li&gt;You’re using powerful cloud warehouses (like Snowflake or BigQuery)&lt;/li&gt;
&lt;li&gt;You want flexibility in how data is transformed&lt;/li&gt;
&lt;li&gt;You want to keep raw data for future use&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Think of it like dumping everything into a warehouse first, then organizing it later when needed.&lt;/p&gt;

&lt;h2&gt;
  
  
  A simple real-world example
&lt;/h2&gt;

&lt;p&gt;Imagine you’re building a dashboard for an e-commerce app.&lt;/p&gt;

&lt;h3&gt;
  
  
  With ETL:
&lt;/h3&gt;

&lt;p&gt;You:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Pull order data&lt;/li&gt;
&lt;li&gt;Clean it (remove duplicates, fix missing values)&lt;/li&gt;
&lt;li&gt;Then load it into your database ready for reporting&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Everything is neat before it even arrives.&lt;/p&gt;

&lt;h3&gt;
  
  
  With ELT:
&lt;/h3&gt;

&lt;p&gt;You:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Pull raw order data&lt;/li&gt;
&lt;li&gt;Load everything into a data warehouse&lt;/li&gt;
&lt;li&gt;Later write SQL transformations to clean and structure it&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This gives you more flexibility if business rules change later.&lt;/p&gt;

&lt;h2&gt;
  
  
  My key takeaway
&lt;/h2&gt;

&lt;p&gt;When I first learned this, I thought ETL was “old” and ELT was “new,” but that’s not really true.&lt;/p&gt;

&lt;p&gt;They both still matter.&lt;/p&gt;

&lt;p&gt;Here’s a simple way I now remember it:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;ETL = Clean first, store later&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;ELT = Store first, clean later&lt;/strong&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Common mistakes beginners make
&lt;/h2&gt;

&lt;p&gt;A few things that confused me at the start:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Thinking ELT means “no cleaning” (it still involves transformation!)&lt;/li&gt;
&lt;li&gt;Mixing up where SQL transformations happen&lt;/li&gt;
&lt;li&gt;Assuming one is always better than the other (it depends on the system)&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  So… which one should YOU use?
&lt;/h2&gt;

&lt;p&gt;There’s no universal winner.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;If you’re working with traditional systems → ETL is common&lt;/li&gt;
&lt;li&gt;If you’re in modern cloud data engineering → ELT is more popular&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Most real companies actually use a &lt;strong&gt;mix of both&lt;/strong&gt;, depending on the pipeline.&lt;/p&gt;

&lt;h2&gt;
  
  
  To make this even more practical, here are some common tools used in real ETL and ELT workflows
&lt;/h2&gt;

&lt;h3&gt;
  
  
  ETL Tools (Transformation happens before loading)
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Apache Airflow&lt;/strong&gt; – for scheduling and orchestrating ETL workflows&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Informatica PowerCenter&lt;/strong&gt; – widely used in enterprise ETL pipelines&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Talend&lt;/strong&gt; – open-source tool for data integration and transformation&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Apache NiFi&lt;/strong&gt; – good for real-time data flow and routing&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;SSIS (SQL Server Integration Services)&lt;/strong&gt; – Microsoft-based ETL tool&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;These tools usually handle data cleaning and transformation before sending data to a warehouse.&lt;/p&gt;

&lt;h3&gt;
  
  
  ELT Tools (Transformation happens after loading)
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Snowflake&lt;/strong&gt; – modern cloud data warehouse with strong ELT support&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Google BigQuery&lt;/strong&gt; – popular for serverless ELT workflows&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Amazon Redshift&lt;/strong&gt; – widely used in AWS-based data stacks&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;dbt (Data Build Tool)&lt;/strong&gt; – one of the most popular tools for transformations inside the warehouse&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Databricks (Apache Spark)&lt;/strong&gt; – used for large-scale ELT processing&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;In ELT setups, tools like &lt;strong&gt;dbt&lt;/strong&gt; handle transformation using SQL after data is loaded.&lt;/p&gt;

&lt;h2&gt;
  
  
  Final thoughts
&lt;/h2&gt;

&lt;p&gt;Once I understood this difference, a lot of other concepts like data pipelines, warehouses, and analytics started to make way more sense.&lt;/p&gt;

&lt;p&gt;If you’re learning data engineering right now, don’t rush it. Build a small pipeline, try both approaches, and you’ll see the difference quickly.&lt;/p&gt;

&lt;p&gt;That’s where it really clicks.&lt;/p&gt;

</description>
      <category>dataengineering</category>
      <category>etl</category>
      <category>elt</category>
    </item>
  </channel>
</rss>
