<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Anthony Gicheru</title>
    <description>The latest articles on DEV Community by Anthony Gicheru (@anthony-gicheru).</description>
    <link>https://dev.to/anthony-gicheru</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F1186529%2Fed9dc374-bfac-4eee-bce7-90ea63105510.jpeg</url>
      <title>DEV Community: Anthony Gicheru</title>
      <link>https://dev.to/anthony-gicheru</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/anthony-gicheru"/>
    <language>en</language>
    <item>
      <title>Apache Kafka Explained: A Practical Beginner Guide for Data Engineers</title>
      <dc:creator>Anthony Gicheru</dc:creator>
      <pubDate>Tue, 09 Jun 2026 05:38:19 +0000</pubDate>
      <link>https://dev.to/anthony-gicheru/apache-kafka-explained-a-practical-beginner-guide-for-data-engineers-1bh5</link>
      <guid>https://dev.to/anthony-gicheru/apache-kafka-explained-a-practical-beginner-guide-for-data-engineers-1bh5</guid>
      <description>&lt;p&gt;If you're learning data engineering, you'll probably meet Apache Kafka very early. You'll see it in job descriptions, system design diagrams, real-time analytics projects, fraud detection systems, and streaming pipelines. But Kafka can feel confusing at first because most explanations jump straight into brokers, partitions, and consumer groups before explaining the actual problem Kafka solves.&lt;/p&gt;

&lt;p&gt;So let's start there.&lt;/p&gt;




&lt;h2&gt;
  
  
  Why Kafka Exists
&lt;/h2&gt;

&lt;p&gt;Traditional batch pipelines are built to answer one question: &lt;em&gt;What does the data look like right now?&lt;/em&gt; Pull from the source, transform it, load it somewhere. Run it every hour, or every night. That works fine until you need to react to something the moment it happens.&lt;/p&gt;

&lt;p&gt;A payment system that catches fraud &lt;em&gt;after&lt;/em&gt; the transaction clears isn't catching fraud. A recommendation engine that updates overnight isn't personalizing anything in real time. A location tracking system that refreshes every hour is useless for logistics. These systems don't just need data. They need to know what's happening &lt;em&gt;now&lt;/em&gt;, as events occur.&lt;/p&gt;

&lt;p&gt;That's the gap Kafka fills. It's a system built to capture events and move them to wherever they're needed, reliably and at scale.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Core Concepts
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Events
&lt;/h3&gt;

&lt;p&gt;An event is a record of something that happened. Not a current state, not an aggregation. Just a fact, timestamped and immutable. A user placed an order. A payment failed. A package changed location. A temperature sensor crossed a threshold.&lt;/p&gt;

&lt;p&gt;In Kafka, events are stored as key-value records, typically in JSON or Avro:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"user_id"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;4821&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"action"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"purchase"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"item_id"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"SKU-009"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"timestamp"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"2026-05-05T14:32:00Z"&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Once an event is written to Kafka, it doesn't change. It becomes part of a permanent, ordered history.&lt;/p&gt;




&lt;h3&gt;
  
  
  Topics
&lt;/h3&gt;

&lt;p&gt;A topic is a named stream of related events, basically a category. You might have an &lt;code&gt;orders&lt;/code&gt; topic that stores every order placed on your platform, a &lt;code&gt;payments&lt;/code&gt; topic for every transaction, and a &lt;code&gt;user-activity&lt;/code&gt; topic for clicks, logins, and page views. Producers write &lt;em&gt;to&lt;/em&gt; topics; consumers read &lt;em&gt;from&lt;/em&gt; them. That's the core of how Kafka organizes data.&lt;/p&gt;

&lt;p&gt;If you're coming from a relational database background, the closest mental model is a table. An &lt;code&gt;orders&lt;/code&gt; topic is roughly analogous to an &lt;code&gt;orders&lt;/code&gt; table: it holds a collection of related records, each with the same general shape. But there are two important differences. First, you can only append to a Kafka topic. No UPDATE, no DELETE. Second, a table stores current state (what the order looks like right now), while a topic stores history (every event that ever happened to every order, in the order it happened). A more accurate comparison is an append-only audit log table, not a regular transactional one. If you've ever used event sourcing patterns or tracked changes in a &lt;code&gt;created_at&lt;/code&gt;/&lt;code&gt;updated_at&lt;/code&gt; log table, that's the right mental model.&lt;/p&gt;




&lt;h3&gt;
  
  
  Producers and Consumers
&lt;/h3&gt;

&lt;p&gt;A &lt;strong&gt;producer&lt;/strong&gt; is any application that sends events into Kafka. Your e-commerce backend produces an order event every time a customer checks out. It doesn't need to know who's going to use that event. It just writes it to the &lt;code&gt;orders&lt;/code&gt; topic and moves on.&lt;/p&gt;

&lt;p&gt;A &lt;strong&gt;consumer&lt;/strong&gt; is any application that reads events from Kafka. Your fraud detection service, your analytics pipeline, and your notification system are all consumers. Here's the part that surprises most people coming from traditional message queues: Kafka doesn't delete an event after a consumer reads it. Each consumer tracks its own position in the log independently. One topic can serve ten different consumer systems simultaneously, and none of them interfere with each other.&lt;/p&gt;




&lt;h3&gt;
  
  
  Partitions
&lt;/h3&gt;

&lt;p&gt;Topics are split into &lt;strong&gt;partitions&lt;/strong&gt;. Each partition is an ordered, append-only log stored on disk on one of Kafka's servers. Splitting a topic across multiple partitions is what gives Kafka its scalability: different consumers can read different partitions in parallel, and partitions can be spread across multiple servers for throughput and fault tolerance.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Topic: orders
  ├── Partition 0  →  [event][event][event]...
  ├── Partition 1  →  [event][event][event]...
  └── Partition 2  →  [event][event][event]...
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;When a producer sends an event, Kafka uses a &lt;strong&gt;partition key&lt;/strong&gt; to decide which partition it lands in. If you use &lt;code&gt;user_id&lt;/code&gt; as the key, every event from the same user always goes to the same partition, which guarantees they'll be read in order. That's key-based partitioning, and it's what you'll use most of the time when the sequence of events for a specific entity matters.&lt;/p&gt;

&lt;p&gt;If you don't supply a key, Kafka distributes records across partitions evenly. That's fine when ordering doesn't matter, like application logs or metrics — but you lose the ordering guarantee entirely.&lt;/p&gt;

&lt;p&gt;Choosing the right partitioning strategy goes deeper than just picking a key, and the consequences of getting it wrong aren't always obvious until you're in production. That's worth a dedicated article on its own.&lt;/p&gt;




&lt;h3&gt;
  
  
  Offsets
&lt;/h3&gt;

&lt;p&gt;Every event inside a partition is assigned a sequential integer called an &lt;strong&gt;offset&lt;/strong&gt;, starting at zero and incrementing forever.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Partition 0:  [0][1][2][3][4][5]...
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Consumers use offsets to track their progress. When a consumer crashes and restarts, it picks up from the last offset it committed. This is also what makes Kafka replayable: you can reset a consumer's offset back to zero and reprocess every event from the beginning. That's not something you get with traditional queues where messages disappear on consumption. It's one of Kafka's most useful features in practice. You can backfill a new table, reprocess events after fixing a bug, or onboard a new downstream system against months of historical data.&lt;/p&gt;




&lt;h2&gt;
  
  
  Brokers, Clusters, and the Metadata Layer
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Brokers and Clusters
&lt;/h3&gt;

&lt;p&gt;A &lt;strong&gt;broker&lt;/strong&gt; is a single Kafka server: one machine running the Kafka process, storing partitions on disk, and handling reads and writes. When your producer sends an event, it's connecting to a broker. When your consumer reads, it's reading from a broker.&lt;/p&gt;

&lt;p&gt;In production, you don't run one broker. You run a &lt;strong&gt;cluster&lt;/strong&gt;, which is just a group of brokers working together. More brokers means more storage, more throughput, and fault tolerance. When a topic has a replication factor of 3, Kafka stores each partition on 3 different brokers. One of those copies is the &lt;strong&gt;leader&lt;/strong&gt;, the broker that handles all reads and writes for that partition. The other two are &lt;strong&gt;followers&lt;/strong&gt; that stay in sync. If the leader broker goes down, Kafka promotes a follower automatically. Producers and consumers reconnect to the new leader, and the cluster keeps running. Done correctly, your consumers don't even notice.&lt;/p&gt;

&lt;p&gt;Replication factor isn't just about data safety, it's about availability. A replication factor of 1 means any single broker failure takes that partition offline completely.&lt;/p&gt;




&lt;h3&gt;
  
  
  ZooKeeper and KRaft: The Part Everyone Finds Confusing
&lt;/h3&gt;

&lt;p&gt;If you've read any Kafka tutorials from the last few years, you've probably seen &lt;strong&gt;Apache ZooKeeper&lt;/strong&gt; mentioned alongside Kafka. ZooKeeper's job was to manage cluster metadata: tracking which brokers were alive, which broker was the leader for each partition, and coordinating elections when a broker went down.&lt;/p&gt;

&lt;p&gt;It worked, but it was operationally awkward. Running Kafka in production meant also running, monitoring, and scaling a completely separate ZooKeeper cluster. Two systems to deploy, two systems to debug, two failure surfaces to worry about. If ZooKeeper had a problem, Kafka had a problem.&lt;/p&gt;

&lt;p&gt;Kafka 3.x introduced &lt;strong&gt;KRaft&lt;/strong&gt; (Kafka Raft) to replace it. Instead of depending on an external system, a subset of brokers act as controllers and manage cluster metadata internally using a built-in consensus protocol. One system, not two.&lt;/p&gt;

&lt;p&gt;The practical takeaway: if you're setting up Kafka today, use KRaft mode. It's the default. If you're following an older tutorial and there's a step to configure ZooKeeper separately, that tutorial is out of date. Kafka 4.0 dropped ZooKeeper support entirely, so this isn't a future transition you need to plan for. It's already done. The reason it still comes up is that a lot of teams are running older Kafka versions in production and haven't migrated yet, so you'll still see ZooKeeper in the wild. Now you know what it is and why it's being replaced.&lt;/p&gt;




&lt;h2&gt;
  
  
  How Kafka Fits Into a Real Pipeline
&lt;/h2&gt;

&lt;p&gt;In production, Kafka typically sits between your source systems and everything downstream:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;[App / Services]
      │
      ▼
[Kafka: orders topic]
      │
      ├──▶ [Fraud Detection Service]  ──▶  [PostgreSQL: flagged orders]
      ├──▶ [Spark Streaming]  ──▶  [Snowflake]  ──▶  Analytics Dashboard
      └──▶ [Notification Service]  ──▶  [Email / SMS]
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Your application writes the order event once. Every downstream system consumes it independently, at its own pace, with its own offset. Adding a fifth consumer (like a new inventory service) means creating a new consumer group, not touching the producer or any existing consumers.&lt;/p&gt;




&lt;h2&gt;
  
  
  A Practical Example
&lt;/h2&gt;

&lt;p&gt;You're building the data infrastructure for an e-commerce platform. Every user action generates an event. Your fraud detection service needs to score transactions in under a second. Your analytics pipeline needs to update a Snowflake dashboard every few minutes. Your notification service sends confirmation emails.&lt;/p&gt;

&lt;p&gt;Here's a minimal Python producer writing order events to Kafka:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;kafka&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;KafkaProducer&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;json&lt;/span&gt;

&lt;span class="n"&gt;producer&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;KafkaProducer&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;bootstrap_servers&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;localhost:9092&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;value_serializer&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="k"&gt;lambda&lt;/span&gt; &lt;span class="n"&gt;v&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;json&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;dumps&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;v&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;encode&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;utf-8&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;event&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;user_id&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;4821&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;action&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;purchase&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;item_id&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;SKU-009&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="n"&gt;producer&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;send&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;orders&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;key&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sa"&gt;b&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;user_4821&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;value&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;event&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;producer&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;flush&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;And a consumer for the fraud detection service:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;kafka&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;KafkaConsumer&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;json&lt;/span&gt;

&lt;span class="n"&gt;consumer&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;KafkaConsumer&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;orders&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;bootstrap_servers&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;localhost:9092&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;group_id&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;fraud-detection-group&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;value_deserializer&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="k"&gt;lambda&lt;/span&gt; &lt;span class="n"&gt;m&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;json&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;loads&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;m&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;decode&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;utf-8&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;message&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;consumer&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="nf"&gt;score_transaction&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;message&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;value&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;If the fraud detection service goes down for 20 minutes, it resumes from its last committed offset when it restarts. The analytics pipeline and notification service never knew there was a problem. They have their own offsets, their own pace, their own position in the log.&lt;/p&gt;




&lt;h2&gt;
  
  
  Common Mistakes (And How to Avoid Them)
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Treating Kafka Like a Queue
&lt;/h3&gt;

&lt;p&gt;This is the most common mental model mistake. In a traditional queue, a message is consumed and gone. Kafka doesn't work that way. Events stay in a partition for your configured retention period (often 7 days), regardless of how many consumers have read them. That means your consumers need to be &lt;strong&gt;idempotent&lt;/strong&gt;: processing the same event twice shouldn't produce incorrect results. The practical fix is simple: use a unique event ID or order ID when writing to your final database, so duplicate processing is harmless.&lt;/p&gt;




&lt;h3&gt;
  
  
  Picking Partition Count Arbitrarily
&lt;/h3&gt;

&lt;p&gt;Most tutorials say "set it to 3" and don't explain why. Here's what they don't tell you: you can increase partition count later, but you can't decrease it. And increasing it after you're already using partition keys breaks your key-to-partition mapping. An event from &lt;code&gt;user_id: 4821&lt;/code&gt; that used to land in partition 3 might now land in partition 7. Ordering guarantees break, and stateful consumers get incomplete data.&lt;/p&gt;

&lt;p&gt;Think about partition count before you create the topic. A reasonable starting point is 2–3× your initial consumer count, with room to grow. Write down why you chose it. Treat it like a schema decision; once it's set, it's very hard to undo.&lt;/p&gt;




&lt;h3&gt;
  
  
  Ignoring Consumer Lag
&lt;/h3&gt;

&lt;p&gt;Consumer lag is the gap between the latest offset in a partition and the offset your consumer has actually processed. A small, stable lag is normal. A lag that keeps growing is a silent production incident.&lt;/p&gt;

&lt;p&gt;If your fraud detection service is 200,000 events behind, it isn't doing real-time fraud detection. It's doing very delayed batch processing with extra steps. Set up monitoring and alerts on consumer lag before you go to production, not after something breaks.&lt;/p&gt;




&lt;h3&gt;
  
  
  Auto-Committing Offsets for Critical Pipelines
&lt;/h3&gt;

&lt;p&gt;Auto-commit sounds convenient. It commits your offset on a timer, which means it can commit &lt;em&gt;before&lt;/em&gt; your consumer has actually finished processing an event. If your service crashes in that window, the event is silently skipped. For payments, orders, or anything where losing an event matters, use manual offset commits and only commit after processing is confirmed complete.&lt;/p&gt;




&lt;h2&gt;
  
  
  Best Practices Worth Knowing
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Use meaningful topic names.&lt;/strong&gt; A topic called &lt;code&gt;events&lt;/code&gt; is going to cause you problems in six months. &lt;code&gt;orders&lt;/code&gt;, &lt;code&gt;payments&lt;/code&gt;, &lt;code&gt;user-activity&lt;/code&gt;. Specific names, clear ownership.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Choose partition keys deliberately.&lt;/strong&gt; If ordering matters by user, use &lt;code&gt;user_id&lt;/code&gt;. If it matters by account, use &lt;code&gt;account_id&lt;/code&gt;. Don't use random or null keys unless you genuinely don't care about ordering.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Set replication factor to 3 in production.&lt;/strong&gt; A replication factor of 1 means a single broker failure takes your data with it. In production, 3 is the baseline, not optional.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Use schemas for serious pipelines.&lt;/strong&gt; JSON is easy to start with and easy to break. If a producer team renames a field, every downstream consumer breaks silently. Avro with a Schema Registry enforces a contract between producers and consumers and lets schemas evolve safely.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Don't use Kafka when you don't need it.&lt;/strong&gt; If your pipeline runs once a day and has one consumer, Airflow and a batch job are simpler and easier to maintain. Kafka earns its operational overhead when you need real-time throughput, multiple consumers, or replayability. Reach for it when those requirements are real, not just anticipated.&lt;/p&gt;




&lt;h2&gt;
  
  
  Conclusion
&lt;/h2&gt;

&lt;p&gt;The mental shift Kafka requires isn't about learning a new API. It's about designing systems around events rather than state. Capture what happened once, and let every system that cares about it consume that record independently, at its own pace, with the ability to replay.&lt;/p&gt;

&lt;p&gt;The best way to make that concrete: spin up Kafka locally with Docker, create a topic with 3 partitions, write the producer and consumer from the example above, and then reset your consumer offset to zero and watch it replay every event from the beginning. That one experiment will do more than any amount of reading.&lt;/p&gt;

</description>
      <category>kafka</category>
      <category>dataengineering</category>
      <category>beginners</category>
      <category>distributedsystems</category>
    </item>
    <item>
      <title>Slowly Changing Dimensions Explained: How Data Warehouses Keep History Accurate</title>
      <dc:creator>Anthony Gicheru</dc:creator>
      <pubDate>Sun, 17 May 2026 07:11:45 +0000</pubDate>
      <link>https://dev.to/anthony-gicheru/slowly-changing-dimensions-explained-how-data-warehouses-keep-history-accurate-2mim</link>
      <guid>https://dev.to/anthony-gicheru/slowly-changing-dimensions-explained-how-data-warehouses-keep-history-accurate-2mim</guid>
      <description>&lt;h2&gt;
  
  
  1. Why Slowly Changing Dimensions Matter
&lt;/h2&gt;

&lt;p&gt;In data engineering, not all data changes the same way.&lt;/p&gt;

&lt;p&gt;Some data changes constantly, like transactions, clicks, payments, and sensor readings. These are usually facts: events that happen at a specific point in time.&lt;/p&gt;

&lt;p&gt;But other data changes slowly.&lt;/p&gt;

&lt;p&gt;A customer changes their address.&lt;br&gt;
A product changes category.&lt;br&gt;
An employee moves to a new department.&lt;br&gt;
A supplier changes region.&lt;br&gt;
A user upgrades from a free plan to a premium plan.&lt;/p&gt;

&lt;p&gt;These changes do not happen every second, but when they happen, they matter a lot.&lt;/p&gt;

&lt;p&gt;Imagine you are building a sales report. A customer originally lived in Nairobi, then moved to Mombasa. If you simply update the customer record, all their historical sales may suddenly appear as if they happened in Mombasa.&lt;/p&gt;

&lt;p&gt;That is a problem.&lt;/p&gt;

&lt;p&gt;The business may ask:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;“How much revenue did we make from Nairobi customers last year?”&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;But if you overwrote the customer’s location, your report may give the wrong answer.&lt;/p&gt;

&lt;p&gt;This is the exact problem Slowly Changing Dimensions solve.&lt;/p&gt;

&lt;p&gt;Slowly Changing Dimensions help data teams manage changes in descriptive data over time while keeping analytics accurate.&lt;/p&gt;


&lt;h2&gt;
  
  
  2. What Is a Slowly Changing Dimension?
&lt;/h2&gt;

&lt;p&gt;A &lt;strong&gt;Slowly Changing Dimension&lt;/strong&gt;, often shortened to &lt;strong&gt;SCD&lt;/strong&gt;, is a technique used in data warehousing to manage changes in dimension tables over time.&lt;/p&gt;

&lt;p&gt;A dimension table stores descriptive information.&lt;/p&gt;

&lt;p&gt;For example:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;customer_id&lt;/th&gt;
&lt;th&gt;customer_name&lt;/th&gt;
&lt;th&gt;city&lt;/th&gt;
&lt;th&gt;customer_type&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;101&lt;/td&gt;
&lt;td&gt;Mary Wanjiku&lt;/td&gt;
&lt;td&gt;Nairobi&lt;/td&gt;
&lt;td&gt;Regular&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;This is not a transaction. It describes the customer.&lt;/p&gt;

&lt;p&gt;Now imagine Mary moves from Nairobi to Kisumu.&lt;/p&gt;

&lt;p&gt;The question becomes:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Should we overwrite Nairobi with Kisumu, or should we keep a history of both?&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;That decision is what SCD is all about.&lt;/p&gt;

&lt;p&gt;This is where Slowly Changing Dimensions become useful.&lt;/p&gt;

&lt;p&gt;They give data teams a structured way to decide how changes should be stored.&lt;/p&gt;

&lt;p&gt;Sometimes we only care about the latest value.&lt;/p&gt;

&lt;p&gt;Sometimes we want to preserve the original value.&lt;/p&gt;

&lt;p&gt;Sometimes we need the full history of every meaningful change.&lt;/p&gt;

&lt;p&gt;And sometimes we only need a simple previous-and-current comparison.&lt;/p&gt;


&lt;h2&gt;
  
  
  3. How Slowly Changing Dimensions Work
&lt;/h2&gt;

&lt;p&gt;In a data warehouse, data is usually organized into &lt;strong&gt;fact tables&lt;/strong&gt; and &lt;strong&gt;dimension tables&lt;/strong&gt;.&lt;/p&gt;
&lt;h3&gt;
  
  
  Fact Tables
&lt;/h3&gt;

&lt;p&gt;Fact tables store business events.&lt;/p&gt;

&lt;p&gt;Examples:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Sales&lt;/li&gt;
&lt;li&gt;Orders&lt;/li&gt;
&lt;li&gt;Payments&lt;/li&gt;
&lt;li&gt;Website clicks&lt;/li&gt;
&lt;li&gt;Deliveries&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;A sales fact table might look like this:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;sale_id&lt;/th&gt;
&lt;th&gt;customer_key&lt;/th&gt;
&lt;th&gt;product_key&lt;/th&gt;
&lt;th&gt;amount&lt;/th&gt;
&lt;th&gt;sale_date&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;5001&lt;/td&gt;
&lt;td&gt;1&lt;/td&gt;
&lt;td&gt;20&lt;/td&gt;
&lt;td&gt;3000&lt;/td&gt;
&lt;td&gt;2025-01-10&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;
&lt;h3&gt;
  
  
  Dimension Tables
&lt;/h3&gt;

&lt;p&gt;Dimension tables describe the facts.&lt;/p&gt;

&lt;p&gt;A customer dimension might look like this:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;customer_key&lt;/th&gt;
&lt;th&gt;customer_id&lt;/th&gt;
&lt;th&gt;name&lt;/th&gt;
&lt;th&gt;city&lt;/th&gt;
&lt;th&gt;customer_type&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;1&lt;/td&gt;
&lt;td&gt;101&lt;/td&gt;
&lt;td&gt;Mary Wanjiku&lt;/td&gt;
&lt;td&gt;Nairobi&lt;/td&gt;
&lt;td&gt;Regular&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The fact table tells us &lt;strong&gt;what happened&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;The dimension table tells us &lt;strong&gt;who, what, where, or how&lt;/strong&gt; it happened.&lt;/p&gt;

&lt;p&gt;The challenge is that dimension data changes.&lt;/p&gt;

&lt;p&gt;When Mary moves from Nairobi to Kisumu, we need to decide how to store that change.&lt;/p&gt;

&lt;p&gt;There are different SCD types, but the most commonly used are:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;SCD Type 0&lt;/li&gt;
&lt;li&gt;SCD Type 1&lt;/li&gt;
&lt;li&gt;SCD Type 2&lt;/li&gt;
&lt;li&gt;SCD Type 3&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Let’s go through them practically.&lt;/p&gt;


&lt;h2&gt;
  
  
  4. SCD Type 0: Keep the Original Value
&lt;/h2&gt;

&lt;p&gt;SCD Type 0 means a value does not change in the data warehouse.&lt;/p&gt;

&lt;p&gt;Once the value is loaded, it stays the same, even if the source system changes later.&lt;/p&gt;

&lt;p&gt;In simple terms, Type 0 says:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;“Keep the original value as it was first recorded.”&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;In real data warehouse work, Type 0 appears often for fields that should represent the original state of something. But many teams do not always call it “SCD Type 0” explicitly.&lt;/p&gt;

&lt;p&gt;They may simply say:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;“This field should never be updated.”&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Or:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;“Preserve the original value.”&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;So conceptually, Type 0 is common. The name “Type 0” is just less commonly emphasized.&lt;/p&gt;

&lt;p&gt;Good examples of Type 0 fields are:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Original signup date&lt;/li&gt;
&lt;li&gt;First purchase date&lt;/li&gt;
&lt;li&gt;Original registration country&lt;/li&gt;
&lt;li&gt;Original acquisition channel&lt;/li&gt;
&lt;li&gt;Original product launch date&lt;/li&gt;
&lt;li&gt;Original employee hire date&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;For example, imagine Mary Wanjiku first registered as a customer while living in Kenya through an Instagram campaign.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;customer_id&lt;/th&gt;
&lt;th&gt;name&lt;/th&gt;
&lt;th&gt;original_signup_date&lt;/th&gt;
&lt;th&gt;original_country&lt;/th&gt;
&lt;th&gt;acquisition_channel&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;101&lt;/td&gt;
&lt;td&gt;Mary Wanjiku&lt;/td&gt;
&lt;td&gt;2025-01-01&lt;/td&gt;
&lt;td&gt;Kenya&lt;/td&gt;
&lt;td&gt;Instagram Ads&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Later, Mary may move cities, upgrade her customer type, or start coming through email campaigns.&lt;/p&gt;

&lt;p&gt;But the original acquisition channel should still remain &lt;strong&gt;Instagram Ads&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;Why?&lt;/p&gt;

&lt;p&gt;Because it tells the business how Mary was first acquired.&lt;/p&gt;

&lt;p&gt;If we overwrite that value, we lose the ability to answer questions like:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;“Which marketing channel originally brought us our best customers?”&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;That is where Type 0 is useful.&lt;/p&gt;

&lt;p&gt;It protects values that describe the original state of a record.&lt;/p&gt;


&lt;h2&gt;
  
  
  5. SCD Type 1: Overwrite the Old Value
&lt;/h2&gt;

&lt;p&gt;SCD Type 1 is the simplest approach.&lt;/p&gt;

&lt;p&gt;When a value changes, you overwrite the old value with the new one.&lt;/p&gt;

&lt;p&gt;Before:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;customer_id&lt;/th&gt;
&lt;th&gt;name&lt;/th&gt;
&lt;th&gt;city&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;101&lt;/td&gt;
&lt;td&gt;Mary Wanjiku&lt;/td&gt;
&lt;td&gt;Nairobi&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;After Mary moves:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;customer_id&lt;/th&gt;
&lt;th&gt;name&lt;/th&gt;
&lt;th&gt;city&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;101&lt;/td&gt;
&lt;td&gt;Mary Wanjiku&lt;/td&gt;
&lt;td&gt;Kisumu&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The old city is gone.&lt;/p&gt;
&lt;h3&gt;
  
  
  When Type 1 Makes Sense
&lt;/h3&gt;

&lt;p&gt;SCD Type 1 is useful when history does not matter.&lt;/p&gt;

&lt;p&gt;For example:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Fixing a spelling mistake&lt;/li&gt;
&lt;li&gt;Correcting wrong data&lt;/li&gt;
&lt;li&gt;Updating an email address&lt;/li&gt;
&lt;li&gt;Updating a phone number&lt;/li&gt;
&lt;li&gt;Correcting a product name typo&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;If the original value was wrong, you usually do not want to preserve it.&lt;/p&gt;
&lt;h3&gt;
  
  
  Example
&lt;/h3&gt;

&lt;p&gt;Imagine a customer’s name was loaded as:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Mary Wanjikuuu
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Then later corrected to:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Mary Wanjiku
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;You do not need historical tracking for the typo. You just update the record.&lt;/p&gt;

&lt;p&gt;That is SCD Type 1.&lt;/p&gt;

&lt;h3&gt;
  
  
  The Risk
&lt;/h3&gt;

&lt;p&gt;The risk with Type 1 is that it destroys history.&lt;/p&gt;

&lt;p&gt;If city changes from Nairobi to Kisumu, all past reports will now treat Mary as a Kisumu customer, even if she lived in Nairobi when the sales happened.&lt;/p&gt;

&lt;p&gt;So Type 1 is simple, but dangerous when historical accuracy matters.&lt;/p&gt;




&lt;h2&gt;
  
  
  6. SCD Type 2: Keep Full History
&lt;/h2&gt;

&lt;p&gt;SCD Type 2 is the most important and most commonly used SCD technique in analytics.&lt;/p&gt;

&lt;p&gt;Instead of overwriting the old record, you create a new row when important attributes change.&lt;/p&gt;

&lt;p&gt;Before Mary moves:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;customer_key&lt;/th&gt;
&lt;th&gt;customer_id&lt;/th&gt;
&lt;th&gt;name&lt;/th&gt;
&lt;th&gt;city&lt;/th&gt;
&lt;th&gt;start_date&lt;/th&gt;
&lt;th&gt;end_date&lt;/th&gt;
&lt;th&gt;is_current&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;1&lt;/td&gt;
&lt;td&gt;101&lt;/td&gt;
&lt;td&gt;Mary Wanjiku&lt;/td&gt;
&lt;td&gt;Nairobi&lt;/td&gt;
&lt;td&gt;2024-01-01&lt;/td&gt;
&lt;td&gt;NULL&lt;/td&gt;
&lt;td&gt;true&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;After Mary moves to Kisumu:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;customer_key&lt;/th&gt;
&lt;th&gt;customer_id&lt;/th&gt;
&lt;th&gt;name&lt;/th&gt;
&lt;th&gt;city&lt;/th&gt;
&lt;th&gt;start_date&lt;/th&gt;
&lt;th&gt;end_date&lt;/th&gt;
&lt;th&gt;is_current&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;1&lt;/td&gt;
&lt;td&gt;101&lt;/td&gt;
&lt;td&gt;Mary Wanjiku&lt;/td&gt;
&lt;td&gt;Nairobi&lt;/td&gt;
&lt;td&gt;2024-01-01&lt;/td&gt;
&lt;td&gt;2025-03-15&lt;/td&gt;
&lt;td&gt;false&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;2&lt;/td&gt;
&lt;td&gt;101&lt;/td&gt;
&lt;td&gt;Mary Wanjiku&lt;/td&gt;
&lt;td&gt;Kisumu&lt;/td&gt;
&lt;td&gt;2025-03-15&lt;/td&gt;
&lt;td&gt;NULL&lt;/td&gt;
&lt;td&gt;true&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Notice something important.&lt;/p&gt;

&lt;p&gt;The &lt;code&gt;customer_id&lt;/code&gt; stays the same because it represents the real-world customer.&lt;/p&gt;

&lt;p&gt;But the &lt;code&gt;customer_key&lt;/code&gt; changes because each historical version gets its own unique warehouse key.&lt;/p&gt;

&lt;p&gt;This is usually called a &lt;strong&gt;surrogate key&lt;/strong&gt;.&lt;/p&gt;

&lt;h3&gt;
  
  
  Why This Matters
&lt;/h3&gt;

&lt;p&gt;Now, if Mary made a purchase while living in Nairobi, the fact table can point to the Nairobi version of her customer record.&lt;/p&gt;

&lt;p&gt;If she made another purchase after moving to Kisumu, that sale can point to the Kisumu version.&lt;/p&gt;

&lt;p&gt;This allows historical reports to stay accurate.&lt;/p&gt;

&lt;h3&gt;
  
  
  Example Fact Table
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;sale_id&lt;/th&gt;
&lt;th&gt;customer_key&lt;/th&gt;
&lt;th&gt;amount&lt;/th&gt;
&lt;th&gt;sale_date&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;5001&lt;/td&gt;
&lt;td&gt;1&lt;/td&gt;
&lt;td&gt;3000&lt;/td&gt;
&lt;td&gt;2025-02-10&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;5002&lt;/td&gt;
&lt;td&gt;2&lt;/td&gt;
&lt;td&gt;4500&lt;/td&gt;
&lt;td&gt;2025-04-20&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The first sale belongs to Mary when she was in Nairobi.&lt;/p&gt;

&lt;p&gt;The second sale belongs to Mary when she was in Kisumu.&lt;/p&gt;

&lt;p&gt;That is the power of SCD Type 2.&lt;/p&gt;




&lt;h2&gt;
  
  
  7. SCD Type 3: Store Limited History in Columns
&lt;/h2&gt;

&lt;p&gt;SCD Type 3 keeps limited history by adding extra columns.&lt;/p&gt;

&lt;p&gt;For example:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;customer_id&lt;/th&gt;
&lt;th&gt;name&lt;/th&gt;
&lt;th&gt;current_city&lt;/th&gt;
&lt;th&gt;previous_city&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;101&lt;/td&gt;
&lt;td&gt;Mary Wanjiku&lt;/td&gt;
&lt;td&gt;Kisumu&lt;/td&gt;
&lt;td&gt;Nairobi&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;This lets you see the current value and one previous value.&lt;/p&gt;

&lt;h3&gt;
  
  
  When Type 3 Makes Sense
&lt;/h3&gt;

&lt;p&gt;SCD Type 3 is useful when you only care about a small amount of history.&lt;/p&gt;

&lt;p&gt;For example:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Previous region and current region&lt;/li&gt;
&lt;li&gt;Previous plan and current plan&lt;/li&gt;
&lt;li&gt;Previous department and current department&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;But it does not scale well if changes happen many times.&lt;/p&gt;

&lt;p&gt;What happens if Mary moves from Nairobi to Kisumu, then Nakuru, then Eldoret?&lt;/p&gt;

&lt;p&gt;You would need more columns:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;previous_city_1
previous_city_2
previous_city_3
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;That becomes messy quickly.&lt;/p&gt;

&lt;p&gt;So Type 3 is useful, but only for very specific cases.&lt;/p&gt;




&lt;h2&gt;
  
  
  8. Practical Example in a Data Warehousing Project
&lt;/h2&gt;

&lt;p&gt;Let’s say you are building a sales analytics warehouse for an e-commerce company.&lt;/p&gt;

&lt;p&gt;You have data coming from:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;PostgreSQL for application data&lt;/li&gt;
&lt;li&gt;Kafka for order events&lt;/li&gt;
&lt;li&gt;Airflow for orchestration&lt;/li&gt;
&lt;li&gt;dbt for transformations&lt;/li&gt;
&lt;li&gt;Snowflake, BigQuery, Redshift, or PostgreSQL as the warehouse&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Your source customer table in PostgreSQL looks like this:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;customer_id&lt;/th&gt;
&lt;th&gt;name&lt;/th&gt;
&lt;th&gt;city&lt;/th&gt;
&lt;th&gt;customer_type&lt;/th&gt;
&lt;th&gt;acquisition_channel&lt;/th&gt;
&lt;th&gt;updated_at&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;101&lt;/td&gt;
&lt;td&gt;Mary Wanjiku&lt;/td&gt;
&lt;td&gt;Nairobi&lt;/td&gt;
&lt;td&gt;Regular&lt;/td&gt;
&lt;td&gt;Instagram Ads&lt;/td&gt;
&lt;td&gt;2025-01-01&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Later, the same customer changes:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;customer_id&lt;/th&gt;
&lt;th&gt;name&lt;/th&gt;
&lt;th&gt;city&lt;/th&gt;
&lt;th&gt;customer_type&lt;/th&gt;
&lt;th&gt;acquisition_channel&lt;/th&gt;
&lt;th&gt;updated_at&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;101&lt;/td&gt;
&lt;td&gt;Mary Wanjiku&lt;/td&gt;
&lt;td&gt;Kisumu&lt;/td&gt;
&lt;td&gt;Premium&lt;/td&gt;
&lt;td&gt;Instagram Ads&lt;/td&gt;
&lt;td&gt;2025-03-15&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Now the data team must decide:&lt;/p&gt;

&lt;p&gt;Do we overwrite the old record?&lt;/p&gt;

&lt;p&gt;Or do we preserve the old version?&lt;/p&gt;

&lt;p&gt;And what do we do with the original acquisition channel?&lt;/p&gt;

&lt;p&gt;In this example:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;acquisition_channel&lt;/code&gt; can be treated as Type 0 because it represents how Mary was originally acquired.&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;city&lt;/code&gt; and &lt;code&gt;customer_type&lt;/code&gt; can be treated as Type 2 because they affect historical reporting.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;For analytics, this is why we often combine different SCD behaviors in the same dimension table. Some fields preserve the original value, while others keep full history.&lt;/p&gt;

&lt;p&gt;A Type 2 customer dimension may look like this:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;customer_key&lt;/th&gt;
&lt;th&gt;customer_id&lt;/th&gt;
&lt;th&gt;name&lt;/th&gt;
&lt;th&gt;city&lt;/th&gt;
&lt;th&gt;customer_type&lt;/th&gt;
&lt;th&gt;acquisition_channel&lt;/th&gt;
&lt;th&gt;valid_from&lt;/th&gt;
&lt;th&gt;valid_to&lt;/th&gt;
&lt;th&gt;is_current&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;1&lt;/td&gt;
&lt;td&gt;101&lt;/td&gt;
&lt;td&gt;Mary Wanjiku&lt;/td&gt;
&lt;td&gt;Nairobi&lt;/td&gt;
&lt;td&gt;Regular&lt;/td&gt;
&lt;td&gt;Instagram Ads&lt;/td&gt;
&lt;td&gt;2025-01-01&lt;/td&gt;
&lt;td&gt;2025-03-15&lt;/td&gt;
&lt;td&gt;false&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;2&lt;/td&gt;
&lt;td&gt;101&lt;/td&gt;
&lt;td&gt;Mary Wanjiku&lt;/td&gt;
&lt;td&gt;Kisumu&lt;/td&gt;
&lt;td&gt;Premium&lt;/td&gt;
&lt;td&gt;Instagram Ads&lt;/td&gt;
&lt;td&gt;2025-03-15&lt;/td&gt;
&lt;td&gt;NULL&lt;/td&gt;
&lt;td&gt;true&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Now your reports can answer questions like:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;How many sales came from Nairobi customers in February?&lt;/li&gt;
&lt;li&gt;How much revenue came from Premium customers after March?&lt;/li&gt;
&lt;li&gt;What was the customer type at the time of purchase?&lt;/li&gt;
&lt;li&gt;How many customers upgraded from Regular to Premium?&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Without SCD Type 2, these questions become difficult or inaccurate.&lt;/p&gt;




&lt;h2&gt;
  
  
  9. A Simple SCD Type 2 Flow
&lt;/h2&gt;

&lt;p&gt;A typical SCD Type 2 pipeline works like this:&lt;/p&gt;

&lt;h3&gt;
  
  
  Step 1: Load the Latest Source Data
&lt;/h3&gt;

&lt;p&gt;You extract the latest customer data from the source system.&lt;/p&gt;

&lt;p&gt;This could come from PostgreSQL, an API, a CSV file, or CDC events from Kafka.&lt;/p&gt;

&lt;h3&gt;
  
  
  Step 2: Compare Source Data With Current Dimension Records
&lt;/h3&gt;

&lt;p&gt;You compare the incoming record with the current active record in the warehouse.&lt;/p&gt;

&lt;p&gt;For example, compare:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;source.city
source.customer_type
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;against:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;dim_customer.city
dim_customer.customer_type
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Step 3: Detect Changes
&lt;/h3&gt;

&lt;p&gt;If nothing changed, do nothing.&lt;/p&gt;

&lt;p&gt;If important attributes changed, expire the old record.&lt;/p&gt;

&lt;p&gt;For example:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;customer_key&lt;/th&gt;
&lt;th&gt;customer_id&lt;/th&gt;
&lt;th&gt;city&lt;/th&gt;
&lt;th&gt;valid_to&lt;/th&gt;
&lt;th&gt;is_current&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;1&lt;/td&gt;
&lt;td&gt;101&lt;/td&gt;
&lt;td&gt;Nairobi&lt;/td&gt;
&lt;td&gt;2025-03-15&lt;/td&gt;
&lt;td&gt;false&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h3&gt;
  
  
  Step 4: Insert a New Current Record
&lt;/h3&gt;

&lt;p&gt;Then insert the new version:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;customer_key&lt;/th&gt;
&lt;th&gt;customer_id&lt;/th&gt;
&lt;th&gt;city&lt;/th&gt;
&lt;th&gt;valid_from&lt;/th&gt;
&lt;th&gt;valid_to&lt;/th&gt;
&lt;th&gt;is_current&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;2&lt;/td&gt;
&lt;td&gt;101&lt;/td&gt;
&lt;td&gt;Kisumu&lt;/td&gt;
&lt;td&gt;2025-03-15&lt;/td&gt;
&lt;td&gt;NULL&lt;/td&gt;
&lt;td&gt;true&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h3&gt;
  
  
  Step 5: Use the Correct Dimension Version in Fact Tables
&lt;/h3&gt;

&lt;p&gt;When loading fact data, join the fact date to the correct dimension record using the validity period.&lt;/p&gt;

&lt;p&gt;For example:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;SELECT&lt;/span&gt;
    &lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;order_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;d&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;customer_key&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;order_date&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;amount&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;staging_orders&lt;/span&gt; &lt;span class="n"&gt;f&lt;/span&gt;
&lt;span class="k"&gt;JOIN&lt;/span&gt; &lt;span class="n"&gt;dim_customer&lt;/span&gt; &lt;span class="n"&gt;d&lt;/span&gt;
    &lt;span class="k"&gt;ON&lt;/span&gt; &lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;customer_id&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;d&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;customer_id&lt;/span&gt;
   &lt;span class="k"&gt;AND&lt;/span&gt; &lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;order_date&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;=&lt;/span&gt; &lt;span class="n"&gt;d&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;valid_from&lt;/span&gt;
   &lt;span class="k"&gt;AND&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;order_date&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&lt;/span&gt; &lt;span class="n"&gt;d&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;valid_to&lt;/span&gt;
        &lt;span class="k"&gt;OR&lt;/span&gt; &lt;span class="n"&gt;d&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;valid_to&lt;/span&gt; &lt;span class="k"&gt;IS&lt;/span&gt; &lt;span class="k"&gt;NULL&lt;/span&gt;
   &lt;span class="p"&gt;);&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This ensures the order connects to the correct version of the customer.&lt;/p&gt;




&lt;h2&gt;
  
  
  10. Common Mistakes Beginners Make
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Mistake 1: Using Type 1 When History Matters
&lt;/h3&gt;

&lt;p&gt;This is probably the most common mistake.&lt;/p&gt;

&lt;p&gt;Beginners often overwrite dimension records because it feels simple.&lt;/p&gt;

&lt;p&gt;But later, when the business asks historical questions, the warehouse cannot answer correctly.&lt;/p&gt;

&lt;p&gt;For example:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;“What was revenue by customer region last year?”&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;If you overwrote all customer regions with the latest value, the report will be wrong.&lt;/p&gt;

&lt;h3&gt;
  
  
  How to Avoid It
&lt;/h3&gt;

&lt;p&gt;Before choosing Type 1, ask:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;“Will the business ever need to know what this value was in the past?”&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;If yes, consider Type 2.&lt;/p&gt;




&lt;h3&gt;
  
  
  Mistake 2: Tracking Every Column as Type 2
&lt;/h3&gt;

&lt;p&gt;Not every change deserves a new historical version.&lt;/p&gt;

&lt;p&gt;For example, do you really need a new customer dimension row when the phone number changes?&lt;/p&gt;

&lt;p&gt;Maybe not.&lt;/p&gt;

&lt;p&gt;If you track every small change, your dimension table can grow unnecessarily large and become harder to manage.&lt;/p&gt;

&lt;h3&gt;
  
  
  How to Avoid It
&lt;/h3&gt;

&lt;p&gt;Classify columns carefully.&lt;/p&gt;

&lt;p&gt;For example:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Column&lt;/th&gt;
&lt;th&gt;SCD Type&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;customer_name typo fix&lt;/td&gt;
&lt;td&gt;Type 1&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;city&lt;/td&gt;
&lt;td&gt;Type 2&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;customer_type&lt;/td&gt;
&lt;td&gt;Type 2&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;phone_number&lt;/td&gt;
&lt;td&gt;Type 1&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;email&lt;/td&gt;
&lt;td&gt;Type 1 or Type 2 depending on business need&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The decision should be based on reporting needs, not just technical preference.&lt;/p&gt;




&lt;h3&gt;
  
  
  Mistake 3: Not Using a Surrogate Key
&lt;/h3&gt;

&lt;p&gt;A big mistake is using the source system ID as the primary key for the dimension table.&lt;/p&gt;

&lt;p&gt;For example, using &lt;code&gt;customer_id&lt;/code&gt; as the only key.&lt;/p&gt;

&lt;p&gt;That becomes a problem in Type 2 because the same customer can have multiple historical versions.&lt;/p&gt;

&lt;h3&gt;
  
  
  Better Approach
&lt;/h3&gt;

&lt;p&gt;Use:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;customer_id&lt;/code&gt; as the business key&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;customer_key&lt;/code&gt; as the warehouse surrogate key&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Example:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;customer_key&lt;/th&gt;
&lt;th&gt;customer_id&lt;/th&gt;
&lt;th&gt;city&lt;/th&gt;
&lt;th&gt;is_current&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;1&lt;/td&gt;
&lt;td&gt;101&lt;/td&gt;
&lt;td&gt;Nairobi&lt;/td&gt;
&lt;td&gt;false&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;2&lt;/td&gt;
&lt;td&gt;101&lt;/td&gt;
&lt;td&gt;Kisumu&lt;/td&gt;
&lt;td&gt;true&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The surrogate key uniquely identifies each version.&lt;/p&gt;




&lt;h3&gt;
  
  
  Mistake 4: Forgetting the Current Flag
&lt;/h3&gt;

&lt;p&gt;Without an &lt;code&gt;is_current&lt;/code&gt; column, it becomes harder to query the latest version of each record.&lt;/p&gt;

&lt;p&gt;You would have to check for:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="n"&gt;valid_to&lt;/span&gt; &lt;span class="k"&gt;IS&lt;/span&gt; &lt;span class="k"&gt;NULL&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;That works, but &lt;code&gt;is_current&lt;/code&gt; makes queries easier and clearer.&lt;/p&gt;

&lt;p&gt;A good SCD Type 2 table usually has:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;valid_from
valid_to
is_current
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Some teams also add:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;created_at
updated_at
record_hash
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h3&gt;
  
  
  Mistake 5: Poor Date Handling
&lt;/h3&gt;

&lt;p&gt;SCD Type 2 depends heavily on dates.&lt;/p&gt;

&lt;p&gt;If your &lt;code&gt;valid_from&lt;/code&gt; and &lt;code&gt;valid_to&lt;/code&gt; values are wrong, your historical joins will be wrong.&lt;/p&gt;

&lt;p&gt;Common problems include:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Overlapping date ranges&lt;/li&gt;
&lt;li&gt;Gaps between versions&lt;/li&gt;
&lt;li&gt;Incorrect timezone handling&lt;/li&gt;
&lt;li&gt;Using load date instead of actual business effective date&lt;/li&gt;
&lt;li&gt;Not handling late-arriving data&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  How to Avoid It
&lt;/h3&gt;

&lt;p&gt;Be very intentional about what your dates mean.&lt;/p&gt;

&lt;p&gt;For example:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;valid_from&lt;/code&gt;: when this version became valid&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;valid_to&lt;/code&gt;: when this version stopped being valid&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;loaded_at&lt;/code&gt;: when the data entered the warehouse&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;These are not always the same thing.&lt;/p&gt;




&lt;h2&gt;
  
  
  11. Best Practices for SCD
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Use SCD Type 2 for Business-Critical History
&lt;/h3&gt;

&lt;p&gt;If a change affects reporting, segmentation, revenue analysis, or compliance, preserve history.&lt;/p&gt;

&lt;p&gt;Examples:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Customer region&lt;/li&gt;
&lt;li&gt;Customer plan&lt;/li&gt;
&lt;li&gt;Product category&lt;/li&gt;
&lt;li&gt;Sales territory&lt;/li&gt;
&lt;li&gt;Employee department&lt;/li&gt;
&lt;li&gt;Account status&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;These are usually worth tracking with Type 2.&lt;/p&gt;




&lt;h3&gt;
  
  
  Use Hashing to Detect Changes
&lt;/h3&gt;

&lt;p&gt;Instead of comparing many columns one by one, you can create a hash from the important attributes.&lt;/p&gt;

&lt;p&gt;Example:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="n"&gt;MD5&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;CONCAT&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;city&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;customer_type&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;region&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Then compare the source hash with the current dimension hash.&lt;/p&gt;

&lt;p&gt;If the hash changes, something important changed.&lt;/p&gt;

&lt;p&gt;This makes SCD pipelines easier to maintain, especially when there are many columns.&lt;/p&gt;




&lt;h3&gt;
  
  
  Keep Your SCD Logic Clear
&lt;/h3&gt;

&lt;p&gt;SCD logic can become confusing quickly.&lt;/p&gt;

&lt;p&gt;Use clear column names like:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;valid_from
valid_to
is_current
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Avoid unclear names like:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;start
end
flag
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Your future self and your teammates will thank you.&lt;/p&gt;




&lt;h3&gt;
  
  
  Document Which Columns Are Tracked
&lt;/h3&gt;

&lt;p&gt;Do not leave SCD behavior hidden inside SQL code only.&lt;/p&gt;

&lt;p&gt;Document which columns are:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Type 0&lt;/li&gt;
&lt;li&gt;Type 1&lt;/li&gt;
&lt;li&gt;Type 2&lt;/li&gt;
&lt;li&gt;Type 3&lt;/li&gt;
&lt;li&gt;Ignored&lt;/li&gt;
&lt;li&gt;Used for change detection&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This is especially important in team environments.&lt;/p&gt;




&lt;h3&gt;
  
  
  Avoid Duplicates in Current Records
&lt;/h3&gt;

&lt;p&gt;For a Type 2 dimension, each business key should have only one current record.&lt;/p&gt;

&lt;p&gt;This should be true:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;One customer_id = one current record
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;You can test this with SQL:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;SELECT&lt;/span&gt;
    &lt;span class="n"&gt;customer_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="k"&gt;COUNT&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;current_records&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;dim_customer&lt;/span&gt;
&lt;span class="k"&gt;WHERE&lt;/span&gt; &lt;span class="n"&gt;is_current&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;true&lt;/span&gt;
&lt;span class="k"&gt;GROUP&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="n"&gt;customer_id&lt;/span&gt;
&lt;span class="k"&gt;HAVING&lt;/span&gt; &lt;span class="k"&gt;COUNT&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;If this query returns rows, your dimension has a problem.&lt;/p&gt;




&lt;h3&gt;
  
  
  Test for Overlapping Validity Periods
&lt;/h3&gt;

&lt;p&gt;For each business key, the date ranges should not overlap.&lt;/p&gt;

&lt;p&gt;Bad example:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;customer_id&lt;/th&gt;
&lt;th&gt;city&lt;/th&gt;
&lt;th&gt;valid_from&lt;/th&gt;
&lt;th&gt;valid_to&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;101&lt;/td&gt;
&lt;td&gt;Nairobi&lt;/td&gt;
&lt;td&gt;2025-01-01&lt;/td&gt;
&lt;td&gt;2025-04-01&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;101&lt;/td&gt;
&lt;td&gt;Kisumu&lt;/td&gt;
&lt;td&gt;2025-03-15&lt;/td&gt;
&lt;td&gt;NULL&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;These overlap between March 15 and April 1.&lt;/p&gt;

&lt;p&gt;That means a sale on March 20 could match both records.&lt;/p&gt;

&lt;p&gt;That is dangerous.&lt;/p&gt;




&lt;h2&gt;
  
  
  12. When to Use Slowly Changing Dimensions
&lt;/h2&gt;

&lt;p&gt;Use SCD when dimension data changes and those changes affect analysis.&lt;/p&gt;

&lt;p&gt;Good use cases include:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Customer address history&lt;/li&gt;
&lt;li&gt;Product category history&lt;/li&gt;
&lt;li&gt;Subscription plan changes&lt;/li&gt;
&lt;li&gt;Employee department changes&lt;/li&gt;
&lt;li&gt;Supplier region changes&lt;/li&gt;
&lt;li&gt;Account status changes&lt;/li&gt;
&lt;li&gt;Sales territory changes&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;SCD is especially useful in data warehouses and analytics systems where historical accuracy matters.&lt;/p&gt;

&lt;p&gt;For example:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;“Show revenue by the customer’s region at the time of purchase.”&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;That is a classic SCD Type 2 problem.&lt;/p&gt;




&lt;h2&gt;
  
  
  13. When SCD May Not Be the Best Choice
&lt;/h2&gt;

&lt;p&gt;SCD is powerful, but not every situation needs it.&lt;/p&gt;

&lt;h3&gt;
  
  
  Do Not Use Type 2 for Every Small Change
&lt;/h3&gt;

&lt;p&gt;If a field changes often and does not matter historically, Type 2 may create unnecessary complexity.&lt;/p&gt;

&lt;p&gt;For example:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Last login timestamp&lt;/li&gt;
&lt;li&gt;Profile picture URL&lt;/li&gt;
&lt;li&gt;Phone number&lt;/li&gt;
&lt;li&gt;Minor spelling corrections&lt;/li&gt;
&lt;li&gt;Temporary status fields&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;For these, Type 1 may be enough.&lt;/p&gt;

&lt;h3&gt;
  
  
  Be Careful With Very High-Volume Changes
&lt;/h3&gt;

&lt;p&gt;If a dimension changes too frequently, Type 2 can grow very fast.&lt;/p&gt;

&lt;p&gt;At that point, you may need a different modeling approach, such as:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Event sourcing&lt;/li&gt;
&lt;li&gt;Audit tables&lt;/li&gt;
&lt;li&gt;Snapshot tables&lt;/li&gt;
&lt;li&gt;Data lake versioning&lt;/li&gt;
&lt;li&gt;Change Data Capture history&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;SCD is best for slowly changing descriptive attributes, not every event that happens in the system.&lt;/p&gt;




&lt;h2&gt;
  
  
  14. SCD in Modern Data Tools
&lt;/h2&gt;

&lt;p&gt;SCD is not limited to traditional warehouses.&lt;/p&gt;

&lt;p&gt;You can implement SCD patterns in many modern data stacks.&lt;/p&gt;

&lt;h3&gt;
  
  
  In dbt
&lt;/h3&gt;

&lt;p&gt;dbt supports snapshots, which are commonly used to implement SCD Type 2.&lt;/p&gt;

&lt;p&gt;A dbt snapshot can track when records change and automatically create historical versions.&lt;/p&gt;

&lt;h3&gt;
  
  
  In Airflow
&lt;/h3&gt;

&lt;p&gt;Airflow can orchestrate SCD pipelines by scheduling extraction, staging, comparison, and dimension loading tasks.&lt;/p&gt;

&lt;h3&gt;
  
  
  In Spark
&lt;/h3&gt;

&lt;p&gt;Spark is useful when you are handling large-scale dimension updates.&lt;/p&gt;

&lt;p&gt;You can compare source and target datasets, detect changes, and write updated records to a lakehouse or warehouse.&lt;/p&gt;

&lt;h3&gt;
  
  
  In Kafka and CDC
&lt;/h3&gt;

&lt;p&gt;Kafka can stream changes from source systems.&lt;/p&gt;

&lt;p&gt;For example, using CDC tools, you can capture customer updates from PostgreSQL and send them into Kafka.&lt;/p&gt;

&lt;p&gt;From there, you can process those changes and update your dimension tables.&lt;/p&gt;

&lt;h3&gt;
  
  
  In Warehouses
&lt;/h3&gt;

&lt;p&gt;SCD can be implemented in:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Snowflake&lt;/li&gt;
&lt;li&gt;BigQuery&lt;/li&gt;
&lt;li&gt;Redshift&lt;/li&gt;
&lt;li&gt;PostgreSQL&lt;/li&gt;
&lt;li&gt;Databricks&lt;/li&gt;
&lt;li&gt;SQL Server&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The concept stays the same, even if the syntax differs.&lt;/p&gt;




&lt;h2&gt;
  
  
  15. Final Summary
&lt;/h2&gt;

&lt;p&gt;Slowly Changing Dimensions help data engineers manage changes in dimension data over time.&lt;/p&gt;

&lt;p&gt;They solve an important problem:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;How do we keep analytics accurate when descriptive data changes?&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;The most common SCD types are:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Type&lt;/th&gt;
&lt;th&gt;Meaning&lt;/th&gt;
&lt;th&gt;Best For&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Type 0&lt;/td&gt;
&lt;td&gt;Keep the original value unchanged&lt;/td&gt;
&lt;td&gt;Original values that should not change&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Type 1&lt;/td&gt;
&lt;td&gt;Overwrite old values&lt;/td&gt;
&lt;td&gt;Corrections and non-historical changes&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Type 2&lt;/td&gt;
&lt;td&gt;Keep full history using new rows&lt;/td&gt;
&lt;td&gt;Historical reporting&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Type 3&lt;/td&gt;
&lt;td&gt;Keep limited history in columns&lt;/td&gt;
&lt;td&gt;Simple previous/current comparisons&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;In real data engineering projects, SCD Type 2 is especially important because it allows the warehouse to answer historical questions correctly.&lt;/p&gt;

&lt;p&gt;Without SCD, reports can quietly become wrong.&lt;/p&gt;

&lt;p&gt;A customer’s current city may overwrite their past city.&lt;br&gt;
A product’s new category may rewrite old sales history.&lt;br&gt;
An employee’s new department may change old performance reports.&lt;/p&gt;

&lt;p&gt;That is why SCD matters.&lt;/p&gt;

&lt;p&gt;The practical takeaway is simple:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Whenever a dimension value changes, ask whether the business needs to remember the old value.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;If the value should never change, Type 0 may be the right choice.&lt;/p&gt;

&lt;p&gt;If the old value does not matter, Type 1 may be enough.&lt;/p&gt;

&lt;p&gt;If the business needs full history, Type 2 is usually the best choice.&lt;/p&gt;

&lt;p&gt;If the business only needs a simple previous-and-current comparison, Type 3 may work.&lt;/p&gt;

</description>
      <category>sql</category>
      <category>datawarehouse</category>
      <category>dataengineering</category>
      <category>scd</category>
    </item>
    <item>
      <title>Why Your Code Breaks in Production (and How Docker Fixes It)</title>
      <dc:creator>Anthony Gicheru</dc:creator>
      <pubDate>Tue, 12 May 2026 05:28:52 +0000</pubDate>
      <link>https://dev.to/anthony-gicheru/why-your-code-breaks-in-production-and-how-docker-fixes-it-3meo</link>
      <guid>https://dev.to/anthony-gicheru/why-your-code-breaks-in-production-and-how-docker-fixes-it-3meo</guid>
      <description>&lt;h2&gt;
  
  
  1. Why This Matters
&lt;/h2&gt;

&lt;p&gt;You write your code.&lt;br&gt;
You test it locally.&lt;br&gt;
Everything works perfectly.&lt;/p&gt;

&lt;p&gt;Then it goes to production… and breaks.&lt;/p&gt;

&lt;p&gt;You spend hours debugging, only to realize:&lt;br&gt;
&lt;strong&gt;nothing is wrong with your code — the environment is the problem.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;In data engineering, this happens all the time:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;A Spark job runs locally but fails in production&lt;/li&gt;
&lt;li&gt;Airflow works on Ubuntu but breaks on macOS&lt;/li&gt;
&lt;li&gt;Kafka pipelines behave differently across environments&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;At its core, the issue is simple:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Your environment is not consistent.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;&lt;strong&gt;Containerization solves this&lt;/strong&gt; by packaging everything your application needs into a single, portable unit that runs the same way anywhere.&lt;/p&gt;


&lt;h2&gt;
  
  
  2. Core Concept — What is Containerization?
&lt;/h2&gt;

&lt;p&gt;Let’s simplify it with an analogy.&lt;/p&gt;
&lt;h2&gt;
  
  
  Analogy: A Fully Equipped House
&lt;/h2&gt;

&lt;p&gt;Imagine being placed in an empty field with nothing around you.&lt;/p&gt;

&lt;p&gt;No food.&lt;br&gt;
No water.&lt;br&gt;
No electricity.&lt;br&gt;
No shelter.&lt;/p&gt;

&lt;p&gt;You might survive for a while, but functioning properly would be difficult.&lt;/p&gt;

&lt;p&gt;Now imagine being placed inside a fully equipped house.&lt;/p&gt;

&lt;p&gt;Everything you need is already there:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;food&lt;/li&gt;
&lt;li&gt;water&lt;/li&gt;
&lt;li&gt;electricity&lt;/li&gt;
&lt;li&gt;furniture&lt;/li&gt;
&lt;li&gt;internet&lt;/li&gt;
&lt;li&gt;a bed&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;No matter where that house is moved, you can still live comfortably because your essentials move with you.&lt;/p&gt;

&lt;p&gt;Applications work the same way.&lt;/p&gt;

&lt;p&gt;An application needs certain things to function:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;libraries&lt;/li&gt;
&lt;li&gt;runtime versions&lt;/li&gt;
&lt;li&gt;system tools&lt;/li&gt;
&lt;li&gt;environment variables&lt;/li&gt;
&lt;li&gt;dependencies&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Without them, the application breaks.&lt;/p&gt;

&lt;p&gt;Containerization solves this problem by packaging the application together with everything it needs to run.&lt;/p&gt;

&lt;p&gt;Think of a container as:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;a fully equipped house for your application.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fryt9amwrext4s0y67ptq.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fryt9amwrext4s0y67ptq.png" alt="Diagram comparing a Docker container to a fully equipped house containing everything an application needs to run consistently across different environments." width="800" height="533"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Inside the container, the app already has:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;its dependencies&lt;/li&gt;
&lt;li&gt;configurations&lt;/li&gt;
&lt;li&gt;runtime environment&lt;/li&gt;
&lt;li&gt;required tools&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;So whether the container runs on:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;your laptop&lt;/li&gt;
&lt;li&gt;a cloud server&lt;/li&gt;
&lt;li&gt;a teammate’s machine&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;…the application still behaves the same way.&lt;/p&gt;
&lt;h3&gt;
  
  
  The Mental Model
&lt;/h3&gt;

&lt;blockquote&gt;
&lt;p&gt;Containerization gives your application its own portable environment with everything it needs to survive and run consistently.&lt;/p&gt;
&lt;/blockquote&gt;


&lt;h2&gt;
  
  
  3. Docker Basics
&lt;/h2&gt;
&lt;h3&gt;
  
  
  Key Components
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Image&lt;/strong&gt; - A blueprint/template&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Container&lt;/strong&gt; - A running instance of that image&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Dockerfile&lt;/strong&gt; - Instructions to build the image&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Frusvrsj0vtnh3eey8x6m.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Frusvrsj0vtnh3eey8x6m.png" alt="A clean modern DevOps-style diagram showing the relationship between a Dockerfile, a Docker Image, and a running Docker Container." width="800" height="600"&gt;&lt;/a&gt;&lt;/p&gt;
&lt;h3&gt;
  
  
  Let’s Make It Real
&lt;/h3&gt;

&lt;p&gt;Here’s the smallest possible Docker setup for a Python app.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;app.py&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Hello from Docker!&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Dockerfile&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight docker"&gt;&lt;code&gt;&lt;span class="k"&gt;FROM&lt;/span&gt;&lt;span class="s"&gt; python:3.10-slim&lt;/span&gt;

&lt;span class="k"&gt;WORKDIR&lt;/span&gt;&lt;span class="s"&gt; /app&lt;/span&gt;
&lt;span class="k"&gt;COPY&lt;/span&gt;&lt;span class="s"&gt; app.py .&lt;/span&gt;

&lt;span class="k"&gt;CMD&lt;/span&gt;&lt;span class="s"&gt; ["python", "app.py"]&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Build and Run
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;docker build &lt;span class="nt"&gt;-t&lt;/span&gt; my-python-app &lt;span class="nb"&gt;.&lt;/span&gt;
docker run my-python-app
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Notice what we didn’t do:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Install Python manually&lt;/li&gt;
&lt;li&gt;Manage versions&lt;/li&gt;
&lt;li&gt;Configure anything&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The environment is fully defined in the Dockerfile.&lt;/p&gt;




&lt;h2&gt;
  
  
  4. Why Docker is Useful in Data Engineering
&lt;/h2&gt;

&lt;p&gt;In real-world data systems, you work with tools like:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Apache Airflow&lt;/li&gt;
&lt;li&gt;Spark / PySpark&lt;/li&gt;
&lt;li&gt;PostgreSQL or another data warehouse&lt;/li&gt;
&lt;li&gt;Reporting tools or dashboards&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Each of these has:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Different dependencies&lt;/li&gt;
&lt;li&gt;Different configurations&lt;/li&gt;
&lt;li&gt;Different runtime requirements&lt;/li&gt;
&lt;li&gt;Different ports&lt;/li&gt;
&lt;li&gt;Different environment variables&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Without Docker, they often conflict.&lt;/p&gt;

&lt;p&gt;For example:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Airflow may require specific Python packages&lt;/li&gt;
&lt;li&gt;PySpark may need Java and Spark installed&lt;/li&gt;
&lt;li&gt;PostgreSQL may need database credentials and storage&lt;/li&gt;
&lt;li&gt;Dashboard tools may need access to the processed data&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;With Docker:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;each tool runs in its own isolated environment — no conflicts, no surprises.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;This is especially useful in batch data pipelines because the entire workflow can be reproduced across different machines and environments.&lt;/p&gt;




&lt;h2&gt;
  
  
  5. Docker Compose — Managing Multiple Containers
&lt;/h2&gt;

&lt;p&gt;Real systems are never just one container.&lt;/p&gt;

&lt;p&gt;A Dockerized data engineering pipeline may include:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;An Airflow webserver&lt;/li&gt;
&lt;li&gt;An Airflow scheduler&lt;/li&gt;
&lt;li&gt;A PostgreSQL database&lt;/li&gt;
&lt;li&gt;A Spark / PySpark processing service&lt;/li&gt;
&lt;li&gt;Shared folders for DAGs, logs, scripts, and data&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Running each service manually quickly becomes painful.&lt;/p&gt;




&lt;h3&gt;
  
  
  Docker vs Docker Compose
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Docker&lt;/strong&gt; - runs one container&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Docker Compose&lt;/strong&gt; - runs an entire system made up of multiple containers&lt;/li&gt;
&lt;/ul&gt;




&lt;h3&gt;
  
  
  The Key Insight
&lt;/h3&gt;

&lt;p&gt;Without Docker Compose:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Multiple terminals&lt;/li&gt;
&lt;li&gt;Manual startup order&lt;/li&gt;
&lt;li&gt;Constant configuration issues&lt;/li&gt;
&lt;li&gt;Harder networking between services&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;With Docker Compose:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;one command starts everything.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fkm87ktdgl98ec2vm9yf9.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fkm87ktdgl98ec2vm9yf9.png" alt="Docker Compose orchestration diagram showing multiple services being started and managed from one docker-compose.yml file." width="800" height="600"&gt;&lt;/a&gt;&lt;/p&gt;




&lt;h3&gt;
  
  
  Example: Multi-Service Setup
&lt;/h3&gt;

&lt;p&gt;A simplified Docker Compose setup for a batch pipeline may include Airflow and PostgreSQL.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;docker-compose.yml&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;services&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;airflow-webserver&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;image&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;apache/airflow:3.2.1&lt;/span&gt;
    &lt;span class="na"&gt;container_name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;airflow_webserver&lt;/span&gt;
    &lt;span class="na"&gt;command&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;airflow webserver&lt;/span&gt;
    &lt;span class="na"&gt;ports&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;8080:8080"&lt;/span&gt;
    &lt;span class="na"&gt;environment&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;AIRFLOW__CORE__EXECUTOR&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;LocalExecutor&lt;/span&gt;
      &lt;span class="na"&gt;AIRFLOW__DATABASE__SQL_ALCHEMY_CONN&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;postgresql+psycopg2://airflow:airflow@postgres:5432/airflow&lt;/span&gt;
    &lt;span class="na"&gt;volumes&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;./dags:/opt/airflow/dags&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;./logs:/opt/airflow/logs&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;./jobs:/opt/airflow/jobs&lt;/span&gt;
    &lt;span class="na"&gt;depends_on&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;postgres&lt;/span&gt;

  &lt;span class="na"&gt;airflow-scheduler&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;image&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;apache/airflow:3.2.1&lt;/span&gt;
    &lt;span class="na"&gt;container_name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;airflow_scheduler&lt;/span&gt;
    &lt;span class="na"&gt;command&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;airflow scheduler&lt;/span&gt;
    &lt;span class="na"&gt;environment&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;AIRFLOW__CORE__EXECUTOR&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;LocalExecutor&lt;/span&gt;
      &lt;span class="na"&gt;AIRFLOW__DATABASE__SQL_ALCHEMY_CONN&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;postgresql+psycopg2://airflow:airflow@postgres:5432/airflow&lt;/span&gt;
    &lt;span class="na"&gt;volumes&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;./dags:/opt/airflow/dags&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;./logs:/opt/airflow/logs&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;./jobs:/opt/airflow/jobs&lt;/span&gt;
    &lt;span class="na"&gt;depends_on&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;postgres&lt;/span&gt;

  &lt;span class="na"&gt;postgres&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;image&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;postgres:16&lt;/span&gt;
    &lt;span class="na"&gt;container_name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;postgres_db&lt;/span&gt;
    &lt;span class="na"&gt;environment&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;POSTGRES_USER&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;airflow&lt;/span&gt;
      &lt;span class="na"&gt;POSTGRES_PASSWORD&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;airflow&lt;/span&gt;
      &lt;span class="na"&gt;POSTGRES_DB&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;airflow&lt;/span&gt;
    &lt;span class="na"&gt;ports&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;5433:5432"&lt;/span&gt;
    &lt;span class="na"&gt;volumes&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;postgres_data:/var/lib/postgresql/data&lt;/span&gt;

&lt;span class="na"&gt;volumes&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;postgres_data&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  8. Common Mistakes
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Using &lt;code&gt;localhost&lt;/code&gt; inside containers&lt;/strong&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This breaks almost everyone at first.&lt;/p&gt;

&lt;p&gt;Inside a container:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;code&gt;localhost&lt;/code&gt; refers to the container itself, not your machine.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Forgetting environment variables&lt;/strong&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Missing configs often cause silent failures.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Not persisting data&lt;/strong&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Containers are temporary. Without volumes, your data disappears.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;  &lt;span class="na"&gt;volumes&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;postgres_data:/var/lib/postgresql/data&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Rebuilding unnecessarily&lt;/strong&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Poor Dockerfile structure can slow builds significantly.&lt;/p&gt;




&lt;h2&gt;
  
  
  9. Best Practices
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Use lightweight images&lt;/strong&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight docker"&gt;&lt;code&gt;  FROM python:3.10-slim
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Add a &lt;code&gt;.dockerignore&lt;/code&gt;&lt;/strong&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;  node_modules
  .git
  .env
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Avoid &lt;code&gt;latest&lt;/code&gt; in production&lt;/strong&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Use fixed versions to keep builds predictable.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Separate dev and production setups&lt;/strong&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;They have different requirements.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Use Docker Compose for local development&lt;/strong&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;It helps simulate real systems easily.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Use clear service names&lt;/strong&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Examples:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;code&gt;kafka&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;postgres&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;airflow&lt;/code&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This simplifies networking and debugging.&lt;/p&gt;




&lt;h2&gt;
  
  
  10. Conclusion
&lt;/h2&gt;

&lt;p&gt;Containerization changes how you think about environments.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Docker packages your application into a portable unit.&lt;/li&gt;
&lt;li&gt;Docker Compose runs entire systems with one command.&lt;/li&gt;
&lt;li&gt;Your pipelines become reproducible and consistent.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The real shift is this:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;You stop debugging environments — and start defining them as code.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;And once you reach that point:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;You’re no longer just writing code — you’re building systems.&lt;/p&gt;
&lt;/blockquote&gt;

</description>
      <category>containers</category>
      <category>docker</category>
      <category>devops</category>
    </item>
    <item>
      <title>Data Warehouses, Data Marts, Data Lakes, and Lakehouses - Explained Like You’re Building Them in Real Life</title>
      <dc:creator>Anthony Gicheru</dc:creator>
      <pubDate>Sun, 03 May 2026 08:10:27 +0000</pubDate>
      <link>https://dev.to/anthony-gicheru/data-warehouses-data-marts-data-lakes-and-lakehouses-explained-like-youre-building-them-in-3hmm</link>
      <guid>https://dev.to/anthony-gicheru/data-warehouses-data-marts-data-lakes-and-lakehouses-explained-like-youre-building-them-in-3hmm</guid>
      <description>&lt;p&gt;If you’ve been around data engineering long enough, you’ve probably heard these terms thrown around in meetings:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;“Just dump it in the data lake”&lt;/li&gt;
&lt;li&gt;“We’ll expose it through the warehouse”&lt;/li&gt;
&lt;li&gt;“That goes into the mart”&lt;/li&gt;
&lt;li&gt;“We’re moving to a lakehouse architecture”&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;And honestly… it can sound like four different ways of saying the same thing.&lt;/p&gt;

&lt;p&gt;They’re not.&lt;/p&gt;

&lt;p&gt;Each one solves a slightly different problem in the data ecosystem. Once you understand the “why” behind each, the architecture suddenly feels a lot less like buzzwords and more like a clean system design.&lt;/p&gt;

&lt;p&gt;Let’s break it down in a practical, engineer-first way.&lt;/p&gt;




&lt;h1&gt;
  
  
  1. The Big Picture (Why all these systems exist)
&lt;/h1&gt;

&lt;p&gt;In most companies, data doesn’t come from one place — it flows in from everywhere:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;User clicks from web/mobile apps&lt;/li&gt;
&lt;li&gt;Payments and transactions&lt;/li&gt;
&lt;li&gt;Logs from servers&lt;/li&gt;
&lt;li&gt;Third-party APIs (Stripe, Shopify, etc.)&lt;/li&gt;
&lt;li&gt;IoT or streaming data (Kafka, sensors, etc.)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Now here’s the problem:&lt;/p&gt;

&lt;p&gt;Raw data is messy. Business users don’t want messy.&lt;/p&gt;

&lt;p&gt;So we build systems that progressively refine data from:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Raw → Clean → Structured → Business-ready&lt;/strong&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;That’s where these four concepts come in:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Data Lake&lt;/strong&gt; → store everything raw&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Data Warehouse&lt;/strong&gt; → structured analytics-ready data&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Data Mart&lt;/strong&gt; → department-specific slices of warehouse data&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Lakehouse&lt;/strong&gt; → hybrid of lake + warehouse&lt;/li&gt;
&lt;/ul&gt;




&lt;h1&gt;
  
  
  2. Data Lake — “Store everything first, figure it out later”
&lt;/h1&gt;

&lt;p&gt;A &lt;strong&gt;data lake&lt;/strong&gt; is basically a massive storage system where you dump raw data in its original format.&lt;/p&gt;

&lt;p&gt;Think of it like:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;A giant warehouse where you throw every box in as-is, without opening it.&lt;br&gt;
Or even better: a farm storage system where everything is stored right after harvest, unprocessed and mixed together.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h3&gt;
  
  
  Characteristics:
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Stores structured, semi-structured, and unstructured data&lt;/li&gt;
&lt;li&gt;Cheap storage (usually object storage like S3)&lt;/li&gt;
&lt;li&gt;Schema is applied &lt;strong&gt;when reading&lt;/strong&gt;, not writing (schema-on-read)&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Example tools:
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Amazon S3&lt;/li&gt;
&lt;li&gt;Azure Data Lake Storage&lt;/li&gt;
&lt;li&gt;Google Cloud Storage&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Example:
&lt;/h3&gt;

&lt;p&gt;You might store:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;/events/clicks/2026/05/01.json
/logs/api/2026/05/01.log
/payments/stripe/2026/05/01.parquet
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;No transformations. No enforcement. Just storage.&lt;/p&gt;

&lt;h3&gt;
  
  
  Here’s the catch:
&lt;/h3&gt;

&lt;p&gt;If you’re not careful, a data lake becomes a &lt;strong&gt;data swamp&lt;/strong&gt; — lots of data, zero usability.&lt;/p&gt;




&lt;h1&gt;
  
  
  3. Data Warehouse — “Clean, structured, and business-ready”
&lt;/h1&gt;

&lt;p&gt;A &lt;strong&gt;data warehouse&lt;/strong&gt; is where data goes after it has been cleaned, transformed, and modeled for analytics.&lt;/p&gt;

&lt;p&gt;Think of it like:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;A well-organized supermarket where everything is cleaned, packaged, labeled, and placed on the right shelves.&lt;br&gt;
You don’t pick raw potatoes from the soil — you get them washed, sorted, and priced.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h3&gt;
  
  
  Characteristics:
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Structured data only&lt;/li&gt;
&lt;li&gt;Schema-on-write (you define structure before loading)&lt;/li&gt;
&lt;li&gt;Optimized for analytics queries (OLAP systems)&lt;/li&gt;
&lt;li&gt;Highly curated and trustworthy&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Example tools:
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Amazon Redshift&lt;/li&gt;
&lt;li&gt;Snowflake&lt;/li&gt;
&lt;li&gt;Google BigQuery&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Typical workflow:
&lt;/h3&gt;

&lt;ol&gt;
&lt;li&gt;Extract data from sources&lt;/li&gt;
&lt;li&gt;Transform (clean, join, aggregate)&lt;/li&gt;
&lt;li&gt;Load into warehouse tables&lt;/li&gt;
&lt;/ol&gt;

&lt;h3&gt;
  
  
  Example SQL model:
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;CREATE&lt;/span&gt; &lt;span class="k"&gt;TABLE&lt;/span&gt; &lt;span class="n"&gt;sales_fact&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;order_id&lt;/span&gt; &lt;span class="nb"&gt;INT&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;customer_id&lt;/span&gt; &lt;span class="nb"&gt;INT&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;product_id&lt;/span&gt; &lt;span class="nb"&gt;INT&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;amount&lt;/span&gt; &lt;span class="nb"&gt;DECIMAL&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;10&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
    &lt;span class="n"&gt;order_date&lt;/span&gt; &lt;span class="nb"&gt;DATE&lt;/span&gt;
&lt;span class="p"&gt;);&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Now business analysts can run queries like:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="n"&gt;product_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="k"&gt;SUM&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;amount&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;sales_fact&lt;/span&gt;
&lt;span class="k"&gt;GROUP&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="n"&gt;product_id&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h1&gt;
  
  
  4. Data Marts — “Department-specific mini warehouses”
&lt;/h1&gt;

&lt;p&gt;A &lt;strong&gt;data mart&lt;/strong&gt; is a subset of a data warehouse focused on a specific business domain.&lt;/p&gt;

&lt;p&gt;Think of it like:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;A grocery store or specialty shop — like a bakery, butcher, or vegetable shop.&lt;br&gt;
It doesn’t sell everything. It only sells what its customers actually need.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h3&gt;
  
  
  Characteristics:
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Smaller scope than a warehouse&lt;/li&gt;
&lt;li&gt;Built for a specific team (finance, marketing, sales)&lt;/li&gt;
&lt;li&gt;Faster queries for targeted use cases&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Example:
&lt;/h3&gt;

&lt;p&gt;A marketing data mart might include:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Campaign performance&lt;/li&gt;
&lt;li&gt;Customer acquisition metrics&lt;/li&gt;
&lt;li&gt;Ad spend data&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Example structure:
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;CREATE&lt;/span&gt; &lt;span class="k"&gt;TABLE&lt;/span&gt; &lt;span class="n"&gt;marketing_campaign_performance&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt;
&lt;span class="k"&gt;SELECT&lt;/span&gt;
    &lt;span class="n"&gt;campaign_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="k"&gt;SUM&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;clicks&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;total_clicks&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="k"&gt;SUM&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;impressions&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;total_impressions&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;ad_events&lt;/span&gt;
&lt;span class="k"&gt;GROUP&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="n"&gt;campaign_id&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Why it exists:
&lt;/h3&gt;

&lt;p&gt;Instead of everyone querying a massive warehouse, teams get &lt;strong&gt;pre-optimized datasets&lt;/strong&gt;.&lt;/p&gt;




&lt;h1&gt;
  
  
  5. Data Lakehouse — “Best of both worlds”
&lt;/h1&gt;

&lt;p&gt;Now this is where things get interesting.&lt;/p&gt;

&lt;p&gt;A &lt;strong&gt;lakehouse&lt;/strong&gt; combines:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;The flexibility of a data lake&lt;/li&gt;
&lt;li&gt;The structure and performance of a data warehouse&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Think of it like:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;A modern retail system where the warehouse and supermarket are combined into one smart facility.&lt;br&gt;
Raw goods arrive, but they are immediately tracked, organized, and made queryable without losing flexibility.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h3&gt;
  
  
  Characteristics:
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Uses low-cost storage (like a lake)&lt;/li&gt;
&lt;li&gt;Adds structure, ACID transactions, and governance&lt;/li&gt;
&lt;li&gt;Supports both analytics and ML workloads&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Example tools:
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Apache Spark + Delta Lake&lt;/li&gt;
&lt;li&gt;Apache Iceberg&lt;/li&gt;
&lt;li&gt;Apache Hudi&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Why it matters:
&lt;/h3&gt;

&lt;p&gt;In traditional setups:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Data lakes = flexible but messy&lt;/li&gt;
&lt;li&gt;Warehouses = clean but expensive and rigid&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Lakehouses try to remove that tradeoff.&lt;/p&gt;




&lt;h1&gt;
  
  
  6. How They Work Together (Real Architecture Flow)
&lt;/h1&gt;

&lt;p&gt;A modern data pipeline often looks like this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;[ Data Sources ]
      ↓
   DATA LAKE (raw storage)
      ↓
ETL / ELT pipelines (Airflow, Spark)
      ↓
DATA WAREHOUSE (modeled data)
      ↓
DATA MARTS (team-specific views)
      ↓
Dashboards / BI tools
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Or in a lakehouse setup:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;[ Data Sources ]
      ↓
DATA LAKEHOUSE (single system)
      ↓
BI + ML + Analytics directly
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h1&gt;
  
  
  7. Practical Example (Mini Pipeline)
&lt;/h1&gt;

&lt;p&gt;Let’s say we’re processing e-commerce data.&lt;/p&gt;

&lt;h3&gt;
  
  
  Step 1: Raw data in S3 (Data Lake)
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"order_id"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;101&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"user_id"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;55&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"amount"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;250&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"timestamp"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"2026-05-01T10:00:00Z"&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Step 2: Spark transformation
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;pyspark.sql&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;SparkSession&lt;/span&gt;

&lt;span class="n"&gt;spark&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;SparkSession&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;builder&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;appName&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;etl&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;getOrCreate&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

&lt;span class="n"&gt;df&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;spark&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;read&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;json&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;s3://datalake/raw/orders/&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;clean_df&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;dropna&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; \
             &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;withColumnRenamed&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;amount&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;order_amount&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;clean_df&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;write&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;mode&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;overwrite&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;parquet&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;s3://warehouse/sales_fact/&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Step 3: Load into warehouse (Redshift example)
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;COPY&lt;/span&gt; &lt;span class="n"&gt;sales_fact&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="s1"&gt;'s3://warehouse/sales_fact/'&lt;/span&gt;
&lt;span class="n"&gt;IAM_ROLE&lt;/span&gt; &lt;span class="s1"&gt;'arn:aws:iam::123456:role/RedshiftRole'&lt;/span&gt;
&lt;span class="n"&gt;FORMAT&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;PARQUET&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Step 4: Create a data mart
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;CREATE&lt;/span&gt; &lt;span class="k"&gt;TABLE&lt;/span&gt; &lt;span class="n"&gt;sales_summary&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt;
&lt;span class="k"&gt;SELECT&lt;/span&gt;
    &lt;span class="nb"&gt;DATE&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;order_date&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="nb"&gt;date&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="k"&gt;SUM&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;order_amount&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;revenue&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;sales_fact&lt;/span&gt;
&lt;span class="k"&gt;GROUP&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="nb"&gt;DATE&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;order_date&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h1&gt;
  
  
  8. Common Pitfalls (Where most teams mess up)
&lt;/h1&gt;

&lt;h3&gt;
  
  
  1. Turning the data lake into a swamp
&lt;/h3&gt;

&lt;p&gt;Dumping everything without metadata or structure leads to chaos.&lt;/p&gt;

&lt;h3&gt;
  
  
  2. Over-modeling too early
&lt;/h3&gt;

&lt;p&gt;Trying to build perfect schemas upfront slows everything down.&lt;/p&gt;

&lt;h3&gt;
  
  
  3. Duplicating logic across marts
&lt;/h3&gt;

&lt;p&gt;You end up with inconsistent metrics like “Revenue_v1”, “Revenue_final”, “Revenue_real_final”.&lt;/p&gt;

&lt;h3&gt;
  
  
  4. No governance layer
&lt;/h3&gt;

&lt;p&gt;Without access control and cataloging, nobody trusts the data.&lt;/p&gt;




&lt;h1&gt;
  
  
  9. Best Practices
&lt;/h1&gt;

&lt;h3&gt;
  
  
  1. Use layered architecture
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Raw (lake)&lt;/li&gt;
&lt;li&gt;Cleaned (staging)&lt;/li&gt;
&lt;li&gt;Modeled (warehouse)&lt;/li&gt;
&lt;li&gt;Aggregated (marts)&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  2. Standardize transformations
&lt;/h3&gt;

&lt;p&gt;Use tools like:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;dbt&lt;/li&gt;
&lt;li&gt;Apache Airflow&lt;/li&gt;
&lt;li&gt;Spark jobs with clear ownership&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  3. Define a single source of truth
&lt;/h3&gt;

&lt;p&gt;One metric definition per business KPI. No duplicates.&lt;/p&gt;

&lt;h3&gt;
  
  
  4. Treat data like software
&lt;/h3&gt;

&lt;p&gt;Version it, test it, document it.&lt;/p&gt;

&lt;h3&gt;
  
  
  5. Monitor everything
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Pipeline failures&lt;/li&gt;
&lt;li&gt;Data freshness&lt;/li&gt;
&lt;li&gt;Schema changes&lt;/li&gt;
&lt;/ul&gt;




&lt;h1&gt;
  
  
  10. Conclusion — The mental model that matters
&lt;/h1&gt;

&lt;p&gt;If you strip away the jargon, it’s really simple:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Data Lake&lt;/strong&gt; → store everything&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Data Warehouse&lt;/strong&gt; → clean and organize it&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Data Mart&lt;/strong&gt; → tailor it for teams&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Lakehouse&lt;/strong&gt; → unify storage and analytics&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The real skill in data engineering isn’t memorizing definitions.&lt;/p&gt;

&lt;p&gt;It’s knowing:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;When to keep data raw, when to structure it, and when to specialize it.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Once that clicks, designing data systems becomes a lot more intuitive — and honestly, more fun to build.&lt;/p&gt;

</description>
      <category>datawarehouse</category>
      <category>datamarts</category>
      <category>datalake</category>
      <category>lakehouse</category>
    </item>
    <item>
      <title>Refactoring Airflow Pipelines: From PythonOperator to TaskFlow</title>
      <dc:creator>Anthony Gicheru</dc:creator>
      <pubDate>Fri, 24 Apr 2026 10:52:39 +0000</pubDate>
      <link>https://dev.to/anthony-gicheru/refactoring-airflow-pipelines-from-pythonoperator-to-taskflow-25mk</link>
      <guid>https://dev.to/anthony-gicheru/refactoring-airflow-pipelines-from-pythonoperator-to-taskflow-25mk</guid>
      <description>&lt;h1&gt;
  
  
  Actually Embracing TaskFlow After a Year of Doing It the “Old Way”
&lt;/h1&gt;

&lt;h2&gt;
  
  
  1. Introduction: This Isn’t New… But It &lt;em&gt;Feels&lt;/em&gt; New
&lt;/h2&gt;

&lt;p&gt;If you’ve been using Airflow for a while-like I have-you probably didn’t start with the TaskFlow API.&lt;/p&gt;

&lt;p&gt;You likely started with the classic Airflow 2.x style:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;PythonOperator&lt;/li&gt;
&lt;li&gt;&lt;code&gt;**kwargs&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;ti.xcom_push()&lt;/code&gt; and &lt;code&gt;ti.xcom_pull()&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;Explicit task chaining with &lt;code&gt;&amp;gt;&amp;gt;&lt;/code&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;I spent over a year building pipelines this way. And to be clear-it works. It’s stable, production-ready, and widely used.&lt;/p&gt;

&lt;p&gt;But here’s the interesting part:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;The TaskFlow API has existed since Airflow 2.0. I just didn’t fully adopt it.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Honestly, I ignored TaskFlow for a long time because I thought it was just ‘syntactic sugar’. That’s more common than people admit.&lt;/p&gt;

&lt;p&gt;Most production systems and tutorials still rely on operators, so you naturally stay in that pattern. It’s only later-when readability and maintainability start to matter-that TaskFlow becomes interesting.&lt;/p&gt;

&lt;p&gt;And once it clicks, it changes how you think about Airflow.&lt;/p&gt;




&lt;h2&gt;
  
  
  2. Core Concepts: Same Engine, Different Experience
&lt;/h2&gt;

&lt;p&gt;TaskFlow doesn’t replace Airflow concepts-it abstracts them.&lt;/p&gt;

&lt;p&gt;You still work with:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Tasks&lt;/li&gt;
&lt;li&gt;DAGs&lt;/li&gt;
&lt;li&gt;Scheduling&lt;/li&gt;
&lt;li&gt;XComs&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The difference is &lt;em&gt;how&lt;/em&gt; you express them.&lt;/p&gt;

&lt;h3&gt;
  
  
  Traditional Approach
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;airflow&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;DAG&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;airflow.operators.python&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;PythonOperator&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;datetime&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;datetime&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;extract&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;**&lt;/span&gt;&lt;span class="n"&gt;kwargs&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;data&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
    &lt;span class="n"&gt;kwargs&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;ti&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="nf"&gt;xcom_push&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;key&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;data&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;value&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;transform&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;**&lt;/span&gt;&lt;span class="n"&gt;kwargs&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;ti&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;kwargs&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;ti&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
    &lt;span class="n"&gt;data&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;ti&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;xcom_pull&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;task_ids&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;extract&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;key&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;data&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;x&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;

&lt;span class="k"&gt;with&lt;/span&gt; &lt;span class="nc"&gt;DAG&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;traditional_dag&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;start_date&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="nf"&gt;datetime&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;2023&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;dag&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;

    &lt;span class="n"&gt;t1&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;PythonOperator&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;task_id&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;extract&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;python_callable&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;extract&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;t2&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;PythonOperator&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;task_id&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;transform&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;python_callable&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;transform&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="n"&gt;t1&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;t2&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This works, but it introduces a lot of orchestration boilerplate into your business logic.&lt;/p&gt;

&lt;h3&gt;
  
  
  TaskFlow Approach
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;airflow.sdk&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;dag&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;task&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;datetime&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;datetime&lt;/span&gt;

&lt;span class="nd"&gt;@dag&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;start_date&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="nf"&gt;datetime&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;2023&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="n"&gt;schedule&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;@daily&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;catchup&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;False&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;taskflow_dag&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;

    &lt;span class="nd"&gt;@task&lt;/span&gt;
    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;extract&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;

    &lt;span class="nd"&gt;@task&lt;/span&gt;
    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;transform&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;x&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;

    &lt;span class="nf"&gt;transform&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;extract&lt;/span&gt;&lt;span class="p"&gt;())&lt;/span&gt;

&lt;span class="n"&gt;dag&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;taskflow_dag&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This feels simpler because it is.&lt;/p&gt;

&lt;p&gt;TaskFlow removes explicit XCom handling and lets function returns define data flow.&lt;/p&gt;




&lt;h2&gt;
  
  
  3. The Real Shift: From Wiring Tasks to Modeling Data Flow
&lt;/h2&gt;

&lt;p&gt;With the traditional approach, your mental model looks like this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Task A - XCom - Task B - XCom - Task C
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;With TaskFlow, it becomes:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;data&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;extract&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="n"&gt;result&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;transform&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Flj2rveoa9vso89mowp9p.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Flj2rveoa9vso89mowp9p.png" alt="Airflow DAG: Traditional vs Taskflow" width="800" height="533"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Same execution engine. Different abstraction.&lt;/p&gt;

&lt;p&gt;The shift is from &lt;em&gt;task orchestration&lt;/em&gt; to &lt;em&gt;data flow composition&lt;/em&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  4. XComs: Manual vs Automatic
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Manual XComs
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;extract&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;**&lt;/span&gt;&lt;span class="n"&gt;kwargs&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;kwargs&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;ti&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="nf"&gt;xcom_push&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;key&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;data&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;value&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;transform&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;**&lt;/span&gt;&lt;span class="n"&gt;kwargs&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;ti&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;kwargs&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;ti&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
    &lt;span class="n"&gt;data&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;ti&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;xcom_pull&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;task_ids&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;extract&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;key&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;data&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;You manage everything explicitly.&lt;/p&gt;

&lt;h3&gt;
  
  
  TaskFlow XComs
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="nd"&gt;@task&lt;/span&gt;
&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;extract&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;

&lt;span class="nd"&gt;@task&lt;/span&gt;
&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;transform&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;x&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Airflow handles:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;serialization&lt;/li&gt;
&lt;li&gt;storage&lt;/li&gt;
&lt;li&gt;retrieval&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;You focus on logic.&lt;/p&gt;

&lt;h3&gt;
  
  
  When You Still Need Control
&lt;/h3&gt;

&lt;p&gt;TaskFlow still allows explicit control when needed:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;airflow.models.xcom_arg&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;XComArg&lt;/span&gt;

&lt;span class="nd"&gt;@task&lt;/span&gt;
&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;extract&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;numbers&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;]}&lt;/span&gt;

&lt;span class="nd"&gt;@task&lt;/span&gt;
&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;transform&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;x&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;numbers&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]]&lt;/span&gt;

&lt;span class="nf"&gt;transform&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nc"&gt;XComArg&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;extract&lt;/span&gt;&lt;span class="p"&gt;()))&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  5. Real-World Example: Gas Prices ETL Refactor
&lt;/h2&gt;

&lt;p&gt;I didn’t build two versions of this pipeline at once.&lt;/p&gt;

&lt;p&gt;I originally built it using the traditional Airflow 2.x approach and later refactored it using TaskFlow.&lt;/p&gt;

&lt;p&gt;That’s when the difference became clear.&lt;/p&gt;

&lt;h3&gt;
  
  
  Pipeline Overview
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;API - Extract gas prices - Transform - Store in PostgreSQL
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  GitHub Reference
&lt;/h3&gt;

&lt;p&gt;Full project: &lt;a href="https://github.com/Anthony-Gicheru/Gas-Prices-ETL-with-Apache-Airflow" rel="noopener noreferrer"&gt;Github Link to the project&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;It includes both the original DAG and the TaskFlow refactor.&lt;/p&gt;

&lt;h3&gt;
  
  
  Traditional Version
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;fetch_gas_prices&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;**&lt;/span&gt;&lt;span class="n"&gt;kwargs&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;kwargs&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;ti&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="nf"&gt;xcom_push&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;key&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;raw_gas_data&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;value&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;decoded_data&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;transform_gas_prices&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;**&lt;/span&gt;&lt;span class="n"&gt;kwargs&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;raw_data&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;kwargs&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;ti&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="nf"&gt;xcom_pull&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="n"&gt;task_ids&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;fetch_gas_prices&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;key&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;raw_gas_data&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This approach tightly couples logic with Airflow internals.&lt;/p&gt;

&lt;p&gt;Data must be serialized manually:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;json_data&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;to_json&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;orient&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;records&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  TaskFlow Version
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="nd"&gt;@task&lt;/span&gt;
&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;fetch_gas_prices&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;decoded_data&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="nd"&gt;@task&lt;/span&gt;
&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;transform_gas_prices&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;raw_data&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;to_json&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;orient&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;records&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;And the pipeline becomes:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;raw&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;fetch_gas_prices&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="n"&gt;cleaned&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;transform_gas_prices&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;raw&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="nf"&gt;store_gas_prices&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;cleaned&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This reads like standard Python.&lt;/p&gt;

&lt;h3&gt;
  
  
  What Changed
&lt;/h3&gt;

&lt;p&gt;The logic stayed the same. The structure changed completely.&lt;/p&gt;

&lt;p&gt;Instead of manually managing XComs, data flows naturally between functions.&lt;/p&gt;

&lt;h3&gt;
  
  
  Before vs After
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Aspect&lt;/th&gt;
&lt;th&gt;Traditional&lt;/th&gt;
&lt;th&gt;TaskFlow&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Task definition&lt;/td&gt;
&lt;td&gt;PythonOperator&lt;/td&gt;
&lt;td&gt;@task&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Data passing&lt;/td&gt;
&lt;td&gt;Manual XCom&lt;/td&gt;
&lt;td&gt;Automatic&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Readability&lt;/td&gt;
&lt;td&gt;Medium&lt;/td&gt;
&lt;td&gt;High&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Boilerplate&lt;/td&gt;
&lt;td&gt;High&lt;/td&gt;
&lt;td&gt;Low&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Mental model&lt;/td&gt;
&lt;td&gt;Wiring tasks&lt;/td&gt;
&lt;td&gt;Data flow&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;




&lt;h2&gt;
  
  
  6. Lessons From the Refactor
&lt;/h2&gt;

&lt;h3&gt;
  
  
  1. TaskFlow doesn’t remove XComs
&lt;/h3&gt;

&lt;p&gt;It only hides them.&lt;/p&gt;

&lt;p&gt;You still need to respect serialization limits:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;big_dataframe&lt;/span&gt;  &lt;span class="c1"&gt;# still not ideal
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  2. Passing data is easier-but not always better
&lt;/h3&gt;

&lt;p&gt;TaskFlow makes it easy to pass data between tasks, but large payloads should still live in external storage.&lt;/p&gt;

&lt;h3&gt;
  
  
  3. Refactoring was mostly structural
&lt;/h3&gt;

&lt;p&gt;Most of the work was:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;removing &lt;code&gt;**kwargs&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;replacing XCom logic with returns&lt;/li&gt;
&lt;li&gt;simplifying task boundaries&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  4. The biggest change is mental
&lt;/h3&gt;

&lt;p&gt;The shift was not technical-it was conceptual.&lt;br&gt;
From:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;How do I connect tasks?&lt;br&gt;
to:&lt;/p&gt;
&lt;h2&gt;
  
  
  How does data flow through this pipeline?
&lt;/h2&gt;
&lt;/blockquote&gt;

&lt;h2&gt;
  
  
  7. Pitfalls to Avoid
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;Don’t push large objects through XCom&lt;/li&gt;
&lt;li&gt;Don’t mix styles without intention&lt;/li&gt;
&lt;li&gt;Don’t overuse TaskFlow just because it’s cleaner&lt;/li&gt;
&lt;li&gt;Don’t forget serialization still exists&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  8. Conclusion
&lt;/h2&gt;

&lt;p&gt;TaskFlow isn’t new-but adopting it after using the traditional approach makes its benefits clearer.&lt;/p&gt;

&lt;p&gt;It moves you from writing orchestration-heavy DAGs to writing clean Python workflows.&lt;/p&gt;

&lt;p&gt;And that shift improves:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;readability&lt;/li&gt;
&lt;li&gt;maintainability&lt;/li&gt;
&lt;li&gt;reasoning about pipelines&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Key Takeaways
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;TaskFlow simplifies DAG structure without changing Airflow’s core engine&lt;/li&gt;
&lt;li&gt;XComs still exist but are abstracted&lt;/li&gt;
&lt;li&gt;The real improvement is cleaner data flow modeling&lt;/li&gt;
&lt;li&gt;Refactoring old DAGs is one of the best ways to understand it&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>python</category>
      <category>etl</category>
      <category>apacheairflow</category>
      <category>productivity</category>
    </item>
    <item>
      <title>Data Pipelines Explained Simply (and How to Build Them with Python)</title>
      <dc:creator>Anthony Gicheru</dc:creator>
      <pubDate>Fri, 17 Apr 2026 07:34:55 +0000</pubDate>
      <link>https://dev.to/anthony-gicheru/data-pipelines-explained-simply-and-how-to-build-them-with-python-555</link>
      <guid>https://dev.to/anthony-gicheru/data-pipelines-explained-simply-and-how-to-build-them-with-python-555</guid>
      <description>&lt;p&gt;Data pipelines are the backbone of modern data-driven organizations. They automate the movement, transformation, and storage of data - from raw sources to actionable insights.&lt;/p&gt;

&lt;p&gt;Python has become the go-to language for building scalable pipelines because of its rich ecosystem, flexibility, and ease of use.&lt;/p&gt;

&lt;p&gt;This guide walks through the fundamentals, tools, and best practices for building robust data pipelines using Python.&lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;strong&gt;Understanding Data Pipelines&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;Imagine you need to supply clean water to a village. The process involves collecting water from different sources (rivers, wells, rain), purifying it, transporting it, and storing it so people can access it whenever they need it.&lt;/p&gt;

&lt;p&gt;A data pipeline works in a very similar way.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F5sonxpmecasd03c5xzhw.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F5sonxpmecasd03c5xzhw.png" alt="A data pipeline represented as a water system, showing how raw data flows through ingestion, transformation, storage, and finally consumption." width="800" height="533"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;It automates the journey of raw, unstructured data from multiple sources (like databases, APIs, or IoT devices) and transforms it into clean, usable data stored in a destination (like a data warehouse) for analysis.&lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;strong&gt;Components of a Data Pipeline&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;Let’s break it down using the same analogy:&lt;/p&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;1. Collecting Water (Data Ingestion)&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;Just like gathering water from lakes or wells, a pipeline starts by extracting data from sources such as databases, APIs, spreadsheets, or sensors.&lt;/p&gt;

&lt;p&gt;The goal here is simple: get all the data into one system, no matter how scattered it is.&lt;/p&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;2. Filtering and Purifying (Data Transformation)&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;Raw water isn’t clean—and neither is raw data.&lt;/p&gt;

&lt;p&gt;At this stage, the pipeline:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Removes duplicates&lt;/li&gt;
&lt;li&gt;Handles missing values&lt;/li&gt;
&lt;li&gt;Standardizes formats&lt;/li&gt;
&lt;li&gt;Enriches data&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This is where messy data becomes usable.&lt;/p&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;3. Transporting Through Pipes (Data Movement)&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;Once cleaned, water flows through pipes. In data pipelines, this represents the movement of data between systems.&lt;/p&gt;

&lt;p&gt;This can involve:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;ETL processes&lt;/li&gt;
&lt;li&gt;Message queues (like Kafka)&lt;/li&gt;
&lt;li&gt;Cloud data transfer services&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The goal is to move data efficiently without delays or bottlenecks.&lt;/p&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;4. Storing in Tanks (Data Storage)&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;Clean water is stored in tanks. Similarly, processed data is stored in:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Data warehouses (like Snowflake)&lt;/li&gt;
&lt;li&gt;Data lakes (like AWS S3)&lt;/li&gt;
&lt;li&gt;Databases&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This is where data becomes ready for use.&lt;/p&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;5. Accessing on Demand (Data Consumption)&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;Finally, people use the water.&lt;/p&gt;

&lt;p&gt;In the same way, data is consumed through:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Dashboards&lt;/li&gt;
&lt;li&gt;APIs&lt;/li&gt;
&lt;li&gt;Machine learning models&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This is where insights actually happen.&lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;strong&gt;Essential Python Libraries and Tools&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;Python supports every stage of a pipeline:&lt;/p&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;Data Ingestion&lt;/strong&gt;
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;requests&lt;/code&gt; - API calls&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;pandas&lt;/code&gt; - handling CSV/JSON files&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;Transformation&lt;/strong&gt;
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;pandas&lt;/code&gt; - cleaning and aggregation&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;PySpark&lt;/code&gt; - large-scale distributed processing&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;Storage&lt;/strong&gt;
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;SQLAlchemy&lt;/code&gt; - database interaction&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;boto3&lt;/code&gt; - AWS S3 integration&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;Orchestration&lt;/strong&gt;
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;Apache Airflow&lt;/code&gt; - workflow scheduling and automation&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;Dagster&lt;/code&gt; - modern pipeline orchestration with observability&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  &lt;strong&gt;Best Practices&lt;/strong&gt;
&lt;/h2&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;Error Handling&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;Implement retries and proper logging to avoid silent failures.&lt;/p&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;Monitoring&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;Track pipeline health using tools like Airflow’s UI.&lt;/p&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;Documentation&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;Keep clear documentation for:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Code&lt;/li&gt;
&lt;li&gt;Dependencies&lt;/li&gt;
&lt;li&gt;Workflow logic&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;Testing&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;Test each stage of the pipeline using:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Unit tests&lt;/li&gt;
&lt;li&gt;Sample datasets&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  &lt;strong&gt;Popular Frameworks for Advanced Use Cases&lt;/strong&gt;
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Apache Airflow&lt;/strong&gt; - Best for complex workflows with dependencies&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Dagster&lt;/strong&gt; - Strong focus on testing and data asset visibility&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Prefect&lt;/strong&gt; - Simplifies building fault-tolerant pipelines&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Luigi&lt;/strong&gt; - Good for batch processing and dependency management&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>etl</category>
      <category>python</category>
      <category>datapipeline</category>
      <category>dataengineering</category>
    </item>
    <item>
      <title>ETL vs ELT: Which One Should You Use and Why?</title>
      <dc:creator>Anthony Gicheru</dc:creator>
      <pubDate>Sun, 12 Apr 2026 21:36:22 +0000</pubDate>
      <link>https://dev.to/anthony-gicheru/etl-vs-elt-which-one-should-you-use-and-why-412e</link>
      <guid>https://dev.to/anthony-gicheru/etl-vs-elt-which-one-should-you-use-and-why-412e</guid>
      <description>&lt;p&gt;When I first started learning data engineering, ETL and ELT honestly felt like the same thing with just swapped letters. Everyone kept mentioning them like they were obvious concepts, but I had to sit down and really break them apart before it made sense.&lt;/p&gt;

&lt;p&gt;If you’re in the same place, don’t worry, you’re not alone.&lt;/p&gt;

&lt;p&gt;Let’s make it simple.&lt;/p&gt;

&lt;h2&gt;
  
  
  First things first: what do ETL and ELT even mean?
&lt;/h2&gt;

&lt;p&gt;Both ETL and ELT are ways of moving and processing data from one place to another.&lt;/p&gt;

&lt;h3&gt;
  
  
  ETL (Extract, Transform, Load)
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Extract&lt;/strong&gt; data from a source (like an API or database)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Transform&lt;/strong&gt; it before storing it (cleaning, filtering, joining, etc.)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Load&lt;/strong&gt; the final cleaned data into a target system (like a data warehouse)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The key idea: &lt;em&gt;you clean the data before storing it.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F1mu9s8n6tstb1jvvl1rn.PNG" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F1mu9s8n6tstb1jvvl1rn.PNG" alt="ELT" width="800" height="797"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  ELT (Extract, Load, Transform)
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Extract&lt;/strong&gt; data from the source&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Load&lt;/strong&gt; it directly into the storage system first&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Transform&lt;/strong&gt; it inside the database/warehouse later&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The key idea: &lt;em&gt;you store raw data first, then clean it inside the system.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F108qm1z9tg391xj0sqrb.PNG" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F108qm1z9tg391xj0sqrb.PNG" alt="ETL" width="800" height="797"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  So what’s the real difference?
&lt;/h2&gt;

&lt;p&gt;The biggest difference is &lt;strong&gt;where the transformation happens&lt;/strong&gt;.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;ETL → Transform happens outside the warehouse&lt;/li&gt;
&lt;li&gt;ELT → Transform happens inside the warehouse&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;That one shift changes a lot more than you’d think.&lt;/p&gt;

&lt;h2&gt;
  
  
  When ETL makes sense
&lt;/h2&gt;

&lt;p&gt;ETL is usually used when:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;You have smaller datasets&lt;/li&gt;
&lt;li&gt;You need strict data control before loading&lt;/li&gt;
&lt;li&gt;Your system can’t handle heavy processing&lt;/li&gt;
&lt;li&gt;Data quality must be enforced early&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Think of it like cleaning your room before putting things in storage.&lt;/p&gt;

&lt;p&gt;You don’t want messy data entering your system at all.&lt;/p&gt;

&lt;h2&gt;
  
  
  When ELT makes sense
&lt;/h2&gt;

&lt;p&gt;ELT is more common in modern systems, especially with cloud platforms.&lt;/p&gt;

&lt;p&gt;It works well when:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;You have large volumes of data&lt;/li&gt;
&lt;li&gt;You’re using powerful cloud warehouses (like Snowflake or BigQuery)&lt;/li&gt;
&lt;li&gt;You want flexibility in how data is transformed&lt;/li&gt;
&lt;li&gt;You want to keep raw data for future use&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Think of it like dumping everything into a warehouse first, then organizing it later when needed.&lt;/p&gt;

&lt;h2&gt;
  
  
  A simple real-world example
&lt;/h2&gt;

&lt;p&gt;Imagine you’re building a dashboard for an e-commerce app.&lt;/p&gt;

&lt;h3&gt;
  
  
  With ETL:
&lt;/h3&gt;

&lt;p&gt;You:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Pull order data&lt;/li&gt;
&lt;li&gt;Clean it (remove duplicates, fix missing values)&lt;/li&gt;
&lt;li&gt;Then load it into your database ready for reporting&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Everything is neat before it even arrives.&lt;/p&gt;

&lt;h3&gt;
  
  
  With ELT:
&lt;/h3&gt;

&lt;p&gt;You:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Pull raw order data&lt;/li&gt;
&lt;li&gt;Load everything into a data warehouse&lt;/li&gt;
&lt;li&gt;Later write SQL transformations to clean and structure it&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This gives you more flexibility if business rules change later.&lt;/p&gt;

&lt;h2&gt;
  
  
  My key takeaway
&lt;/h2&gt;

&lt;p&gt;When I first learned this, I thought ETL was “old” and ELT was “new,” but that’s not really true.&lt;/p&gt;

&lt;p&gt;They both still matter.&lt;/p&gt;

&lt;p&gt;Here’s a simple way I now remember it:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;ETL = Clean first, store later&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;ELT = Store first, clean later&lt;/strong&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Common mistakes beginners make
&lt;/h2&gt;

&lt;p&gt;A few things that confused me at the start:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Thinking ELT means “no cleaning” (it still involves transformation!)&lt;/li&gt;
&lt;li&gt;Mixing up where SQL transformations happen&lt;/li&gt;
&lt;li&gt;Assuming one is always better than the other (it depends on the system)&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  So… which one should YOU use?
&lt;/h2&gt;

&lt;p&gt;There’s no universal winner.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;If you’re working with traditional systems → ETL is common&lt;/li&gt;
&lt;li&gt;If you’re in modern cloud data engineering → ELT is more popular&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Most real companies actually use a &lt;strong&gt;mix of both&lt;/strong&gt;, depending on the pipeline.&lt;/p&gt;

&lt;h2&gt;
  
  
  To make this even more practical, here are some common tools used in real ETL and ELT workflows
&lt;/h2&gt;

&lt;h3&gt;
  
  
  ETL Tools (Transformation happens before loading)
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Apache Airflow&lt;/strong&gt; – for scheduling and orchestrating ETL workflows&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Informatica PowerCenter&lt;/strong&gt; – widely used in enterprise ETL pipelines&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Talend&lt;/strong&gt; – open-source tool for data integration and transformation&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Apache NiFi&lt;/strong&gt; – good for real-time data flow and routing&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;SSIS (SQL Server Integration Services)&lt;/strong&gt; – Microsoft-based ETL tool&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;These tools usually handle data cleaning and transformation before sending data to a warehouse.&lt;/p&gt;

&lt;h3&gt;
  
  
  ELT Tools (Transformation happens after loading)
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Snowflake&lt;/strong&gt; – modern cloud data warehouse with strong ELT support&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Google BigQuery&lt;/strong&gt; – popular for serverless ELT workflows&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Amazon Redshift&lt;/strong&gt; – widely used in AWS-based data stacks&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;dbt (Data Build Tool)&lt;/strong&gt; – one of the most popular tools for transformations inside the warehouse&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Databricks (Apache Spark)&lt;/strong&gt; – used for large-scale ELT processing&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;In ELT setups, tools like &lt;strong&gt;dbt&lt;/strong&gt; handle transformation using SQL after data is loaded.&lt;/p&gt;

&lt;h2&gt;
  
  
  Final thoughts
&lt;/h2&gt;

&lt;p&gt;Once I understood this difference, a lot of other concepts like data pipelines, warehouses, and analytics started to make way more sense.&lt;/p&gt;

&lt;p&gt;If you’re learning data engineering right now, don’t rush it. Build a small pipeline, try both approaches, and you’ll see the difference quickly.&lt;/p&gt;

&lt;p&gt;That’s where it really clicks.&lt;/p&gt;

</description>
      <category>dataengineering</category>
      <category>etl</category>
      <category>elt</category>
    </item>
  </channel>
</rss>
