<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Karthikeyan Rajasekaran</title>
    <description>The latest articles on DEV Community by Karthikeyan Rajasekaran (@karthikeyan_rajasekaran_c).</description>
    <link>https://dev.to/karthikeyan_rajasekaran_c</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3640082%2Fe182c6b4-b042-429d-b268-e8e9600a22e6.png</url>
      <title>DEV Community: Karthikeyan Rajasekaran</title>
      <link>https://dev.to/karthikeyan_rajasekaran_c</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/karthikeyan_rajasekaran_c"/>
    <language>en</language>
    <item>
      <title>Building Bulletproof Data Pipelines: Orchestration, Testing, and Monitoring (Part 3 of 3)</title>
      <dc:creator>Karthikeyan Rajasekaran</dc:creator>
      <pubDate>Fri, 02 Jan 2026 05:54:14 +0000</pubDate>
      <link>https://dev.to/karthikeyan_rajasekaran_c/building-bulletproof-data-pipelines-orchestration-testing-and-monitoring-part-3-of-3-20n8</link>
      <guid>https://dev.to/karthikeyan_rajasekaran_c/building-bulletproof-data-pipelines-orchestration-testing-and-monitoring-part-3-of-3-20n8</guid>
      <description>&lt;p&gt;It was 3:17 AM when my phone buzzed in a previous role. I grabbed it, squinting at the screen: "Pipeline failed: account_summary."&lt;/p&gt;

&lt;p&gt;Still half-asleep, I opened my laptop and pulled up the logs. The error message stared back at me: "Relation 'intermediate_accounts' does not exist."&lt;/p&gt;

&lt;p&gt;Wait, what? That table should exist. The intermediate layer runs before the marts layer. Why is it missing?&lt;/p&gt;

&lt;p&gt;Then I saw it. The intermediate job had failed silently 20 minutes earlier. The marts job ran anyway, looking for a table that wasn't there. The orchestration had failed.&lt;/p&gt;

&lt;p&gt;This is the moment I realized: you can have perfect transformations and blazing-fast incremental processing (see &lt;a href="https://dev.to/karthikeyan_rajasekaran_c/the-day-our-pipeline-went-from-10-minutes-to-6-seconds-part-2-of-3-3jeh"&gt;Part 2&lt;/a&gt;), but if your orchestration is broken, your pipeline is a ticking time bomb.&lt;/p&gt;

&lt;p&gt;That experience inspired me to build an example pipeline demonstrating the right patterns for orchestration, testing, and monitoring. Let me show you what I learned.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Orchestration Problem
&lt;/h2&gt;

&lt;p&gt;Here's what I got wrong for way too long in production: I focused on the transformations and forgot about the orchestration.&lt;/p&gt;

&lt;p&gt;I had perfect SQL, clean architecture, and blazing-fast incremental processing. But the jobs were held together with cron jobs and shell scripts. And that's how I ended up debugging at 3 AM.&lt;/p&gt;

&lt;p&gt;Let me show you the patterns I learned and implemented in my example pipeline.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Naive Approach: Cron Jobs
&lt;/h2&gt;

&lt;p&gt;The production pipeline started with cron jobs. Simple, right?&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# crontab&lt;/span&gt;
0 2 &lt;span class="k"&gt;*&lt;/span&gt; &lt;span class="k"&gt;*&lt;/span&gt; &lt;span class="k"&gt;*&lt;/span&gt; /scripts/run_source.sh
5 2 &lt;span class="k"&gt;*&lt;/span&gt; &lt;span class="k"&gt;*&lt;/span&gt; &lt;span class="k"&gt;*&lt;/span&gt; /scripts/run_staging.sh
10 2 &lt;span class="k"&gt;*&lt;/span&gt; &lt;span class="k"&gt;*&lt;/span&gt; &lt;span class="k"&gt;*&lt;/span&gt; /scripts/run_snapshots.sh
15 2 &lt;span class="k"&gt;*&lt;/span&gt; &lt;span class="k"&gt;*&lt;/span&gt; &lt;span class="k"&gt;*&lt;/span&gt; /scripts/run_intermediate.sh
20 2 &lt;span class="k"&gt;*&lt;/span&gt; &lt;span class="k"&gt;*&lt;/span&gt; &lt;span class="k"&gt;*&lt;/span&gt; /scripts/run_marts.sh
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Each job runs 5 minutes after the previous one. Plenty of time, right?&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;What could go wrong?&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Everything.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Scenario 1&lt;/strong&gt;: The source job takes 7 minutes instead of 4. The staging job starts before source finishes. Chaos.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Scenario 2&lt;/strong&gt;: The intermediate job fails. The marts job runs anyway, using stale data. Nobody notices for three days.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Scenario 3&lt;/strong&gt;: You need to rerun just the marts layer. You have to manually figure out which script to run and in what order.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Scenario 4&lt;/strong&gt;: Someone asks "when did this job last run successfully?" You grep through logs for 20 minutes.&lt;/p&gt;

&lt;p&gt;Cron jobs work for simple tasks. For data pipelines? They're a disaster waiting to happen.&lt;/p&gt;

&lt;h2&gt;
  
  
  Enter Dagster: Asset-Centric Orchestration
&lt;/h2&gt;

&lt;p&gt;Switching to Dagster changed everything for production pipelines I've worked on. Not because Dagster is magic, but because it forced me to think about data, not tasks.&lt;/p&gt;

&lt;p&gt;Here's the mental shift: Instead of "run this script, then that script," you think "this data depends on that data."&lt;/p&gt;

&lt;p&gt;Let me show you what this looks like in my example pipeline:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="nd"&gt;@asset&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;group_name&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;ingestion&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;customers_raw&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;context&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;Ingest customer data from CSV&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
    &lt;span class="n"&gt;df&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;pd&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;read_csv&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;data/customers.csv&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;df&lt;/span&gt;

&lt;span class="nd"&gt;@asset&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;deps&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;customers_raw&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;  &lt;span class="c1"&gt;# Wait for customers_raw
&lt;/span&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;dbt_transformations&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;context&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;dbt&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;Run all DBT models&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
    &lt;span class="n"&gt;dbt&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;cli&lt;/span&gt;&lt;span class="p"&gt;([&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;build&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="n"&gt;context&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;context&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="nd"&gt;@asset&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;deps&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;dbt_transformations&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;  &lt;span class="c1"&gt;# Wait for transformations
&lt;/span&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;account_summary_csv&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;context&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;Export results to CSV&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
    &lt;span class="c1"&gt;# Read from database and export
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Notice what's different? We're not saying "run at 2:05 AM." We're saying "this asset depends on that asset."&lt;/p&gt;

&lt;p&gt;Dagster figures out the order. If &lt;code&gt;customers_raw&lt;/code&gt; fails, &lt;code&gt;dbt_transformations&lt;/code&gt; doesn't run. If &lt;code&gt;dbt_transformations&lt;/code&gt; fails, &lt;code&gt;account_summary_csv&lt;/code&gt; doesn't run. The failure stops propagating.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Asset Lineage View
&lt;/h2&gt;

&lt;p&gt;Here's where Dagster really shines. You get a visual graph of your entire pipeline:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;customers_raw ──┐
                ├──&amp;gt; dbt_transformations ──&amp;gt; account_summary_csv
accounts_raw ───┘                       └──&amp;gt; account_summary_parquet
                                        └──&amp;gt; data_quality_report
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This isn't just pretty. It's functional. Click on any asset and you see:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;When it last ran&lt;/li&gt;
&lt;li&gt;How long it took&lt;/li&gt;
&lt;li&gt;What data it produced&lt;/li&gt;
&lt;li&gt;What depends on it&lt;/li&gt;
&lt;li&gt;The full logs&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;That 3 AM debugging session? Would have taken 2 minutes instead of 20 with this visibility.&lt;/p&gt;

&lt;h2&gt;
  
  
  Retry Logic: Because Things Fail
&lt;/h2&gt;

&lt;p&gt;Networks timeout. Databases get overloaded. Cloud services have hiccups. I learned this during a Databricks outage in a previous production system.&lt;/p&gt;

&lt;p&gt;The pipeline failed. Then it retried. And succeeded. I didn't even know there was an outage until I checked the logs later.&lt;/p&gt;

&lt;p&gt;Here's the retry strategy that saved us, which I've implemented in my example pipeline:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="nd"&gt;@asset&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;retry_policy&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="nc"&gt;RetryPolicy&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="n"&gt;max_retries&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;delay&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  &lt;span class="c1"&gt;# Start with 1 second
&lt;/span&gt;        &lt;span class="n"&gt;backoff&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;Backoff&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;EXPONENTIAL&lt;/span&gt;  &lt;span class="c1"&gt;# Double each time
&lt;/span&gt;    &lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;account_summary_to_databricks&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;context&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;databricks&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;Load data to Databricks with retry logic&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;

    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;attempt&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="nf"&gt;range&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;max_retries&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="k"&gt;try&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="c1"&gt;# Attempt to load data
&lt;/span&gt;            &lt;span class="n"&gt;databricks&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;load_data&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;account_summary&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
            &lt;span class="k"&gt;return&lt;/span&gt;
        &lt;span class="k"&gt;except&lt;/span&gt; &lt;span class="n"&gt;ConnectionTimeout&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;e&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;attempt&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&lt;/span&gt; &lt;span class="n"&gt;max_retries&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
                &lt;span class="n"&gt;wait_time&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;delay&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt; &lt;span class="o"&gt;**&lt;/span&gt; &lt;span class="n"&gt;attempt&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  &lt;span class="c1"&gt;# Exponential backoff
&lt;/span&gt;                &lt;span class="n"&gt;context&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;log&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;warning&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Attempt &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;attempt&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt; failed, retrying in &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;wait_time&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;s&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
                &lt;span class="n"&gt;time&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;sleep&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;wait_time&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
            &lt;span class="k"&gt;else&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
                &lt;span class="k"&gt;raise&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;First attempt fails&lt;/strong&gt;: Wait 1 second, retry&lt;br&gt;&lt;br&gt;
&lt;strong&gt;Second attempt fails&lt;/strong&gt;: Wait 2 seconds, retry&lt;br&gt;&lt;br&gt;
&lt;strong&gt;Third attempt fails&lt;/strong&gt;: Wait 4 seconds, retry&lt;br&gt;&lt;br&gt;
&lt;strong&gt;Fourth attempt fails&lt;/strong&gt;: Give up, alert humans&lt;/p&gt;

&lt;p&gt;This pattern saved us during that outage. The first few attempts failed, but by the time the third retry happened, Databricks was back up. The pipeline succeeded without waking me up.&lt;/p&gt;

&lt;p&gt;I've built this same retry logic into my example pipeline to demonstrate the pattern.&lt;/p&gt;
&lt;h2&gt;
  
  
  Data Quality: Trust But Verify
&lt;/h2&gt;

&lt;p&gt;Fast pipelines are useless if they produce wrong results. I learned this the embarrassing way in production.&lt;/p&gt;

&lt;p&gt;A business user asked why the interest calculations looked off. I checked the code. Looked fine. I checked the data. Looked fine. Then I dug deeper.&lt;/p&gt;

&lt;p&gt;Turns out, we had a bug in the staging layer. Some boolean values weren't being standardized correctly. The pipeline ran successfully every day, producing wrong results every day. For three weeks.&lt;/p&gt;

&lt;p&gt;That's when I became obsessed with testing. My example pipeline includes 99 automated tests to demonstrate comprehensive data quality patterns.&lt;/p&gt;
&lt;h3&gt;
  
  
  Layer 1: Schema Tests
&lt;/h3&gt;

&lt;p&gt;First line of defense: make sure the data structure is correct.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;models&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;stg_customer&lt;/span&gt;
    &lt;span class="na"&gt;columns&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;customer_id&lt;/span&gt;
        &lt;span class="na"&gt;tests&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
          &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;unique&lt;/span&gt;
          &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;not_null&lt;/span&gt;

      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;has_loan_flag&lt;/span&gt;
        &lt;span class="na"&gt;tests&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
          &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;accepted_values&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
              &lt;span class="na"&gt;values&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;[&lt;/span&gt;&lt;span class="nv"&gt;true&lt;/span&gt;&lt;span class="pi"&gt;,&lt;/span&gt; &lt;span class="nv"&gt;false&lt;/span&gt;&lt;span class="pi"&gt;]&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;These tests run after every transformation. If customer_id has duplicates, the pipeline fails. If has_loan_flag has a value other than true/false, the pipeline fails.&lt;/p&gt;

&lt;p&gt;Fail fast, fail loud.&lt;/p&gt;

&lt;h3&gt;
  
  
  Layer 2: Relationship Tests
&lt;/h3&gt;

&lt;p&gt;Make sure data relationships are valid.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;models&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;int_account_with_customer&lt;/span&gt;
    &lt;span class="na"&gt;columns&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;customer_id&lt;/span&gt;
        &lt;span class="na"&gt;tests&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
          &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;relationships&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
              &lt;span class="na"&gt;to&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;ref('stg_customer')&lt;/span&gt;
              &lt;span class="na"&gt;field&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;customer_id&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This ensures every account has a valid customer. No orphaned records, no broken foreign keys.&lt;/p&gt;

&lt;h3&gt;
  
  
  Layer 3: Business Logic Tests
&lt;/h3&gt;

&lt;p&gt;Make sure calculations are correct.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;models&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;account_summary&lt;/span&gt;
    &lt;span class="na"&gt;columns&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;interest_rate_pct&lt;/span&gt;
        &lt;span class="na"&gt;tests&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
          &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;positive_value&lt;/span&gt;  &lt;span class="c1"&gt;# Custom test&lt;/span&gt;
          &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;accepted_range&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
              &lt;span class="na"&gt;min_value&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;0.01&lt;/span&gt;
              &lt;span class="na"&gt;max_value&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;0.025&lt;/span&gt;

      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;new_balance_amount&lt;/span&gt;
        &lt;span class="na"&gt;tests&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
          &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;positive_value&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Interest rates should be between 1% and 2.5%. If we see 25% or 0.001%, something's wrong.&lt;/p&gt;

&lt;h3&gt;
  
  
  Layer 4: Freshness Tests
&lt;/h3&gt;

&lt;p&gt;Make sure data is recent.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;sources&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;raw&lt;/span&gt;
    &lt;span class="na"&gt;tables&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;customers_raw&lt;/span&gt;
        &lt;span class="na"&gt;freshness&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
          &lt;span class="na"&gt;warn_after&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;{&lt;/span&gt;&lt;span class="nv"&gt;count&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="nv"&gt;24&lt;/span&gt;&lt;span class="pi"&gt;,&lt;/span&gt; &lt;span class="nv"&gt;period&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="nv"&gt;hour&lt;/span&gt;&lt;span class="pi"&gt;}&lt;/span&gt;
          &lt;span class="na"&gt;error_after&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;{&lt;/span&gt;&lt;span class="nv"&gt;count&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="nv"&gt;48&lt;/span&gt;&lt;span class="pi"&gt;,&lt;/span&gt; &lt;span class="nv"&gt;period&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="nv"&gt;hour&lt;/span&gt;&lt;span class="pi"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;If customer data hasn't been updated in 24 hours, warn us. If it's been 48 hours, fail the pipeline.&lt;/p&gt;

&lt;p&gt;This catches issues where ingestion jobs silently fail. The pipeline keeps running with stale data until the freshness test catches it.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Quality Report
&lt;/h2&gt;

&lt;p&gt;After every run, we generate a quality report:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"timestamp"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"2024-12-01T02:30:00"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"summary"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"total_tests"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;40&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"passed"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;38&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"failed"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"pass_rate"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mf"&gt;95.0&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"failures"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"test"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"unique_customer_id"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"model"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"stg_customer"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"message"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"Found 3 duplicate customer IDs: [101, 205, 309]"&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"test"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"positive_value_balance"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"model"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"stg_account"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"message"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"Found 1 negative balance: account A042 has balance -150.00"&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This report goes to Slack in production systems. If tests fail, teams investigate before the data reaches production. I've implemented this same pattern in my example pipeline.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Quarantine Pattern
&lt;/h2&gt;

&lt;p&gt;Sometimes you can't fix bad data immediately in production. Maybe it's a weekend, or the source system is down, or you need business input on how to handle it.&lt;/p&gt;

&lt;p&gt;I use a quarantine pattern in my example pipeline to demonstrate this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="c1"&gt;-- stg_customer.sql&lt;/span&gt;
&lt;span class="c1"&gt;-- Good records go to stg_customer&lt;/span&gt;
&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;src_customer&lt;/span&gt;
&lt;span class="k"&gt;WHERE&lt;/span&gt; &lt;span class="n"&gt;customer_id&lt;/span&gt; &lt;span class="k"&gt;IS&lt;/span&gt; &lt;span class="k"&gt;NOT&lt;/span&gt; &lt;span class="k"&gt;NULL&lt;/span&gt;
  &lt;span class="k"&gt;AND&lt;/span&gt; &lt;span class="n"&gt;customer_name&lt;/span&gt; &lt;span class="k"&gt;IS&lt;/span&gt; &lt;span class="k"&gt;NOT&lt;/span&gt; &lt;span class="k"&gt;NULL&lt;/span&gt;

&lt;span class="c1"&gt;-- Bad records go to quarantine&lt;/span&gt;
&lt;span class="c1"&gt;-- quarantine_stg_customer.sql&lt;/span&gt;
&lt;span class="k"&gt;SELECT&lt;/span&gt; 
    &lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="k"&gt;CASE&lt;/span&gt; 
        &lt;span class="k"&gt;WHEN&lt;/span&gt; &lt;span class="n"&gt;customer_id&lt;/span&gt; &lt;span class="k"&gt;IS&lt;/span&gt; &lt;span class="k"&gt;NULL&lt;/span&gt; &lt;span class="k"&gt;THEN&lt;/span&gt; &lt;span class="s1"&gt;'missing_customer_id'&lt;/span&gt;
        &lt;span class="k"&gt;WHEN&lt;/span&gt; &lt;span class="n"&gt;customer_name&lt;/span&gt; &lt;span class="k"&gt;IS&lt;/span&gt; &lt;span class="k"&gt;NULL&lt;/span&gt; &lt;span class="k"&gt;THEN&lt;/span&gt; &lt;span class="s1"&gt;'missing_customer_name'&lt;/span&gt;
    &lt;span class="k"&gt;END&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;quarantine_reason&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="k"&gt;CURRENT_TIMESTAMP&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;quarantined_at&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;src_customer&lt;/span&gt;
&lt;span class="k"&gt;WHERE&lt;/span&gt; &lt;span class="n"&gt;customer_id&lt;/span&gt; &lt;span class="k"&gt;IS&lt;/span&gt; &lt;span class="k"&gt;NULL&lt;/span&gt;
   &lt;span class="k"&gt;OR&lt;/span&gt; &lt;span class="n"&gt;customer_name&lt;/span&gt; &lt;span class="k"&gt;IS&lt;/span&gt; &lt;span class="k"&gt;NULL&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Bad records don't break the pipeline. They go to a quarantine table where we can review them later. The pipeline continues with good data.&lt;/p&gt;

&lt;p&gt;This saved us in production when a source system started sending records with null IDs. Instead of failing the entire pipeline, we quarantined those records and processed everything else. We fixed the source system later and reprocessed the quarantined records.&lt;/p&gt;

&lt;p&gt;I've implemented this pattern in my example pipeline to show how it works.&lt;/p&gt;

&lt;h2&gt;
  
  
  Monitoring: Know What's Happening
&lt;/h2&gt;

&lt;p&gt;I learned to track three key metrics in production systems:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;1. Run duration&lt;/strong&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;context&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;add_output_metadata&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;duration_seconds&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;end_time&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="n"&gt;start_time&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;records_processed&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nf"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="p"&gt;})&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;If a job that usually takes 6 seconds suddenly takes 60 seconds, something's wrong.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;2. Record counts&lt;/strong&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;context&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;add_output_metadata&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;input_records&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nf"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;input_df&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;output_records&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nf"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;output_df&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;filtered_records&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nf"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;input_df&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="nf"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;output_df&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="p"&gt;})&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;If we usually process 50 records and suddenly process 5,000, something's wrong.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;3. Data quality&lt;/strong&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;context&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;add_output_metadata&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;null_count&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;isnull&lt;/span&gt;&lt;span class="p"&gt;().&lt;/span&gt;&lt;span class="nf"&gt;sum&lt;/span&gt;&lt;span class="p"&gt;().&lt;/span&gt;&lt;span class="nf"&gt;sum&lt;/span&gt;&lt;span class="p"&gt;(),&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;duplicate_count&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nf"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="nf"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;drop_duplicates&lt;/span&gt;&lt;span class="p"&gt;())&lt;/span&gt;
&lt;span class="p"&gt;})&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;If null counts spike, something's wrong.&lt;/p&gt;

&lt;p&gt;These metrics go to a dashboard in production. I check it every morning. If something looks off, I investigate.&lt;/p&gt;

&lt;p&gt;I've implemented the same monitoring patterns in my example pipeline.&lt;/p&gt;

&lt;h2&gt;
  
  
  The 3 AM Incident: Resolved
&lt;/h2&gt;

&lt;p&gt;Remember that 3 AM failure from the beginning? Here's how proper orchestration would have prevented it (and now does in my example pipeline):&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Before&lt;/strong&gt;:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Intermediate job failed silently&lt;/li&gt;
&lt;li&gt;Marts job ran anyway&lt;/li&gt;
&lt;li&gt;Used stale data&lt;/li&gt;
&lt;li&gt;Nobody noticed for hours&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;After (with proper orchestration)&lt;/strong&gt;:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Intermediate job fails&lt;/li&gt;
&lt;li&gt;Dagster stops the pipeline&lt;/li&gt;
&lt;li&gt;Marts job doesn't run&lt;/li&gt;
&lt;li&gt;Slack alert: "Pipeline stopped at intermediate layer"&lt;/li&gt;
&lt;li&gt;Error is visible immediately&lt;/li&gt;
&lt;li&gt;Fix takes 5 minutes instead of 3 hours&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The fix was simple: the job ran out of memory, so I increased the allocation. But I only caught it quickly because of proper orchestration.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Checklist
&lt;/h2&gt;

&lt;p&gt;If you want reliable pipelines, you need:&lt;/p&gt;

&lt;p&gt;✅ &lt;strong&gt;Dependency management&lt;/strong&gt;: Jobs run in the right order&lt;br&gt;&lt;br&gt;
✅ &lt;strong&gt;Failure isolation&lt;/strong&gt;: One failure doesn't cascade&lt;br&gt;&lt;br&gt;
✅ &lt;strong&gt;Retry logic&lt;/strong&gt;: Transient failures resolve automatically&lt;br&gt;&lt;br&gt;
✅ &lt;strong&gt;Data quality tests&lt;/strong&gt;: Catch issues before production&lt;br&gt;&lt;br&gt;
✅ &lt;strong&gt;Quarantine pattern&lt;/strong&gt;: Bad data doesn't break the pipeline&lt;br&gt;&lt;br&gt;
✅ &lt;strong&gt;Monitoring&lt;/strong&gt;: Know what's happening in real-time&lt;br&gt;&lt;br&gt;
✅ &lt;strong&gt;Alerting&lt;/strong&gt;: Get notified when things go wrong&lt;br&gt;&lt;br&gt;
✅ &lt;strong&gt;Observability&lt;/strong&gt;: Debug issues quickly  &lt;/p&gt;

&lt;p&gt;Without these, you're flying blind.&lt;/p&gt;
&lt;h2&gt;
  
  
  What We Learned
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;1. Orchestration is not optional&lt;/strong&gt;: Cron jobs work for simple tasks. For data pipelines, use a proper orchestrator.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;2. Test everything&lt;/strong&gt;: Schema, relationships, business logic, freshness. If you don't test it, it will break.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;3. Fail fast&lt;/strong&gt;: Better to catch issues early than to produce wrong results.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;4. Make debugging easy&lt;/strong&gt;: When things break (and they will), you need to diagnose quickly.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;5. Automate recovery&lt;/strong&gt;: Retry transient failures. Quarantine bad data. Don't wake humans for things machines can handle.&lt;/p&gt;
&lt;h2&gt;
  
  
  The Final Architecture
&lt;/h2&gt;

&lt;p&gt;Here's what a complete, production-ready pipeline looks like:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;CSV Files
    ↓
[Ingestion Assets]
    ↓ (Dagster orchestration)
[DBT Transformations]
    ├─ Source Layer (raw data)
    ├─ Staging Layer (cleaned data)
    ├─ Snapshots (SCD2 history)
    ├─ Intermediate Layer (joins)
    └─ Marts Layer (analytics)
    ↓
[Data Quality Tests] (40+ tests)
    ↓
[Output Assets]
    ├─ CSV exports
    ├─ Parquet files
    └─ Databricks tables
    ↓
[Quality Report]
    └─ Slack notification
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Every step is orchestrated. Every layer is tested. Every failure is caught. Every metric is tracked.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Results
&lt;/h2&gt;

&lt;p&gt;Here's what changed after we implemented proper orchestration and data quality:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Before&lt;/strong&gt;:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Pipeline failures: 2-3 per week&lt;/li&gt;
&lt;li&gt;Mean time to detection: 4 hours&lt;/li&gt;
&lt;li&gt;Mean time to resolution: 2 hours&lt;/li&gt;
&lt;li&gt;Data quality issues in production: Weekly&lt;/li&gt;
&lt;li&gt;On-call stress level: High&lt;/li&gt;
&lt;li&gt;3 AM wake-ups: Too many&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;After&lt;/strong&gt;:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Pipeline failures: 1-2 per month&lt;/li&gt;
&lt;li&gt;Mean time to detection: 2 minutes&lt;/li&gt;
&lt;li&gt;Mean time to resolution: 15 minutes&lt;/li&gt;
&lt;li&gt;Data quality issues in production: None in 6 months&lt;/li&gt;
&lt;li&gt;On-call stress level: Low&lt;/li&gt;
&lt;li&gt;3 AM wake-ups: Zero&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The pipeline isn't perfect. Things still break. But when they do, I know immediately, and I can fix them quickly. Usually before anyone else even notices.&lt;/p&gt;

&lt;h2&gt;
  
  
  Closing Thoughts
&lt;/h2&gt;

&lt;p&gt;Building a data pipeline is easy. Building a reliable data pipeline is hard.&lt;/p&gt;

&lt;p&gt;The transformations are the easy part. The orchestration, testing, monitoring, and error handling—that's where the real work is.&lt;/p&gt;

&lt;p&gt;But it's worth it. Because a pipeline that runs reliably at 3 AM, catches issues before production, and recovers from failures automatically? That's the difference between a script and a production system.&lt;/p&gt;

&lt;p&gt;And that's what lets me sleep through the night instead of waking up to Slack alerts.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;This is Part 3 of a 3-part series on modern data pipeline architecture. The examples come from an open-source banking pipeline I built based on my production experience.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;&lt;a href="https://dev.to/karthikeyan_rajasekaran_c/modern-data-pipelines-why-five-layers-changed-everything-part-1-of-3-1e92"&gt;Part 1: Modern Data Pipelines - Why Five Layers Changed Everything&lt;/a&gt;&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;&lt;a href="https://dev.to/karthikeyan_rajasekaran_c/the-day-our-pipeline-went-from-10-minutes-to-6-seconds-part-2-of-3-3jeh"&gt;Part 2: The Day Our Pipeline Went From 10 Minutes to 6 Seconds&lt;/a&gt;&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Want to see the full code?&lt;/strong&gt; Check out the &lt;a href="https://github.com/ai-tech-karthik/banking-data-pipeline" rel="noopener noreferrer"&gt;GitHub repository&lt;/a&gt; with complete source code, documentation, and production metrics.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Tech Stack&lt;/strong&gt;: Dagster • DBT • DuckDB • Databricks • Python • Docker&lt;/p&gt;




&lt;h2&gt;
  
  
  What's your worst pipeline incident?
&lt;/h2&gt;

&lt;p&gt;How did you fix it? What lessons did you learn? Drop a comment below—I'd love to hear your war stories! 👇&lt;/p&gt;




</description>
      <category>dataengineering</category>
      <category>dagster</category>
      <category>dataquality</category>
      <category>testing</category>
    </item>
    <item>
      <title>The Day Our Pipeline Went From 10 Minutes to 6 Seconds (Part 2 of 3)</title>
      <dc:creator>Karthikeyan Rajasekaran</dc:creator>
      <pubDate>Fri, 26 Dec 2025 22:05:11 +0000</pubDate>
      <link>https://dev.to/karthikeyan_rajasekaran_c/the-day-our-pipeline-went-from-10-minutes-to-6-seconds-part-2-of-3-3jeh</link>
      <guid>https://dev.to/karthikeyan_rajasekaran_c/the-day-our-pipeline-went-from-10-minutes-to-6-seconds-part-2-of-3-3jeh</guid>
      <description>&lt;p&gt;Remember that feeling when you discover a shortcut that saves you hours every week? That's what incremental processing did for production pipelines I've worked on.&lt;/p&gt;

&lt;p&gt;Let me tell you about the moment I realized we had a problem.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Wake-Up Call
&lt;/h2&gt;

&lt;p&gt;It was a Tuesday morning in production. I kicked off the daily pipeline run and went to grab coffee. When I came back 10 minutes later, it was still running. Processing 50,000 account records shouldn't take this long, I thought.&lt;/p&gt;

&lt;p&gt;I checked the logs. The pipeline was reprocessing every single record from scratch. All 50,000 of them. Even though only 47 accounts had actually changed since yesterday.&lt;/p&gt;

&lt;p&gt;We were doing the equivalent of repainting your entire house every time you scuff one wall.&lt;/p&gt;

&lt;p&gt;This experience inspired me to build an example pipeline demonstrating the right way to handle incremental processing.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Naive Approach (What I Was Seeing)
&lt;/h2&gt;

&lt;p&gt;Here's what the original production pipeline looked like:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="c1"&gt;-- Every day, process EVERYTHING&lt;/span&gt;
&lt;span class="k"&gt;CREATE&lt;/span&gt; &lt;span class="k"&gt;OR&lt;/span&gt; &lt;span class="k"&gt;REPLACE&lt;/span&gt; &lt;span class="k"&gt;TABLE&lt;/span&gt; &lt;span class="n"&gt;account_summary&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt;
&lt;span class="k"&gt;SELECT&lt;/span&gt; 
    &lt;span class="n"&gt;account_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;balance&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;calculate_interest&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;balance&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;has_loan&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;interest&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;balance&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;interest&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;new_balance&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;accounts&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Day 1: Process 50,000 accounts → 10 minutes&lt;br&gt;
Day 2: Process 50,000 accounts (47 changed) → 10 minutes&lt;br&gt;&lt;br&gt;
Day 3: Process 50,000 accounts (23 changed) → 10 minutes&lt;/p&gt;

&lt;p&gt;You see the problem. We're wasting 99.9% of our compute on unchanged data.&lt;/p&gt;
&lt;h2&gt;
  
  
  The Incremental Mindset
&lt;/h2&gt;

&lt;p&gt;The solution seems obvious in hindsight: only process what changed. But how do you know what changed?&lt;/p&gt;

&lt;p&gt;This is where that &lt;code&gt;loaded_at&lt;/code&gt; timestamp from &lt;a href="https://dev.to/karthikeyan_rajasekaran_c/modern-data-pipelines-why-five-layers-changed-everything-part-1-of-3-1e92"&gt;Part 1&lt;/a&gt; becomes crucial. Remember when we added it to the source layer? This is why.&lt;/p&gt;

&lt;p&gt;Here's the mental model: Every record has a timestamp showing when it was last modified. Your pipeline remembers when it last ran. On the next run, you only process records modified after that timestamp.&lt;/p&gt;

&lt;p&gt;Think of it like checking your email. You don't re-read every email you've ever received. You just check for new ones since you last looked.&lt;/p&gt;
&lt;h2&gt;
  
  
  The Implementation
&lt;/h2&gt;

&lt;p&gt;Let's look at how this actually works in code. I'll show you the pattern I use in my example pipeline's marts layer:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="p"&gt;{{&lt;/span&gt;
    &lt;span class="n"&gt;config&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="n"&gt;materialized&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s1"&gt;'incremental'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;unique_key&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s1"&gt;'account_id'&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="p"&gt;}}&lt;/span&gt;

&lt;span class="k"&gt;SELECT&lt;/span&gt; 
    &lt;span class="n"&gt;account_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;customer_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;balance&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;original_balance&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;calculate_interest&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;balance&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;has_loan&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;interest&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;balance&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;interest&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;new_balance&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="k"&gt;CURRENT_TIMESTAMP&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;calculated_at&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="p"&gt;{{&lt;/span&gt; &lt;span class="k"&gt;ref&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;'intermediate_accounts'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;}}&lt;/span&gt;

&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="o"&gt;%&lt;/span&gt; &lt;span class="n"&gt;if&lt;/span&gt; &lt;span class="n"&gt;is_incremental&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="o"&gt;%&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="c1"&gt;-- Only process records that changed since last run&lt;/span&gt;
&lt;span class="k"&gt;WHERE&lt;/span&gt; &lt;span class="n"&gt;valid_from_at&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="k"&gt;MAX&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;calculated_at&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; 
    &lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="p"&gt;{{&lt;/span&gt; &lt;span class="n"&gt;this&lt;/span&gt; &lt;span class="p"&gt;}}&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="o"&gt;%&lt;/span&gt; &lt;span class="n"&gt;endif&lt;/span&gt; &lt;span class="o"&gt;%&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Let me break down what's happening here:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;First run&lt;/strong&gt; (table doesn't exist yet):&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;is_incremental()&lt;/code&gt; returns false&lt;/li&gt;
&lt;li&gt;Process all 50,000 records&lt;/li&gt;
&lt;li&gt;Takes 10 minutes&lt;/li&gt;
&lt;li&gt;Each record gets a &lt;code&gt;calculated_at&lt;/code&gt; timestamp&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Second run&lt;/strong&gt; (table exists):&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;is_incremental()&lt;/code&gt; returns true&lt;/li&gt;
&lt;li&gt;Find the latest &lt;code&gt;calculated_at&lt;/code&gt; timestamp (let's say it's yesterday at 2 AM)&lt;/li&gt;
&lt;li&gt;Only process records where &lt;code&gt;valid_from_at&lt;/code&gt; &amp;gt; yesterday at 2 AM&lt;/li&gt;
&lt;li&gt;That's just 47 records&lt;/li&gt;
&lt;li&gt;Takes 6 seconds&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;The magic&lt;/strong&gt;: The &lt;code&gt;unique_key='account_id'&lt;/code&gt; tells the database to merge results. If account A001 appears in the new data, it updates the existing row. If account A999 is new, it inserts a new row.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Merge Strategy
&lt;/h2&gt;

&lt;p&gt;Here's what actually happens in the database during an incremental run:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="c1"&gt;-- Simplified version of what the database does&lt;/span&gt;
&lt;span class="n"&gt;MERGE&lt;/span&gt; &lt;span class="k"&gt;INTO&lt;/span&gt; &lt;span class="n"&gt;account_summary&lt;/span&gt; &lt;span class="n"&gt;target&lt;/span&gt;
&lt;span class="k"&gt;USING&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="c1"&gt;-- Your incremental query results&lt;/span&gt;
    &lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;new_records&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;source&lt;/span&gt;
&lt;span class="k"&gt;ON&lt;/span&gt; &lt;span class="n"&gt;target&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;account_id&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;source&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;account_id&lt;/span&gt;
&lt;span class="k"&gt;WHEN&lt;/span&gt; &lt;span class="n"&gt;MATCHED&lt;/span&gt; &lt;span class="k"&gt;THEN&lt;/span&gt; 
    &lt;span class="k"&gt;UPDATE&lt;/span&gt; &lt;span class="k"&gt;SET&lt;/span&gt; 
        &lt;span class="n"&gt;balance&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;source&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;balance&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;interest&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;source&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;interest&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;calculated_at&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;source&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;calculated_at&lt;/span&gt;
&lt;span class="k"&gt;WHEN&lt;/span&gt; &lt;span class="k"&gt;NOT&lt;/span&gt; &lt;span class="n"&gt;MATCHED&lt;/span&gt; &lt;span class="k"&gt;THEN&lt;/span&gt;
    &lt;span class="k"&gt;INSERT&lt;/span&gt; &lt;span class="k"&gt;VALUES&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;source&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Changed records get updated. New records get inserted. Unchanged records? Untouched. Exactly what we want.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Edge Cases (Where Things Get Tricky)
&lt;/h2&gt;

&lt;p&gt;Of course, it's never quite that simple. Here are the gotchas we ran into:&lt;/p&gt;

&lt;h3&gt;
  
  
  Problem 1: Late-Arriving Data
&lt;/h3&gt;

&lt;p&gt;Sometimes data shows up late in production systems. An account update from Monday arrives on Wednesday. Your incremental logic already processed Tuesday's data, so it misses the Monday update.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Solution&lt;/strong&gt;: Add a lookback window.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="o"&gt;%&lt;/span&gt; &lt;span class="n"&gt;if&lt;/span&gt; &lt;span class="n"&gt;is_incremental&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="o"&gt;%&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="k"&gt;WHERE&lt;/span&gt; &lt;span class="n"&gt;valid_from_at&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="k"&gt;MAX&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;calculated_at&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="n"&gt;INTERVAL&lt;/span&gt; &lt;span class="s1"&gt;'3 days'&lt;/span&gt;  &lt;span class="c1"&gt;-- Look back 3 days&lt;/span&gt;
    &lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="p"&gt;{{&lt;/span&gt; &lt;span class="n"&gt;this&lt;/span&gt; &lt;span class="p"&gt;}}&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="o"&gt;%&lt;/span&gt; &lt;span class="n"&gt;endif&lt;/span&gt; &lt;span class="o"&gt;%&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Now you reprocess the last 3 days of data on every run. It's a bit redundant, but it catches late arrivals. I found 3 days was the sweet spot in production—long enough to catch stragglers, short enough to stay fast.&lt;/p&gt;

&lt;h3&gt;
  
  
  Problem 2: The Empty Table Trap
&lt;/h3&gt;

&lt;p&gt;What happens on the very first run when the table doesn't exist? &lt;code&gt;MAX(calculated_at)&lt;/code&gt; returns NULL, and your WHERE clause breaks.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Solution&lt;/strong&gt;: Use COALESCE with a fallback date.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="o"&gt;%&lt;/span&gt; &lt;span class="n"&gt;if&lt;/span&gt; &lt;span class="n"&gt;is_incremental&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="o"&gt;%&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="k"&gt;WHERE&lt;/span&gt; &lt;span class="n"&gt;valid_from_at&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;COALESCE&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="k"&gt;MAX&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;calculated_at&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="p"&gt;{{&lt;/span&gt; &lt;span class="n"&gt;this&lt;/span&gt; &lt;span class="p"&gt;}}),&lt;/span&gt;
    &lt;span class="s1"&gt;'1900-01-01'&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="nb"&gt;timestamp&lt;/span&gt;  &lt;span class="c1"&gt;-- Fallback: process everything&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="o"&gt;%&lt;/span&gt; &lt;span class="n"&gt;endif&lt;/span&gt; &lt;span class="o"&gt;%&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;If the table is empty, fall back to a date in the distant past, which effectively processes all records. Simple and bulletproof.&lt;/p&gt;

&lt;h3&gt;
  
  
  Problem 3: Schema Changes
&lt;/h3&gt;

&lt;p&gt;You add a new column to your calculation. Now what? The incremental logic will only update new records. Old records won't have the new column.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Solution&lt;/strong&gt;: Full refresh when needed.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Normal incremental run&lt;/span&gt;
dbt run &lt;span class="nt"&gt;--select&lt;/span&gt; account_summary

&lt;span class="c"&gt;# Force full refresh (reprocess everything)&lt;/span&gt;
dbt run &lt;span class="nt"&gt;--select&lt;/span&gt; account_summary &lt;span class="nt"&gt;--full-refresh&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;We run full refreshes in production:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;After schema changes&lt;/li&gt;
&lt;li&gt;After logic changes that affect all records&lt;/li&gt;
&lt;li&gt;Once a month as a sanity check&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The rest of the time? Incremental all the way. I've implemented this same pattern in my example pipeline.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Performance Numbers
&lt;/h2&gt;

&lt;p&gt;Let me show you the actual impact I've seen in production systems:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Before incremental processing&lt;/strong&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Daily run: 10 minutes 23 seconds
Weekly compute cost: $47
Records processed per day: 50,000
Records actually changed: ~50 (0.1%)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;After incremental processing&lt;/strong&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Daily run: 6 seconds
Weekly compute cost: $0.80
Records processed per day: ~50
Records actually changed: ~50 (100%)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;That's a &lt;strong&gt;100x speedup&lt;/strong&gt; and a &lt;strong&gt;98% cost reduction&lt;/strong&gt;. Same results, fraction of the time and money.&lt;/p&gt;

&lt;p&gt;I've replicated this pattern in my example banking pipeline to demonstrate how it works.&lt;/p&gt;

&lt;h2&gt;
  
  
  When NOT to Use Incremental Processing
&lt;/h2&gt;

&lt;p&gt;Incremental isn't always the answer. Here's when we stick with full refreshes:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Small datasets&lt;/strong&gt;: If you're processing 1,000 records and it takes 5 seconds, don't bother with incremental. The complexity isn't worth it.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Frequent schema changes&lt;/strong&gt;: If your logic changes weekly, you'll be running full refreshes anyway. Incremental adds complexity without benefit.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Complex dependencies&lt;/strong&gt;: If your calculation depends on the entire dataset (like percentiles or rankings), incremental gets tricky. Sometimes it's easier to just reprocess everything.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Aggregations across all records&lt;/strong&gt;: If you're calculating "total balance across all accounts," you need all records, not just changed ones.&lt;/p&gt;

&lt;p&gt;We use incremental for:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Row-level calculations (interest rates, classifications)&lt;/li&gt;
&lt;li&gt;Joins that don't require full table scans&lt;/li&gt;
&lt;li&gt;Transformations where each record is independent&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;We use full refresh for:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Aggregations (sums, averages across all data)&lt;/li&gt;
&lt;li&gt;Rankings and percentiles&lt;/li&gt;
&lt;li&gt;Anything that needs the complete dataset&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;I've implemented both patterns in my example pipeline to show when to use each.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Debugging Challenge
&lt;/h2&gt;

&lt;p&gt;Here's something nobody tells you: incremental processing makes debugging harder.&lt;/p&gt;

&lt;p&gt;With full refresh, every run is identical. With incremental, each run processes different data. A bug might only appear when certain records are processed together.&lt;/p&gt;

&lt;p&gt;I learned to:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;1. Keep detailed logs&lt;/strong&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;context&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;log&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;info&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Processing &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="nf"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;new_records&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt; changed records&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;context&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;log&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;info&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Date range: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;min_date&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt; to &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;max_date&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;context&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;log&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;info&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Last run timestamp: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;last_run&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;2. Make full refresh easy&lt;/strong&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# One command to reprocess everything&lt;/span&gt;
dbt run &lt;span class="nt"&gt;--select&lt;/span&gt; account_summary &lt;span class="nt"&gt;--full-refresh&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;3. Test incremental logic&lt;/strong&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Unit test: Does incremental produce same results as full refresh?
&lt;/span&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;test_incremental_matches_full&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;
    &lt;span class="n"&gt;full_results&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;run_full_refresh&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="n"&gt;incremental_results&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;run_incremental&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="k"&gt;assert&lt;/span&gt; &lt;span class="n"&gt;full_results&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="n"&gt;incremental_results&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  The Compound Effect
&lt;/h2&gt;

&lt;p&gt;Here's what really sold me on incremental processing in production: it compounds.&lt;/p&gt;

&lt;p&gt;When your pipeline runs in 6 seconds instead of 10 minutes, you can run it more often. I've seen teams go from daily runs to hourly runs. Suddenly they had near-real-time analytics.&lt;/p&gt;

&lt;p&gt;When your compute costs drop 98%, you can afford to add more transformations. I've seen teams add three new marts that would have been too expensive before.&lt;/p&gt;

&lt;p&gt;When your pipeline is fast, people trust it more. They know they can get fresh data quickly, so they actually use it.&lt;/p&gt;

&lt;p&gt;It's not just about speed. It's about what speed enables.&lt;/p&gt;

&lt;p&gt;I built my example pipeline to demonstrate these patterns with real code you can run and learn from.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Practical Checklist
&lt;/h2&gt;

&lt;p&gt;If you want to implement incremental processing in your pipeline, here's what you need:&lt;/p&gt;

&lt;p&gt;✅ &lt;strong&gt;Timestamp column&lt;/strong&gt;: Every record needs a "last modified" timestamp&lt;br&gt;&lt;br&gt;
✅ &lt;strong&gt;Unique key&lt;/strong&gt;: A column (or combination) that uniquely identifies each record&lt;br&gt;&lt;br&gt;
✅ &lt;strong&gt;Merge support&lt;/strong&gt;: Your database needs to support MERGE or UPSERT operations&lt;br&gt;&lt;br&gt;
✅ &lt;strong&gt;Lookback window&lt;/strong&gt;: Handle late-arriving data gracefully&lt;br&gt;&lt;br&gt;
✅ &lt;strong&gt;Full refresh option&lt;/strong&gt;: For when you need to reprocess everything&lt;br&gt;&lt;br&gt;
✅ &lt;strong&gt;Monitoring&lt;/strong&gt;: Track how many records are processed each run  &lt;/p&gt;

&lt;p&gt;If you have these pieces, you're ready to go incremental.&lt;/p&gt;

&lt;h2&gt;
  
  
  What's Next
&lt;/h2&gt;

&lt;p&gt;In Part 3, we'll talk about orchestration and data quality. Because a fast pipeline that produces wrong results is worse than a slow pipeline that produces right results.&lt;/p&gt;

&lt;p&gt;We'll cover:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;How to orchestrate these layers so they run in the right order&lt;/li&gt;
&lt;li&gt;Automated testing to catch issues before production&lt;/li&gt;
&lt;li&gt;Monitoring and alerting when things go wrong&lt;/li&gt;
&lt;li&gt;The retry strategies that saved us during that Databricks outage&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;But for now, take a look at your pipelines. Are you reprocessing everything every time? Could you process just what changed? The performance gains might surprise you.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;This is Part 2 of a 3-part series on modern data pipeline architecture. The examples come from an open-source banking pipeline I built based on my production experience.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;&lt;a href="https://dev.to/karthikeyan_rajasekaran_c/modern-data-pipelines-why-five-layers-changed-everything-part-1-of-3-1e92"&gt;Part 1: Modern Data Pipelines - Why Five Layers Changed Everything&lt;/a&gt;&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Part 3: Orchestration &amp;amp; Data Quality (coming soon)&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Want to see the full code?&lt;/strong&gt; Check out the &lt;a href="https://github.com/ai-tech-karthik/banking-data-pipeline" rel="noopener noreferrer"&gt;GitHub repository&lt;/a&gt; with complete source code, documentation, and production metrics.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Tech Stack&lt;/strong&gt;: Dagster • DBT • DuckDB • Databricks • Python • Docker&lt;/p&gt;




&lt;h2&gt;
  
  
  Have you implemented incremental processing?
&lt;/h2&gt;

&lt;p&gt;What challenges did you face? What patterns worked for you? Drop a comment below—I'd love to hear your experiences! 👇&lt;/p&gt;




</description>
      <category>dataengineering</category>
      <category>dbt</category>
      <category>dagster</category>
      <category>data</category>
    </item>
    <item>
      <title>Modern Data Pipelines: Why Five Layers Changed Everything (Part 1 of 3)</title>
      <dc:creator>Karthikeyan Rajasekaran</dc:creator>
      <pubDate>Wed, 24 Dec 2025 03:12:53 +0000</pubDate>
      <link>https://dev.to/karthikeyan_rajasekaran_c/modern-data-pipelines-why-five-layers-changed-everything-part-1-of-3-1e92</link>
      <guid>https://dev.to/karthikeyan_rajasekaran_c/modern-data-pipelines-why-five-layers-changed-everything-part-1-of-3-1e92</guid>
      <description>&lt;p&gt;I'll be honest—when I first heard about "layered data architectures," I rolled my eyes. Another buzzword, I thought. Just write some SQL, move the data, and call it a day. &lt;/p&gt;

&lt;p&gt;Then I spent three weeks debugging a production pipeline where raw data, cleaned data, and analytics were all intermingled in one giant, spaghetti-like mess. That's when it clicked.&lt;/p&gt;

&lt;p&gt;To demonstrate these lessons, I built an example banking data pipeline that captures the patterns and architecture I learned from real production systems. The code is open source, and the principles apply to any domain.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Problem Nobody Talks About
&lt;/h2&gt;

&lt;p&gt;Here's what actually happens in most data projects:&lt;/p&gt;

&lt;p&gt;You start simple. Maybe you're pulling data from an API or reading CSV files. You write a script that cleans the data and calculates some metrics. It works! You ship it. Everyone's happy.&lt;/p&gt;

&lt;p&gt;Six months later, someone asks: "Can we see what this metric looked like last quarter?" &lt;/p&gt;

&lt;p&gt;You check the database. The old data is gone—overwritten by yesterday's run.&lt;/p&gt;

&lt;p&gt;"Can we add a new calculation without breaking the existing reports?"&lt;/p&gt;

&lt;p&gt;You look at the code. Everything is tangled together. Changing one thing breaks three others.&lt;/p&gt;

&lt;p&gt;"Why did this number change between Tuesday and Wednesday?"&lt;/p&gt;

&lt;p&gt;You have no idea. There's no audit trail.&lt;/p&gt;

&lt;p&gt;Sound familiar? This is why we need layers.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Five-Layer Philosophy
&lt;/h2&gt;

&lt;p&gt;Think about how a restaurant kitchen works. You don't see the head chef doing everything. There's a system:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Receiving dock&lt;/strong&gt;: Ingredients arrive exactly as delivered (even if the tomatoes are bruised)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Prep station&lt;/strong&gt;: Wash, peel, chop—make ingredients ready to use&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Cold storage&lt;/strong&gt;: Keep prepared ingredients fresh and organized&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Cooking line&lt;/strong&gt;: Combine ingredients following recipes&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Plating station&lt;/strong&gt;: Final presentation for customers&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Each station has one job. If something goes wrong, you know exactly where to look. A data pipeline works the same way.&lt;/p&gt;

&lt;h3&gt;
  
  
  Layer 1: Source (The Receiving Dock)
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;What it does&lt;/strong&gt;: Store data exactly as received. No cleaning, no transformations, no "fixing" things.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Why it matters&lt;/strong&gt;: This is your insurance policy. When something goes wrong downstream (and it will), you can always come back to the original data.&lt;/p&gt;

&lt;p&gt;I learned this the hard way in production. We once had a pipeline that "cleaned" data on ingestion—converting empty strings to nulls, trimming whitespace, fixing typos. Seemed smart at the time. Then a business user asked why certain records were missing. We had no way to prove whether the data arrived that way or if our cleaning broke something.&lt;/p&gt;

&lt;p&gt;Now? I always save everything exactly as received.  Here's the pattern from my example pipeline:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="c1"&gt;-- Source layer: Just add a timestamp&lt;/span&gt;
&lt;span class="k"&gt;SELECT&lt;/span&gt; 
    &lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  &lt;span class="c1"&gt;-- Everything, unchanged&lt;/span&gt;
    &lt;span class="k"&gt;CURRENT_TIMESTAMP&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;loaded_at&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;raw_input&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;That &lt;code&gt;loaded_at&lt;/code&gt; timestamp becomes crucial later. It tells us when data arrived, which helps track down issues and enables change detection.&lt;/p&gt;

&lt;h3&gt;
  
  
  Layer 2: Staging (The Prep Station)
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;What it does&lt;/strong&gt;: Clean and standardize data without changing its meaning.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Why it matters&lt;/strong&gt;: Real-world data is messy. You'll see "Yes", "YES", "true", "1", "Y" all meaning the same thing. Staging normalizes this chaos.&lt;/p&gt;

&lt;p&gt;Here's an example from my demo pipeline that mirrors what I've seen in production. Customer data arrives with loan status in various formats:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="c1"&gt;-- Before staging (the mess)&lt;/span&gt;
&lt;span class="n"&gt;CustomerID&lt;/span&gt; &lt;span class="o"&gt;|&lt;/span&gt; &lt;span class="n"&gt;HasLoan&lt;/span&gt;
&lt;span class="c1"&gt;-----------|--------&lt;/span&gt;
&lt;span class="mi"&gt;1&lt;/span&gt;          &lt;span class="o"&gt;|&lt;/span&gt; &lt;span class="n"&gt;Yes&lt;/span&gt;
&lt;span class="mi"&gt;2&lt;/span&gt;          &lt;span class="o"&gt;|&lt;/span&gt; &lt;span class="n"&gt;YES&lt;/span&gt;  
&lt;span class="mi"&gt;3&lt;/span&gt;          &lt;span class="o"&gt;|&lt;/span&gt; &lt;span class="k"&gt;true&lt;/span&gt;
&lt;span class="mi"&gt;4&lt;/span&gt;          &lt;span class="o"&gt;|&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;
&lt;span class="mi"&gt;5&lt;/span&gt;          &lt;span class="o"&gt;|&lt;/span&gt; &lt;span class="k"&gt;no&lt;/span&gt;
&lt;span class="mi"&gt;6&lt;/span&gt;          &lt;span class="o"&gt;|&lt;/span&gt; &lt;span class="k"&gt;FALSE&lt;/span&gt;

&lt;span class="c1"&gt;-- After staging (clean and consistent)&lt;/span&gt;
&lt;span class="n"&gt;customer_id&lt;/span&gt; &lt;span class="o"&gt;|&lt;/span&gt; &lt;span class="n"&gt;has_loan_flag&lt;/span&gt;
&lt;span class="c1"&gt;------------|---------------&lt;/span&gt;
&lt;span class="mi"&gt;1&lt;/span&gt;           &lt;span class="o"&gt;|&lt;/span&gt; &lt;span class="k"&gt;true&lt;/span&gt;
&lt;span class="mi"&gt;2&lt;/span&gt;           &lt;span class="o"&gt;|&lt;/span&gt; &lt;span class="k"&gt;true&lt;/span&gt;
&lt;span class="mi"&gt;3&lt;/span&gt;           &lt;span class="o"&gt;|&lt;/span&gt; &lt;span class="k"&gt;true&lt;/span&gt;
&lt;span class="mi"&gt;4&lt;/span&gt;           &lt;span class="o"&gt;|&lt;/span&gt; &lt;span class="k"&gt;true&lt;/span&gt;
&lt;span class="mi"&gt;5&lt;/span&gt;           &lt;span class="o"&gt;|&lt;/span&gt; &lt;span class="k"&gt;false&lt;/span&gt;
&lt;span class="mi"&gt;6&lt;/span&gt;           &lt;span class="o"&gt;|&lt;/span&gt; &lt;span class="k"&gt;false&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The staging layer handles this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;SELECT&lt;/span&gt; 
    &lt;span class="n"&gt;customer_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="k"&gt;LOWER&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;TRIM&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;customer_name&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;customer_name&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="k"&gt;CASE&lt;/span&gt; 
        &lt;span class="k"&gt;WHEN&lt;/span&gt; &lt;span class="k"&gt;LOWER&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;has_loan&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;IN&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;'yes'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;'true'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;'1'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;'y'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;THEN&lt;/span&gt; &lt;span class="k"&gt;true&lt;/span&gt;
        &lt;span class="k"&gt;WHEN&lt;/span&gt; &lt;span class="k"&gt;LOWER&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;has_loan&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;IN&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;'no'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;'false'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;'0'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;'n'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;THEN&lt;/span&gt; &lt;span class="k"&gt;false&lt;/span&gt;
        &lt;span class="k"&gt;ELSE&lt;/span&gt; &lt;span class="k"&gt;null&lt;/span&gt;
    &lt;span class="k"&gt;END&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;has_loan_flag&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;source_customer&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Notice we're not calculating anything or joining tables. We're just cleaning. One job, done well.&lt;/p&gt;

&lt;h3&gt;
  
  
  Layer 3: Snapshots (The Time Machine)
&lt;/h3&gt;

&lt;p&gt;This is where things get interesting. Most pipelines overwrite data every day. Yesterday's data? Gone. Last month's data? Gone. You're flying blind.&lt;/p&gt;

&lt;p&gt;Snapshots solve this by keeping every version of every record. It's called Slowly Changing Dimension Type 2 (SCD2), but I prefer to think of it as version control for data.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Real scenario from production&lt;/strong&gt;: A customer's loan status changes on February 15th. Without snapshots, you only know their current status. With snapshots, you know their status on any date in history.&lt;/p&gt;

&lt;p&gt;Here's what the snapshot table looks like:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;customer_id | has_loan | valid_from  | valid_to    | Status
------------|----------|-------------|-------------|--------
123         | false    | 2024-01-01  | 2024-02-15  | Old
123         | true     | 2024-02-15  | NULL        | Current
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The magic is in those &lt;code&gt;valid_from&lt;/code&gt; and &lt;code&gt;valid_to&lt;/code&gt; timestamps. Want to know the status on January 20th? Query where that date falls between &lt;code&gt;valid_from&lt;/code&gt; and &lt;code&gt;valid_to&lt;/code&gt;. Want current status? Query where &lt;code&gt;valid_to&lt;/code&gt; is NULL.&lt;/p&gt;

&lt;p&gt;This saved us during a production audit. Regulators asked about account balances from six months ago. Without snapshots, we would have been scrambling. With snapshots? One SQL query, done in 30 seconds.&lt;/p&gt;

&lt;p&gt;I've implemented this same pattern in my example pipeline to show how it works.&lt;/p&gt;

&lt;h3&gt;
  
  
  Layer 4: Intermediate (The Cooking Line)
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;What it does&lt;/strong&gt;: Join data from different sources and apply business rules.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Why it matters&lt;/strong&gt;: This is where you start building the actual insights. But you're not calculating final metrics yet—you're preparing the ingredients.&lt;/p&gt;

&lt;p&gt;In my example pipeline, I join customer data with account data:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;SELECT&lt;/span&gt; 
    &lt;span class="n"&gt;accounts&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;account_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;accounts&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;balance&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;customers&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;customer_name&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;customers&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;has_loan_flag&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;account_snapshots&lt;/span&gt; &lt;span class="n"&gt;accounts&lt;/span&gt;
&lt;span class="k"&gt;JOIN&lt;/span&gt; &lt;span class="n"&gt;customer_snapshots&lt;/span&gt; &lt;span class="n"&gt;customers&lt;/span&gt; 
    &lt;span class="k"&gt;ON&lt;/span&gt; &lt;span class="n"&gt;accounts&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;customer_id&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;customers&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;customer_id&lt;/span&gt;
&lt;span class="k"&gt;WHERE&lt;/span&gt; &lt;span class="n"&gt;accounts&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;valid_to&lt;/span&gt; &lt;span class="k"&gt;IS&lt;/span&gt; &lt;span class="k"&gt;NULL&lt;/span&gt;  &lt;span class="c1"&gt;-- Current records only&lt;/span&gt;
  &lt;span class="k"&gt;AND&lt;/span&gt; &lt;span class="n"&gt;customers&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;valid_to&lt;/span&gt; &lt;span class="k"&gt;IS&lt;/span&gt; &lt;span class="k"&gt;NULL&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Why not do this in the marts layer? Because other teams might need this joined data for different calculations. Build it once, use it everywhere.&lt;/p&gt;

&lt;h3&gt;
  
  
  Layer 5: Marts (The Plating Station)
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;What it does&lt;/strong&gt;: Final calculations and aggregations. This is what business users actually see.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Why it matters&lt;/strong&gt;: This is your product. Everything before this was preparation.&lt;/p&gt;

&lt;p&gt;Here's where I calculate interest rates based on business rules in the example:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;SELECT&lt;/span&gt; 
    &lt;span class="n"&gt;account_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;balance&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;original_balance&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;

    &lt;span class="c1"&gt;-- Business logic: Interest rate based on balance tiers&lt;/span&gt;
    &lt;span class="k"&gt;CASE&lt;/span&gt;
        &lt;span class="k"&gt;WHEN&lt;/span&gt; &lt;span class="n"&gt;balance&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&lt;/span&gt; &lt;span class="mi"&gt;10000&lt;/span&gt; &lt;span class="k"&gt;THEN&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;01&lt;/span&gt;
        &lt;span class="k"&gt;WHEN&lt;/span&gt; &lt;span class="n"&gt;balance&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&lt;/span&gt; &lt;span class="mi"&gt;20000&lt;/span&gt; &lt;span class="k"&gt;THEN&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;015&lt;/span&gt;
        &lt;span class="k"&gt;ELSE&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;02&lt;/span&gt;
    &lt;span class="k"&gt;END&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;base_rate&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;

    &lt;span class="c1"&gt;-- Bonus rate for customers with loans&lt;/span&gt;
    &lt;span class="k"&gt;CASE&lt;/span&gt; &lt;span class="k"&gt;WHEN&lt;/span&gt; &lt;span class="n"&gt;has_loan_flag&lt;/span&gt; &lt;span class="k"&gt;THEN&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;005&lt;/span&gt; &lt;span class="k"&gt;ELSE&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt; &lt;span class="k"&gt;END&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;bonus_rate&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;

    &lt;span class="c1"&gt;-- Final calculation&lt;/span&gt;
    &lt;span class="n"&gt;balance&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;base_rate&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;bonus_rate&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;annual_interest&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;intermediate_accounts&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The business logic is crystal clear. No digging through nested queries or trying to figure out where a number came from. It's right there.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why This Actually Works
&lt;/h2&gt;

&lt;p&gt;I've seen teams try to skip layers. "We don't need staging, we'll just clean in the source layer." Or "Why separate intermediate and marts? Let's just do it all in one query."&lt;/p&gt;

&lt;p&gt;Here's what happens:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Without layers&lt;/strong&gt;: A bug in the cleaning logic corrupts your analytics. You can't tell if the issue is in the data, the cleaning, the joins, or the calculations. You're debugging everything at once.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;With layers&lt;/strong&gt;: A bug in the cleaning logic? Check staging. Bad join? Check intermediate. Wrong calculation? Check marts. You know exactly where to look.&lt;/p&gt;

&lt;p&gt;It's like having a stack trace for your data.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Performance Question
&lt;/h2&gt;

&lt;p&gt;"But doesn't this mean more tables and slower queries?"&lt;/p&gt;

&lt;p&gt;Actually, no. Here's why:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Staging is views, not tables&lt;/strong&gt;: No storage overhead. They're computed on the fly.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Snapshots enable incremental processing&lt;/strong&gt;: Instead of reprocessing everything daily, you only process what changed. In production, I've seen this reduce runs from 10 minutes to 6 seconds.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Intermediate tables are reusable&lt;/strong&gt;: Build the join once, use it in multiple marts. Faster than joining raw data every time.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Marts are optimized for queries&lt;/strong&gt;: They're pre-aggregated and indexed exactly how business users need them.&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;The performance actually improves because each layer is optimized for its specific job.&lt;/p&gt;

&lt;h2&gt;
  
  
  What I Wish I Knew Earlier
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Start with layers from day one&lt;/strong&gt;: Don't wait until the pipeline is a mess. It's easier to build it right than to refactor later.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Layers aren't bureaucracy&lt;/strong&gt;: They're clarity. Each layer answers one question: What is this data? (source), Is it clean? (staging), What changed? (snapshots), How does it relate? (intermediate), What does it mean? (marts).&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The time machine is worth it&lt;/strong&gt;: Snapshots take more storage, yes. But the first time someone asks "what was this value last month?" you'll be glad you have them.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;One job per layer&lt;/strong&gt;: The moment you start mixing concerns (cleaning in source, calculating in intermediate), you're back to spaghetti.&lt;/p&gt;

&lt;h2&gt;
  
  
  Coming Up Next
&lt;/h2&gt;

&lt;p&gt;In Part 2, we'll dive into incremental processing—how to process only what changed instead of reprocessing everything. This is where the real performance gains happen.&lt;/p&gt;

&lt;p&gt;In Part 3, we'll cover orchestration and data quality—how to make sure this whole system runs reliably and catches issues before they reach production.&lt;/p&gt;

&lt;p&gt;But for now, think about your current pipelines. Are they layered? Can you trace a number from the final report back through each transformation to the raw data? If not, it might be time to add some layers.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;This is Part 1 of a 3-part series on modern data pipeline architecture. The examples come from an open-source banking pipeline I built based on my production experience. The patterns apply to any domain—e-commerce, healthcare, logistics, you name it.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Want to see the full code?&lt;/strong&gt; Check out the &lt;a href="https://github.com/ai-tech-karthik/banking-data-pipeline" rel="noopener noreferrer"&gt;GitHub repository&lt;/a&gt; with complete source code, documentation, and production metrics.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Tech Stack&lt;/strong&gt;: Dagster • DBT • DuckDB • Databricks • Python • Docker&lt;/p&gt;




&lt;h2&gt;
  
  
  What's your experience with data pipeline architecture?
&lt;/h2&gt;

&lt;p&gt;Have you built layered pipelines? What challenges did you face? Drop a comment below—I'd love to hear your stories! 👇&lt;/p&gt;




</description>
      <category>dataengineering</category>
      <category>dbt</category>
      <category>dagster</category>
      <category>python</category>
    </item>
  </channel>
</rss>
