<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Victor Kipruto</title>
    <description>The latest articles on DEV Community by Victor Kipruto (@victor_kipruto_0f44ad2a3b).</description>
    <link>https://dev.to/victor_kipruto_0f44ad2a3b</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3970711%2F63a5cd58-3eb3-4e19-b400-eaf4b6da7d69.png</url>
      <title>DEV Community: Victor Kipruto</title>
      <link>https://dev.to/victor_kipruto_0f44ad2a3b</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/victor_kipruto_0f44ad2a3b"/>
    <language>en</language>
    <item>
      <title>Building Scalable Systems with Apache Airflow: A Data Engineer's Guide</title>
      <dc:creator>Victor Kipruto</dc:creator>
      <pubDate>Tue, 09 Jun 2026 00:25:01 +0000</pubDate>
      <link>https://dev.to/victor_kipruto_0f44ad2a3b/building-scalable-systems-with-apache-airflow-a-data-engineers-guide-5f27</link>
      <guid>https://dev.to/victor_kipruto_0f44ad2a3b/building-scalable-systems-with-apache-airflow-a-data-engineers-guide-5f27</guid>
      <description>&lt;h1&gt;
  
  
  Building Scalable Systems with Apache Airflow: A Data Engineer's Guide
&lt;/h1&gt;

&lt;p&gt;It is 3:00 AM. Your Slack channels are lighting up with alerts. Your analytical database is reporting connection timeouts, critical customer dashboards are displaying stale data, and the Apache Airflow web server is completely unresponsive, throwing 504 Gateway Timeouts. If you have managed data platforms at scale, this nightmare scenario probably feels all too familiar.&lt;/p&gt;

&lt;p&gt;When teams first adopt workflow orchestration, setting up a single virtual machine with a local executor and a handful of cron-like DAGs is easy. However, as your data organization matures from 10 to 1,000 pipelines, you will inevitably hit the scalability wall. The symptoms are always the same: missed Service Level Agreements (SLAs), severe scheduler latency, metadata database lock contention, and random task failures caused by Out-of-Memory (OOM) errors. &lt;/p&gt;

&lt;p&gt;Welcome to &lt;strong&gt;Building Scalable Systems with Apache Airflow: A Data Engineer's Guide&lt;/strong&gt;. In this post, we will look at how to design, optimize, and maintain production-ready Airflow environments that can scale to thousands of daily tasks without breaking a sweat.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Core Challenge: Understanding the Scalability Wall
&lt;/h2&gt;

&lt;p&gt;Before we dive into configuration files and Python code, we must understand &lt;em&gt;why&lt;/em&gt; Airflow deployments fail under heavy load.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;┌────────────────────────────────────────────────────────┐
│                   AIRFLOW ARCHITECTURE                 │
└──────────────────────────┬─────────────────────────────┘
                           │
             ┌─────────────┴─────────────┐
             ▼                           ▼
      ┌─────────────┐             ┌─────────────┐
      │  Scheduler  │             │ Web Server  │
      └──────┬──────┘             └──────┬──────┘
             │                           │
             │     ┌───────────────┐     │
             └────►│  Metadata DB  │◄────┘
                   │ (PostgreSQL)  │
                   └───────┬───────┘
                           │
             ┌─────────────┴─────────────┐
             ▼                           ▼
      ┌─────────────┐             ┌─────────────┐
      │   Worker 1  │             │   Worker 2  │
      │  (Compute)  │             │  (Compute)  │
      └─────────────┘             └─────────────┘
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The most common misconception is that Airflow is a data processing engine. &lt;strong&gt;It is not.&lt;/strong&gt; Airflow is strictly a workflow orchestrator. Its core components—the Web Server, Scheduler, and Metadata Database—are designed to coordinate work, not execute it. &lt;/p&gt;

&lt;p&gt;When you run heavy data processing workloads (like Pandas transformations, API extractions, or model training) directly on Airflow workers, you consume local CPU and RAM. When multiple tasks run concurrently, workers quickly run out of memory, killing the processes and causing the Scheduler to lose track of task states. This guide focuses on teaching you how to build a decoupled, scalable, and resilient orchestration architecture.&lt;/p&gt;




&lt;h2&gt;
  
  
  Section 1: Architecting Your Airflow Cluster for Enterprise Scale
&lt;/h2&gt;

&lt;p&gt;Scaling your infrastructure begins with choosing the right executor. If you are still running the &lt;code&gt;SequentialExecutor&lt;/code&gt; or &lt;code&gt;LocalExecutor&lt;/code&gt; in production, migrations should be your top priority. To build the &lt;strong&gt;best Building Scalable Systems with Apache Airflow: A Data Engineer's Guide&lt;/strong&gt;, we must look at distributed executors.&lt;/p&gt;

&lt;h3&gt;
  
  
  CeleryExecutor vs. KubernetesExecutor
&lt;/h3&gt;

&lt;p&gt;For enterprise-scale orchestration, you generally have two choices:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Feature&lt;/th&gt;
&lt;th&gt;CeleryExecutor&lt;/th&gt;
&lt;th&gt;KubernetesExecutor&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Task Isolation&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Low (tasks share worker resources)&lt;/td&gt;
&lt;td&gt;High (each task runs in its own isolated pod)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Startup Latency&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Low (&amp;lt; 1 second)&lt;/td&gt;
&lt;td&gt;Medium (5–15 seconds for pod creation)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Resource Efficiency&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Static (workers run continuously)&lt;/td&gt;
&lt;td&gt;Dynamic (pods spin up and down on demand)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Scaling Mechanism&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Scaled via Celery Workers / &lt;a href="https://keda.sh/" rel="noopener noreferrer"&gt;KEDA&lt;/a&gt;
&lt;/td&gt;
&lt;td&gt;Native Kubernetes Autoscaler&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h4&gt;
  
  
  The Hybrid Solution: CeleryKubernetesExecutor
&lt;/h4&gt;

&lt;p&gt;For organizations handling high-velocity, mixed-profile tasks, the &lt;code&gt;CeleryKubernetesExecutor&lt;/code&gt; offers the best of both worlds. It directs small, fast tasks (such as simple API triggers or sensor checks) to a persistent pool of Celery workers to avoid container startup overhead. Simultaneously, it routes resource-heavy, complex tasks directly to Kubernetes pods for runtime isolation.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;💡 &lt;strong&gt;Pro Tip 1: Decouple Your Compute Completely&lt;/strong&gt;&lt;br&gt;
To build highly scalable systems, follow the golden rule of orchestration: &lt;strong&gt;Airflow should only push buttons and monitor statuses.&lt;/strong&gt; Offload actual data processing to distributed, external compute engines like Snowflake, BigQuery, Apache Spark (on AWS EMR / Databricks), or serverless containers (AWS ECS/Fargate). Airflow should trigger the job, poll for its completion, and yield resources.&lt;/p&gt;
&lt;/blockquote&gt;




&lt;h2&gt;
  
  
  Section 2: Building Scalable Systems with Apache Airflow: A Data Engineer's Guide Tutorial
&lt;/h2&gt;

&lt;p&gt;Writing scalable pipelines is as much about software engineering best practices as it is about systems architecture. Let's look at &lt;strong&gt;how to use Building Scalable Systems with Apache Airflow: A Data Engineer's Guide&lt;/strong&gt; to write clean, performance-minded DAGs.&lt;/p&gt;

&lt;p&gt;One of the most common mistakes data engineers make is running top-level code inside their DAG files. The Airflow Scheduler regularly parses every file in your &lt;code&gt;DAGS_FOLDER&lt;/code&gt; (defined by &lt;code&gt;min_file_process_interval&lt;/code&gt;). If your DAG file contains active database connections, HTTP requests, or heavy package imports at the top level, those queries execute on &lt;em&gt;every parse cycle&lt;/em&gt;, which can overwhelm your database and slow down the Scheduler.&lt;/p&gt;

&lt;p&gt;Here is a step-by-step tutorial on how to construct a scalable DAG using the modern &lt;a href="https://airflow.apache.org/docs/apache-airflow/stable/core-concepts/taskflow.html" rel="noopener noreferrer"&gt;TaskFlow API&lt;/a&gt; while keeping Scheduler overhead minimal.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;
DAG Title: Scalable E-Commerce Analytics ETL Pipeline
Author: Victor Kipruto Rop
Description: Demonstrates scalable DAG patterns including the TaskFlow API,
             minimized top-level imports, and external compute delegation.
&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;

&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;datetime&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;datetime&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;timedelta&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;airflow.decorators&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;dag&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;task&lt;/span&gt;

&lt;span class="c1"&gt;# Keep top-level imports minimal to reduce scheduler parsing latency
# Avoid importing heavy libraries like pandas, numpy, or tensorflow at the top level
&lt;/span&gt;
&lt;span class="nd"&gt;@dag&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;dag_id&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;scalable_ecommerce_analytics&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;schedule_interval&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;@daily&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;start_date&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="nf"&gt;datetime&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;2023&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;10&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
    &lt;span class="n"&gt;catchup&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;False&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;default_args&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;retries&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;retry_delay&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nf"&gt;timedelta&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;minutes&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;5&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;depends_on_past&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="bp"&gt;False&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="p"&gt;},&lt;/span&gt;
    &lt;span class="n"&gt;tags&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;production&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;analytics&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;ecommerce_analytics_dag&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;

    &lt;span class="nd"&gt;@task&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;extract_order_metadata&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;ds&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;dict&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;
        Extracts transactional metadata from source databases.
        Notice that heavy libraries or database connections are initialized
        INSIDE the task function scope, not at the top level.
        &lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
        &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;psycopg2&lt;/span&gt;  &lt;span class="c1"&gt;# Localized import to prevent scheduler parsing overhead
&lt;/span&gt;
        &lt;span class="c1"&gt;# Simulated extraction metadata (Do not pass raw, heavy data via XCom)
&lt;/span&gt;        &lt;span class="n"&gt;extracted_metadata&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;execution_date&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;ds&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;s3_raw_data_path&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;s3://my-data-lake-raw/orders/&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;ds&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;/data.parquet&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;record_count&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;450000&lt;/span&gt;
        &lt;span class="p"&gt;}&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;extracted_metadata&lt;/span&gt;

    &lt;span class="nd"&gt;@task&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;transform_on_external_compute&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;metadata&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;dict&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;dict&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;
        Simulates running heavy transformation on external compute (Spark/EMR).
        Airflow merely initiates the process and monitors its status.
        &lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
        &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;time&lt;/span&gt;
        &lt;span class="n"&gt;raw_path&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;metadata&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;s3_raw_data_path&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
        &lt;span class="n"&gt;transformed_path&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;raw_path&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;replace&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;raw&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;transformed&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

        &lt;span class="c1"&gt;# Instead of reading the parquet file into memory with pandas on the Airflow worker:
&lt;/span&gt;        &lt;span class="c1"&gt;# We would trigger an EMR Serverless Spark Job or run a Snowflake query here.
&lt;/span&gt;        &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Triggering Spark job to process records from &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;raw_path&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="n"&gt;time&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;sleep&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;5&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  &lt;span class="c1"&gt;# Simulating API poll
&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;s3_processed_data_path&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;transformed_path&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;status&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;SUCCESS&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
        &lt;span class="p"&gt;}&lt;/span&gt;

    &lt;span class="nd"&gt;@task&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;quality_check&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;processed_info&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;dict&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;
        Executes lightweight quality assertions using Great Expectations or standard SQL validation.
        &lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
        &lt;span class="n"&gt;processed_path&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;processed_info&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;s3_processed_data_path&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
        &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Asserting schema conformity and record completeness on: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;processed_path&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="c1"&gt;# Set explicit dependencies using TaskFlow API syntax
&lt;/span&gt;    &lt;span class="n"&gt;order_metadata&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;extract_order_metadata&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="n"&gt;transformed_info&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;transform_on_external_compute&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;order_metadata&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="nf"&gt;quality_check&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;transformed_info&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# Instantiate the DAG
&lt;/span&gt;&lt;span class="nf"&gt;ecommerce_analytics_dag&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Why This Design Pattern Scales:
&lt;/h3&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;No Top-Level Database Connections&lt;/strong&gt;: Heavy database drivers (&lt;code&gt;psycopg2&lt;/code&gt;) are imported locally inside the task function, which protects your CPU during parsing.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Metadata-Only XComs&lt;/strong&gt;: Instead of passing raw transaction dataframes through Airflow’s XCom engine (which bloats the PostgreSQL metadata DB), we store raw data in cloud storage (&lt;code&gt;S3&lt;/code&gt;) and pass only the path references between tasks.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;External Compute Delegation&lt;/strong&gt;: Task execution processes are simulated as remote operations (like Spark triggers) rather than running locally on the worker.&lt;/li&gt;
&lt;/ol&gt;




&lt;h2&gt;
  
  
  Section 3: Advanced Optimization — Dynamic DAG Generation
&lt;/h2&gt;

&lt;p&gt;To scale your operations efficiently, you must avoid hand-crafting thousands of separate DAG files for similar ingestion pipelines. Let's look at how to dynamically generate DAGs using external configuration schemas.&lt;/p&gt;

&lt;p&gt;Instead of writing 100 files for 100 database tables, you can write a single generator script that reads a JSON configuration and populates the Airflow &lt;code&gt;globals()&lt;/code&gt;.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;json&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;os&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;datetime&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;datetime&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;airflow&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;DAG&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;airflow.operators.empty&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;EmptyOperator&lt;/span&gt;

&lt;span class="c1"&gt;# Path to the config file defining our ingestion pipelines
&lt;/span&gt;&lt;span class="n"&gt;CONFIG_PATH&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;os&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;path&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;join&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;os&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;path&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;dirname&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;__file__&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;ingestion_config.json&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;create_dag&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;dag_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;table_name&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;schedule&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;Factory function to dynamically construct a DAG configuration.&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
    &lt;span class="n"&gt;dag&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;DAG&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="n"&gt;dag_id&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;dag_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;schedule_interval&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;schedule&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;start_date&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="nf"&gt;datetime&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;2023&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
        &lt;span class="n"&gt;catchup&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;False&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;tags&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;dynamic&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;ingestion&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="k"&gt;with&lt;/span&gt; &lt;span class="n"&gt;dag&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;start&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;EmptyOperator&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;task_id&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;start&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="c1"&gt;# Imagine a real ingestion operator here querying the source DB
&lt;/span&gt;        &lt;span class="n"&gt;ingest&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;EmptyOperator&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;task_id&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;ingest_&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;table_name&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="n"&gt;end&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;EmptyOperator&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;task_id&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;end&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

        &lt;span class="n"&gt;start&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;ingest&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;end&lt;/span&gt;

    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;dag&lt;/span&gt;

&lt;span class="c1"&gt;# Read the configurations dynamically
&lt;/span&gt;&lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;os&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;path&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;exists&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;CONFIG_PATH&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="k"&gt;with&lt;/span&gt; &lt;span class="nf"&gt;open&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;CONFIG_PATH&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;r&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;config&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;json&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;load&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;pipeline&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;config&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;pipelines&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]:&lt;/span&gt;
        &lt;span class="n"&gt;d_id&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;dynamic_ingest_&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;pipeline&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;table&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
        &lt;span class="c1"&gt;# Inject the generated DAG into the global namespace so Airflow can parse it
&lt;/span&gt;        &lt;span class="nf"&gt;globals&lt;/span&gt;&lt;span class="p"&gt;()[&lt;/span&gt;&lt;span class="n"&gt;d_id&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;create_dag&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
            &lt;span class="n"&gt;dag_id&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;d_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="n"&gt;table_name&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;pipeline&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;table&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
            &lt;span class="n"&gt;schedule&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;pipeline&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;schedule&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
        &lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;For this to work seamlessly, create an adjacent &lt;code&gt;ingestion_config.json&lt;/code&gt; file in your DAG directory:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"pipelines"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="nl"&gt;"table"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"users"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"schedule"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"@daily"&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="nl"&gt;"table"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"orders"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"schedule"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"0 2 * * *"&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="nl"&gt;"table"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"products"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"schedule"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"@weekly"&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This pattern simplifies administrative overhead. Adding a new pipeline is as simple as adding a new configuration block to your JSON file, without writing any additional Python code.&lt;/p&gt;




&lt;h2&gt;
  
  
  Section 4: Tuning Airflow Configuration Variables for Enterprise Performance
&lt;/h2&gt;

&lt;p&gt;Tuning your Airflow cluster requires adjusting specific configuration variables within your &lt;code&gt;airflow.cfg&lt;/code&gt; file. These parameters act as control valves for your system's resource consumption. Below are some of the most critical tuning variables and the standard benchmarks used by senior architects:&lt;/p&gt;

&lt;h3&gt;
  
  
  1. Concurrency Management
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;parallelism&lt;/code&gt;&lt;/strong&gt;: This defines the maximum number of task instances that can run concurrently across your entire Airflow deployment.

&lt;ul&gt;
&lt;li&gt;
&lt;em&gt;Scale Benchmark&lt;/em&gt;: Default is &lt;code&gt;32&lt;/code&gt;. For heavy workloads, scale this to &lt;code&gt;256&lt;/code&gt; or &lt;code&gt;512&lt;/code&gt; depending on your executor resources.&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;max_active_runs_per_dag&lt;/code&gt;&lt;/strong&gt;: Prevents a single DAG from spawning too many concurrent active DAG runs, which is especially useful when catching up historical datasets.

&lt;ul&gt;
&lt;li&gt;
&lt;em&gt;Scale Benchmark&lt;/em&gt;: Set this to &lt;code&gt;3&lt;/code&gt; or &lt;code&gt;4&lt;/code&gt; to prevent resource starvation on your cluster.&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;max_active_tasks_per_dag&lt;/code&gt;&lt;/strong&gt;: Limits the number of tasks that can run concurrently within a single DAG run.

&lt;ul&gt;
&lt;li&gt;
&lt;em&gt;Scale Benchmark&lt;/em&gt;: Set this to &lt;code&gt;16&lt;/code&gt; or &lt;code&gt;32&lt;/code&gt; to ensure one pipeline doesn't consume your entire worker pool.&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  2. Scheduler Performance Tuning
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;scheduler_heartbeat_sec&lt;/code&gt;&lt;/strong&gt;: Determines how often the scheduler checks for tasks that need to run.

&lt;ul&gt;
&lt;li&gt;
&lt;em&gt;Scale Benchmark&lt;/em&gt;: Default is &lt;code&gt;5&lt;/code&gt;. Increase this to &lt;code&gt;10&lt;/code&gt; or &lt;code&gt;15&lt;/code&gt; on busy systems to reduce CPU pressure on your metadata database.&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;min_file_process_interval&lt;/code&gt;&lt;/strong&gt;: Controls how frequently the scheduler parses your DAG files.

&lt;ul&gt;
&lt;li&gt;
&lt;em&gt;Scale Benchmark&lt;/em&gt;: Default is &lt;code&gt;30&lt;/code&gt; seconds. On large clusters, increase this to &lt;code&gt;120&lt;/code&gt; or &lt;code&gt;300&lt;/code&gt; seconds to significantly reduce parsing overhead.&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Section 5: Case Study: Scaling Retail Pipelines for Black Friday
&lt;/h2&gt;

&lt;p&gt;Let’s look at a real-world case study. A large multinational retail client faced catastrophic failures on their analytical platform during a major holiday sales event.&lt;/p&gt;

&lt;h3&gt;
  
  
  The Problem
&lt;/h3&gt;

&lt;p&gt;During peak hours, their Airflow deployment (which processed 1,200 analytical tasks per hour using a single VM on a &lt;code&gt;LocalExecutor&lt;/code&gt; with a PostgreSQL backend) collapsed. The scheduler latency soared to over &lt;strong&gt;15 minutes&lt;/strong&gt;, and task executions began failing with PostgreSQL connection timeouts.&lt;/p&gt;

&lt;h3&gt;
  
  
  The Diagnostic Process
&lt;/h3&gt;

&lt;p&gt;Using diagnostic queries against their metadata database, we identified two main bottlenecks:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Metadata Bloat&lt;/strong&gt;: Over &lt;strong&gt;15 million rows&lt;/strong&gt; of historical task logs and XComs were clogging their metadata tables, slowing down the scheduler's check queries.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Database Connection Saturation&lt;/strong&gt;: The &lt;code&gt;LocalExecutor&lt;/code&gt; spawned processes that directly queried the database, exceeding its connection limit of &lt;strong&gt;100&lt;/strong&gt;.&lt;/li&gt;
&lt;/ol&gt;

&lt;h3&gt;
  
  
  The Solution &amp;amp; Metrics
&lt;/h3&gt;

&lt;p&gt;To fix these issues, we implemented several performance improvements:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Connection Pooling&lt;/strong&gt;: We deployed &lt;a href="https://www.pgbouncer.org/" rel="noopener noreferrer"&gt;PgBouncer&lt;/a&gt; in transaction mode in front of PostgreSQL, reducing database connection overhead by over &lt;strong&gt;80%&lt;/strong&gt;.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Database Maintenance&lt;/strong&gt;: We scheduled an automated daily cleanup process using the &lt;code&gt;airflow db clean&lt;/code&gt; command to prune historical data older than 14 days, reducing database query times from seconds to milliseconds.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Architecture Migration&lt;/strong&gt;: We migrated the cluster from a &lt;code&gt;LocalExecutor&lt;/code&gt; on a single VM to the &lt;code&gt;KubernetesExecutor&lt;/code&gt; on an AWS EKS cluster, using &lt;a href="https://keda.sh/" rel="noopener noreferrer"&gt;KEDA&lt;/a&gt; to scale worker pods dynamically.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The impact was immediate:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;┌────────────────────────────────────────────────────────┐
│               SCHEDULER LATENCY METRICS                │
├───────────────────────────┬────────────────────────────┤
│ Before Optimization       │ █░░░░░░░░░░░░░░░ 15.2 mins │
│ After PgBouncer &amp;amp; KEDA    │ █ 1.4 seconds              │
└───────────────────────────┴────────────────────────────┘
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The system successfully handled the holiday peak without a single service outage.&lt;/p&gt;




&lt;h2&gt;
  
  
  Section 6: Airflow vs. Modern Alternatives
&lt;/h2&gt;

&lt;p&gt;No guide is complete without looking at how Airflow compares to other tools on the market. Let's look at &lt;strong&gt;Building Scalable Systems with Apache Airflow: A Data Engineer's Guide vs alternatives&lt;/strong&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;                              ┌───────────────┐
                              │  Orchestrator │
                              │   Selection   │
                              └───────┬───────┘
                                      │
                     ┌────────────────┴────────────────┐
                     ▼                                 ▼
           Complex, Multi-System               Dynamic Python Code
            Enterprise Workflows                 Asset-First Focus
                     │                                 │
                     ▼                                 ▼
             ┌───────────────┐                 ┌───────────────┐
             │  Use Airflow  │                 │  Use Dagster  │
             └───────────────┘                 └───────────────┘
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Apache Airflow&lt;/strong&gt;: The industry standard. Its main strengths are its massive ecosystem of operators, active open-source community, and managed offerings (such as AWS MWAA or Astronomer). It excels in large enterprises with complex, multi-system integration requirements.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Dagster&lt;/strong&gt;: A modern alternative focusing on "software-defined assets." It is an excellent choice for teams prioritizing testability, local development, and data-asset lineage over infrastructure management.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Prefect&lt;/strong&gt;: A highly dynamic orchestrator that treats tasks as standard Python functions, reducing boilerplate code. It is great for dynamic orchestration where workflow structures can change during runtime.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Temporal&lt;/strong&gt;: Unlike database-backed orchestrators, Temporal is a high-throughput, stateful code execution engine. It is ideal for microservices and mission-critical transactions, though it has a steeper learning curve for traditional data analysts.&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Pro Tips and Best Practices
&lt;/h2&gt;

&lt;p&gt;To help you get the most out of your Airflow environment, here is a consolidated list of &lt;strong&gt;Building Scalable Systems with Apache Airflow: A Data Engineer's Guide best practices&lt;/strong&gt;:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;💡 &lt;strong&gt;Pro Tip 2: Implement a Custom XCom Backend&lt;/strong&gt;&lt;br&gt;
Instead of writing heavy task inputs and outputs directly to your metadata database, configure a custom XCom backend in your &lt;code&gt;airflow.cfg&lt;/code&gt; file using the &lt;code&gt;xcom_backend&lt;/code&gt; parameter to store payloads directly in AWS S3 or Google Cloud Storage.&lt;/p&gt;

&lt;p&gt;💡 &lt;strong&gt;Pro Tip 3: Always Use PgBouncer&lt;/strong&gt;&lt;br&gt;
Airflow schedulers query your database continuously. Running a connection pooler like PgBouncer in front of your PostgreSQL instance prevents connection starvation and reduces database CPU utilization by up to &lt;strong&gt;60%&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;💡 &lt;strong&gt;Pro Tip 4: Configure SLA Timeouts Carefully&lt;/strong&gt;&lt;br&gt;
Avoid using Airflow’s default SLA mechanism for large-scale pipelines, as it can cause performance bottlenecks on the scheduler. Instead, use custom callback alerts (&lt;code&gt;on_failure_callback&lt;/code&gt;, &lt;code&gt;on_retry_callback&lt;/code&gt;) or task-level execution timeouts (&lt;code&gt;execution_timeout=timedelta(...)&lt;/code&gt;) for better reliability.&lt;/p&gt;
&lt;/blockquote&gt;




&lt;h2&gt;
  
  
  Key Takeaways
&lt;/h2&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Keep Compute Separate&lt;/strong&gt;: Keep your Airflow worker nodes lightweight by offloading heavy data transformations to cloud platforms like Snowflake, Spark, or BigQuery.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Minimize Top-Level Imports&lt;/strong&gt;: Write clean DAGs with localized task imports to reduce parsing latency on your scheduler.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Use Distributed Executors&lt;/strong&gt;: Migrate to the &lt;code&gt;KubernetesExecutor&lt;/code&gt; or &lt;code&gt;CeleryExecutor&lt;/code&gt; to handle unpredictable, high-volume workloads.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Tune Your Metadata DB&lt;/strong&gt;: Use tools like PgBouncer, schedule regular cleanups with the &lt;code&gt;airflow db clean&lt;/code&gt; command, and keep your metadata database running smoothly.&lt;/li&gt;
&lt;/ol&gt;




&lt;h2&gt;
  
  
  Resources &amp;amp; Further Reading
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://airflow.apache.org/docs/apache-airflow/stable/best-practices.html" rel="noopener noreferrer"&gt;Apache Airflow Best Practices Documentation&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://airflow.apache.org/docs/apache-airflow/stable/administration-and-deployment/scheduler.html" rel="noopener noreferrer"&gt;Scaling the Airflow Scheduler (Official Guide)&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://keda.sh/docs/2.12/scalers/celery/" rel="noopener noreferrer"&gt;KEDA (Kubernetes Event-driven Autoscaling) for Celery Workers&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://www.pgbouncer.org/usage.html" rel="noopener noreferrer"&gt;PgBouncer Setup Guide for PostgreSQL Databases&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;




&lt;h3&gt;
  
  
  Join the Discussion!
&lt;/h3&gt;

&lt;p&gt;What are the biggest scaling bottlenecks you have encountered in your Airflow environments? How is your team balancing workloads between Airflow workers and external compute engines? &lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Drop a comment below with your thoughts and experiences, or let me know if you have any questions!&lt;/strong&gt; You can also find additional configuration templates and deployment scripts on my &lt;a href="https://github.com/kipruto45" rel="noopener noreferrer"&gt;GitHub repository&lt;/a&gt;. Let’s keep the conversation going!&lt;/p&gt;

</description>
      <category>apacheairflow</category>
      <category>dataengineering</category>
      <category>scalability</category>
      <category>devops</category>
    </item>
    <item>
      <title>[Boost]</title>
      <dc:creator>Victor Kipruto</dc:creator>
      <pubDate>Mon, 08 Jun 2026 23:55:36 +0000</pubDate>
      <link>https://dev.to/victor_kipruto_0f44ad2a3b/-1k69</link>
      <guid>https://dev.to/victor_kipruto_0f44ad2a3b/-1k69</guid>
      <description>&lt;div class="ltag__link--embedded"&gt;
  &lt;div class="crayons-story "&gt;
  &lt;a href="https://dev.to/victor_kipruto_0f44ad2a3b/advanced-kubernetes-patterns-for-data-engineers-1pk9" class="crayons-story__hidden-navigation-link"&gt;Advanced Kubernetes Patterns for Data Engineers&lt;/a&gt;


  &lt;div class="crayons-story__body crayons-story__body-full_post"&gt;
    &lt;div class="crayons-story__top"&gt;
      &lt;div class="crayons-story__meta"&gt;
        &lt;div class="crayons-story__author-pic"&gt;

          &lt;a href="/victor_kipruto_0f44ad2a3b" class="crayons-avatar  crayons-avatar--l  "&gt;
            &lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3970711%2F63a5cd58-3eb3-4e19-b400-eaf4b6da7d69.png" alt="victor_kipruto_0f44ad2a3b profile" class="crayons-avatar__image" width="96" height="96"&gt;
          &lt;/a&gt;
        &lt;/div&gt;
        &lt;div&gt;
          &lt;div&gt;
            &lt;a href="/victor_kipruto_0f44ad2a3b" class="crayons-story__secondary fw-medium m:hidden"&gt;
              Victor Kipruto
            &lt;/a&gt;
            &lt;div class="profile-preview-card relative mb-4 s:mb-0 fw-medium hidden m:inline-block"&gt;
              
                Victor Kipruto
                
              
              &lt;div id="story-author-preview-content-3852037" class="profile-preview-card__content crayons-dropdown branded-7 p-4 pt-0"&gt;
                &lt;div class="gap-4 grid"&gt;
                  &lt;div class="-mt-4"&gt;
                    &lt;a href="/victor_kipruto_0f44ad2a3b" class="flex"&gt;
                      &lt;span class="crayons-avatar crayons-avatar--xl mr-2 shrink-0"&gt;
                        &lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3970711%2F63a5cd58-3eb3-4e19-b400-eaf4b6da7d69.png" class="crayons-avatar__image" alt="" width="96" height="96"&gt;
                      &lt;/span&gt;
                      &lt;span class="crayons-link crayons-subtitle-2 mt-5"&gt;Victor Kipruto&lt;/span&gt;
                    &lt;/a&gt;
                  &lt;/div&gt;
                  &lt;div class="print-hidden"&gt;
                    
                      Follow
                    
                  &lt;/div&gt;
                  &lt;div class="author-preview-metadata-container"&gt;&lt;/div&gt;
                &lt;/div&gt;
              &lt;/div&gt;
            &lt;/div&gt;

          &lt;/div&gt;
          &lt;a href="https://dev.to/victor_kipruto_0f44ad2a3b/advanced-kubernetes-patterns-for-data-engineers-1pk9" class="crayons-story__tertiary fs-xs"&gt;&lt;time&gt;Jun 8&lt;/time&gt;&lt;span class="time-ago-indicator-initial-placeholder"&gt;&lt;/span&gt;&lt;/a&gt;
        &lt;/div&gt;
      &lt;/div&gt;

    &lt;/div&gt;

    &lt;div class="crayons-story__indention"&gt;
      &lt;h2 class="crayons-story__title crayons-story__title-full_post"&gt;
        &lt;a href="https://dev.to/victor_kipruto_0f44ad2a3b/advanced-kubernetes-patterns-for-data-engineers-1pk9" id="article-link-3852037"&gt;
          Advanced Kubernetes Patterns for Data Engineers
        &lt;/a&gt;
      &lt;/h2&gt;
        &lt;div class="crayons-story__tags"&gt;
            &lt;a class="crayons-tag  crayons-tag--monochrome " href="/t/kubernetes"&gt;&lt;span class="crayons-tag__prefix"&gt;#&lt;/span&gt;kubernetes&lt;/a&gt;
            &lt;a class="crayons-tag  crayons-tag--monochrome " href="/t/dataengineering"&gt;&lt;span class="crayons-tag__prefix"&gt;#&lt;/span&gt;dataengineering&lt;/a&gt;
            &lt;a class="crayons-tag  crayons-tag--monochrome " href="/t/docker"&gt;&lt;span class="crayons-tag__prefix"&gt;#&lt;/span&gt;docker&lt;/a&gt;
            &lt;a class="crayons-tag  crayons-tag--monochrome " href="/t/devops"&gt;&lt;span class="crayons-tag__prefix"&gt;#&lt;/span&gt;devops&lt;/a&gt;
        &lt;/div&gt;
      &lt;div class="crayons-story__bottom"&gt;
        &lt;div class="crayons-story__details"&gt;
          &lt;a href="https://dev.to/victor_kipruto_0f44ad2a3b/advanced-kubernetes-patterns-for-data-engineers-1pk9" class="crayons-btn crayons-btn--s crayons-btn--ghost crayons-btn--icon-left"&gt;
            &lt;div class="multiple_reactions_aggregate"&gt;
              &lt;span class="multiple_reactions_icons_container"&gt;
                  &lt;span class="crayons_icon_container"&gt;
                    &lt;img src="https://assets.dev.to/assets/exploding-head-daceb38d627e6ae9b730f36a1e390fca556a4289d5a41abb2c35068ad3e2c4b5.svg" width="24" height="24"&gt;
                  &lt;/span&gt;
                  &lt;span class="crayons_icon_container"&gt;
                    &lt;img src="https://assets.dev.to/assets/multi-unicorn-b44d6f8c23cdd00964192bedc38af3e82463978aa611b4365bd33a0f1f4f3e97.svg" width="24" height="24"&gt;
                  &lt;/span&gt;
                  &lt;span class="crayons_icon_container"&gt;
                    &lt;img src="https://assets.dev.to/assets/sparkle-heart-5f9bee3767e18deb1bb725290cb151c25234768a0e9a2bd39370c382d02920cf.svg" width="24" height="24"&gt;
                  &lt;/span&gt;
              &lt;/span&gt;
              &lt;span class="aggregate_reactions_counter"&gt;5&lt;span class="hidden s:inline"&gt;&amp;nbsp;reactions&lt;/span&gt;&lt;/span&gt;
            &lt;/div&gt;
          &lt;/a&gt;
            &lt;a href="https://dev.to/victor_kipruto_0f44ad2a3b/advanced-kubernetes-patterns-for-data-engineers-1pk9#comments" class="crayons-btn crayons-btn--s crayons-btn--ghost crayons-btn--icon-left flex items-center"&gt;
              

              &lt;span class="hidden s:inline"&gt;Add&amp;nbsp;Comment&lt;/span&gt;
            &lt;/a&gt;
        &lt;/div&gt;
        &lt;div class="crayons-story__save"&gt;
          &lt;small class="crayons-story__tertiary fs-xs mr-2"&gt;
            1 min read
          &lt;/small&gt;
            
              &lt;span class="bm-initial crayons-icon c-btn__icon"&gt;
                

              &lt;/span&gt;
              &lt;span class="bm-success crayons-icon c-btn__icon"&gt;
                

              &lt;/span&gt;
            
        &lt;/div&gt;
      &lt;/div&gt;
    &lt;/div&gt;
  &lt;/div&gt;
&lt;/div&gt;

&lt;/div&gt;


</description>
    </item>
    <item>
      <title>Advanced Kubernetes Patterns for Data Engineers</title>
      <dc:creator>Victor Kipruto</dc:creator>
      <pubDate>Mon, 08 Jun 2026 23:49:32 +0000</pubDate>
      <link>https://dev.to/victor_kipruto_0f44ad2a3b/advanced-kubernetes-patterns-for-data-engineers-1dn1</link>
      <guid>https://dev.to/victor_kipruto_0f44ad2a3b/advanced-kubernetes-patterns-for-data-engineers-1dn1</guid>
      <description>&lt;p&gt;Master production-ready Kubernetes deployment patterns specifically designed for data engineering workloads including operators, StatefulSets, monitoring, and GitOps workflows.&lt;/p&gt;

</description>
      <category>kubernetes</category>
      <category>dataengineering</category>
      <category>devops</category>
      <category>containers</category>
    </item>
    <item>
      <title>Advanced Kubernetes Patterns for Data Engineers</title>
      <dc:creator>Victor Kipruto</dc:creator>
      <pubDate>Mon, 08 Jun 2026 23:30:43 +0000</pubDate>
      <link>https://dev.to/victor_kipruto_0f44ad2a3b/advanced-kubernetes-patterns-for-data-engineers-1pk9</link>
      <guid>https://dev.to/victor_kipruto_0f44ad2a3b/advanced-kubernetes-patterns-for-data-engineers-1pk9</guid>
      <description>&lt;h2&gt;
  
  
  Mastering Kubernetes for Data Workloads
&lt;/h2&gt;

&lt;p&gt;Production-ready Kubernetes deployment patterns for data engineering.&lt;/p&gt;

&lt;h3&gt;
  
  
  Key Topics Covered
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Operators and CRDs&lt;/strong&gt; for custom data pipeline management&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;StatefulSets&lt;/strong&gt; for deploying Kafka, Airflow, and databases&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Horizontal Pod Autoscaling&lt;/strong&gt; based on queue depth metrics&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Network Policies&lt;/strong&gt; for securing data flow between microservices&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Persistent Volumes&lt;/strong&gt; for stateful data processing workloads&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Why Kubernetes for Data Engineering?
&lt;/h3&gt;

&lt;p&gt;Data engineers increasingly need to deploy and manage complex data pipelines in containerized environments. Kubernetes provides the orchestration layer that makes this possible at scale.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;"Kubernetes is not just for web services — it's the new operating system for data infrastructure." - Data Engineering Weekly&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h3&gt;
  
  
  Architecture Pattern
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Producer → Kafka (StatefulSet) → Consumer (Deployment) → Sink
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Code Example
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;kubernetes&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;config&lt;/span&gt;

&lt;span class="n"&gt;config&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;load_in_config&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="n"&gt;v1&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;CoreV1Api&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

&lt;span class="c1"&gt;# Create persistent volume for data pipeline
&lt;/span&gt;&lt;span class="n"&gt;pvc&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;V1PersistentVolumeClaim&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;spec&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;V1PersistentVolumeClaimSpec&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="n"&gt;access_modes&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;ReadWriteOnce&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
        &lt;span class="n"&gt;resources&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;V1ResourceRequirements&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
            &lt;span class="n"&gt;requests&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;storage&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;50Gi&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
        &lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Monitoring Stack
&lt;/h3&gt;

&lt;p&gt;Deploy Prometheus + Grafana for pipeline monitoring:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Consumer lag metrics&lt;/li&gt;
&lt;li&gt;Pipeline throughput dashboards
&lt;/li&gt;
&lt;li&gt;Alert rules for data freshness SLAs&lt;/li&gt;
&lt;li&gt;Cost tracking per namespace&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Conclusion
&lt;/h3&gt;

&lt;p&gt;Kubernetes provides the foundation for building resilient, scalable data platforms. These patterns have been battle-tested in production environments processing millions of records daily.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Follow for more data engineering content&lt;/em&gt;&lt;/p&gt;

</description>
      <category>kubernetes</category>
      <category>dataengineering</category>
      <category>docker</category>
      <category>devops</category>
    </item>
    <item>
      <title>Building Data Pipelines with Apache Airflow</title>
      <dc:creator>Victor Kipruto</dc:creator>
      <pubDate>Sun, 07 Jun 2026 14:55:58 +0000</pubDate>
      <link>https://dev.to/victor_kipruto_0f44ad2a3b/building-data-pipelines-with-apache-airflow-ij1</link>
      <guid>https://dev.to/victor_kipruto_0f44ad2a3b/building-data-pipelines-with-apache-airflow-ij1</guid>
      <description>&lt;p&gt;content/posts/airflow-pipelines.md&lt;/p&gt;

</description>
      <category>airflow</category>
      <category>etl</category>
      <category>datapipelines</category>
      <category>python</category>
    </item>
  </channel>
</rss>
