Victor Kipruto

Posted on Jun 9 • Originally published at victorkirpruto.dev

Building Scalable Systems with Apache Airflow: A Data Engineer's Guide

#apacheairflow #dataengineering #scalability #devops

Building Scalable Systems with Apache Airflow: A Data Engineer's Guide

It is 3:00 AM. Your Slack channels are lighting up with alerts. Your analytical database is reporting connection timeouts, critical customer dashboards are displaying stale data, and the Apache Airflow web server is completely unresponsive, throwing 504 Gateway Timeouts. If you have managed data platforms at scale, this nightmare scenario probably feels all too familiar.

When teams first adopt workflow orchestration, setting up a single virtual machine with a local executor and a handful of cron-like DAGs is easy. However, as your data organization matures from 10 to 1,000 pipelines, you will inevitably hit the scalability wall. The symptoms are always the same: missed Service Level Agreements (SLAs), severe scheduler latency, metadata database lock contention, and random task failures caused by Out-of-Memory (OOM) errors.

Welcome to Building Scalable Systems with Apache Airflow: A Data Engineer's Guide. In this post, we will look at how to design, optimize, and maintain production-ready Airflow environments that can scale to thousands of daily tasks without breaking a sweat.

The Core Challenge: Understanding the Scalability Wall

Before we dive into configuration files and Python code, we must understand why Airflow deployments fail under heavy load.

┌────────────────────────────────────────────────────────┐
│                   AIRFLOW ARCHITECTURE                 │
└──────────────────────────┬─────────────────────────────┘
                           │
             ┌─────────────┴─────────────┐
             ▼                           ▼
      ┌─────────────┐             ┌─────────────┐
      │  Scheduler  │             │ Web Server  │
      └──────┬──────┘             └──────┬──────┘
             │                           │
             │     ┌───────────────┐     │
             └────►│  Metadata DB  │◄────┘
                   │ (PostgreSQL)  │
                   └───────┬───────┘
                           │
             ┌─────────────┴─────────────┐
             ▼                           ▼
      ┌─────────────┐             ┌─────────────┐
      │   Worker 1  │             │   Worker 2  │
      │  (Compute)  │             │  (Compute)  │
      └─────────────┘             └─────────────┘

The most common misconception is that Airflow is a data processing engine. It is not. Airflow is strictly a workflow orchestrator. Its core components—the Web Server, Scheduler, and Metadata Database—are designed to coordinate work, not execute it.

When you run heavy data processing workloads (like Pandas transformations, API extractions, or model training) directly on Airflow workers, you consume local CPU and RAM. When multiple tasks run concurrently, workers quickly run out of memory, killing the processes and causing the Scheduler to lose track of task states. This guide focuses on teaching you how to build a decoupled, scalable, and resilient orchestration architecture.

Section 1: Architecting Your Airflow Cluster for Enterprise Scale

Scaling your infrastructure begins with choosing the right executor. If you are still running the SequentialExecutor or LocalExecutor in production, migrations should be your top priority. To build the best Building Scalable Systems with Apache Airflow: A Data Engineer's Guide, we must look at distributed executors.

CeleryExecutor vs. KubernetesExecutor

For enterprise-scale orchestration, you generally have two choices:

Feature	CeleryExecutor	KubernetesExecutor
Task Isolation	Low (tasks share worker resources)	High (each task runs in its own isolated pod)
Startup Latency	Low (< 1 second)	Medium (5–15 seconds for pod creation)
Resource Efficiency	Static (workers run continuously)	Dynamic (pods spin up and down on demand)
Scaling Mechanism	Scaled via Celery Workers / KEDA	Native Kubernetes Autoscaler

The Hybrid Solution: CeleryKubernetesExecutor

For organizations handling high-velocity, mixed-profile tasks, the CeleryKubernetesExecutor offers the best of both worlds. It directs small, fast tasks (such as simple API triggers or sensor checks) to a persistent pool of Celery workers to avoid container startup overhead. Simultaneously, it routes resource-heavy, complex tasks directly to Kubernetes pods for runtime isolation.

💡 Pro Tip 1: Decouple Your Compute Completely
To build highly scalable systems, follow the golden rule of orchestration: Airflow should only push buttons and monitor statuses. Offload actual data processing to distributed, external compute engines like Snowflake, BigQuery, Apache Spark (on AWS EMR / Databricks), or serverless containers (AWS ECS/Fargate). Airflow should trigger the job, poll for its completion, and yield resources.

Section 2: Building Scalable Systems with Apache Airflow: A Data Engineer's Guide Tutorial

Writing scalable pipelines is as much about software engineering best practices as it is about systems architecture. Let's look at how to use Building Scalable Systems with Apache Airflow: A Data Engineer's Guide to write clean, performance-minded DAGs.

One of the most common mistakes data engineers make is running top-level code inside their DAG files. The Airflow Scheduler regularly parses every file in your DAGS_FOLDER (defined by min_file_process_interval). If your DAG file contains active database connections, HTTP requests, or heavy package imports at the top level, those queries execute on every parse cycle, which can overwhelm your database and slow down the Scheduler.

Here is a step-by-step tutorial on how to construct a scalable DAG using the modern TaskFlow API while keeping Scheduler overhead minimal.

"""
DAG Title: Scalable E-Commerce Analytics ETL Pipeline
Author: Victor Kipruto Rop
Description: Demonstrates scalable DAG patterns including the TaskFlow API,
             minimized top-level imports, and external compute delegation.
"""

from datetime import datetime, timedelta
from airflow.decorators import dag, task

# Keep top-level imports minimal to reduce scheduler parsing latency
# Avoid importing heavy libraries like pandas, numpy, or tensorflow at the top level

@dag(
    dag_id="scalable_ecommerce_analytics",
    schedule_interval="@daily",
    start_date=datetime(2023, 10, 1),
    catchup=False,
    default_args={
        "retries": 3,
        "retry_delay": timedelta(minutes=5),
        "depends_on_past": False,
    },
    tags=["production", "analytics"],
)
def ecommerce_analytics_dag():

    @task()
    def extract_order_metadata(ds: str = None) -> dict:
        """
        Extracts transactional metadata from source databases.
        Notice that heavy libraries or database connections are initialized
        INSIDE the task function scope, not at the top level.
        """
        import psycopg2  # Localized import to prevent scheduler parsing overhead

        # Simulated extraction metadata (Do not pass raw, heavy data via XCom)
        extracted_metadata = {
            "execution_date": ds,
            "s3_raw_data_path": f"s3://my-data-lake-raw/orders/{ds}/data.parquet",
            "record_count": 450000
        }
        return extracted_metadata

    @task()
    def transform_on_external_compute(metadata: dict) -> dict:
        """
        Simulates running heavy transformation on external compute (Spark/EMR).
        Airflow merely initiates the process and monitors its status.
        """
        import time
        raw_path = metadata["s3_raw_data_path"]
        transformed_path = raw_path.replace("raw", "transformed")

        # Instead of reading the parquet file into memory with pandas on the Airflow worker:
        # We would trigger an EMR Serverless Spark Job or run a Snowflake query here.
        print(f"Triggering Spark job to process records from {raw_path}")
        time.sleep(5)  # Simulating API poll

        return {
            "s3_processed_data_path": transformed_path,
            "status": "SUCCESS"
        }

    @task()
    def quality_check(processed_info: dict) -> None:
        """
        Executes lightweight quality assertions using Great Expectations or standard SQL validation.
        """
        processed_path = processed_info["s3_processed_data_path"]
        print(f"Asserting schema conformity and record completeness on: {processed_path}")

    # Set explicit dependencies using TaskFlow API syntax
    order_metadata = extract_order_metadata()
    transformed_info = transform_on_external_compute(order_metadata)
    quality_check(transformed_info)

# Instantiate the DAG
ecommerce_analytics_dag()

Why This Design Pattern Scales:

No Top-Level Database Connections: Heavy database drivers (psycopg2) are imported locally inside the task function, which protects your CPU during parsing.
Metadata-Only XComs: Instead of passing raw transaction dataframes through Airflow’s XCom engine (which bloats the PostgreSQL metadata DB), we store raw data in cloud storage (S3) and pass only the path references between tasks.
External Compute Delegation: Task execution processes are simulated as remote operations (like Spark triggers) rather than running locally on the worker.

Section 3: Advanced Optimization — Dynamic DAG Generation

To scale your operations efficiently, you must avoid hand-crafting thousands of separate DAG files for similar ingestion pipelines. Let's look at how to dynamically generate DAGs using external configuration schemas.

Instead of writing 100 files for 100 database tables, you can write a single generator script that reads a JSON configuration and populates the Airflow globals().

import json
import os
from datetime import datetime
from airflow import DAG
from airflow.operators.empty import EmptyOperator

# Path to the config file defining our ingestion pipelines
CONFIG_PATH = os.path.join(os.path.dirname(__file__), "ingestion_config.json")

def create_dag(dag_id, table_name, schedule):
    """Factory function to dynamically construct a DAG configuration."""
    dag = DAG(
        dag_id=dag_id,
        schedule_interval=schedule,
        start_date=datetime(2023, 1, 1),
        catchup=False,
        tags=["dynamic", "ingestion"]
    )

    with dag:
        start = EmptyOperator(task_id="start")
        # Imagine a real ingestion operator here querying the source DB
        ingest = EmptyOperator(task_id=f"ingest_{table_name}")
        end = EmptyOperator(task_id="end")

        start >> ingest >> end

    return dag

# Read the configurations dynamically
if os.path.exists(CONFIG_PATH):
    with open(CONFIG_PATH, "r") as f:
        config = json.load(f)

    for pipeline in config["pipelines"]:
        d_id = f"dynamic_ingest_{pipeline['table']}"
        # Inject the generated DAG into the global namespace so Airflow can parse it
        globals()[d_id] = create_dag(
            dag_id=d_id,
            table_name=pipeline["table"],
            schedule=pipeline["schedule"]
        )

For this to work seamlessly, create an adjacent ingestion_config.json file in your DAG directory:

{
  "pipelines": [
    {"table": "users", "schedule": "@daily"},
    {"table": "orders", "schedule": "0 2 * * *"},
    {"table": "products", "schedule": "@weekly"}
  ]
}

This pattern simplifies administrative overhead. Adding a new pipeline is as simple as adding a new configuration block to your JSON file, without writing any additional Python code.

Section 4: Tuning Airflow Configuration Variables for Enterprise Performance

Tuning your Airflow cluster requires adjusting specific configuration variables within your airflow.cfg file. These parameters act as control valves for your system's resource consumption. Below are some of the most critical tuning variables and the standard benchmarks used by senior architects:

1. Concurrency Management

parallelism: This defines the maximum number of task instances that can run concurrently across your entire Airflow deployment.
- Scale Benchmark: Default is 32. For heavy workloads, scale this to 256 or 512 depending on your executor resources.
max_active_runs_per_dag: Prevents a single DAG from spawning too many concurrent active DAG runs, which is especially useful when catching up historical datasets.
- Scale Benchmark: Set this to 3 or 4 to prevent resource starvation on your cluster.
max_active_tasks_per_dag: Limits the number of tasks that can run concurrently within a single DAG run.
- Scale Benchmark: Set this to 16 or 32 to ensure one pipeline doesn't consume your entire worker pool.

2. Scheduler Performance Tuning

scheduler_heartbeat_sec: Determines how often the scheduler checks for tasks that need to run.
- Scale Benchmark: Default is 5. Increase this to 10 or 15 on busy systems to reduce CPU pressure on your metadata database.
min_file_process_interval: Controls how frequently the scheduler parses your DAG files.
- Scale Benchmark: Default is 30 seconds. On large clusters, increase this to 120 or 300 seconds to significantly reduce parsing overhead.

Section 5: Case Study: Scaling Retail Pipelines for Black Friday

Let’s look at a real-world case study. A large multinational retail client faced catastrophic failures on their analytical platform during a major holiday sales event.

The Problem

During peak hours, their Airflow deployment (which processed 1,200 analytical tasks per hour using a single VM on a LocalExecutor with a PostgreSQL backend) collapsed. The scheduler latency soared to over 15 minutes, and task executions began failing with PostgreSQL connection timeouts.

The Diagnostic Process

Using diagnostic queries against their metadata database, we identified two main bottlenecks:

Metadata Bloat: Over 15 million rows of historical task logs and XComs were clogging their metadata tables, slowing down the scheduler's check queries.
Database Connection Saturation: The LocalExecutor spawned processes that directly queried the database, exceeding its connection limit of 100.

The Solution & Metrics

To fix these issues, we implemented several performance improvements:

Connection Pooling: We deployed PgBouncer in transaction mode in front of PostgreSQL, reducing database connection overhead by over 80%.
Database Maintenance: We scheduled an automated daily cleanup process using the airflow db clean command to prune historical data older than 14 days, reducing database query times from seconds to milliseconds.
Architecture Migration: We migrated the cluster from a LocalExecutor on a single VM to the KubernetesExecutor on an AWS EKS cluster, using KEDA to scale worker pods dynamically.

The impact was immediate:

┌────────────────────────────────────────────────────────┐
│               SCHEDULER LATENCY METRICS                │
├───────────────────────────┬────────────────────────────┤
│ Before Optimization       │ █░░░░░░░░░░░░░░░ 15.2 mins │
│ After PgBouncer & KEDA    │ █ 1.4 seconds              │
└───────────────────────────┴────────────────────────────┘

The system successfully handled the holiday peak without a single service outage.

Section 6: Airflow vs. Modern Alternatives

No guide is complete without looking at how Airflow compares to other tools on the market. Let's look at Building Scalable Systems with Apache Airflow: A Data Engineer's Guide vs alternatives:

                              ┌───────────────┐
                              │  Orchestrator │
                              │   Selection   │
                              └───────┬───────┘
                                      │
                     ┌────────────────┴────────────────┐
                     ▼                                 ▼
           Complex, Multi-System               Dynamic Python Code
            Enterprise Workflows                 Asset-First Focus
                     │                                 │
                     ▼                                 ▼
             ┌───────────────┐                 ┌───────────────┐
             │  Use Airflow  │                 │  Use Dagster  │
             └───────────────┘                 └───────────────┘

Apache Airflow: The industry standard. Its main strengths are its massive ecosystem of operators, active open-source community, and managed offerings (such as AWS MWAA or Astronomer). It excels in large enterprises with complex, multi-system integration requirements.
Dagster: A modern alternative focusing on "software-defined assets." It is an excellent choice for teams prioritizing testability, local development, and data-asset lineage over infrastructure management.
Prefect: A highly dynamic orchestrator that treats tasks as standard Python functions, reducing boilerplate code. It is great for dynamic orchestration where workflow structures can change during runtime.
Temporal: Unlike database-backed orchestrators, Temporal is a high-throughput, stateful code execution engine. It is ideal for microservices and mission-critical transactions, though it has a steeper learning curve for traditional data analysts.

Pro Tips and Best Practices

To help you get the most out of your Airflow environment, here is a consolidated list of Building Scalable Systems with Apache Airflow: A Data Engineer's Guide best practices:

💡 Pro Tip 2: Implement a Custom XCom Backend
Instead of writing heavy task inputs and outputs directly to your metadata database, configure a custom XCom backend in your airflow.cfg file using the xcom_backend parameter to store payloads directly in AWS S3 or Google Cloud Storage.

💡 Pro Tip 3: Always Use PgBouncer
Airflow schedulers query your database continuously. Running a connection pooler like PgBouncer in front of your PostgreSQL instance prevents connection starvation and reduces database CPU utilization by up to 60%.

💡 Pro Tip 4: Configure SLA Timeouts Carefully
Avoid using Airflow’s default SLA mechanism for large-scale pipelines, as it can cause performance bottlenecks on the scheduler. Instead, use custom callback alerts (on_failure_callback, on_retry_callback) or task-level execution timeouts (execution_timeout=timedelta(...)) for better reliability.

Key Takeaways

Keep Compute Separate: Keep your Airflow worker nodes lightweight by offloading heavy data transformations to cloud platforms like Snowflake, Spark, or BigQuery.
Minimize Top-Level Imports: Write clean DAGs with localized task imports to reduce parsing latency on your scheduler.
Use Distributed Executors: Migrate to the KubernetesExecutor or CeleryExecutor to handle unpredictable, high-volume workloads.
Tune Your Metadata DB: Use tools like PgBouncer, schedule regular cleanups with the airflow db clean command, and keep your metadata database running smoothly.

Resources & Further Reading

Join the Discussion!

What are the biggest scaling bottlenecks you have encountered in your Airflow environments? How is your team balancing workloads between Airflow workers and external compute engines?

Drop a comment below with your thoughts and experiences, or let me know if you have any questions! You can also find additional configuration templates and deployment scripts on my GitHub repository. Let’s keep the conversation going!

Top comments (1)

Linux Kipruto • Jun 22

Very helpfull