DEV Community

Cover image for Apache Airflow: Complete Guide for Basic to Advanced Developers
Data Tech Bridge
Data Tech Bridge

Posted on

Apache Airflow: Complete Guide for Basic to Advanced Developers

The Beginning of Your Airflow Adventure πŸš€β˜ΈοΈ

πŸ“š Table of Contents

PART 1: FUNDAMENTALS

PART 2: ADVANCED CONCEPTS

PART 3: OPERATIONS & DEPLOYMENT

FINAL SECTIONS


SECTION 1: INTRODUCTION

What is Apache Airflow?

Apache Airflow is a workflow orchestration platform that allows you to define, schedule, and monitor complex data pipelines as code. Instead of using UI-based workflow builders, Airflow uses Python to define Directed Acyclic Graphs (DAGs) representing your workflows.

Key Benefits

  • Programmatic workflows: Define complex pipelines in Python with full control
  • Scalable: From single-machine setups to distributed Kubernetes deployments
  • Extensible: Create custom operators, sensors, and hooks without modifying core
  • Rich monitoring: Web UI with real-time pipeline visibility and control
  • Flexible scheduling: Time-based, event-driven, or custom business logic schedules
  • Reliable: Built-in retry logic, SLAs, and comprehensive failure handling
  • Community-driven: Active ecosystem with hundreds of provider integrations

Why Airflow Over Alternatives?

Alternative Why Airflow is Better
Cron scripts Better reliability, monitoring, dependency management, version control
Shell scripts Testable, debuggable, proper error handling, team collaboration
Commercial tools Open source, customizable, no vendor lock-in, self-hosted
Cloud-native tools Language agnostic, can self-host, avoid provider lock-in

Who Uses Airflow?

  • Data engineers building ETL/ELT pipelines
  • ML engineers orchestrating training workflows
  • DevOps teams managing infrastructure automation
  • Analytics teams automating reports
  • Any organization with complex scheduled workflows

- ↑ Back to Table of Contents

SECTION 2: ARCHITECTURE

Core Components Overview

Apache Airflow consists of several interconnected services working together:

Web Server

  • Provides REST API and web UI
  • Runs on port 8080 by default
  • Allows manual DAG triggering, viewing logs, and managing connections
  • Multiple instances can be deployed for high availability
  • Stateless (can run on different machines)

Scheduler

  • Monitors all DAGs and determines when tasks should execute
  • Parses DAG files periodically (every 5 minutes by default)
  • Handles task dependency resolution and sequencing
  • Manages retry logic and failure callbacks
  • Single instance per deployment (HA scheduler coming in Airflow 3.1+)
  • Continuously runs in background

Executor

  • Determines HOW tasks are physically executed
  • Different executors for different deployment models:
    • LocalExecutor: Single machine, limited parallelism (~10-50 tasks)
    • SequentialExecutor: One task at a time (testing only)
    • CeleryExecutor: Distributed across worker nodes (100s of tasks)
    • KubernetesExecutor: Each task runs in separate pod (elastic scaling)

Metadata Database

  • Central repository for all Airflow state
  • Stores: DAG definitions, task instance states, variables, connections, logs metadata
  • Should be PostgreSQL 12+ or MySQL 8+ in production
  • SQLite only for development (not concurrent access)
  • Requires daily backups and monitoring

Message Broker (for distributed setups)

  • Redis or RabbitMQ
  • Queues tasks from Scheduler to Workers
  • Enables horizontal scaling
  • Required only for CeleryExecutor

Triggerer Service (Airflow 2.2+)

  • Manages all deferrable/async tasks
  • Single lightweight async service per deployment
  • Can handle 1000+ concurrent deferred tasks
  • Monitors events efficiently
  • Fires triggers when conditions met

Execution Flow Diagram

1. DAG File Loaded
   ↓
2. Scheduler Parses
   ↓
3. Check Task Ready?
   β”œβ”€ Dependencies met?
   β”œβ”€ Start date passed?
   β”œβ”€ Not already running?
   ↓
4. Task State: SCHEDULED
   ↓
5. Queue to Message Broker
   ↓
6. Worker Picks Up Task
   ↓
7. Execution Phase
   β”œβ”€ Run operator code
   β”œβ”€ OR Defer to Trigger
   ↓
8. Log Output
   ↓
9. Record Result
   β”œβ”€ Success β†’ COMPLETED
   β”œβ”€ Failure β†’ FAILED
   β”œβ”€ Skipped β†’ SKIPPED
   ↓
10. Execute Callback (if configured)
    ↓
11. Mark downstream ready
    ↓
12. Update Dashboard
Enter fullscreen mode Exit fullscreen mode

Architecture Patterns

Single Machine Setup

  • One server with all components
  • LocalExecutor
  • SQLite database (acceptable for development)
  • Best for: Development, testing, small workflows

Distributed Setup

  • Separate scheduler machine
  • Multiple worker machines
  • CeleryExecutor
  • PostgreSQL database
  • Redis message broker
  • Best for: Production, high volume, 100+ daily DAGs

Kubernetes Native

  • Scheduler pod
  • Web server pod (multiple replicas)
  • KubernetesExecutor
  • Managed PostgreSQL (Cloud SQL, RDS)
  • Dynamic pod creation per task
  • Best for: Cloud-native, elastic scaling, containerized workflows

↑ Back to Table of Contents

SECTION 3: DAG & BUILDING BLOCKS

What is a DAG?

A DAG (Directed Acyclic Graph) is the fundamental unit in Airflow. It represents:

  • Directed: Tasks execute in specific direction (flow from A β†’ B β†’ C)
  • Acyclic: No circular dependencies allowed (prevents infinite loops)
  • Graph: Visual representation of tasks and relationships
  • Configurable: Start date, schedule, retries, timeouts, SLAs, etc.

Key DAG Properties

Property Purpose Example
dag_id Unique identifier 'daily_etl'
start_date First execution date datetime(2024,1,1)
schedule When to run '@daily', CronTimetable, Dataset
catchup Backfill missed runs True/False
max_active_runs Concurrent runs allowed 1-5
default_args Task defaults retries, timeout, owner
description What DAG does "Daily customer ETL"
tags Categorization ['etl', 'daily']

Operators (Task Types)

Operators are classes that execute actual work. Each operator is a reusable task that does something specific.

Common Built-in Operators

Operator What It Does Common Parameters
PythonOperator Executes Python function python_callable, op_kwargs
BashOperator Runs bash commands bash_command
SqlOperator Executes SQL query sql, conn_id
EmailOperator Sends emails to, subject, html_content
S3Operator S3 file operations bucket, key, source/dest
HttpOperator Makes HTTP requests endpoint, method, data
SlackOperator Sends Slack messages text, channel, token
BranchOperator Conditional branching python_callable returning task_id
DummyOperator Placeholder/no-op Used for flow control

When to Use Each

  • PythonOperator: Data processing, API calls, complex logic
  • BashOperator: Run scripts, system commands, existing tools
  • SqlOperator: Database transformations, queries
  • EmailOperator: Notifications, reports
  • S3Operator: Data movement, file management
  • HttpOperator: External service integration
  • BranchOperator: Conditional workflows

Sensors (Waiting for Conditions)

Sensors are operators that wait for something to happen. They poll a condition until it becomes true, blocking further execution.

Common Sensors

Sensor Waits For Use Case
FileSensor File creation on disk Wait for uploaded data
S3KeySensor S3 object existence Wait for data in S3
SqlSensor SQL query condition Wait for record count
TimeSensor Specific time of day Wait until 2 PM
ExternalTaskSensor External DAG task Cross-DAG dependency
HttpSensor HTTP endpoint response Wait for API ready
TimeDeltaSensor Time duration elapsed Wait 30 minutes

Sensor Behavior

  • Poke Mode (default): Worker thread continuously checks condition, holds worker slot
  • Reschedule Mode: Worker released, task rescheduled when ready
  • Smart Mode: Multiple sensors batched for efficiency

Performance Consideration: Many idle sensors waste resources. Use deferrable operators for long waits.

Hooks (Connection Abstractions)

Hooks encapsulate connection and authentication logic to external systems. They're the reusable layer between operators and external services.

Common Hooks

Hook Connects To Purpose
PostgresHook PostgreSQL Query execution
MySqlHook MySQL Database operations
S3Hook AWS S3 File operations
HttpHook REST APIs HTTP requests
SlackHook Slack Send messages
SSHHook Remote servers Execute commands
DockerHook Docker daemon Container operations

Benefits of Hooks

  • Centralize credential management (stored in Airflow connections)
  • Reusable across multiple tasks
  • No credentials in DAG code
  • Easier testing (mock connections)
  • Connection pooling support
  • Error handling abstraction

Tasks

A task is an instance of an operator within a DAG. It's the smallest unit of work.

Task Configuration

Setting Purpose Example
task_id Unique identifier 'extract_customers'
retries Retry attempts on failure 2-3
retry_delay Wait between retries timedelta(minutes=5)
timeout Maximum execution time timedelta(hours=1)
pool Concurrency control 'default_pool'
priority_weight Execution priority 1-100 (higher = more urgent)
trigger_rule When to execute 'all_success' (default)
depends_on_past Depend on previous run False (usually)

XCom (Cross-Communication)

XCom (cross-communication) allows tasks to pass data to each other.

How XCom Works

  • Tasks push values: task_instance.xcom_push(key='count', value=1000)
  • Tasks pull values: task_instance.xcom_pull(task_ids='upstream', key='count')
  • Automatically stored in metadata database
  • Serialized to JSON by default

XCom Limitations

  • Database-backed (not for large data)
  • Intended for metadata, not big data
  • Consider external storage for large datasets
  • Can cause database bloat if overused

↑ Back to Table of Contents

SECTION 4: PLUGINS

What are Plugins?

Plugins allow you to extend Airflow with custom components without modifying core code. They're organized Python modules that register custom functionality.

Types of Plugins

Plugin Type Purpose When to Create
Custom Operators New task types Business-specific operations
Custom Sensors New wait conditions Proprietary systems
Custom Hooks New connections Proprietary APIs
Custom Executors New execution patterns Specialized hardware
Custom Macros New template variables Common business calculations
UI Extensions Dashboard widgets Custom monitoring

Plugin Organization

Plugins live in the plugins/ directory with this structure:

plugins/
β”œβ”€β”€ operators/
β”‚   └── custom_operator.py
β”œβ”€β”€ sensors/
β”‚   └── custom_sensor.py
β”œβ”€β”€ hooks/
β”‚   └── custom_hook.py
β”œβ”€β”€ macros/
β”‚   └── custom_macros.py
└── airflow_plugins.py
Enter fullscreen mode Exit fullscreen mode

When to Create Plugins

CREATE plugins for:

  • Integrating proprietary/internal systems
  • Reusable patterns across organization
  • Complex business logic encapsulation
  • Specialized monitoring/alerting

DON'T create plugins for:

  • One-off tasks (use standard operators)
  • Simple data transformations (use PythonOperator)
  • Ad-hoc scripts (use BashOperator)
  • If built-in operator already exists

Plugin Development

Custom operators extend BaseOperator and override execute() method.
Custom sensors extend BaseSensorOperator and override poke() method.
Custom hooks extend BaseHook with connection logic.

Plugins are loaded automatically from plugins/ directory on scheduler startup.

↑ Back to Table of Contents

SECTION 5: TASKFLOW API

What is TaskFlow API?

TaskFlow API (Airflow 2.0+) is the modern, Pythonic way to write DAGs using decorators. It simplifies DAG creation by using functions instead of operator instantiation.

Comparison: Traditional vs TaskFlow

Aspect Traditional TaskFlow
Syntax Verbose operator creation Clean function decorators
Data Passing Manual XCom push/pull Implicit via return values
Dependencies Explicit >> operators Implicit via parameters
Code Length More lines, repetitive Fewer lines, concise
Type Hints Optional Strongly recommended
Learning Curve Moderate Easier for Python devs
IDE Support Basic Excellent (autocomplete)

Key Decorators

@dag

  • Decorates function that contains tasks
  • Function body defines workflow
  • Returns DAG instance
  • Replaces DAG() class creation

@task

  • Decorates function that becomes a task
  • Function is task logic
  • Return value becomes XCom value
  • Parameters are upstream dependencies

@task_group

  • Groups related tasks together
  • Creates visual subgraph in UI
  • Simplifies large DAGs
  • Can be nested

@task.expand (Dynamic Mapping)

  • Creates task for each item in list
  • Reduces repetitive task creation
  • Dynamic fan-out pattern
  • Available in Airflow 2.3+

Advantages of TaskFlow

  • Tasks are just Python functions
  • Type hints enable IDE autocomplete
  • Natural data flow through function parameters
  • Automatic XCom handling (no manual push/pull)
  • Cleaner, more Pythonic code
  • Easier testing (function-based)
  • Less boilerplate
  • Better code organization

When to Use TaskFlow

Use for:

  • New DAGs
  • Complex dependencies
  • Python-savvy teams
  • Modern deployments (2.0+)

Alternative if:

  • Legacy Airflow 1.10 requirement
  • Using operators not available as tasks
  • Team unfamiliar with decorators

↑ Back to Table of Contents

SECTION 6: DEFERRABLE OPERATORS & TRIGGERS

The Problem: Resource Waste

Traditional Sensors (Blocking)

  • Sensor holds worker slot while waiting
  • Worker CPU ~5-10% even during idle
  • Memory ~50MB per idle task
  • Can handle ~50-100 concurrent waits
  • Resource waste for long waits (hours/days)

Example: 1000 tasks waiting 1 hour = 1000 worker slots held continuously = massive waste

The Solution: Deferrable Operators

Deferrable Approach (Non-blocking)

  • Task releases worker slot immediately after deferring
  • Minimal CPU usage (~0.1%) during wait
  • Memory ~1MB per deferred task
  • Can handle 1000+ concurrent waits
  • ~95% resource savings compared to traditional

How Deferrable Operators Work

4-Phase Execution:

  1. Execution Phase: Task executes initial logic

    • Setup, validation, prerequisites
    • Then: Defer to trigger
    • Worker slot RELEASED
  2. Deferred Phase: Trigger monitors event

    • Async/event-based monitoring
    • Triggerer service handles
    • Very lightweight
    • No worker slot needed
  3. Completion Phase: Event occurs

    • Trigger detects condition met
    • Returns event data
    • Trigger fires
  4. Resume Phase: Task completes

    • Worker acquires new slot
    • Task resumes with trigger result
    • Final processing
    • Task completes

Available Deferrable Operators

Operator Defers For Best For
TimeSensorAsync Specific time Waiting until business hours
S3KeySensorAsync S3 object Long-running file uploads
ExternalTaskSensorAsync External DAG Cross-DAG long waits
SqlSensorAsync SQL condition Database state changes
HttpSensorAsync HTTP response External service availability
TimeDeltaSensorAsync Duration Fixed time wait

Triggerer Service

  • Manages all active triggers
  • Single lightweight async service per deployment
  • Default capacity: 1000 triggers
  • Requires: airflow triggerer process
  • Monitors events, fires when ready
  • Stateless (can restart without losing state)

When to Use Deferrable

Use when:

  • Wait duration > 5 minutes
  • Many concurrent waits
  • Resource-constrained environment
  • Can tolerate event latency (<1 second)
  • Long-running waits (hours, days)

Don't use when:

  • Wait < 5 minutes (overhead not worth it)
  • Real-time detection critical (sub-second)
  • Simple local operations
  • Frequent detection required

Performance Impact

1000 Tasks Waiting 1 Hour

Traditional: 1000 worker slots Γ— 1 hour = 1000 slot-hours wasted
Deferrable: 0 worker slots, minimal CPU in trigger service

Cost Savings: ~95% for typical scenarios

↑ Back to Table of Contents

SECTION 7: DEPENDENCIES & RELATIONSHIPS

Task Dependencies

Linear Chain: Task A β†’ Task B β†’ Task C β†’ Task D

  • Sequential execution
  • Each waits for previous
  • Easiest to understand
  • Best for: ETL pipelines with stages

Fan-out / Fan-in: One task triggers many, then merge

      β”œβ”€ Process A ─┐
Extract ─            β”œβ”€ Merge
      └─ Process B β”€β”˜
Enter fullscreen mode Exit fullscreen mode
  • Parallel processing
  • Best for: Multiple independent operations
  • Example: Extract once, transform multiple ways

Branching: Conditional execution

      β”Œβ”€ Success path
Branch ─
      └─ Error path
Enter fullscreen mode Exit fullscreen mode
  • Different paths based on condition
  • Best for: Alternative workflows
  • Example: If data valid, load; else alert

Trigger Rules

Trigger rules determine WHEN downstream tasks execute based on UPSTREAM status.

Rule Executes When Use Case
all_success All upstream succeed Default, sequential flow
all_failed All upstream fail Cleanup after failures
all_done All complete (any status) Final reporting
one_success At least one succeeds Alternative paths
one_failed At least one fails Error alerts
none_failed No failures (skips OK) Robust pipelines βœ“ Recommended
none_failed_or_skipped No failures/skips Strict validation

Best Practice: Use none_failed for fault-tolerant pipelines.

DAG-to-DAG Dependencies

Methods to Create

Method Mechanism Best For
ExternalTaskSensor Poll for completion Different schedules
TriggerDagRunOperator Actively trigger Orchestrate workflows
Datasets Event-driven Data pipelines

Datasets (Airflow 2.3+)

  • Producer DAG updates dataset
  • Consumer DAG triggered automatically
  • Real-time responsiveness (<1 second)
  • Low latency, efficient
  • Best for data-driven workflows

Example:

  • ETL DAG extracts and updates orders_dataset
  • Analytics DAG triggered automatically when dataset updates
  • No waiting, real-time processing

↑ Back to Table of Contents

SECTION 8: SCHEDULING

Overview of Scheduling Methods

Three ways to trigger DAG runs:

  1. Time-based: Specific times/intervals
  2. Event-driven: When data arrives/updates
  3. Manual: User-triggered via UI

Time-Based: Cron Expressions (Legacy)

Cron format: minute hour day month day_of_week

Common Patterns

  • @hourly: Every hour
  • @daily: Every day at midnight
  • @weekly: Every Sunday at midnight
  • @monthly: First day of month
  • 0 9 * * MON-FRI: 9 AM on weekdays
  • */15 * * * *: Every 15 minutes

Limitations

  • Simple time-only scheduling
  • Limited to time expressions
  • No business logic
  • Timezone handling tricky

Time-Based: Timetable API (Modern)

Timetables (Airflow 2.2+) provide object-oriented scheduling with unlimited flexibility.

Built-in Timetables

  • CronDataTimetable: Cron-based (replaces schedule_interval)
  • DailyTimetable: Daily runs
  • IntervalTimetable: Fixed intervals

Custom Timetables Allow

  • Business day scheduling (Mon-Fri)
  • Holiday-aware schedules (skip holidays)
  • Timezone-specific runs
  • Complex business logic
  • End-of-month runs
  • Business hour runs (9 AM - 5 PM)

Advantages

  • Unlimited flexibility
  • Timezone support
  • Readable code
  • Testable
  • Documented rationale

Event-Driven: Datasets (Airflow 2.3+)

Datasets enable data-driven workflows instead of time-driven.

How it Works

  1. Producer task completes and updates dataset
  2. Scheduler detects dataset update
  3. Consumer DAG triggered immediately
  4. No scheduled waiting

Benefits

  • Real-time responsiveness
  • Resource efficient (no idle waits)
  • Data-quality focused
  • Reduced latency (<1 second)
  • Perfect for data pipelines

When to Use

  • Multi-stage ETL (extract β†’ transform β†’ load)
  • Event streaming
  • Real-time processing
  • Data-dependent workflows

Hybrid Scheduling

Combine time-based and event-driven:

  • DAG runs on schedule OR when data arrives
  • Whichever comes first
  • Best of both worlds
  • Flexible and responsive

Scheduling Decision Tree

TIME-BASED?
β”œβ”€ YES β†’ Complex logic?
β”‚        β”œβ”€ YES β†’ Custom Timetable
β”‚        └─ NO β†’ Cron Expression
β”‚
└─ NO β†’ Event-driven?
         β”œβ”€ YES β†’ Use Datasets
         └─ NO β†’ Manual trigger or reconsider design
Enter fullscreen mode Exit fullscreen mode

↑ Back to Table of Contents

SECTION 9: FAILURE HANDLING

Retry Mechanisms

Purpose: Allow tasks to recover from transient failures (network timeout, API rate limit, etc.)

Configuration

  • retries: Number of retry attempts
  • retry_delay: Wait time between retries
  • retry_exponential_backoff: Exponentially increasing delays
  • max_retry_delay: Cap on maximum wait

Example Retry Sequence

Attempt 1: Fail β†’ Wait 1 minute
Attempt 2: Fail β†’ Wait 2 minutes (1 Γ— 2^1)
Attempt 3: Fail β†’ Wait 4 minutes (1 Γ— 2^2)
Attempt 4: Fail β†’ Wait 8 minutes (1 Γ— 2^3)
Attempt 5: Fail β†’ Wait 16 minutes (1 Γ— 2^4)
Attempt 6: Fail β†’ Max 1 hour (capped)
Attempt 7: Fail β†’ Task FAILED
Enter fullscreen mode Exit fullscreen mode

Best Practice: 2-3 retries with exponential backoff for most tasks.

Trigger Rules (Already Covered in Section 7)

But worth repeating: Use none_failed for robust pipelines.

Callbacks

Execute custom logic when task state changes:

  • on_failure_callback: When task fails
  • on_retry_callback: When task retries
  • on_success_callback: When task succeeds

Common Uses

  • Send notifications (Slack, email, PagerDuty)
  • Log to external monitoring systems
  • Trigger incident management
  • Update dashboards
  • Cleanup resources

Error Handling Patterns

Pattern 1: Idempotency

  • Make tasks safe to retry
  • Delete-then-insert instead of insert
  • Same result regardless of executions
  • Safe for retries

Pattern 2: Circuit Breaker

  • Stop retrying after N attempts
  • Raise critical exceptions
  • Alert operations team
  • Prevent cascading failures

Pattern 3: Graceful Degradation

  • Continue with partial data
  • Log warnings
  • Don't fail entire pipeline
  • Better UX than all-or-nothing

Pattern 4: Skip on Condition

  • Raise AirflowSkipException
  • Skip task without marking as failure
  • Allow downstream to continue
  • Best for conditional workflows

↑ Back to Table of Contents

SECTION 10: SLA MANAGEMENT

What is SLA?

SLA (Service Level Agreement) is a guarantee that a task will complete within a specified time window.

If SLA is missed:

  • Task marked as "SLA Missed"
  • Notifications triggered
  • Callbacks executed
  • Logged for auditing/metrics

Setting SLAs

At DAG Level: All tasks inherit SLA

default_args = {
    'sla': timedelta(hours=1)
}
Enter fullscreen mode Exit fullscreen mode

At Task Level: Specific to individual task

task = PythonOperator(
    task_id='critical_task',
    sla=timedelta(minutes=30)
)
Enter fullscreen mode Exit fullscreen mode

Per Task Overrides: Task SLA overrides DAG default

SLA Monitoring

Key Metrics

  • SLA compliance rate (%)
  • Average overdue duration
  • Most violated tasks
  • Impact on downstream

Alerts

  • Send notifications when SLA at risk (80%)
  • Escalate on missed SLA
  • Create incidents for critical SLAs

Dashboard

  • Track compliance trends
  • Identify problematic tasks
  • Alert history
  • Remediation tracking

SLA Best Practices

Practice Benefit
Set realistic SLAs Avoid false positives
Different tiers Critical < Normal < Optional
Review quarterly Account for growth
Document rationale Team alignment
Escalation policy Proper incident response
Callbacks Automated notifications
Monitor compliance Identify trends

↑ Back to Table of Contents

SECTION 11: CATCHUP BEHAVIOR

What is Catchup?

Catchup is automatic backfill of missed DAG runs.

Example Scenario

Start date: 2024-01-01
Today: 2024-05-15
Schedule: @daily (every day)
Catchup: True

RESULT: Airflow creates 135 DAG runs instantly!
(One for each day from Jan 1 to May 14)
Enter fullscreen mode Exit fullscreen mode

Catchup Settings

Setting Behavior When to Use
catchup=True Backfill all missed runs Historical data loading
catchup=False No backfill, start fresh New pipelines

Catchup Risks

Potential Issues

  • Resource exhaustion (CPU, memory, database)
  • Network bandwidth spike
  • Third-party API rate limits exceeded
  • Database connection pool exhausted
  • Worker node crashes
  • Scheduler paralysis (busy processing backfill)

Safe Catchup Strategy

Step 1: Use Separate Backfill DAG

  • Main DAG: catchup=False (normal operations)
  • Backfill DAG: catchup=True, end_date set (historical only)
  • Run backfill during maintenance window

Step 2: Control Parallelism

  • max_active_runs: Limit concurrent DAG runs (1-3)
  • max_active_tasks_per_dag: Limit concurrent tasks (16-32)
  • Prevents resource exhaustion

Step 3: Staged Approach

  • Test with subset of dates first
  • Monitor resource usage
  • Gradually increase pace
  • Have rollback plan

Step 4: Validation

  • Verify data completeness
  • Check for duplicates
  • Validate transformations
  • Compare with expected results

Catchup Checklist

Before enabling catchup, verify:

  • Database capacity available
  • Worker resources adequate
  • Source data exists for all periods
  • Data schema hasn't changed
  • No API rate limits exceeded
  • Network bandwidth available
  • Stakeholders informed
  • Rollback procedure ready

↑ Back to Table of Contents

SECTION 12: CONFIGURATION

Configuration Hierarchy

Airflow reads configuration in this order (first found wins):

  1. Environment Variables (AIRFLOW___ prefix)
  2. airflow.cfg file (main config)
  3. Default configuration (hardcoded)
  4. Command-line arguments (runtime)

Critical Settings

Setting Purpose Production Value
sql_alchemy_conn Metadata database PostgreSQL connection
executor Task execution CeleryExecutor
parallelism Max tasks globally 32-128
max_active_runs_per_dag Concurrent DAG runs 1-3
max_active_tasks_per_dag Concurrent tasks 16-32
remote_logging Centralized logs True (to S3)
email_backend Email provider SMTP
base_log_folder Log storage NFS mount

Database Configuration

Production Database

  • PostgreSQL 12+ (recommended) or MySQL 8+
  • Dedicated database server (not shared)
  • Daily backups automated
  • Connection pooling enabled
  • Monitoring for slow queries
  • Regular maintenance (VACUUM, ANALYZE)

Why Not SQLite

  • Not concurrent-access safe
  • Locks on writes
  • Single machine only
  • Not for production

Executor Configuration

Executor Config Best For
Local [local_executor] Development
Celery [celery] Distributed, high volume
Kubernetes [kubernetes] Cloud-native
Sequential (internal) Testing only

Performance Tuning

Database

  • Connection pooling (pool_size=5, max_overflow=10)
  • Indexes on frequently queried columns
  • Archive old logs/task instances
  • Query optimization

Scheduler

  • Increase DAG parsing frequency if needed
  • Optimize DAG file count
  • Use pool resources wisely

Web Server

  • Cache results
  • Limit search scope
  • Monitor response times

↑ Back to Table of Contents

SECTION 13: AWS INTEGRATION

Connection Setup

Three Methods

  1. Airflow UI: Admin β†’ Connections β†’ Create
  2. Environment Variables: AIRFLOW_CONN_AWS_DEFAULT
  3. Programmatic: Python script creating Connection objects

Authentication Types

  • Access Key / Secret Key (discouraged)
  • IAM Roles (recommended for EC2)
  • AssumeRole (cross-account access)
  • Temporary credentials

Common AWS Services

Service What It Does Use Case
S3 File storage Data lake, logs, backups
Redshift Data warehouse Analytics, aggregations
Lambda Serverless compute Lightweight processing
SNS Notifications Alerts, messaging
SQS Message queue Task decoupling
Glue ETL service Spark/Python jobs
RDS/Aurora Managed database Transactional data
EMR Big data Spark, Hadoop clusters
CloudWatch Monitoring Metrics, logs, alarms

Operators for AWS Services

S3 Operations

  • Upload/download files
  • Copy within S3
  • List objects
  • Delete files
  • Check existence

Redshift Operations

  • Execute SQL queries
  • Load data from S3
  • Create/delete snapshots
  • Pause/resume clusters

Lambda Operations

  • Invoke functions
  • Async execution
  • Sync execution with polling

SQS/SNS Operations

  • Send messages to queue
  • Receive from queue
  • Publish to topics
  • Batch operations

AWS Best Practices

Practice Benefit
Use IAM roles No credentials in code
Secrets Manager Centralized credential storage
Enable CloudTrail Audit all API calls
Use VPC endpoints S3 access without internet
Monitor costs CloudWatch cost alerts
Implement budgets Prevent overspending
Regular backups Disaster recovery
Region strategy Latency, compliance

↑ Back to Table of Contents

SECTION 14: VERSION COMPARISON

Version Timeline

1.10.x (Deprecated, EOL 2021)
    ↓ Major Redesign
2.0 (March 2021) - TaskFlow API introduced
    ↓
2.1-2.2 (Improvements & Deferrable)
    ↓
2.3 (Nov 2022) - Datasets event-driven scheduling
    ↓
2.4-2.5 (Current stable, production-ready)
    ↓
3.0 (Latest) - Performance focus, async improvements
Enter fullscreen mode Exit fullscreen mode

Key Feature Comparison

Feature 1.10 2.x 3.0
TaskFlow API ❌ βœ… βœ…
Deferrable Ops ❌ βœ… βœ…
Datasets ❌ ❌ βœ…
Performance Baseline 3-5x faster 5-6x faster
Python 3.10+ ❌ βœ… βœ… Required
Async/Await ❌ Limited βœ… Full

Performance Improvements

Metric 1.10 2.5 3.0
DAG parsing (1000 DAGs) ~120s ~45s ~15s
Task scheduling latency ~5-10s ~2-3s <1s
Memory (idle scheduler) ~500MB ~300MB ~150MB
Web UI response ~800ms ~200ms ~50ms

Migration Path

From 1.10 to 2.x

  • Python 3.6+ required
  • Import paths changed (reorganized)
  • Operators moved to providers
  • Test in staging thoroughly
  • Gradual production rollout (1-2 weeks)

From 2.x to 3.0

  • Most DAGs work without changes
  • Drop Python < 3.10
  • Update async patterns
  • Leverage new performance features
  • Simpler migration (backward compatible)

Version Selection

Scenario Recommended
New projects 2.5+ or 3.0
Production (stable) 2.4.x
Enterprise 2.5+
Performance-critical 3.0
Learning 2.5+ or 3.0

↑ Back to Table of Contents

SECTION 15: BEST PRACTICES

DAG Design DO's

βœ“ Use TaskFlow API for modern code
βœ“ Keep tasks small and focused
βœ“ Name tasks descriptively
βœ“ Document complex logic
βœ“ Use type hints
βœ“ Add comprehensive logging
βœ“ Handle errors gracefully
βœ“ Test DAGs before production

DAG Design DON'Ts

βœ— Put heavy processing in DAG definition
βœ— Create 1000s of tasks per DAG
βœ— Hardcode credentials
βœ— Ignore data validation
βœ— Use DAG-level database connections
βœ— Forget about idempotency
βœ— Skip error handling
βœ— Assume tasks always succeed

Task Design DO's

βœ“ Make tasks idempotent
βœ“ Set appropriate timeouts
βœ“ Implement retries with backoff
βœ“ Use pools for resource control
βœ“ Skip tasks on errors (when appropriate)
βœ“ Log important events
βœ“ Clean up resources

Task Design DON'Ts

βœ— Assume tasks always succeed
βœ— Run long operations without timeout
βœ— Hold resources unnecessarily
βœ— Store large data in XCom
βœ— Use blocking operations
βœ— Ignore upstream failures
βœ— Forget cleanup

Scheduling DO's

βœ“ Match frequency to data needs
βœ“ Account for upstream dependencies
βœ“ Plan for failures
βœ“ Test schedule changes
βœ“ Document rationale
βœ“ Monitor adherence

Scheduling DON'Ts

βœ— Schedule too frequently
βœ— Ignore resource constraints
βœ— Forget about catchup
βœ— Use ambiguous expressions
βœ— Create tight couplings
βœ— Skip monitoring

↑ Back to Table of Contents

SECTION 16: MONITORING

Key Metrics

Metric What It Shows Alert Threshold
Scheduler heartbeat Scheduler running >30s = ALERT
Task failure rate Pipeline health >5% = INVESTIGATE
SLA violations Performance issues >10/day = REVIEW
DAG parsing time System load >60s = OPTIMIZE
Task queue depth Worker capacity >1000 = ADD WORKERS
DB connections Connection health >80% = INCREASE
Log size Storage usage Monitor growth

Common Issues & Solutions

Issue Likely Cause Solution
Tasks stuck Worker crashed Restart workers
Scheduler lagging High load Add resources
Memory leaks Bad code Profile and fix
Database slow No indexes Optimize queries
Triggers not firing Triggerer down Check service
SLA misses Unrealistic goals Adjust SLA
Catchup explosion Too many runs Limit with max_active

Monitoring Stack Components

Metrics Collection

  • Prometheus for collection
  • Custom StatsD metrics
  • CloudWatch for AWS

Visualization

  • Grafana dashboards
  • CloudWatch dashboards
  • Custom visualizations

Alerting

  • Threshold-based alerts
  • Anomaly detection
  • Escalation policies
  • Integration with incident systems

Logging

  • Centralized logs (ELK, Splunk)
  • Remote logging to S3
  • Log aggregation
  • Searchable archives

↑ Back to Table of Contents

SECTION 17: CONNECTIONS AND SECRETS

What are Connections?

Connections store authentication credentials to external systems. Instead of hardcoding database passwords or API keys in DAGs, Airflow stores these centrally.

Components:

  • Connection ID: Unique identifier (e.g., postgres_default)
  • Connection Type: System type (postgres, aws, http, mysql, etc.)
  • Host: Server address
  • Login: Username
  • Password: Secret credential
  • Port: Service port
  • Schema: Database or default path
  • Extra: JSON for additional parameters

Why Use Connections?

  • Credentials not in DAG code (can't accidentally commit to git)
  • Centralized credential management
  • Easy credential rotation
  • Different credentials per environment (dev, staging, prod)
  • Audit trail of access

Connection Types

Type Used For Example
postgres PostgreSQL databases Analytics warehouse
mysql MySQL databases Transactional data
snowflake Snowflake warehouse Cloud data platform
aws AWS services (S3, Redshift, Lambda) Cloud infrastructure
gcp Google Cloud services BigQuery, Cloud Storage
http REST APIs External services
ssh Remote server access Bastion hosts
slack Slack messaging Notifications
sftp SFTP file transfer Remote servers

What are Secrets/Variables?

Variables store non-sensitive configuration values.
Secrets store sensitive data (marked as secret).

Aspect Variables Secrets
Purpose Configuration settings Passwords, API keys
Visibility Visible in UI Masked/hidden in UI
Storage Metadata database External backend (Vault, etc.)
Example batch_size, email api_key, db_password

How to Add Connections

Method 1: Web UI (Easiest)

  • Airflow UI β†’ Admin β†’ Connections β†’ Create Connection
  • Fill form with credentials
  • Click Save

Method 2: Environment Variables

Format: AIRFLOW_CONN_{CONNECTION_ID}={CONN_TYPE}://{USER}:{PASSWORD}@{HOST}:{PORT}/{SCHEMA}

Example:

export AIRFLOW_CONN_POSTGRES_DEFAULT=postgresql://user:password@localhost:5432/mydb
export AIRFLOW_CONN_AWS_DEFAULT=aws://ACCESS_KEY:SECRET_KEY@
Enter fullscreen mode Exit fullscreen mode

Method 3: Python Script

Create Connection objects programmatically and add to database.

Method 4: Airflow CLI

airflow connections add postgres_prod \
    --conn-type postgres \
    --conn-host localhost \
    --conn-login user \
    --conn-password password
Enter fullscreen mode Exit fullscreen mode

How to Add Variables/Secrets

Method 1: Web UI

  • Airflow UI β†’ Admin β†’ Variables β†’ Create Variable
  • Enter Key and Value
  • Check "Is Secret" for sensitive data
  • Click Save

Method 2: Environment Variables

export AIRFLOW_VAR_BATCH_SIZE=1000
export AIRFLOW_VAR_API_KEY=secret123
Enter fullscreen mode Exit fullscreen mode

Method 3: Python Script

Create Variable objects and add to database.

Method 4: Airflow CLI

airflow variables set batch_size 1000
airflow variables set api_key secret123
Enter fullscreen mode Exit fullscreen mode

Using Connections in DAGs

Connections are accessed through Hooks (connection abstraction layer).

from airflow.hooks.postgres_hook import PostgresHook

def query_database(**context):
    hook = PostgresHook(postgres_conn_id='postgres_default')
    result = hook.get_records("SELECT * FROM users")
    return result
Enter fullscreen mode Exit fullscreen mode

Benefits of Hooks:

  • Encapsulate connection logic
  • Reusable across tasks
  • Handle connection pooling
  • Error handling built-in
  • No credentials in task code

Using Variables in DAGs

Variables are accessed directly:

from airflow.models import Variable

batch_size = Variable.get('batch_size', default_var=1000)
api_key = Variable.get('api_key')
is_production = Variable.get('is_production', default_var='false')

# With JSON parsing
config = Variable.get('config', deserialize_json=True)
Enter fullscreen mode Exit fullscreen mode

Secrets Backend: Advanced Security

For production, use external secrets backends instead of storing in Airflow database.

Backend Security Level Best For
Metadata DB Low Development only
AWS Secrets Manager High AWS deployments
HashiCorp Vault Very High Enterprise
Google Secret Manager High GCP deployments
Azure Key Vault High Azure deployments

How it works:

  • Configure Airflow to use external backend
  • Airflow queries backend when credential needed
  • Secrets never stored in database
  • Automatic rotation supported
  • Audit logging in backend

Security Best Practices

DO βœ…

  • Use external secrets backend in production
  • Rotate credentials quarterly
  • Use IAM roles instead of hardcoded keys
  • Encrypt connections in transit (SSL/TLS)
  • Encrypt secrets at rest (database encryption)
  • Limit who can see connections (RBAC)
  • Use short-lived credentials (temporary tokens)
  • Audit access to secrets
  • Never commit credentials to git

DON'T ❌

  • Hardcode credentials in DAG code
  • Store passwords in comments or docstrings
  • Share credentials via Slack/Email
  • Use default passwords
  • Store credentials in config files in git
  • Forget to revoke old credentials
  • Use same credentials everywhere
  • Log credential values

Common Issues

Issue Cause Fix
Connection refused Service not running Start the service
Authentication failed Wrong credentials Verify in source system
Host not found Wrong hostname/DNS Check network, DNS resolution
Timeout Network unreachable Check firewall, routing
Variable not found Wrong key name Check variable ID spelling
Permission denied User lacks access Check user RBAC permissions

Testing Connections

Via Web UI:

  • Navigate to Admin β†’ Connections
  • Click on connection
  • Click "Test Connection"

Via CLI:

airflow connections test postgres_default
airflow connections test aws_default
Enter fullscreen mode Exit fullscreen mode

Via Python:

from airflow.hooks.postgres_hook import PostgresHook

hook = PostgresHook(postgres_conn_id='postgres_default')
hook.get_conn()  # Will raise error if connection fails
Enter fullscreen mode Exit fullscreen mode

Production Setup

Minimum Requirements:

βœ… All connections stored centrally (not hardcoded)
βœ… Credentials in external secrets backend
βœ… RBAC enabled (limit connection visibility)
βœ… No credentials in DAG code
βœ… No credentials in git history
βœ… Quarterly credential rotation
βœ… SSL/TLS enabled for external connections
βœ… Audit logging for credential access
βœ… Monitoring/alerts for failed authentication

↑ Back to Table of Contents

SUMMARY & CHECKLIST

Production Deployment Checklist

INFRASTRUCTURE
β–‘ PostgreSQL 12+ database
β–‘ Separate scheduler machine
β–‘ Multiple web server instances (load balanced)
β–‘ Multiple worker machines (if Celery)
β–‘ Message broker (Redis/RabbitMQ)
β–‘ Triggerer service running
β–‘ NFS for logs (centralized)
β–‘ Backup systems

CONFIGURATION
β–‘ Connection pooling configured
β–‘ Resource pools created
β–‘ Email/Slack configured
β–‘ Remote logging to S3/cloud
β–‘ Monitoring setup (Prometheus/Grafana)
β–‘ Alert thresholds defined
β–‘ Timezone configured correctly

SECURITY
β–‘ RBAC enabled in UI
β–‘ Credentials in Secrets Manager
β–‘ SSL/TLS enabled
β–‘ Database encryption at rest
β–‘ VPC configured
β–‘ Firewall rules tight
β–‘ Regular security updates
β–‘ No credentials in DAG code

RELIABILITY
β–‘ Metadata backup automated (daily)
β–‘ DAG code in Git repository
β–‘ Disaster recovery plan documented
β–‘ Rollback procedure tested
β–‘ SLA enforcement active
β–‘ Callback notifications configured
β–‘ Data validation implemented

MONITORING
β–‘ Scheduler health check
β–‘ Worker heartbeat monitoring
β–‘ Database performance monitoring
β–‘ Trigger service health
β–‘ DAG/task failure alerts
β–‘ SLA violation alerts
β–‘ Resource usage dashboards
β–‘ Cost tracking (if cloud)

PERFORMANCE
β–‘ Database indexes optimized
β–‘ Connection pool sized appropriately
β–‘ Max active runs limited
β–‘ Task priorities set
β–‘ Deferrable operators for long waits
β–‘ SmartSensor enabled
β–‘ DAG parsing optimized

DOCUMENTATION
β–‘ DAG documentation complete
β–‘ SLA rationale documented
β–‘ Runbooks created for issues
β–‘ Team trained on platform
β–‘ Changes logged in git
β–‘ Disaster recovery plan shared
β–‘ Escalation procedures documented

↑ Back to Table of Contents

RECOMMENDATIONS

For New Projects Starting Today

  1. Choose Airflow 2.5+ or 3.0 (not 1.10)
  2. Use TaskFlow API (modern, clean code)
  3. Use Timetable API for scheduling (not cron strings)
  4. Use Deferrable operators for long waits
  5. Plan for scale from day one (don't start with LocalExecutor)
  6. Set up monitoring immediately (not later)
  7. Use event-driven scheduling (Datasets) where possible
  8. Store credentials externally (Secrets Manager, not hardcoded)

For Teams Upgrading from 1.10

  1. Plan 2-3 months for migration (not quick)
  2. Test thoroughly in staging environment
  3. Keep 1.10 backup for emergency rollback
  4. Update DAGs incrementally (not all at once)
  5. Monitor closely first month post-upgrade
  6. Document breaking changes for team
  7. Celebrate milestones (keep team motivated)

For Large Enterprise Deployments

  1. Use Kubernetes executor (scalability, isolation)
  2. Implement comprehensive monitoring (critical for troubleshooting)
  3. Set up disaster recovery (business continuity)
  4. Regular chaos engineering tests (find weak points before production)
  5. Capacity planning quarterly (predict growth)
  6. Performance optimization ongoing (never "done")
  7. Security audits regular (compliance, risk)

For Data Pipeline Use Cases

  1. Use event-driven scheduling (Datasets, not time-based)
  2. Implement data quality checks (validate early)
  3. Monitor SLAs closely (data freshness critical)
  4. Implement idempotent tasks (safe retries)
  5. Plan catchup carefully (historical data loads)
  6. Version control everything (DAGs, configurations)
  7. Use separate environment (dev, staging, prod)

↑ Back to Table of Contents

CONCLUSION

The Airflow Journey

Apache Airflow has evolved from a simple scheduler into a mature, production-grade workflow orchestration platform. The progression from version 1.10 through modern versions demonstrates significant improvements in usability, performance, flexibility, and reliability.

Key Takeaways

Architecture & Design

  • TaskFlow API provides clean, Pythonic DAG definitions
  • Deferrable operators enable massive resource savings for long-running waits
  • Plugins allow endless extensibility without modifying core

Scheduling & Flexibility

  • Choose appropriate scheduling method (time, event, or hybrid)
  • Event-driven scheduling (Datasets) perfect for data pipelines
  • Timetable API enables business logic in scheduling

Reliability & Operations

  • Retries with exponential backoff handle transient failures
  • Proper error handling prevents cascading failures
  • SLAs enforce performance agreements
  • Comprehensive monitoring catches issues early

Scalability & Performance

  • Multiple executor options for different scales
  • Deferrable operators handle 1000s of concurrent waits
  • Performance improvements: 1.10 β†’ 2.5 β†’ 3.0 shows 5-6x speedup

Real-World Success

  • Airflow powers ETL/ELT pipelines at hundreds of companies
  • Used for data engineering, ML orchestration, DevOps automation
  • Solves real problems: dependency management, scheduling, monitoring

Starting Your Airflow Journey

First Step: Understand architecture and core concepts (DAGs, operators, scheduling)
Next Step: Build simple DAGs with TaskFlow API
Then: Add error handling, retries, monitoring
Finally: Optimize for scale, implement production patterns

Airflow is Not**

  • A replacement for application code
  • Suitable for real-time streaming (use Kafka, Spark Streaming)
  • A replacement for data warehouses (use Snowflake, BigQuery)
  • Simple enough to learn in an hour (takes weeks/months)

Airflow is Perfect For**

  • Scheduled batch processing (daily, hourly, etc.)
  • Multi-stage ETL/ELT workflows
  • Cross-system orchestration
  • Data quality validation
  • DevOps automation
  • ML pipeline orchestration
  • Report generation

Final Wisdom

Success with Airflow depends not just on technical setup, but on organizational practices:

  • Treat DAGs as code (version control, code review, testing)
  • Monitor continuously (you can't fix what you don't see)
  • Document thoroughly (future you and teammates will thank you)
  • Automate responses (reduce manual incident response)
  • Iterate and improve (perfect is enemy of done)

Apache Airflow's strength lies in its flexibility, active community, and ecosystem. Whether you're building simple daily pipelines or complex multi-stage data workflows orchestrating hundreds of systems, Airflow provides the tools, patterns, and reliability to succeed.

Start small. Learn thoroughly. Scale thoughtfully. Monitor obsessively.


Top comments (0)