Apache Airflow is a workflow orchestration platform that allows you to define, schedule, and monitor complex data pipelines as code. Instead of using UI-based workflow builders, Airflow uses Python to define Directed Acyclic Graphs (DAGs) representing your workflows.

Key Benefits

Programmatic workflows: Define complex pipelines in Python with full control
Scalable: From single-machine setups to distributed Kubernetes deployments
Extensible: Create custom operators, sensors, and hooks without modifying core
Rich monitoring: Web UI with real-time pipeline visibility and control
Flexible scheduling: Time-based, event-driven, or custom business logic schedules
Reliable: Built-in retry logic, SLAs, and comprehensive failure handling
Community-driven: Active ecosystem with hundreds of provider integrations

Why Airflow Over Alternatives?

Alternative	Why Airflow is Better
Cron scripts	Better reliability, monitoring, dependency management, version control
Shell scripts	Testable, debuggable, proper error handling, team collaboration
Commercial tools	Open source, customizable, no vendor lock-in, self-hosted
Cloud-native tools	Language agnostic, can self-host, avoid provider lock-in

Who Uses Airflow?

Data engineers building ETL/ELT pipelines
ML engineers orchestrating training workflows
DevOps teams managing infrastructure automation
Analytics teams automating reports
Any organization with complex scheduled workflows

- ↑ Back to Table of Contents

SECTION 2: ARCHITECTURE

Core Components Overview

Apache Airflow consists of several interconnected services working together:

Web Server

Provides REST API and web UI
Runs on port 8080 by default
Allows manual DAG triggering, viewing logs, and managing connections
Multiple instances can be deployed for high availability
Stateless (can run on different machines)

Scheduler

Monitors all DAGs and determines when tasks should execute
Parses DAG files periodically (every 5 minutes by default)
Handles task dependency resolution and sequencing
Manages retry logic and failure callbacks
Single instance per deployment (HA scheduler coming in Airflow 3.1+)
Continuously runs in background

Executor

Determines HOW tasks are physically executed
Different executors for different deployment models:
- LocalExecutor: Single machine, limited parallelism (~10-50 tasks)
- SequentialExecutor: One task at a time (testing only)
- CeleryExecutor: Distributed across worker nodes (100s of tasks)
- KubernetesExecutor: Each task runs in separate pod (elastic scaling)

Metadata Database

Central repository for all Airflow state
Stores: DAG definitions, task instance states, variables, connections, logs metadata
Should be PostgreSQL 12+ or MySQL 8+ in production
SQLite only for development (not concurrent access)
Requires daily backups and monitoring

Message Broker (for distributed setups)

Redis or RabbitMQ
Queues tasks from Scheduler to Workers
Enables horizontal scaling
Required only for CeleryExecutor

Triggerer Service (Airflow 2.2+)

Manages all deferrable/async tasks
Single lightweight async service per deployment
Can handle 1000+ concurrent deferred tasks
Monitors events efficiently
Fires triggers when conditions met

Execution Flow Diagram

1. DAG File Loaded
   ↓
2. Scheduler Parses
   ↓
3. Check Task Ready?
   ├─ Dependencies met?
   ├─ Start date passed?
   ├─ Not already running?
   ↓
4. Task State: SCHEDULED
   ↓
5. Queue to Message Broker
   ↓
6. Worker Picks Up Task
   ↓
7. Execution Phase
   ├─ Run operator code
   ├─ OR Defer to Trigger
   ↓
8. Log Output
   ↓
9. Record Result
   ├─ Success → COMPLETED
   ├─ Failure → FAILED
   ├─ Skipped → SKIPPED
   ↓
10. Execute Callback (if configured)
    ↓
11. Mark downstream ready
    ↓
12. Update Dashboard

Architecture Patterns

Single Machine Setup

One server with all components
LocalExecutor
SQLite database (acceptable for development)
Best for: Development, testing, small workflows

Distributed Setup

Separate scheduler machine
Multiple worker machines
CeleryExecutor
PostgreSQL database
Redis message broker
Best for: Production, high volume, 100+ daily DAGs

Kubernetes Native

Scheduler pod
Web server pod (multiple replicas)
KubernetesExecutor
Managed PostgreSQL (Cloud SQL, RDS)
Dynamic pod creation per task
Best for: Cloud-native, elastic scaling, containerized workflows

↑ Back to Table of Contents

SECTION 3: DAG & BUILDING BLOCKS

What is a DAG?

A DAG (Directed Acyclic Graph) is the fundamental unit in Airflow. It represents:

Directed: Tasks execute in specific direction (flow from A → B → C)
Acyclic: No circular dependencies allowed (prevents infinite loops)
Graph: Visual representation of tasks and relationships
Configurable: Start date, schedule, retries, timeouts, SLAs, etc.

Key DAG Properties

Property	Purpose	Example
`dag_id`	Unique identifier	'daily_etl'
`start_date`	First execution date	datetime(2024,1,1)
`schedule`	When to run	'@daily', CronTimetable, Dataset
`catchup`	Backfill missed runs	True/False
`max_active_runs`	Concurrent runs allowed	1-5
`default_args`	Task defaults	retries, timeout, owner
`description`	What DAG does	"Daily customer ETL"
`tags`	Categorization	['etl', 'daily']

Operators (Task Types)

Operators are classes that execute actual work. Each operator is a reusable task that does something specific.

Common Built-in Operators

Operator	What It Does	Common Parameters
PythonOperator	Executes Python function	python_callable, op_kwargs
BashOperator	Runs bash commands	bash_command
SqlOperator	Executes SQL query	sql, conn_id
EmailOperator	Sends emails	to, subject, html_content
S3Operator	S3 file operations	bucket, key, source/dest
HttpOperator	Makes HTTP requests	endpoint, method, data
SlackOperator	Sends Slack messages	text, channel, token
BranchOperator	Conditional branching	python_callable returning task_id
DummyOperator	Placeholder/no-op	Used for flow control

When to Use Each

PythonOperator: Data processing, API calls, complex logic
BashOperator: Run scripts, system commands, existing tools
SqlOperator: Database transformations, queries
EmailOperator: Notifications, reports
S3Operator: Data movement, file management
HttpOperator: External service integration
BranchOperator: Conditional workflows

Sensors (Waiting for Conditions)

Sensors are operators that wait for something to happen. They poll a condition until it becomes true, blocking further execution.

Common Sensors

Sensor	Waits For	Use Case
FileSensor	File creation on disk	Wait for uploaded data
S3KeySensor	S3 object existence	Wait for data in S3
SqlSensor	SQL query condition	Wait for record count
TimeSensor	Specific time of day	Wait until 2 PM
ExternalTaskSensor	External DAG task	Cross-DAG dependency
HttpSensor	HTTP endpoint response	Wait for API ready
TimeDeltaSensor	Time duration elapsed	Wait 30 minutes

Sensor Behavior

Poke Mode (default): Worker thread continuously checks condition, holds worker slot
Reschedule Mode: Worker released, task rescheduled when ready
Smart Mode: Multiple sensors batched for efficiency

Performance Consideration: Many idle sensors waste resources. Use deferrable operators for long waits.

Hooks (Connection Abstractions)

Hooks encapsulate connection and authentication logic to external systems. They're the reusable layer between operators and external services.

Common Hooks

Hook	Connects To	Purpose
PostgresHook	PostgreSQL	Query execution
MySqlHook	MySQL	Database operations
S3Hook	AWS S3	File operations
HttpHook	REST APIs	HTTP requests
SlackHook	Slack	Send messages
SSHHook	Remote servers	Execute commands
DockerHook	Docker daemon	Container operations

Benefits of Hooks

Centralize credential management (stored in Airflow connections)
Reusable across multiple tasks
No credentials in DAG code
Easier testing (mock connections)
Connection pooling support
Error handling abstraction

Tasks

A task is an instance of an operator within a DAG. It's the smallest unit of work.

Task Configuration

Setting	Purpose	Example
`task_id`	Unique identifier	'extract_customers'
`retries`	Retry attempts on failure	2-3
`retry_delay`	Wait between retries	timedelta(minutes=5)
`timeout`	Maximum execution time	timedelta(hours=1)
`pool`	Concurrency control	'default_pool'
`priority_weight`	Execution priority	1-100 (higher = more urgent)
`trigger_rule`	When to execute	'all_success' (default)
`depends_on_past`	Depend on previous run	False (usually)

XCom (Cross-Communication)

XCom (cross-communication) allows tasks to pass data to each other.

How XCom Works

Tasks push values: task_instance.xcom_push(key='count', value=1000)
Tasks pull values: task_instance.xcom_pull(task_ids='upstream', key='count')
Automatically stored in metadata database
Serialized to JSON by default

XCom Limitations

Database-backed (not for large data)
Intended for metadata, not big data
Consider external storage for large datasets
Can cause database bloat if overused

↑ Back to Table of Contents

SECTION 4: PLUGINS

What are Plugins?

Plugins allow you to extend Airflow with custom components without modifying core code. They're organized Python modules that register custom functionality.

Types of Plugins

Plugin Type	Purpose	When to Create
Custom Operators	New task types	Business-specific operations
Custom Sensors	New wait conditions	Proprietary systems
Custom Hooks	New connections	Proprietary APIs
Custom Executors	New execution patterns	Specialized hardware
Custom Macros	New template variables	Common business calculations
UI Extensions	Dashboard widgets	Custom monitoring

Plugin Organization

Plugins live in the plugins/ directory with this structure:

plugins/
├── operators/
│   └── custom_operator.py
├── sensors/
│   └── custom_sensor.py
├── hooks/
│   └── custom_hook.py
├── macros/
│   └── custom_macros.py
└── airflow_plugins.py

When to Create Plugins

CREATE plugins for:

Integrating proprietary/internal systems
Reusable patterns across organization
Complex business logic encapsulation
Specialized monitoring/alerting

DON'T create plugins for:

One-off tasks (use standard operators)
Simple data transformations (use PythonOperator)
Ad-hoc scripts (use BashOperator)
If built-in operator already exists

Plugin Development

Custom operators extend BaseOperator and override execute() method.
Custom sensors extend BaseSensorOperator and override poke() method.
Custom hooks extend BaseHook with connection logic.

Plugins are loaded automatically from plugins/ directory on scheduler startup.

↑ Back to Table of Contents

SECTION 5: TASKFLOW API

What is TaskFlow API?

TaskFlow API (Airflow 2.0+) is the modern, Pythonic way to write DAGs using decorators. It simplifies DAG creation by using functions instead of operator instantiation.

Comparison: Traditional vs TaskFlow

Aspect	Traditional	TaskFlow
Syntax	Verbose operator creation	Clean function decorators
Data Passing	Manual XCom push/pull	Implicit via return values
Dependencies	Explicit >> operators	Implicit via parameters
Code Length	More lines, repetitive	Fewer lines, concise
Type Hints	Optional	Strongly recommended
Learning Curve	Moderate	Easier for Python devs
IDE Support	Basic	Excellent (autocomplete)

Key Decorators

@dag

Decorates function that contains tasks
Function body defines workflow
Returns DAG instance
Replaces DAG() class creation

@task

Decorates function that becomes a task
Function is task logic
Return value becomes XCom value
Parameters are upstream dependencies

@task_group

Groups related tasks together
Creates visual subgraph in UI
Simplifies large DAGs
Can be nested

@task.expand (Dynamic Mapping)

Creates task for each item in list
Reduces repetitive task creation
Dynamic fan-out pattern
Available in Airflow 2.3+

Advantages of TaskFlow

Tasks are just Python functions
Type hints enable IDE autocomplete
Natural data flow through function parameters
Automatic XCom handling (no manual push/pull)
Cleaner, more Pythonic code
Easier testing (function-based)
Less boilerplate
Better code organization

When to Use TaskFlow

Use for:

New DAGs
Complex dependencies
Python-savvy teams
Modern deployments (2.0+)

Alternative if:

Legacy Airflow 1.10 requirement
Using operators not available as tasks
Team unfamiliar with decorators

↑ Back to Table of Contents

SECTION 6: DEFERRABLE OPERATORS & TRIGGERS

The Problem: Resource Waste

Traditional Sensors (Blocking)

Sensor holds worker slot while waiting
Worker CPU ~5-10% even during idle
Memory ~50MB per idle task
Can handle ~50-100 concurrent waits
Resource waste for long waits (hours/days)

Example: 1000 tasks waiting 1 hour = 1000 worker slots held continuously = massive waste

The Solution: Deferrable Operators

Deferrable Approach (Non-blocking)

Task releases worker slot immediately after deferring
Minimal CPU usage (~0.1%) during wait
Memory ~1MB per deferred task
Can handle 1000+ concurrent waits
~95% resource savings compared to traditional

How Deferrable Operators Work

4-Phase Execution:

Execution Phase: Task executes initial logic
- Setup, validation, prerequisites
- Then: Defer to trigger
- Worker slot RELEASED
Deferred Phase: Trigger monitors event
- Async/event-based monitoring
- Triggerer service handles
- Very lightweight
- No worker slot needed
Completion Phase: Event occurs
- Trigger detects condition met
- Returns event data
- Trigger fires
Resume Phase: Task completes
- Worker acquires new slot
- Task resumes with trigger result
- Final processing
- Task completes

Available Deferrable Operators

Operator	Defers For	Best For
TimeSensorAsync	Specific time	Waiting until business hours
S3KeySensorAsync	S3 object	Long-running file uploads
ExternalTaskSensorAsync	External DAG	Cross-DAG long waits
SqlSensorAsync	SQL condition	Database state changes
HttpSensorAsync	HTTP response	External service availability
TimeDeltaSensorAsync	Duration	Fixed time wait

Triggerer Service

Manages all active triggers
Single lightweight async service per deployment
Default capacity: 1000 triggers
Requires: airflow triggerer process
Monitors events, fires when ready
Stateless (can restart without losing state)

When to Use Deferrable

Use when:

Wait duration > 5 minutes
Many concurrent waits
Resource-constrained environment
Can tolerate event latency (<1 second)
Long-running waits (hours, days)

Don't use when:

Wait < 5 minutes (overhead not worth it)
Real-time detection critical (sub-second)
Simple local operations
Frequent detection required

Performance Impact

1000 Tasks Waiting 1 Hour

Traditional: 1000 worker slots × 1 hour = 1000 slot-hours wasted
Deferrable: 0 worker slots, minimal CPU in trigger service

Cost Savings: ~95% for typical scenarios

↑ Back to Table of Contents

SECTION 7: DEPENDENCIES & RELATIONSHIPS

Task Dependencies

Linear Chain: Task A → Task B → Task C → Task D

Sequential execution
Each waits for previous
Easiest to understand
Best for: ETL pipelines with stages

Fan-out / Fan-in: One task triggers many, then merge

      ├─ Process A ─┐
Extract ┤            ├─ Merge
      └─ Process B ─┘

Parallel processing
Best for: Multiple independent operations
Example: Extract once, transform multiple ways

Branching: Conditional execution

      ┌─ Success path
Branch ┤
      └─ Error path

Different paths based on condition
Best for: Alternative workflows
Example: If data valid, load; else alert

Trigger Rules

Trigger rules determine WHEN downstream tasks execute based on UPSTREAM status.

Rule	Executes When	Use Case
`all_success`	All upstream succeed	Default, sequential flow
`all_failed`	All upstream fail	Cleanup after failures
`all_done`	All complete (any status)	Final reporting
`one_success`	At least one succeeds	Alternative paths
`one_failed`	At least one fails	Error alerts
`none_failed`	No failures (skips OK)	Robust pipelines ✓ Recommended
`none_failed_or_skipped`	No failures/skips	Strict validation

Best Practice: Use none_failed for fault-tolerant pipelines.

DAG-to-DAG Dependencies

Methods to Create

Method	Mechanism	Best For
ExternalTaskSensor	Poll for completion	Different schedules
TriggerDagRunOperator	Actively trigger	Orchestrate workflows
Datasets	Event-driven	Data pipelines

Datasets (Airflow 2.3+)

Producer DAG updates dataset
Consumer DAG triggered automatically
Real-time responsiveness (<1 second)
Low latency, efficient
Best for data-driven workflows

Example:

ETL DAG extracts and updates orders_dataset
Analytics DAG triggered automatically when dataset updates
No waiting, real-time processing

↑ Back to Table of Contents

SECTION 8: SCHEDULING

Overview of Scheduling Methods

Three ways to trigger DAG runs:

Time-based: Specific times/intervals
Event-driven: When data arrives/updates
Manual: User-triggered via UI

Time-Based: Cron Expressions (Legacy)

Cron format: minute hour day month day_of_week

Common Patterns

@hourly: Every hour
@daily: Every day at midnight
@weekly: Every Sunday at midnight
@monthly: First day of month
0 9 * * MON-FRI: 9 AM on weekdays
*/15 * * * *: Every 15 minutes

Limitations

Simple time-only scheduling
Limited to time expressions
No business logic
Timezone handling tricky

Time-Based: Timetable API (Modern)

Timetables (Airflow 2.2+) provide object-oriented scheduling with unlimited flexibility.

Built-in Timetables

CronDataTimetable: Cron-based (replaces schedule_interval)
DailyTimetable: Daily runs
IntervalTimetable: Fixed intervals

Custom Timetables Allow

Business day scheduling (Mon-Fri)
Holiday-aware schedules (skip holidays)
Timezone-specific runs
Complex business logic
End-of-month runs
Business hour runs (9 AM - 5 PM)

Advantages

Unlimited flexibility
Timezone support
Readable code
Testable
Documented rationale

Event-Driven: Datasets (Airflow 2.3+)

Datasets enable data-driven workflows instead of time-driven.

How it Works

Producer task completes and updates dataset
Scheduler detects dataset update
Consumer DAG triggered immediately
No scheduled waiting

Benefits

Real-time responsiveness
Resource efficient (no idle waits)
Data-quality focused
Reduced latency (<1 second)
Perfect for data pipelines

When to Use

Multi-stage ETL (extract → transform → load)
Event streaming
Real-time processing
Data-dependent workflows

Hybrid Scheduling

Combine time-based and event-driven:

DAG runs on schedule OR when data arrives
Whichever comes first
Best of both worlds
Flexible and responsive

Scheduling Decision Tree

TIME-BASED?
├─ YES → Complex logic?
│        ├─ YES → Custom Timetable
│        └─ NO → Cron Expression
│
└─ NO → Event-driven?
         ├─ YES → Use Datasets
         └─ NO → Manual trigger or reconsider design

↑ Back to Table of Contents

SECTION 9: FAILURE HANDLING

Retry Mechanisms

Purpose: Allow tasks to recover from transient failures (network timeout, API rate limit, etc.)

Configuration

retries: Number of retry attempts
retry_delay: Wait time between retries
retry_exponential_backoff: Exponentially increasing delays
max_retry_delay: Cap on maximum wait

Example Retry Sequence

Attempt 1: Fail → Wait 1 minute
Attempt 2: Fail → Wait 2 minutes (1 × 2^1)
Attempt 3: Fail → Wait 4 minutes (1 × 2^2)
Attempt 4: Fail → Wait 8 minutes (1 × 2^3)
Attempt 5: Fail → Wait 16 minutes (1 × 2^4)
Attempt 6: Fail → Max 1 hour (capped)
Attempt 7: Fail → Task FAILED

Best Practice: 2-3 retries with exponential backoff for most tasks.

Trigger Rules (Already Covered in Section 7)

But worth repeating: Use none_failed for robust pipelines.

Callbacks

Execute custom logic when task state changes:

on_failure_callback: When task fails
on_retry_callback: When task retries
on_success_callback: When task succeeds

Common Uses

Send notifications (Slack, email, PagerDuty)
Log to external monitoring systems
Trigger incident management
Update dashboards
Cleanup resources

Error Handling Patterns

Pattern 1: Idempotency

Make tasks safe to retry
Delete-then-insert instead of insert
Same result regardless of executions
Safe for retries

Pattern 2: Circuit Breaker

Stop retrying after N attempts
Raise critical exceptions
Alert operations team
Prevent cascading failures

Pattern 3: Graceful Degradation

Continue with partial data
Log warnings
Don't fail entire pipeline
Better UX than all-or-nothing

Pattern 4: Skip on Condition

Raise AirflowSkipException
Skip task without marking as failure
Allow downstream to continue
Best for conditional workflows

↑ Back to Table of Contents

SECTION 10: SLA MANAGEMENT

What is SLA?

SLA (Service Level Agreement) is a guarantee that a task will complete within a specified time window.

If SLA is missed:

Task marked as "SLA Missed"
Notifications triggered
Callbacks executed
Logged for auditing/metrics

Setting SLAs

At DAG Level: All tasks inherit SLA

default_args = {
    'sla': timedelta(hours=1)
}

At Task Level: Specific to individual task

task = PythonOperator(
    task_id='critical_task',
    sla=timedelta(minutes=30)
)

Per Task Overrides: Task SLA overrides DAG default

SLA Monitoring

Key Metrics

SLA compliance rate (%)
Average overdue duration
Most violated tasks
Impact on downstream

Alerts

Send notifications when SLA at risk (80%)
Escalate on missed SLA
Create incidents for critical SLAs

Dashboard

Track compliance trends
Identify problematic tasks
Alert history
Remediation tracking

SLA Best Practices

Practice	Benefit
Set realistic SLAs	Avoid false positives
Different tiers	Critical < Normal < Optional
Review quarterly	Account for growth
Document rationale	Team alignment
Escalation policy	Proper incident response
Callbacks	Automated notifications
Monitor compliance	Identify trends

↑ Back to Table of Contents

SECTION 11: CATCHUP BEHAVIOR

What is Catchup?

Catchup is automatic backfill of missed DAG runs.

Example Scenario

Start date: 2024-01-01
Today: 2024-05-15
Schedule: @daily (every day)
Catchup: True

RESULT: Airflow creates 135 DAG runs instantly!
(One for each day from Jan 1 to May 14)

Catchup Settings

Setting	Behavior	When to Use
`catchup=True`	Backfill all missed runs	Historical data loading
`catchup=False`	No backfill, start fresh	New pipelines

Catchup Risks

Potential Issues

Resource exhaustion (CPU, memory, database)
Network bandwidth spike
Third-party API rate limits exceeded
Database connection pool exhausted
Worker node crashes
Scheduler paralysis (busy processing backfill)

Safe Catchup Strategy

Step 1: Use Separate Backfill DAG

Main DAG: catchup=False (normal operations)
Backfill DAG: catchup=True, end_date set (historical only)
Run backfill during maintenance window

Step 2: Control Parallelism

max_active_runs: Limit concurrent DAG runs (1-3)
max_active_tasks_per_dag: Limit concurrent tasks (16-32)
Prevents resource exhaustion

Step 3: Staged Approach

Test with subset of dates first
Monitor resource usage
Gradually increase pace
Have rollback plan

Step 4: Validation

Verify data completeness
Check for duplicates
Validate transformations
Compare with expected results

Catchup Checklist

Before enabling catchup, verify:

Database capacity available
Worker resources adequate
Source data exists for all periods
Data schema hasn't changed
No API rate limits exceeded
Network bandwidth available
Stakeholders informed
Rollback procedure ready

↑ Back to Table of Contents

SECTION 12: CONFIGURATION

Configuration Hierarchy

Airflow reads configuration in this order (first found wins):

Environment Variables (AIRFLOW___ prefix)
airflow.cfg file (main config)
Default configuration (hardcoded)
Command-line arguments (runtime)

Critical Settings

Setting	Purpose	Production Value
`sql_alchemy_conn`	Metadata database	PostgreSQL connection
`executor`	Task execution	CeleryExecutor
`parallelism`	Max tasks globally	32-128
`max_active_runs_per_dag`	Concurrent DAG runs	1-3
`max_active_tasks_per_dag`	Concurrent tasks	16-32
`remote_logging`	Centralized logs	True (to S3)
`email_backend`	Email provider	SMTP
`base_log_folder`	Log storage	NFS mount

Database Configuration

Production Database

PostgreSQL 12+ (recommended) or MySQL 8+
Dedicated database server (not shared)
Daily backups automated
Connection pooling enabled
Monitoring for slow queries
Regular maintenance (VACUUM, ANALYZE)

Why Not SQLite

Not concurrent-access safe
Locks on writes
Single machine only
Not for production

Executor Configuration

Executor	Config	Best For
Local	[local_executor]	Development
Celery	[celery]	Distributed, high volume
Kubernetes	[kubernetes]	Cloud-native
Sequential	(internal)	Testing only

Performance Tuning

Database

Connection pooling (pool_size=5, max_overflow=10)
Indexes on frequently queried columns
Archive old logs/task instances
Query optimization

Scheduler

Increase DAG parsing frequency if needed
Optimize DAG file count
Use pool resources wisely

Web Server

Cache results
Limit search scope
Monitor response times

↑ Back to Table of Contents

SECTION 13: AWS INTEGRATION

Connection Setup

Three Methods

Airflow UI: Admin → Connections → Create
Environment Variables: AIRFLOW_CONN_AWS_DEFAULT
Programmatic: Python script creating Connection objects

Authentication Types

Access Key / Secret Key (discouraged)
IAM Roles (recommended for EC2)
AssumeRole (cross-account access)
Temporary credentials

Common AWS Services

Service	What It Does	Use Case
S3	File storage	Data lake, logs, backups
Redshift	Data warehouse	Analytics, aggregations
Lambda	Serverless compute	Lightweight processing
SNS	Notifications	Alerts, messaging
SQS	Message queue	Task decoupling
Glue	ETL service	Spark/Python jobs
RDS/Aurora	Managed database	Transactional data
EMR	Big data	Spark, Hadoop clusters
CloudWatch	Monitoring	Metrics, logs, alarms

Operators for AWS Services

S3 Operations

Upload/download files
Copy within S3
List objects
Delete files
Check existence

Redshift Operations

Execute SQL queries
Load data from S3
Create/delete snapshots
Pause/resume clusters

Lambda Operations

Invoke functions
Async execution
Sync execution with polling

SQS/SNS Operations

Send messages to queue
Receive from queue
Publish to topics
Batch operations

AWS Best Practices

Practice	Benefit
Use IAM roles	No credentials in code
Secrets Manager	Centralized credential storage
Enable CloudTrail	Audit all API calls
Use VPC endpoints	S3 access without internet
Monitor costs	CloudWatch cost alerts
Implement budgets	Prevent overspending
Regular backups	Disaster recovery
Region strategy	Latency, compliance

↑ Back to Table of Contents

SECTION 14: VERSION COMPARISON

Version Timeline

1.10.x (Deprecated, EOL 2021)
    ↓ Major Redesign
2.0 (March 2021) - TaskFlow API introduced
    ↓
2.1-2.2 (Improvements & Deferrable)
    ↓
2.3 (Nov 2022) - Datasets event-driven scheduling
    ↓
2.4-2.5 (Current stable, production-ready)
    ↓
3.0 (Latest) - Performance focus, async improvements

Key Feature Comparison

Feature	1.10	2.x	3.0
TaskFlow API	❌	✅	✅
Deferrable Ops	❌	✅	✅
Datasets	❌	❌	✅
Performance	Baseline	3-5x faster	5-6x faster
Python 3.10+	❌	✅	✅ Required
Async/Await	❌	Limited	✅ Full

Performance Improvements

Metric	1.10	2.5	3.0
DAG parsing (1000 DAGs)	~120s	~45s	~15s
Task scheduling latency	~5-10s	~2-3s	<1s
Memory (idle scheduler)	~500MB	~300MB	~150MB
Web UI response	~800ms	~200ms	~50ms

Migration Path

From 1.10 to 2.x

Python 3.6+ required
Import paths changed (reorganized)
Operators moved to providers
Test in staging thoroughly
Gradual production rollout (1-2 weeks)

From 2.x to 3.0

Most DAGs work without changes
Drop Python < 3.10
Update async patterns
Leverage new performance features
Simpler migration (backward compatible)

Version Selection

Scenario	Recommended
New projects	2.5+ or 3.0
Production (stable)	2.4.x
Enterprise	2.5+
Performance-critical	3.0
Learning	2.5+ or 3.0

↑ Back to Table of Contents

SECTION 15: BEST PRACTICES

DAG Design DO's

✓ Use TaskFlow API for modern code
✓ Keep tasks small and focused
✓ Name tasks descriptively
✓ Document complex logic
✓ Use type hints
✓ Add comprehensive logging
✓ Handle errors gracefully
✓ Test DAGs before production

DAG Design DON'Ts

✗ Put heavy processing in DAG definition
✗ Create 1000s of tasks per DAG
✗ Hardcode credentials
✗ Ignore data validation
✗ Use DAG-level database connections
✗ Forget about idempotency
✗ Skip error handling
✗ Assume tasks always succeed

Task Design DO's

✓ Make tasks idempotent
✓ Set appropriate timeouts
✓ Implement retries with backoff
✓ Use pools for resource control
✓ Skip tasks on errors (when appropriate)
✓ Log important events
✓ Clean up resources

Task Design DON'Ts

✗ Assume tasks always succeed
✗ Run long operations without timeout
✗ Hold resources unnecessarily
✗ Store large data in XCom
✗ Use blocking operations
✗ Ignore upstream failures
✗ Forget cleanup

Scheduling DO's

✓ Match frequency to data needs
✓ Account for upstream dependencies
✓ Plan for failures
✓ Test schedule changes
✓ Document rationale
✓ Monitor adherence

Scheduling DON'Ts

✗ Schedule too frequently
✗ Ignore resource constraints
✗ Forget about catchup
✗ Use ambiguous expressions
✗ Create tight couplings
✗ Skip monitoring

↑ Back to Table of Contents

SECTION 16: MONITORING

Key Metrics

Metric	What It Shows	Alert Threshold
Scheduler heartbeat	Scheduler running	>30s = ALERT
Task failure rate	Pipeline health	>5% = INVESTIGATE
SLA violations	Performance issues	>10/day = REVIEW
DAG parsing time	System load	>60s = OPTIMIZE
Task queue depth	Worker capacity	>1000 = ADD WORKERS
DB connections	Connection health	>80% = INCREASE
Log size	Storage usage	Monitor growth

Common Issues & Solutions

Issue	Likely Cause	Solution
Tasks stuck	Worker crashed	Restart workers
Scheduler lagging	High load	Add resources
Memory leaks	Bad code	Profile and fix
Database slow	No indexes	Optimize queries
Triggers not firing	Triggerer down	Check service
SLA misses	Unrealistic goals	Adjust SLA
Catchup explosion	Too many runs	Limit with max_active

Monitoring Stack Components

Metrics Collection

Prometheus for collection
Custom StatsD metrics
CloudWatch for AWS

Visualization

Grafana dashboards
CloudWatch dashboards
Custom visualizations

Alerting

Threshold-based alerts
Anomaly detection
Escalation policies
Integration with incident systems

Logging

Centralized logs (ELK, Splunk)
Remote logging to S3
Log aggregation
Searchable archives

↑ Back to Table of Contents

SECTION 17: CONNECTIONS AND SECRETS

What are Connections?

Connections store authentication credentials to external systems. Instead of hardcoding database passwords or API keys in DAGs, Airflow stores these centrally.

Components:

Connection ID: Unique identifier (e.g., postgres_default)
Connection Type: System type (postgres, aws, http, mysql, etc.)
Host: Server address
Login: Username
Password: Secret credential
Port: Service port
Schema: Database or default path
Extra: JSON for additional parameters

Why Use Connections?

Credentials not in DAG code (can't accidentally commit to git)
Centralized credential management
Easy credential rotation
Different credentials per environment (dev, staging, prod)
Audit trail of access

Connection Types

Type	Used For	Example
postgres	PostgreSQL databases	Analytics warehouse
mysql	MySQL databases	Transactional data
snowflake	Snowflake warehouse	Cloud data platform
aws	AWS services (S3, Redshift, Lambda)	Cloud infrastructure
gcp	Google Cloud services	BigQuery, Cloud Storage
http	REST APIs	External services
ssh	Remote server access	Bastion hosts
slack	Slack messaging	Notifications
sftp	SFTP file transfer	Remote servers

What are Secrets/Variables?

Variables store non-sensitive configuration values.
Secrets store sensitive data (marked as secret).

Aspect	Variables	Secrets
Purpose	Configuration settings	Passwords, API keys
Visibility	Visible in UI	Masked/hidden in UI
Storage	Metadata database	External backend (Vault, etc.)
Example	batch_size, email	api_key, db_password

How to Add Connections

Method 1: Web UI (Easiest)

Airflow UI → Admin → Connections → Create Connection
Fill form with credentials
Click Save

Method 2: Environment Variables

Format: AIRFLOW_CONN_{CONNECTION_ID}={CONN_TYPE}://{USER}:{PASSWORD}@{HOST}:{PORT}/{SCHEMA}

Example:

export AIRFLOW_CONN_POSTGRES_DEFAULT=postgresql://user:password@localhost:5432/mydb
export AIRFLOW_CONN_AWS_DEFAULT=aws://ACCESS_KEY:SECRET_KEY@

Method 3: Python Script

Create Connection objects programmatically and add to database.

Method 4: Airflow CLI

airflow connections add postgres_prod \
    --conn-type postgres \
    --conn-host localhost \
    --conn-login user \
    --conn-password password

How to Add Variables/Secrets

Method 1: Web UI

Airflow UI → Admin → Variables → Create Variable
Enter Key and Value
Check "Is Secret" for sensitive data
Click Save

Method 2: Environment Variables

export AIRFLOW_VAR_BATCH_SIZE=1000
export AIRFLOW_VAR_API_KEY=secret123

Method 3: Python Script

Create Variable objects and add to database.

Method 4: Airflow CLI

airflow variables set batch_size 1000
airflow variables set api_key secret123

Using Connections in DAGs

Connections are accessed through Hooks (connection abstraction layer).

from airflow.hooks.postgres_hook import PostgresHook

def query_database(**context):
    hook = PostgresHook(postgres_conn_id='postgres_default')
    result = hook.get_records("SELECT * FROM users")
    return result

Benefits of Hooks:

Encapsulate connection logic
Reusable across tasks
Handle connection pooling
Error handling built-in
No credentials in task code

Using Variables in DAGs

Variables are accessed directly:

from airflow.models import Variable

batch_size = Variable.get('batch_size', default_var=1000)
api_key = Variable.get('api_key')
is_production = Variable.get('is_production', default_var='false')

# With JSON parsing
config = Variable.get('config', deserialize_json=True)

Secrets Backend: Advanced Security

For production, use external secrets backends instead of storing in Airflow database.

Backend	Security Level	Best For
Metadata DB	Low	Development only
AWS Secrets Manager	High	AWS deployments
HashiCorp Vault	Very High	Enterprise
Google Secret Manager	High	GCP deployments
Azure Key Vault	High	Azure deployments

How it works:

Configure Airflow to use external backend
Airflow queries backend when credential needed
Secrets never stored in database
Automatic rotation supported
Audit logging in backend

Security Best Practices

DO ✅

Use external secrets backend in production
Rotate credentials quarterly
Use IAM roles instead of hardcoded keys
Encrypt connections in transit (SSL/TLS)
Encrypt secrets at rest (database encryption)
Limit who can see connections (RBAC)
Use short-lived credentials (temporary tokens)
Audit access to secrets
Never commit credentials to git

DON'T ❌

Hardcode credentials in DAG code
Store passwords in comments or docstrings
Share credentials via Slack/Email
Use default passwords
Store credentials in config files in git
Forget to revoke old credentials
Use same credentials everywhere
Log credential values

Common Issues

Issue	Cause	Fix
Connection refused	Service not running	Start the service
Authentication failed	Wrong credentials	Verify in source system
Host not found	Wrong hostname/DNS	Check network, DNS resolution
Timeout	Network unreachable	Check firewall, routing
Variable not found	Wrong key name	Check variable ID spelling
Permission denied	User lacks access	Check user RBAC permissions

Testing Connections

Via Web UI:

Navigate to Admin → Connections
Click on connection
Click "Test Connection"

Via CLI:

airflow connections test postgres_default
airflow connections test aws_default

Via Python:

from airflow.hooks.postgres_hook import PostgresHook

hook = PostgresHook(postgres_conn_id='postgres_default')
hook.get_conn()  # Will raise error if connection fails

Production Setup

Minimum Requirements:

✅ All connections stored centrally (not hardcoded)
✅ Credentials in external secrets backend
✅ RBAC enabled (limit connection visibility)
✅ No credentials in DAG code
✅ No credentials in git history
✅ Quarterly credential rotation
✅ SSL/TLS enabled for external connections
✅ Audit logging for credential access
✅ Monitoring/alerts for failed authentication

↑ Back to Table of Contents

SUMMARY & CHECKLIST

Production Deployment Checklist

INFRASTRUCTURE
□ PostgreSQL 12+ database
□ Separate scheduler machine
□ Multiple web server instances (load balanced)
□ Multiple worker machines (if Celery)
□ Message broker (Redis/RabbitMQ)
□ Triggerer service running
□ NFS for logs (centralized)
□ Backup systems

CONFIGURATION
□ Connection pooling configured
□ Resource pools created
□ Email/Slack configured
□ Remote logging to S3/cloud
□ Monitoring setup (Prometheus/Grafana)
□ Alert thresholds defined
□ Timezone configured correctly

SECURITY
□ RBAC enabled in UI
□ Credentials in Secrets Manager
□ SSL/TLS enabled
□ Database encryption at rest
□ VPC configured
□ Firewall rules tight
□ Regular security updates
□ No credentials in DAG code

RELIABILITY
□ Metadata backup automated (daily)
□ DAG code in Git repository
□ Disaster recovery plan documented
□ Rollback procedure tested
□ SLA enforcement active
□ Callback notifications configured
□ Data validation implemented

MONITORING
□ Scheduler health check
□ Worker heartbeat monitoring
□ Database performance monitoring
□ Trigger service health
□ DAG/task failure alerts
□ SLA violation alerts
□ Resource usage dashboards
□ Cost tracking (if cloud)

PERFORMANCE
□ Database indexes optimized
□ Connection pool sized appropriately
□ Max active runs limited
□ Task priorities set
□ Deferrable operators for long waits
□ SmartSensor enabled
□ DAG parsing optimized

DOCUMENTATION
□ DAG documentation complete
□ SLA rationale documented
□ Runbooks created for issues
□ Team trained on platform
□ Changes logged in git
□ Disaster recovery plan shared
□ Escalation procedures documented

↑ Back to Table of Contents

RECOMMENDATIONS

For New Projects Starting Today

Choose Airflow 2.5+ or 3.0 (not 1.10)
Use TaskFlow API (modern, clean code)
Use Timetable API for scheduling (not cron strings)
Use Deferrable operators for long waits
Plan for scale from day one (don't start with LocalExecutor)
Set up monitoring immediately (not later)
Use event-driven scheduling (Datasets) where possible
Store credentials externally (Secrets Manager, not hardcoded)

For Teams Upgrading from 1.10

Plan 2-3 months for migration (not quick)
Test thoroughly in staging environment
Keep 1.10 backup for emergency rollback
Update DAGs incrementally (not all at once)
Monitor closely first month post-upgrade
Document breaking changes for team
Celebrate milestones (keep team motivated)

For Large Enterprise Deployments

Use Kubernetes executor (scalability, isolation)
Implement comprehensive monitoring (critical for troubleshooting)
Set up disaster recovery (business continuity)
Regular chaos engineering tests (find weak points before production)
Capacity planning quarterly (predict growth)
Performance optimization ongoing (never "done")
Security audits regular (compliance, risk)

For Data Pipeline Use Cases

Use event-driven scheduling (Datasets, not time-based)
Implement data quality checks (validate early)
Monitor SLAs closely (data freshness critical)
Implement idempotent tasks (safe retries)
Plan catchup carefully (historical data loads)
Version control everything (DAGs, configurations)
Use separate environment (dev, staging, prod)

↑ Back to Table of Contents

CONCLUSION

The Airflow Journey

Apache Airflow has evolved from a simple scheduler into a mature, production-grade workflow orchestration platform. The progression from version 1.10 through modern versions demonstrates significant improvements in usability, performance, flexibility, and reliability.

Key Takeaways

Architecture & Design

TaskFlow API provides clean, Pythonic DAG definitions
Deferrable operators enable massive resource savings for long-running waits
Plugins allow endless extensibility without modifying core

Scheduling & Flexibility

Choose appropriate scheduling method (time, event, or hybrid)
Event-driven scheduling (Datasets) perfect for data pipelines
Timetable API enables business logic in scheduling

Reliability & Operations

Retries with exponential backoff handle transient failures
Proper error handling prevents cascading failures
SLAs enforce performance agreements
Comprehensive monitoring catches issues early

Scalability & Performance

Multiple executor options for different scales
Deferrable operators handle 1000s of concurrent waits
Performance improvements: 1.10 → 2.5 → 3.0 shows 5-6x speedup

Real-World Success

Airflow powers ETL/ELT pipelines at hundreds of companies
Used for data engineering, ML orchestration, DevOps automation
Solves real problems: dependency management, scheduling, monitoring

Starting Your Airflow Journey

First Step: Understand architecture and core concepts (DAGs, operators, scheduling)
Next Step: Build simple DAGs with TaskFlow API
Then: Add error handling, retries, monitoring
Finally: Optimize for scale, implement production patterns

Airflow is Not**

A replacement for application code
Suitable for real-time streaming (use Kafka, Spark Streaming)
A replacement for data warehouses (use Snowflake, BigQuery)
Simple enough to learn in an hour (takes weeks/months)

Airflow is Perfect For**

Scheduled batch processing (daily, hourly, etc.)
Multi-stage ETL/ELT workflows
Cross-system orchestration
Data quality validation
DevOps automation
ML pipeline orchestration
Report generation

Final Wisdom

Success with Airflow depends not just on technical setup, but on organizational practices:

Treat DAGs as code (version control, code review, testing)
Monitor continuously (you can't fix what you don't see)
Document thoroughly (future you and teammates will thank you)
Automate responses (reduce manual incident response)
Iterate and improve (perfect is enemy of done)

Apache Airflow's strength lies in its flexibility, active community, and ecosystem. Whether you're building simple daily pipelines or complex multi-stage data workflows orchestrating hundreds of systems, Airflow provides the tools, patterns, and reliability to succeed.

Start small. Learn thoroughly. Scale thoughtfully. Monitor obsessively.