The Beginning of Your Airflow Adventure πβΈοΈ
π Table of Contents
PART 1: FUNDAMENTALS
- Section 1: Introduction
- Section 2: Architecture
- Section 3: DAG & Building Blocks
- Section 4: Plugins
- Section 5: TaskFlow API
PART 2: ADVANCED CONCEPTS
- Section 6: Deferrable Operators
- Section 7: Dependencies & Relationships
- Section 8: Scheduling
- Section 9: Failure Handling
- Section 10: SLA Management
PART 3: OPERATIONS & DEPLOYMENT
- Section 11: Catchup Behavior
- Section 12: Configuration
- Section 13: AWS Integration
- Section 14: Version Comparison
- Section 15: Best Practices
- Section 16: Monitoring
- Section 17: CONNECTIONS AND SECRETS
FINAL SECTIONS
SECTION 1: INTRODUCTION
What is Apache Airflow?
Apache Airflow is a workflow orchestration platform that allows you to define, schedule, and monitor complex data pipelines as code. Instead of using UI-based workflow builders, Airflow uses Python to define Directed Acyclic Graphs (DAGs) representing your workflows.
Key Benefits
- Programmatic workflows: Define complex pipelines in Python with full control
- Scalable: From single-machine setups to distributed Kubernetes deployments
- Extensible: Create custom operators, sensors, and hooks without modifying core
- Rich monitoring: Web UI with real-time pipeline visibility and control
- Flexible scheduling: Time-based, event-driven, or custom business logic schedules
- Reliable: Built-in retry logic, SLAs, and comprehensive failure handling
- Community-driven: Active ecosystem with hundreds of provider integrations
Why Airflow Over Alternatives?
| Alternative | Why Airflow is Better |
|---|---|
| Cron scripts | Better reliability, monitoring, dependency management, version control |
| Shell scripts | Testable, debuggable, proper error handling, team collaboration |
| Commercial tools | Open source, customizable, no vendor lock-in, self-hosted |
| Cloud-native tools | Language agnostic, can self-host, avoid provider lock-in |
Who Uses Airflow?
- Data engineers building ETL/ELT pipelines
- ML engineers orchestrating training workflows
- DevOps teams managing infrastructure automation
- Analytics teams automating reports
Any organization with complex scheduled workflows
- β Back to Table of Contents
SECTION 2: ARCHITECTURE
Core Components Overview
Apache Airflow consists of several interconnected services working together:
Web Server
- Provides REST API and web UI
- Runs on port 8080 by default
- Allows manual DAG triggering, viewing logs, and managing connections
- Multiple instances can be deployed for high availability
- Stateless (can run on different machines)
Scheduler
- Monitors all DAGs and determines when tasks should execute
- Parses DAG files periodically (every 5 minutes by default)
- Handles task dependency resolution and sequencing
- Manages retry logic and failure callbacks
- Single instance per deployment (HA scheduler coming in Airflow 3.1+)
- Continuously runs in background
Executor
- Determines HOW tasks are physically executed
- Different executors for different deployment models:
- LocalExecutor: Single machine, limited parallelism (~10-50 tasks)
- SequentialExecutor: One task at a time (testing only)
- CeleryExecutor: Distributed across worker nodes (100s of tasks)
- KubernetesExecutor: Each task runs in separate pod (elastic scaling)
Metadata Database
- Central repository for all Airflow state
- Stores: DAG definitions, task instance states, variables, connections, logs metadata
- Should be PostgreSQL 12+ or MySQL 8+ in production
- SQLite only for development (not concurrent access)
- Requires daily backups and monitoring
Message Broker (for distributed setups)
- Redis or RabbitMQ
- Queues tasks from Scheduler to Workers
- Enables horizontal scaling
- Required only for CeleryExecutor
Triggerer Service (Airflow 2.2+)
- Manages all deferrable/async tasks
- Single lightweight async service per deployment
- Can handle 1000+ concurrent deferred tasks
- Monitors events efficiently
- Fires triggers when conditions met
Execution Flow Diagram
1. DAG File Loaded
β
2. Scheduler Parses
β
3. Check Task Ready?
ββ Dependencies met?
ββ Start date passed?
ββ Not already running?
β
4. Task State: SCHEDULED
β
5. Queue to Message Broker
β
6. Worker Picks Up Task
β
7. Execution Phase
ββ Run operator code
ββ OR Defer to Trigger
β
8. Log Output
β
9. Record Result
ββ Success β COMPLETED
ββ Failure β FAILED
ββ Skipped β SKIPPED
β
10. Execute Callback (if configured)
β
11. Mark downstream ready
β
12. Update Dashboard
Architecture Patterns
Single Machine Setup
- One server with all components
- LocalExecutor
- SQLite database (acceptable for development)
- Best for: Development, testing, small workflows
Distributed Setup
- Separate scheduler machine
- Multiple worker machines
- CeleryExecutor
- PostgreSQL database
- Redis message broker
- Best for: Production, high volume, 100+ daily DAGs
Kubernetes Native
- Scheduler pod
- Web server pod (multiple replicas)
- KubernetesExecutor
- Managed PostgreSQL (Cloud SQL, RDS)
- Dynamic pod creation per task
- Best for: Cloud-native, elastic scaling, containerized workflows
β Back to Table of Contents
SECTION 3: DAG & BUILDING BLOCKS
What is a DAG?
A DAG (Directed Acyclic Graph) is the fundamental unit in Airflow. It represents:
- Directed: Tasks execute in specific direction (flow from A β B β C)
- Acyclic: No circular dependencies allowed (prevents infinite loops)
- Graph: Visual representation of tasks and relationships
- Configurable: Start date, schedule, retries, timeouts, SLAs, etc.
Key DAG Properties
| Property | Purpose | Example |
|---|---|---|
dag_id |
Unique identifier | 'daily_etl' |
start_date |
First execution date | datetime(2024,1,1) |
schedule |
When to run | '@daily', CronTimetable, Dataset |
catchup |
Backfill missed runs | True/False |
max_active_runs |
Concurrent runs allowed | 1-5 |
default_args |
Task defaults | retries, timeout, owner |
description |
What DAG does | "Daily customer ETL" |
tags |
Categorization | ['etl', 'daily'] |
Operators (Task Types)
Operators are classes that execute actual work. Each operator is a reusable task that does something specific.
Common Built-in Operators
| Operator | What It Does | Common Parameters |
|---|---|---|
| PythonOperator | Executes Python function | python_callable, op_kwargs |
| BashOperator | Runs bash commands | bash_command |
| SqlOperator | Executes SQL query | sql, conn_id |
| EmailOperator | Sends emails | to, subject, html_content |
| S3Operator | S3 file operations | bucket, key, source/dest |
| HttpOperator | Makes HTTP requests | endpoint, method, data |
| SlackOperator | Sends Slack messages | text, channel, token |
| BranchOperator | Conditional branching | python_callable returning task_id |
| DummyOperator | Placeholder/no-op | Used for flow control |
When to Use Each
- PythonOperator: Data processing, API calls, complex logic
- BashOperator: Run scripts, system commands, existing tools
- SqlOperator: Database transformations, queries
- EmailOperator: Notifications, reports
- S3Operator: Data movement, file management
- HttpOperator: External service integration
- BranchOperator: Conditional workflows
Sensors (Waiting for Conditions)
Sensors are operators that wait for something to happen. They poll a condition until it becomes true, blocking further execution.
Common Sensors
| Sensor | Waits For | Use Case |
|---|---|---|
| FileSensor | File creation on disk | Wait for uploaded data |
| S3KeySensor | S3 object existence | Wait for data in S3 |
| SqlSensor | SQL query condition | Wait for record count |
| TimeSensor | Specific time of day | Wait until 2 PM |
| ExternalTaskSensor | External DAG task | Cross-DAG dependency |
| HttpSensor | HTTP endpoint response | Wait for API ready |
| TimeDeltaSensor | Time duration elapsed | Wait 30 minutes |
Sensor Behavior
- Poke Mode (default): Worker thread continuously checks condition, holds worker slot
- Reschedule Mode: Worker released, task rescheduled when ready
- Smart Mode: Multiple sensors batched for efficiency
Performance Consideration: Many idle sensors waste resources. Use deferrable operators for long waits.
Hooks (Connection Abstractions)
Hooks encapsulate connection and authentication logic to external systems. They're the reusable layer between operators and external services.
Common Hooks
| Hook | Connects To | Purpose |
|---|---|---|
| PostgresHook | PostgreSQL | Query execution |
| MySqlHook | MySQL | Database operations |
| S3Hook | AWS S3 | File operations |
| HttpHook | REST APIs | HTTP requests |
| SlackHook | Slack | Send messages |
| SSHHook | Remote servers | Execute commands |
| DockerHook | Docker daemon | Container operations |
Benefits of Hooks
- Centralize credential management (stored in Airflow connections)
- Reusable across multiple tasks
- No credentials in DAG code
- Easier testing (mock connections)
- Connection pooling support
- Error handling abstraction
Tasks
A task is an instance of an operator within a DAG. It's the smallest unit of work.
Task Configuration
| Setting | Purpose | Example |
|---|---|---|
task_id |
Unique identifier | 'extract_customers' |
retries |
Retry attempts on failure | 2-3 |
retry_delay |
Wait between retries | timedelta(minutes=5) |
timeout |
Maximum execution time | timedelta(hours=1) |
pool |
Concurrency control | 'default_pool' |
priority_weight |
Execution priority | 1-100 (higher = more urgent) |
trigger_rule |
When to execute | 'all_success' (default) |
depends_on_past |
Depend on previous run | False (usually) |
XCom (Cross-Communication)
XCom (cross-communication) allows tasks to pass data to each other.
How XCom Works
- Tasks push values:
task_instance.xcom_push(key='count', value=1000) - Tasks pull values:
task_instance.xcom_pull(task_ids='upstream', key='count') - Automatically stored in metadata database
- Serialized to JSON by default
XCom Limitations
- Database-backed (not for large data)
- Intended for metadata, not big data
- Consider external storage for large datasets
- Can cause database bloat if overused
β Back to Table of Contents
SECTION 4: PLUGINS
What are Plugins?
Plugins allow you to extend Airflow with custom components without modifying core code. They're organized Python modules that register custom functionality.
Types of Plugins
| Plugin Type | Purpose | When to Create |
|---|---|---|
| Custom Operators | New task types | Business-specific operations |
| Custom Sensors | New wait conditions | Proprietary systems |
| Custom Hooks | New connections | Proprietary APIs |
| Custom Executors | New execution patterns | Specialized hardware |
| Custom Macros | New template variables | Common business calculations |
| UI Extensions | Dashboard widgets | Custom monitoring |
Plugin Organization
Plugins live in the plugins/ directory with this structure:
plugins/
βββ operators/
β βββ custom_operator.py
βββ sensors/
β βββ custom_sensor.py
βββ hooks/
β βββ custom_hook.py
βββ macros/
β βββ custom_macros.py
βββ airflow_plugins.py
When to Create Plugins
CREATE plugins for:
- Integrating proprietary/internal systems
- Reusable patterns across organization
- Complex business logic encapsulation
- Specialized monitoring/alerting
DON'T create plugins for:
- One-off tasks (use standard operators)
- Simple data transformations (use PythonOperator)
- Ad-hoc scripts (use BashOperator)
- If built-in operator already exists
Plugin Development
Custom operators extend BaseOperator and override execute() method.
Custom sensors extend BaseSensorOperator and override poke() method.
Custom hooks extend BaseHook with connection logic.
Plugins are loaded automatically from plugins/ directory on scheduler startup.
β Back to Table of Contents
SECTION 5: TASKFLOW API
What is TaskFlow API?
TaskFlow API (Airflow 2.0+) is the modern, Pythonic way to write DAGs using decorators. It simplifies DAG creation by using functions instead of operator instantiation.
Comparison: Traditional vs TaskFlow
| Aspect | Traditional | TaskFlow |
|---|---|---|
| Syntax | Verbose operator creation | Clean function decorators |
| Data Passing | Manual XCom push/pull | Implicit via return values |
| Dependencies | Explicit >> operators | Implicit via parameters |
| Code Length | More lines, repetitive | Fewer lines, concise |
| Type Hints | Optional | Strongly recommended |
| Learning Curve | Moderate | Easier for Python devs |
| IDE Support | Basic | Excellent (autocomplete) |
Key Decorators
- Decorates function that contains tasks
- Function body defines workflow
- Returns DAG instance
- Replaces
DAG()class creation
@task
- Decorates function that becomes a task
- Function is task logic
- Return value becomes XCom value
- Parameters are upstream dependencies
@task_group
- Groups related tasks together
- Creates visual subgraph in UI
- Simplifies large DAGs
- Can be nested
@task.expand (Dynamic Mapping)
- Creates task for each item in list
- Reduces repetitive task creation
- Dynamic fan-out pattern
- Available in Airflow 2.3+
Advantages of TaskFlow
- Tasks are just Python functions
- Type hints enable IDE autocomplete
- Natural data flow through function parameters
- Automatic XCom handling (no manual push/pull)
- Cleaner, more Pythonic code
- Easier testing (function-based)
- Less boilerplate
- Better code organization
When to Use TaskFlow
Use for:
- New DAGs
- Complex dependencies
- Python-savvy teams
- Modern deployments (2.0+)
Alternative if:
- Legacy Airflow 1.10 requirement
- Using operators not available as tasks
- Team unfamiliar with decorators
β Back to Table of Contents
SECTION 6: DEFERRABLE OPERATORS & TRIGGERS
The Problem: Resource Waste
Traditional Sensors (Blocking)
- Sensor holds worker slot while waiting
- Worker CPU ~5-10% even during idle
- Memory ~50MB per idle task
- Can handle ~50-100 concurrent waits
- Resource waste for long waits (hours/days)
Example: 1000 tasks waiting 1 hour = 1000 worker slots held continuously = massive waste
The Solution: Deferrable Operators
Deferrable Approach (Non-blocking)
- Task releases worker slot immediately after deferring
- Minimal CPU usage (~0.1%) during wait
- Memory ~1MB per deferred task
- Can handle 1000+ concurrent waits
- ~95% resource savings compared to traditional
How Deferrable Operators Work
4-Phase Execution:
-
Execution Phase: Task executes initial logic
- Setup, validation, prerequisites
- Then: Defer to trigger
- Worker slot RELEASED
-
Deferred Phase: Trigger monitors event
- Async/event-based monitoring
- Triggerer service handles
- Very lightweight
- No worker slot needed
-
Completion Phase: Event occurs
- Trigger detects condition met
- Returns event data
- Trigger fires
-
Resume Phase: Task completes
- Worker acquires new slot
- Task resumes with trigger result
- Final processing
- Task completes
Available Deferrable Operators
| Operator | Defers For | Best For |
|---|---|---|
| TimeSensorAsync | Specific time | Waiting until business hours |
| S3KeySensorAsync | S3 object | Long-running file uploads |
| ExternalTaskSensorAsync | External DAG | Cross-DAG long waits |
| SqlSensorAsync | SQL condition | Database state changes |
| HttpSensorAsync | HTTP response | External service availability |
| TimeDeltaSensorAsync | Duration | Fixed time wait |
Triggerer Service
- Manages all active triggers
- Single lightweight async service per deployment
- Default capacity: 1000 triggers
- Requires:
airflow triggererprocess - Monitors events, fires when ready
- Stateless (can restart without losing state)
When to Use Deferrable
Use when:
- Wait duration > 5 minutes
- Many concurrent waits
- Resource-constrained environment
- Can tolerate event latency (<1 second)
- Long-running waits (hours, days)
Don't use when:
- Wait < 5 minutes (overhead not worth it)
- Real-time detection critical (sub-second)
- Simple local operations
- Frequent detection required
Performance Impact
1000 Tasks Waiting 1 Hour
Traditional: 1000 worker slots Γ 1 hour = 1000 slot-hours wasted
Deferrable: 0 worker slots, minimal CPU in trigger service
Cost Savings: ~95% for typical scenarios
β Back to Table of Contents
SECTION 7: DEPENDENCIES & RELATIONSHIPS
Task Dependencies
Linear Chain: Task A β Task B β Task C β Task D
- Sequential execution
- Each waits for previous
- Easiest to understand
- Best for: ETL pipelines with stages
Fan-out / Fan-in: One task triggers many, then merge
ββ Process A ββ
Extract β€ ββ Merge
ββ Process B ββ
- Parallel processing
- Best for: Multiple independent operations
- Example: Extract once, transform multiple ways
Branching: Conditional execution
ββ Success path
Branch β€
ββ Error path
- Different paths based on condition
- Best for: Alternative workflows
- Example: If data valid, load; else alert
Trigger Rules
Trigger rules determine WHEN downstream tasks execute based on UPSTREAM status.
| Rule | Executes When | Use Case |
|---|---|---|
all_success |
All upstream succeed | Default, sequential flow |
all_failed |
All upstream fail | Cleanup after failures |
all_done |
All complete (any status) | Final reporting |
one_success |
At least one succeeds | Alternative paths |
one_failed |
At least one fails | Error alerts |
none_failed |
No failures (skips OK) | Robust pipelines β Recommended |
none_failed_or_skipped |
No failures/skips | Strict validation |
Best Practice: Use none_failed for fault-tolerant pipelines.
DAG-to-DAG Dependencies
Methods to Create
| Method | Mechanism | Best For |
|---|---|---|
| ExternalTaskSensor | Poll for completion | Different schedules |
| TriggerDagRunOperator | Actively trigger | Orchestrate workflows |
| Datasets | Event-driven | Data pipelines |
Datasets (Airflow 2.3+)
- Producer DAG updates dataset
- Consumer DAG triggered automatically
- Real-time responsiveness (<1 second)
- Low latency, efficient
- Best for data-driven workflows
Example:
- ETL DAG extracts and updates
orders_dataset - Analytics DAG triggered automatically when dataset updates
- No waiting, real-time processing
β Back to Table of Contents
SECTION 8: SCHEDULING
Overview of Scheduling Methods
Three ways to trigger DAG runs:
- Time-based: Specific times/intervals
- Event-driven: When data arrives/updates
- Manual: User-triggered via UI
Time-Based: Cron Expressions (Legacy)
Cron format: minute hour day month day_of_week
Common Patterns
-
@hourly: Every hour -
@daily: Every day at midnight -
@weekly: Every Sunday at midnight -
@monthly: First day of month -
0 9 * * MON-FRI: 9 AM on weekdays -
*/15 * * * *: Every 15 minutes
Limitations
- Simple time-only scheduling
- Limited to time expressions
- No business logic
- Timezone handling tricky
Time-Based: Timetable API (Modern)
Timetables (Airflow 2.2+) provide object-oriented scheduling with unlimited flexibility.
Built-in Timetables
-
CronDataTimetable: Cron-based (replaces schedule_interval) -
DailyTimetable: Daily runs -
IntervalTimetable: Fixed intervals
Custom Timetables Allow
- Business day scheduling (Mon-Fri)
- Holiday-aware schedules (skip holidays)
- Timezone-specific runs
- Complex business logic
- End-of-month runs
- Business hour runs (9 AM - 5 PM)
Advantages
- Unlimited flexibility
- Timezone support
- Readable code
- Testable
- Documented rationale
Event-Driven: Datasets (Airflow 2.3+)
Datasets enable data-driven workflows instead of time-driven.
How it Works
- Producer task completes and updates dataset
- Scheduler detects dataset update
- Consumer DAG triggered immediately
- No scheduled waiting
Benefits
- Real-time responsiveness
- Resource efficient (no idle waits)
- Data-quality focused
- Reduced latency (<1 second)
- Perfect for data pipelines
When to Use
- Multi-stage ETL (extract β transform β load)
- Event streaming
- Real-time processing
- Data-dependent workflows
Hybrid Scheduling
Combine time-based and event-driven:
- DAG runs on schedule OR when data arrives
- Whichever comes first
- Best of both worlds
- Flexible and responsive
Scheduling Decision Tree
TIME-BASED?
ββ YES β Complex logic?
β ββ YES β Custom Timetable
β ββ NO β Cron Expression
β
ββ NO β Event-driven?
ββ YES β Use Datasets
ββ NO β Manual trigger or reconsider design
β Back to Table of Contents
SECTION 9: FAILURE HANDLING
Retry Mechanisms
Purpose: Allow tasks to recover from transient failures (network timeout, API rate limit, etc.)
Configuration
-
retries: Number of retry attempts -
retry_delay: Wait time between retries -
retry_exponential_backoff: Exponentially increasing delays -
max_retry_delay: Cap on maximum wait
Example Retry Sequence
Attempt 1: Fail β Wait 1 minute
Attempt 2: Fail β Wait 2 minutes (1 Γ 2^1)
Attempt 3: Fail β Wait 4 minutes (1 Γ 2^2)
Attempt 4: Fail β Wait 8 minutes (1 Γ 2^3)
Attempt 5: Fail β Wait 16 minutes (1 Γ 2^4)
Attempt 6: Fail β Max 1 hour (capped)
Attempt 7: Fail β Task FAILED
Best Practice: 2-3 retries with exponential backoff for most tasks.
Trigger Rules (Already Covered in Section 7)
But worth repeating: Use none_failed for robust pipelines.
Callbacks
Execute custom logic when task state changes:
-
on_failure_callback: When task fails -
on_retry_callback: When task retries -
on_success_callback: When task succeeds
Common Uses
- Send notifications (Slack, email, PagerDuty)
- Log to external monitoring systems
- Trigger incident management
- Update dashboards
- Cleanup resources
Error Handling Patterns
Pattern 1: Idempotency
- Make tasks safe to retry
- Delete-then-insert instead of insert
- Same result regardless of executions
- Safe for retries
Pattern 2: Circuit Breaker
- Stop retrying after N attempts
- Raise critical exceptions
- Alert operations team
- Prevent cascading failures
Pattern 3: Graceful Degradation
- Continue with partial data
- Log warnings
- Don't fail entire pipeline
- Better UX than all-or-nothing
Pattern 4: Skip on Condition
- Raise
AirflowSkipException - Skip task without marking as failure
- Allow downstream to continue
- Best for conditional workflows
β Back to Table of Contents
SECTION 10: SLA MANAGEMENT
What is SLA?
SLA (Service Level Agreement) is a guarantee that a task will complete within a specified time window.
If SLA is missed:
- Task marked as "SLA Missed"
- Notifications triggered
- Callbacks executed
- Logged for auditing/metrics
Setting SLAs
At DAG Level: All tasks inherit SLA
default_args = {
'sla': timedelta(hours=1)
}
At Task Level: Specific to individual task
task = PythonOperator(
task_id='critical_task',
sla=timedelta(minutes=30)
)
Per Task Overrides: Task SLA overrides DAG default
SLA Monitoring
Key Metrics
- SLA compliance rate (%)
- Average overdue duration
- Most violated tasks
- Impact on downstream
Alerts
- Send notifications when SLA at risk (80%)
- Escalate on missed SLA
- Create incidents for critical SLAs
Dashboard
- Track compliance trends
- Identify problematic tasks
- Alert history
- Remediation tracking
SLA Best Practices
| Practice | Benefit |
|---|---|
| Set realistic SLAs | Avoid false positives |
| Different tiers | Critical < Normal < Optional |
| Review quarterly | Account for growth |
| Document rationale | Team alignment |
| Escalation policy | Proper incident response |
| Callbacks | Automated notifications |
| Monitor compliance | Identify trends |
β Back to Table of Contents
SECTION 11: CATCHUP BEHAVIOR
What is Catchup?
Catchup is automatic backfill of missed DAG runs.
Example Scenario
Start date: 2024-01-01
Today: 2024-05-15
Schedule: @daily (every day)
Catchup: True
RESULT: Airflow creates 135 DAG runs instantly!
(One for each day from Jan 1 to May 14)
Catchup Settings
| Setting | Behavior | When to Use |
|---|---|---|
catchup=True |
Backfill all missed runs | Historical data loading |
catchup=False |
No backfill, start fresh | New pipelines |
Catchup Risks
Potential Issues
- Resource exhaustion (CPU, memory, database)
- Network bandwidth spike
- Third-party API rate limits exceeded
- Database connection pool exhausted
- Worker node crashes
- Scheduler paralysis (busy processing backfill)
Safe Catchup Strategy
Step 1: Use Separate Backfill DAG
- Main DAG: catchup=False (normal operations)
- Backfill DAG: catchup=True, end_date set (historical only)
- Run backfill during maintenance window
Step 2: Control Parallelism
-
max_active_runs: Limit concurrent DAG runs (1-3) -
max_active_tasks_per_dag: Limit concurrent tasks (16-32) - Prevents resource exhaustion
Step 3: Staged Approach
- Test with subset of dates first
- Monitor resource usage
- Gradually increase pace
- Have rollback plan
Step 4: Validation
- Verify data completeness
- Check for duplicates
- Validate transformations
- Compare with expected results
Catchup Checklist
Before enabling catchup, verify:
- Database capacity available
- Worker resources adequate
- Source data exists for all periods
- Data schema hasn't changed
- No API rate limits exceeded
- Network bandwidth available
- Stakeholders informed
- Rollback procedure ready
β Back to Table of Contents
SECTION 12: CONFIGURATION
Configuration Hierarchy
Airflow reads configuration in this order (first found wins):
- Environment Variables (AIRFLOW___ prefix)
- airflow.cfg file (main config)
- Default configuration (hardcoded)
- Command-line arguments (runtime)
Critical Settings
| Setting | Purpose | Production Value |
|---|---|---|
sql_alchemy_conn |
Metadata database | PostgreSQL connection |
executor |
Task execution | CeleryExecutor |
parallelism |
Max tasks globally | 32-128 |
max_active_runs_per_dag |
Concurrent DAG runs | 1-3 |
max_active_tasks_per_dag |
Concurrent tasks | 16-32 |
remote_logging |
Centralized logs | True (to S3) |
email_backend |
Email provider | SMTP |
base_log_folder |
Log storage | NFS mount |
Database Configuration
Production Database
- PostgreSQL 12+ (recommended) or MySQL 8+
- Dedicated database server (not shared)
- Daily backups automated
- Connection pooling enabled
- Monitoring for slow queries
- Regular maintenance (VACUUM, ANALYZE)
Why Not SQLite
- Not concurrent-access safe
- Locks on writes
- Single machine only
- Not for production
Executor Configuration
| Executor | Config | Best For |
|---|---|---|
| Local | [local_executor] | Development |
| Celery | [celery] | Distributed, high volume |
| Kubernetes | [kubernetes] | Cloud-native |
| Sequential | (internal) | Testing only |
Performance Tuning
Database
- Connection pooling (pool_size=5, max_overflow=10)
- Indexes on frequently queried columns
- Archive old logs/task instances
- Query optimization
Scheduler
- Increase DAG parsing frequency if needed
- Optimize DAG file count
- Use pool resources wisely
Web Server
- Cache results
- Limit search scope
- Monitor response times
β Back to Table of Contents
SECTION 13: AWS INTEGRATION
Connection Setup
Three Methods
- Airflow UI: Admin β Connections β Create
-
Environment Variables:
AIRFLOW_CONN_AWS_DEFAULT - Programmatic: Python script creating Connection objects
Authentication Types
- Access Key / Secret Key (discouraged)
- IAM Roles (recommended for EC2)
- AssumeRole (cross-account access)
- Temporary credentials
Common AWS Services
| Service | What It Does | Use Case |
|---|---|---|
| S3 | File storage | Data lake, logs, backups |
| Redshift | Data warehouse | Analytics, aggregations |
| Lambda | Serverless compute | Lightweight processing |
| SNS | Notifications | Alerts, messaging |
| SQS | Message queue | Task decoupling |
| Glue | ETL service | Spark/Python jobs |
| RDS/Aurora | Managed database | Transactional data |
| EMR | Big data | Spark, Hadoop clusters |
| CloudWatch | Monitoring | Metrics, logs, alarms |
Operators for AWS Services
S3 Operations
- Upload/download files
- Copy within S3
- List objects
- Delete files
- Check existence
Redshift Operations
- Execute SQL queries
- Load data from S3
- Create/delete snapshots
- Pause/resume clusters
Lambda Operations
- Invoke functions
- Async execution
- Sync execution with polling
SQS/SNS Operations
- Send messages to queue
- Receive from queue
- Publish to topics
- Batch operations
AWS Best Practices
| Practice | Benefit |
|---|---|
| Use IAM roles | No credentials in code |
| Secrets Manager | Centralized credential storage |
| Enable CloudTrail | Audit all API calls |
| Use VPC endpoints | S3 access without internet |
| Monitor costs | CloudWatch cost alerts |
| Implement budgets | Prevent overspending |
| Regular backups | Disaster recovery |
| Region strategy | Latency, compliance |
β Back to Table of Contents
SECTION 14: VERSION COMPARISON
Version Timeline
1.10.x (Deprecated, EOL 2021)
β Major Redesign
2.0 (March 2021) - TaskFlow API introduced
β
2.1-2.2 (Improvements & Deferrable)
β
2.3 (Nov 2022) - Datasets event-driven scheduling
β
2.4-2.5 (Current stable, production-ready)
β
3.0 (Latest) - Performance focus, async improvements
Key Feature Comparison
| Feature | 1.10 | 2.x | 3.0 |
|---|---|---|---|
| TaskFlow API | β | β | β |
| Deferrable Ops | β | β | β |
| Datasets | β | β | β |
| Performance | Baseline | 3-5x faster | 5-6x faster |
| Python 3.10+ | β | β | β Required |
| Async/Await | β | Limited | β Full |
Performance Improvements
| Metric | 1.10 | 2.5 | 3.0 |
|---|---|---|---|
| DAG parsing (1000 DAGs) | ~120s | ~45s | ~15s |
| Task scheduling latency | ~5-10s | ~2-3s | <1s |
| Memory (idle scheduler) | ~500MB | ~300MB | ~150MB |
| Web UI response | ~800ms | ~200ms | ~50ms |
Migration Path
From 1.10 to 2.x
- Python 3.6+ required
- Import paths changed (reorganized)
- Operators moved to providers
- Test in staging thoroughly
- Gradual production rollout (1-2 weeks)
From 2.x to 3.0
- Most DAGs work without changes
- Drop Python < 3.10
- Update async patterns
- Leverage new performance features
- Simpler migration (backward compatible)
Version Selection
| Scenario | Recommended |
|---|---|
| New projects | 2.5+ or 3.0 |
| Production (stable) | 2.4.x |
| Enterprise | 2.5+ |
| Performance-critical | 3.0 |
| Learning | 2.5+ or 3.0 |
β Back to Table of Contents
SECTION 15: BEST PRACTICES
DAG Design DO's
β Use TaskFlow API for modern code
β Keep tasks small and focused
β Name tasks descriptively
β Document complex logic
β Use type hints
β Add comprehensive logging
β Handle errors gracefully
β Test DAGs before production
DAG Design DON'Ts
β Put heavy processing in DAG definition
β Create 1000s of tasks per DAG
β Hardcode credentials
β Ignore data validation
β Use DAG-level database connections
β Forget about idempotency
β Skip error handling
β Assume tasks always succeed
Task Design DO's
β Make tasks idempotent
β Set appropriate timeouts
β Implement retries with backoff
β Use pools for resource control
β Skip tasks on errors (when appropriate)
β Log important events
β Clean up resources
Task Design DON'Ts
β Assume tasks always succeed
β Run long operations without timeout
β Hold resources unnecessarily
β Store large data in XCom
β Use blocking operations
β Ignore upstream failures
β Forget cleanup
Scheduling DO's
β Match frequency to data needs
β Account for upstream dependencies
β Plan for failures
β Test schedule changes
β Document rationale
β Monitor adherence
Scheduling DON'Ts
β Schedule too frequently
β Ignore resource constraints
β Forget about catchup
β Use ambiguous expressions
β Create tight couplings
β Skip monitoring
β Back to Table of Contents
SECTION 16: MONITORING
Key Metrics
| Metric | What It Shows | Alert Threshold |
|---|---|---|
| Scheduler heartbeat | Scheduler running | >30s = ALERT |
| Task failure rate | Pipeline health | >5% = INVESTIGATE |
| SLA violations | Performance issues | >10/day = REVIEW |
| DAG parsing time | System load | >60s = OPTIMIZE |
| Task queue depth | Worker capacity | >1000 = ADD WORKERS |
| DB connections | Connection health | >80% = INCREASE |
| Log size | Storage usage | Monitor growth |
Common Issues & Solutions
| Issue | Likely Cause | Solution |
|---|---|---|
| Tasks stuck | Worker crashed | Restart workers |
| Scheduler lagging | High load | Add resources |
| Memory leaks | Bad code | Profile and fix |
| Database slow | No indexes | Optimize queries |
| Triggers not firing | Triggerer down | Check service |
| SLA misses | Unrealistic goals | Adjust SLA |
| Catchup explosion | Too many runs | Limit with max_active |
Monitoring Stack Components
Metrics Collection
- Prometheus for collection
- Custom StatsD metrics
- CloudWatch for AWS
Visualization
- Grafana dashboards
- CloudWatch dashboards
- Custom visualizations
Alerting
- Threshold-based alerts
- Anomaly detection
- Escalation policies
- Integration with incident systems
Logging
- Centralized logs (ELK, Splunk)
- Remote logging to S3
- Log aggregation
- Searchable archives
β Back to Table of Contents
SECTION 17: CONNECTIONS AND SECRETS
What are Connections?
Connections store authentication credentials to external systems. Instead of hardcoding database passwords or API keys in DAGs, Airflow stores these centrally.
Components:
-
Connection ID: Unique identifier (e.g.,
postgres_default) - Connection Type: System type (postgres, aws, http, mysql, etc.)
- Host: Server address
- Login: Username
- Password: Secret credential
- Port: Service port
- Schema: Database or default path
- Extra: JSON for additional parameters
Why Use Connections?
- Credentials not in DAG code (can't accidentally commit to git)
- Centralized credential management
- Easy credential rotation
- Different credentials per environment (dev, staging, prod)
- Audit trail of access
Connection Types
| Type | Used For | Example |
|---|---|---|
| postgres | PostgreSQL databases | Analytics warehouse |
| mysql | MySQL databases | Transactional data |
| snowflake | Snowflake warehouse | Cloud data platform |
| aws | AWS services (S3, Redshift, Lambda) | Cloud infrastructure |
| gcp | Google Cloud services | BigQuery, Cloud Storage |
| http | REST APIs | External services |
| ssh | Remote server access | Bastion hosts |
| slack | Slack messaging | Notifications |
| sftp | SFTP file transfer | Remote servers |
What are Secrets/Variables?
Variables store non-sensitive configuration values.
Secrets store sensitive data (marked as secret).
| Aspect | Variables | Secrets |
|---|---|---|
| Purpose | Configuration settings | Passwords, API keys |
| Visibility | Visible in UI | Masked/hidden in UI |
| Storage | Metadata database | External backend (Vault, etc.) |
| Example | batch_size, email | api_key, db_password |
How to Add Connections
Method 1: Web UI (Easiest)
- Airflow UI β Admin β Connections β Create Connection
- Fill form with credentials
- Click Save
Method 2: Environment Variables
Format: AIRFLOW_CONN_{CONNECTION_ID}={CONN_TYPE}://{USER}:{PASSWORD}@{HOST}:{PORT}/{SCHEMA}
Example:
export AIRFLOW_CONN_POSTGRES_DEFAULT=postgresql://user:password@localhost:5432/mydb
export AIRFLOW_CONN_AWS_DEFAULT=aws://ACCESS_KEY:SECRET_KEY@
Method 3: Python Script
Create Connection objects programmatically and add to database.
Method 4: Airflow CLI
airflow connections add postgres_prod \
--conn-type postgres \
--conn-host localhost \
--conn-login user \
--conn-password password
How to Add Variables/Secrets
Method 1: Web UI
- Airflow UI β Admin β Variables β Create Variable
- Enter Key and Value
- Check "Is Secret" for sensitive data
- Click Save
Method 2: Environment Variables
export AIRFLOW_VAR_BATCH_SIZE=1000
export AIRFLOW_VAR_API_KEY=secret123
Method 3: Python Script
Create Variable objects and add to database.
Method 4: Airflow CLI
airflow variables set batch_size 1000
airflow variables set api_key secret123
Using Connections in DAGs
Connections are accessed through Hooks (connection abstraction layer).
from airflow.hooks.postgres_hook import PostgresHook
def query_database(**context):
hook = PostgresHook(postgres_conn_id='postgres_default')
result = hook.get_records("SELECT * FROM users")
return result
Benefits of Hooks:
- Encapsulate connection logic
- Reusable across tasks
- Handle connection pooling
- Error handling built-in
- No credentials in task code
Using Variables in DAGs
Variables are accessed directly:
from airflow.models import Variable
batch_size = Variable.get('batch_size', default_var=1000)
api_key = Variable.get('api_key')
is_production = Variable.get('is_production', default_var='false')
# With JSON parsing
config = Variable.get('config', deserialize_json=True)
Secrets Backend: Advanced Security
For production, use external secrets backends instead of storing in Airflow database.
| Backend | Security Level | Best For |
|---|---|---|
| Metadata DB | Low | Development only |
| AWS Secrets Manager | High | AWS deployments |
| HashiCorp Vault | Very High | Enterprise |
| Google Secret Manager | High | GCP deployments |
| Azure Key Vault | High | Azure deployments |
How it works:
- Configure Airflow to use external backend
- Airflow queries backend when credential needed
- Secrets never stored in database
- Automatic rotation supported
- Audit logging in backend
Security Best Practices
DO β
- Use external secrets backend in production
- Rotate credentials quarterly
- Use IAM roles instead of hardcoded keys
- Encrypt connections in transit (SSL/TLS)
- Encrypt secrets at rest (database encryption)
- Limit who can see connections (RBAC)
- Use short-lived credentials (temporary tokens)
- Audit access to secrets
- Never commit credentials to git
DON'T β
- Hardcode credentials in DAG code
- Store passwords in comments or docstrings
- Share credentials via Slack/Email
- Use default passwords
- Store credentials in config files in git
- Forget to revoke old credentials
- Use same credentials everywhere
- Log credential values
Common Issues
| Issue | Cause | Fix |
|---|---|---|
| Connection refused | Service not running | Start the service |
| Authentication failed | Wrong credentials | Verify in source system |
| Host not found | Wrong hostname/DNS | Check network, DNS resolution |
| Timeout | Network unreachable | Check firewall, routing |
| Variable not found | Wrong key name | Check variable ID spelling |
| Permission denied | User lacks access | Check user RBAC permissions |
Testing Connections
Via Web UI:
- Navigate to Admin β Connections
- Click on connection
- Click "Test Connection"
Via CLI:
airflow connections test postgres_default
airflow connections test aws_default
Via Python:
from airflow.hooks.postgres_hook import PostgresHook
hook = PostgresHook(postgres_conn_id='postgres_default')
hook.get_conn() # Will raise error if connection fails
Production Setup
Minimum Requirements:
β
All connections stored centrally (not hardcoded)
β
Credentials in external secrets backend
β
RBAC enabled (limit connection visibility)
β
No credentials in DAG code
β
No credentials in git history
β
Quarterly credential rotation
β
SSL/TLS enabled for external connections
β
Audit logging for credential access
β
Monitoring/alerts for failed authentication
β Back to Table of Contents
SUMMARY & CHECKLIST
Production Deployment Checklist
INFRASTRUCTURE
β‘ PostgreSQL 12+ database
β‘ Separate scheduler machine
β‘ Multiple web server instances (load balanced)
β‘ Multiple worker machines (if Celery)
β‘ Message broker (Redis/RabbitMQ)
β‘ Triggerer service running
β‘ NFS for logs (centralized)
β‘ Backup systems
CONFIGURATION
β‘ Connection pooling configured
β‘ Resource pools created
β‘ Email/Slack configured
β‘ Remote logging to S3/cloud
β‘ Monitoring setup (Prometheus/Grafana)
β‘ Alert thresholds defined
β‘ Timezone configured correctly
SECURITY
β‘ RBAC enabled in UI
β‘ Credentials in Secrets Manager
β‘ SSL/TLS enabled
β‘ Database encryption at rest
β‘ VPC configured
β‘ Firewall rules tight
β‘ Regular security updates
β‘ No credentials in DAG code
RELIABILITY
β‘ Metadata backup automated (daily)
β‘ DAG code in Git repository
β‘ Disaster recovery plan documented
β‘ Rollback procedure tested
β‘ SLA enforcement active
β‘ Callback notifications configured
β‘ Data validation implemented
MONITORING
β‘ Scheduler health check
β‘ Worker heartbeat monitoring
β‘ Database performance monitoring
β‘ Trigger service health
β‘ DAG/task failure alerts
β‘ SLA violation alerts
β‘ Resource usage dashboards
β‘ Cost tracking (if cloud)
PERFORMANCE
β‘ Database indexes optimized
β‘ Connection pool sized appropriately
β‘ Max active runs limited
β‘ Task priorities set
β‘ Deferrable operators for long waits
β‘ SmartSensor enabled
β‘ DAG parsing optimized
DOCUMENTATION
β‘ DAG documentation complete
β‘ SLA rationale documented
β‘ Runbooks created for issues
β‘ Team trained on platform
β‘ Changes logged in git
β‘ Disaster recovery plan shared
β‘ Escalation procedures documented
β Back to Table of Contents
RECOMMENDATIONS
For New Projects Starting Today
- Choose Airflow 2.5+ or 3.0 (not 1.10)
- Use TaskFlow API (modern, clean code)
- Use Timetable API for scheduling (not cron strings)
- Use Deferrable operators for long waits
- Plan for scale from day one (don't start with LocalExecutor)
- Set up monitoring immediately (not later)
- Use event-driven scheduling (Datasets) where possible
- Store credentials externally (Secrets Manager, not hardcoded)
For Teams Upgrading from 1.10
- Plan 2-3 months for migration (not quick)
- Test thoroughly in staging environment
- Keep 1.10 backup for emergency rollback
- Update DAGs incrementally (not all at once)
- Monitor closely first month post-upgrade
- Document breaking changes for team
- Celebrate milestones (keep team motivated)
For Large Enterprise Deployments
- Use Kubernetes executor (scalability, isolation)
- Implement comprehensive monitoring (critical for troubleshooting)
- Set up disaster recovery (business continuity)
- Regular chaos engineering tests (find weak points before production)
- Capacity planning quarterly (predict growth)
- Performance optimization ongoing (never "done")
- Security audits regular (compliance, risk)
For Data Pipeline Use Cases
- Use event-driven scheduling (Datasets, not time-based)
- Implement data quality checks (validate early)
- Monitor SLAs closely (data freshness critical)
- Implement idempotent tasks (safe retries)
- Plan catchup carefully (historical data loads)
- Version control everything (DAGs, configurations)
- Use separate environment (dev, staging, prod)
β Back to Table of Contents
CONCLUSION
The Airflow Journey
Apache Airflow has evolved from a simple scheduler into a mature, production-grade workflow orchestration platform. The progression from version 1.10 through modern versions demonstrates significant improvements in usability, performance, flexibility, and reliability.
Key Takeaways
Architecture & Design
- TaskFlow API provides clean, Pythonic DAG definitions
- Deferrable operators enable massive resource savings for long-running waits
- Plugins allow endless extensibility without modifying core
Scheduling & Flexibility
- Choose appropriate scheduling method (time, event, or hybrid)
- Event-driven scheduling (Datasets) perfect for data pipelines
- Timetable API enables business logic in scheduling
Reliability & Operations
- Retries with exponential backoff handle transient failures
- Proper error handling prevents cascading failures
- SLAs enforce performance agreements
- Comprehensive monitoring catches issues early
Scalability & Performance
- Multiple executor options for different scales
- Deferrable operators handle 1000s of concurrent waits
- Performance improvements: 1.10 β 2.5 β 3.0 shows 5-6x speedup
Real-World Success
- Airflow powers ETL/ELT pipelines at hundreds of companies
- Used for data engineering, ML orchestration, DevOps automation
- Solves real problems: dependency management, scheduling, monitoring
Starting Your Airflow Journey
First Step: Understand architecture and core concepts (DAGs, operators, scheduling)
Next Step: Build simple DAGs with TaskFlow API
Then: Add error handling, retries, monitoring
Finally: Optimize for scale, implement production patterns
Airflow is Not**
- A replacement for application code
- Suitable for real-time streaming (use Kafka, Spark Streaming)
- A replacement for data warehouses (use Snowflake, BigQuery)
- Simple enough to learn in an hour (takes weeks/months)
Airflow is Perfect For**
- Scheduled batch processing (daily, hourly, etc.)
- Multi-stage ETL/ELT workflows
- Cross-system orchestration
- Data quality validation
- DevOps automation
- ML pipeline orchestration
- Report generation
Final Wisdom
Success with Airflow depends not just on technical setup, but on organizational practices:
- Treat DAGs as code (version control, code review, testing)
- Monitor continuously (you can't fix what you don't see)
- Document thoroughly (future you and teammates will thank you)
- Automate responses (reduce manual incident response)
- Iterate and improve (perfect is enemy of done)
Apache Airflow's strength lies in its flexibility, active community, and ecosystem. Whether you're building simple daily pipelines or complex multi-stage data workflows orchestrating hundreds of systems, Airflow provides the tools, patterns, and reliability to succeed.
Start small. Learn thoroughly. Scale thoughtfully. Monitor obsessively.
Top comments (0)