From local development to enterprise-scale orchestration
π Table of Contents
PART 1: FUNDAMENTALS
- Section 1: Introduction
- Section 2: Architecture
- Section 3: DAG & Building Blocks
- Section 4: TaskFlow API
PART 2: ADVANCED CONCEPTS
- Section 5: Deferrable Operators
- Section 6: Dependencies & Relationships
- Section 7: Scheduling
- Section 8: Failure Handling
- Section 9: SLA Management
- Section 10: Catchup Behavior
PART 3: OPERATIONS & DEPLOYMENT
- Section 11: Configuration
- Section 12: Version Comparison
- Section 13: Best Practices
- Section 14: Connections & Secrets
- Summary & Recommendations
SECTION 1: INTRODUCTION
What is Apache Airflow?
Apache Airflow is a workflow orchestration platform that allows you to define, schedule, and monitor complex data pipelines as code using Python to define Directed Acyclic Graphs (DAGs).
Key Benefits
- Programmatic workflows: Define complex pipelines in Python with full control
- Scalable: From single-machine setups to distributed Kubernetes deployments
- Extensible: Create custom operators, sensors, and hooks without modifying core
- Rich monitoring: Web UI with real-time pipeline visibility and control
- Flexible scheduling: Time-based, event-driven, or custom business logic
- Reliable: Built-in retry logic, SLAs, and comprehensive failure handling
- Community-driven: Active ecosystem with hundreds of provider integrations
Who Uses Airflow?
- Data engineers building ETL/ELT pipelines
- ML engineers orchestrating training workflows
- DevOps teams managing infrastructure automation
- Analytics teams automating reports
- Any organization with complex scheduled workflows
SECTION 2: ARCHITECTURE
Core Components
Web Server
- Provides REST API and web UI (port 8080 by default)
- Allows manual DAG triggering, viewing logs, managing connections
- Stateless (multiple instances can be deployed)
Scheduler
- Monitors all DAGs and determines when tasks should execute
- Parses DAG files periodically (every 5 minutes by default)
- Handles task dependency resolution and sequencing
- Single instance per deployment (HA coming in Airflow 3.1+)
Executor
- Determines HOW tasks are physically executed
- Options:
- LocalExecutor: Single machine (~10-50 parallel tasks)
- CeleryExecutor: Distributed across worker nodes (100s of tasks)
- KubernetesExecutor: Each task runs in separate pod (elastic scaling)
Metadata Database
- Central repository for DAG definitions, task states, variables, connections
- PostgreSQL 12+ or MySQL 8+ in production
- SQLite only for development
- Requires daily backups
Message Broker (for distributed setups)
- Redis or RabbitMQ
- Queues tasks from Scheduler to Workers
- Required only for CeleryExecutor
Triggerer Service (Airflow 2.2+)
- Manages all deferrable/async tasks
- Single lightweight async service per deployment
- Can handle 1000+ concurrent deferred tasks
- Fires triggers when conditions are met
SECTION 3: DAG & BUILDING BLOCKS
What is a DAG?
A DAG (Directed Acyclic Graph) is the fundamental unit in Airflow:
- Directed: Tasks execute in specific direction (A β B β C)
- Acyclic: No circular dependencies allowed
- Graph: Visual representation of tasks and relationships
Key DAG Properties
| Property | Purpose | Example |
|---|---|---|
dag_id |
Unique identifier | 'daily_etl' |
start_date |
First execution date | datetime(2024, 1, 1) |
schedule |
When to run |
'@daily', CronTimetable, Dataset
|
max_active_runs |
Concurrent runs allowed | 1-5 |
default_args |
Task defaults |
retries, timeout, owner
|
description |
What DAG does | "Daily customer ETL" |
tags |
Categorization | ['etl', 'daily'] |
Operators (Task Types)
Operators are classes that execute actual work. Each is a reusable task type.
| Operator | Purpose | Use Case |
|---|---|---|
| PythonOperator | Execute Python function | Data processing, API calls |
| BashOperator | Run bash commands | Scripts, system commands |
| SqlOperator | Execute SQL query | Database transformations |
| EmailOperator | Send emails | Notifications, reports |
| S3Operator | S3 file operations | Data movement, management |
| HttpOperator | Make HTTP requests | External service integration |
| BranchOperator | Conditional branching | Different execution paths |
| DummyOperator | Placeholder/no-op | Flow control |
Sensors (Waiting for Conditions)
Sensors are operators that wait for something to happen. They poll a condition until it becomes true, blocking further execution.
Common Sensors
| Sensor | Waits For | Use Case |
|---|---|---|
| FileSensor | File creation on disk | Wait for uploaded data |
| S3KeySensor | S3 object existence | Wait for data in S3 |
| SqlSensor | SQL query condition | Wait for record count |
| TimeSensor | Specific time of day | Wait until 2 PM |
| ExternalTaskSensor | External DAG task completion | Cross-DAG dependency |
| HttpSensor | HTTP endpoint response | Wait for API ready |
| TimeDeltaSensor | Time duration elapsed | Wait 30 minutes |
Sensor Modes
- Poke Mode (default): Worker thread continuously checks β holds worker slot
- Reschedule Mode: Worker released, task rescheduled when ready
- Smart Mode: Batched checking for efficiency
β οΈ Performance Note: Many idle sensors waste resources. Use deferrable operators (with Async suffix) for waits longer than 5 minutes.
Hooks (Connection Abstractions)
Hooks encapsulate connection and authentication logic to external systems. They're the reusable layer between operators and external services.
| Hook | Connects To | Purpose |
|---|---|---|
| PostgresHook | PostgreSQL | Query execution |
| MySqlHook | MySQL | Database operations |
| S3Hook | AWS S3 | File operations |
| HttpHook | REST APIs | HTTP requests |
| SlackHook | Slack | Send messages |
| SSHHook | Remote servers | Execute commands |
Benefits:
- Centralize credential management
- Reusable across multiple tasks
- No credentials in DAG code
- Connection pooling support
- Error handling abstraction
Tasks
A task is an instance of an operator within a DAGβthe smallest unit of work.
| Setting | Purpose | Example |
|---|---|---|
task_id |
Unique identifier | 'extract_customers' |
retries |
Retry attempts on failure | 2-3 |
retry_delay |
Wait between retries | timedelta(minutes=5) |
timeout |
Maximum execution time | timedelta(hours=1) |
pool |
Concurrency control | 'default_pool' |
priority_weight |
Execution priority |
1-100 (higher = urgent) |
trigger_rule |
When to execute |
'all_success' (default) |
XCom (Cross-Communication)
XCom allows tasks to pass data to each other (automatically serialized to JSON).
# Push
task_instance.xcom_push(key='count', value=1000)
# Pull
count = task_instance.xcom_pull(task_ids='upstream', key='count')
Limitation: Intended for metadata only, not large datasets (causes database bloat).
SECTION 4: TASKFLOW API
What is TaskFlow API?
TaskFlow API (Airflow 2.0+) is the modern, Pythonic way to write DAGs using decorators instead of verbose operator instantiation.
Comparison: Traditional vs TaskFlow
| Aspect | Traditional | TaskFlow |
|---|---|---|
| Syntax | Verbose operator creation | Clean function decorators |
| Data Passing | Manual XCom push/pull | Implicit via return values |
| Dependencies | Explicit >> operators |
Implicit via parameters |
| IDE Support | Basic | Excellent (autocomplete) |
| Code Length | More lines, repetitive | Fewer lines, concise |
Key Decorators
- Decorates function that contains tasks
- Function body defines workflow
- Returns DAG instance
@task
- Decorates function that becomes a task
- Function is task logic
- Return value becomes XCom value
- Parameters are upstream dependencies
@task_group
- Groups related tasks together
- Creates visual subgraph in UI
- Simplifies large DAGs
@task.expand (Dynamic Mapping - Airflow 2.3+)
- Creates task for each item in list
- Reduces repetitive task creation
- Dynamic fan-out pattern
When to Use TaskFlow
Use for:
- New DAGs (recommended)
- Complex dependencies
- Python-savvy teams
- Modern deployments (2.0+)
Alternative if:
- Legacy Airflow 1.10 requirement
- Using operators not available as tasks
SECTION 5: DEFERRABLE OPERATORS & TRIGGERS
The Problem: Traditional Sensors Waste Resources
Traditional Approach (Blocking)
- Sensor holds worker slot while waiting
- Worker CPU: ~5-10% even during idle
- Memory: ~50MB per idle task
- Can handle: ~50-100 concurrent waits
- Issue: Massive waste for long waits (hours/days)
Example: 1000 tasks waiting 1 hour = 1000 worker slots held continuously!
The Solution: Deferrable Operators
Deferrable Approach (Non-blocking)
- Task releases worker slot immediately after deferring
- Minimal CPU usage: ~0.1% during wait
- Memory: ~1MB per deferred task
- Can handle: 1000+ concurrent waits
- Savings: ~95% resource reduction
How Deferrable Operators Work
4-Phase Execution:
-
Execution Phase: Task runs initial logic, then defers to trigger
- Worker slot RELEASED
-
Deferred Phase: Trigger monitors event asynchronously
- Triggerer service handles
- Very lightweight
- No worker slot needed
-
Completion Phase: Event occurs
- Trigger detects condition met
- Returns event data
-
Resume Phase: Task resumes with new worker slot
- Task completes with trigger result
Available Deferrable Operators
| Operator | Defers For | Best For |
|---|---|---|
| TimeSensorAsync | Specific time | Waiting until business hours |
| S3KeySensorAsync | S3 object | Long-running file uploads |
| ExternalTaskSensorAsync | External DAG | Cross-DAG long waits |
| SqlSensorAsync | SQL condition | Database state changes |
| HttpSensorAsync | HTTP response | External service availability |
When to Use Deferrable
Use when:
- Wait duration > 5 minutes
- Many concurrent waits
- Resource-constrained environment
- Can tolerate event latency (<1 second)
Don't use when:
- Wait < 5 minutes (overhead not worth it)
- Real-time detection critical (sub-second)
- Simple local operations
Performance Impact
1000 Tasks Waiting 1 Hour
| Approach | Cost |
|---|---|
| Traditional | 1000 worker slots Γ 1 hour = 1000 slot-hours wasted |
| Deferrable | 0 worker slots + minimal CPU = 95% savings |
SECTION 6: DEPENDENCIES & RELATIONSHIPS
Task Dependencies
Linear Chain: A β B β C β D
- Sequential execution
- Best for: ETL pipelines with stages
Fan-out / Fan-in: One task triggers many, then merge
ββ Process A ββ
Extract β€ ββ Merge
ββ Process B ββ
- Parallel processing
- Best for: Multiple independent operations
Branching: Conditional execution
ββ Success path
Branch β€
ββ Error path
- Different paths based on condition
Trigger Rules
Determine WHEN downstream tasks execute based on UPSTREAM status.
| Rule | Executes When | Use Case |
|---|---|---|
all_success |
All upstream succeed | Default, sequential flow |
all_failed |
All upstream fail | Cleanup after failures |
all_done |
All complete (any status) | Final reporting |
one_success |
At least one succeeds | Alternative paths |
one_failed |
At least one fails | Error alerts |
none_failed |
No failures (skips OK) | β Robust pipelines |
none_failed_or_skipped |
No failures/skips | Strict validation |
Best Practice: Use none_failed for fault-tolerant pipelines.
DAG-to-DAG Dependencies
| Method | Mechanism | Best For |
|---|---|---|
| ExternalTaskSensor | Poll for completion | Different schedules |
| TriggerDagRunOperator | Actively trigger | Orchestrate workflows |
| Datasets | Event-driven (2.3+) | Data pipelines |
Datasets (Airflow 2.3+)
- Producer DAG updates dataset
- Consumer DAG triggered automatically
- Real-time responsiveness: <1 second
- Perfect for data-driven workflows
Example:
- ETL DAG extracts and updates
orders_dataset - Analytics DAG triggered automatically when dataset updates
- No waiting, real-time processing
SECTION 7: SCHEDULING
Three Scheduling Methods
- Time-based: Specific times/intervals
- Event-driven: When data arrives/updates
- Manual: User-triggered via UI
Time-Based: Cron Expressions
Legacy method using cron format: minute hour day month day_of_week
Common Patterns
-
@hourly: Every hour -
@daily: Every day at midnight -
@weekly: Every Sunday at midnight -
@monthly: First day of month -
0 9 * * MON-FRI: 9 AM on weekdays -
*/15 * * * *: Every 15 minutes
Limitations: Simple time-only scheduling, no business logic
Time-Based: Timetable API (Modern)
Object-oriented scheduling (Airflow 2.2+) with unlimited flexibility.
Built-in Timetables
-
CronTimetable: Cron-based (replaces schedule_interval) -
DailyTimetable: Daily runs -
IntervalTimetable: Fixed intervals
Custom Timetables Allow
- Business day scheduling (Mon-Fri)
- Holiday-aware schedules
- Timezone-specific runs
- Complex business logic
- End-of-month runs
- Business hour runs (9 AM - 5 PM)
Event-Driven: Datasets (Airflow 2.3+)
Enable data-driven workflows instead of time-driven.
How it Works
- Producer task completes and updates dataset
- Scheduler detects dataset update
- Consumer DAG triggered immediately
- No scheduled waiting
Benefits
- Real-time responsiveness (<1 second)
- Resource efficient (no idle waits)
- Data-quality focused
- Perfect for data pipelines
Hybrid Scheduling
Combine time-based and event-driven:
- DAG runs on schedule OR when data arrives
- Whichever comes first
- Best of both worlds
Scheduling Decision Tree
TIME-BASED?
ββ YES β Complex logic?
β ββ YES β Custom Timetable
β ββ NO β Cron Expression
β
ββ NO β Event-driven?
ββ YES β Use Datasets
ββ NO β Manual trigger or reconsider
SECTION 8: FAILURE HANDLING
Retry Mechanisms
Allow tasks to recover from transient failures (network timeout, API rate limit, etc.).
Configuration
-
retries: Number of retry attempts -
retry_delay: Wait time between retries -
retry_exponential_backoff: Exponentially increasing delays -
max_retry_delay: Cap on maximum wait
Example Retry Sequence
Attempt 1: Fail β Wait 1 minute
Attempt 2: Fail β Wait 2 minutes (1 Γ 2^1)
Attempt 3: Fail β Wait 4 minutes (1 Γ 2^2)
Attempt 4: Fail β Wait 8 minutes (1 Γ 2^3)
Attempt 5: Fail β Wait 16 minutes (1 Γ 2^4)
Attempt 6: Fail β Wait 1 hour max (capped)
Attempt 7: Fail β Task FAILED
Best Practice: 2-3 retries with exponential backoff for most tasks.
Callbacks
Execute custom logic when task state changes:
-
on_failure_callback: When task fails -
on_retry_callback: When task retries -
on_success_callback: When task succeeds
Common Uses
- Send notifications (Slack, email, PagerDuty)
- Log to external monitoring systems
- Trigger incident management
- Update dashboards
- Cleanup resources
Error Handling Patterns
Pattern 1: Idempotency
- Make tasks safe to retry
- Delete-then-insert instead of insert
- Same result regardless of retry count
Pattern 2: Circuit Breaker
- Stop retrying after N attempts
- Raise critical exceptions
- Alert operations team
- Prevent cascading failures
Pattern 3: Graceful Degradation
- Continue with partial data
- Log warnings
- Don't fail entire pipeline
Pattern 4: Skip on Condition
- Raise
AirflowSkipException - Skip task without marking as failure
- Allow downstream to continue
- Best for conditional workflows
SECTION 9: SLA MANAGEMENT
What is SLA?
SLA (Service Level Agreement) is a guarantee that a task will complete within a specified time window.
If SLA is missed:
- Task marked as "SLA Missed"
- Notifications triggered
- Callbacks executed
- Logged for auditing/metrics
Setting SLAs
At DAG Level (all tasks inherit):
default_args = {
'sla': timedelta(hours=1)
}
At Task Level (specific to individual task):
task = PythonOperator(
task_id='critical_task',
sla=timedelta(minutes=30)
)
Per Task Overrides: Task SLA overrides DAG default
SLA Monitoring
Key Metrics
- SLA compliance rate (%)
- Average overdue duration
- Most violated tasks
- Impact on downstream
Alerts
- Send notifications when SLA at risk (80%)
- Escalate on missed SLA
- Create incidents for critical SLAs
Dashboard
- Track compliance trends
- Identify problematic tasks
- Alert history
- Remediation tracking
SLA Best Practices
| Practice | Benefit |
|---|---|
| Set realistic SLAs | Avoid false positives |
| Different tiers | Critical < Normal < Optional |
| Review quarterly | Account for growth |
| Document rationale | Team alignment |
| Escalation policy | Proper incident response |
| Callbacks | Automated notifications |
| Monitor compliance | Identify trends |
SECTION 10: CATCHUP BEHAVIOR
What is Catchup?
Catchup is automatic backfill of missed DAG runs.
Example Scenario
Start date: 2024-01-01
Today: 2024-05-15
Schedule: @daily (every day)
Catchup: True
RESULT: Airflow creates 135 DAG runs instantly!
(One for each day from Jan 1 to May 14)
Catchup Settings
| Setting | Behavior | When to Use |
|---|---|---|
catchup=True |
Backfill all missed runs | Historical data loading |
catchup=False |
No backfill, start fresh | New pipelines |
Catchup Risks
Potential Issues
- Resource exhaustion (CPU, memory, database)
- Network bandwidth spike
- Third-party API rate limits exceeded
- Database connection pool exhausted
- Worker node crashes
- Scheduler paralysis
Safe Catchup Strategy
Step 1: Use Separate Backfill DAG
- Main DAG:
catchup=False(normal operations) - Backfill DAG:
catchup=True,end_dateset (historical only) - Run backfill during maintenance window
Step 2: Control Parallelism
-
max_active_runs: Limit concurrent DAG runs (1-3) -
max_active_tasks_per_dag: Limit concurrent tasks (16-32) - Prevents resource exhaustion
Step 3: Staged Approach
- Test with subset of dates first
- Monitor resource usage
- Gradually increase pace
- Have rollback plan
Step 4: Validation
- Verify data completeness
- Check for duplicates
- Validate transformations
- Compare with expected results
Catchup Checklist
Before enabling catchup:
- β Database capacity available
- β Worker resources adequate
- β Source data exists for all periods
- β Data schema hasn't changed
- β No API rate limits exceeded
- β Network bandwidth available
- β Stakeholders informed
- β Rollback procedure ready
SECTION 11: CONFIGURATION
Configuration Hierarchy
Airflow reads configuration in this order (first found wins):
-
Environment Variables (
AIRFLOW___prefix) - airflow.cfg file (main config)
- Default configuration (hardcoded)
- Command-line arguments (runtime)
Critical Settings
| Setting | Purpose | Production Value |
|---|---|---|
sql_alchemy_conn |
Metadata database | PostgreSQL connection |
executor |
Task execution | CeleryExecutor or KubernetesExecutor |
parallelism |
Max tasks globally | 32-128 |
max_active_runs_per_dag |
Concurrent DAG runs | 1-3 |
max_active_tasks_per_dag |
Concurrent tasks | 16-32 |
remote_logging |
Centralized logs | True (to S3) |
email_backend |
Email provider | SMTP |
Database Configuration
Production Database
- PostgreSQL 12+ (recommended) or MySQL 8+
- Dedicated database server (not shared)
- Daily backups automated
- Connection pooling enabled (pool_size=5, max_overflow=10)
- Monitoring for slow queries
- Regular maintenance (VACUUM, ANALYZE)
Why Not SQLite
- Not concurrent-access safe
- Locks on writes
- Single machine only
- Not suitable for production
Executor Configuration
| Executor | Config | Best For |
|---|---|---|
| LocalExecutor | [local_executor] |
Development |
| CeleryExecutor | [celery] |
Distributed, high volume |
| KubernetesExecutor | [kubernetes] |
Cloud-native |
SECTION 12: VERSION COMPARISON
Version Timeline
1.10.x (Deprecated, EOL 2021)
β Major Redesign
2.0 (March 2021) - TaskFlow API introduced
β
2.1-2.2 (Improvements & Deferrable Operators)
β
2.3 (Nov 2022) - Datasets event-driven scheduling
β
2.4-2.5 (Current stable, production-ready)
β
3.0 (Latest) - Performance focus, async improvements
Key Feature Comparison
| Feature | 1.10 | 2.x | 3.0 |
|---|---|---|---|
| TaskFlow API | β | β | β |
| Deferrable Operators | β | β | β |
| Datasets (Event-driven) | β | β | β |
| Performance | Baseline | 3-5x faster | 5-6x faster |
| Python 3.10+ | β | β | β Required |
| Full Async/Await | β | Limited | β Full |
Performance Improvements
| Metric | 1.10 | 2.5 | 3.0 |
|---|---|---|---|
| DAG parsing (1000 DAGs) | ~120s | ~45s | ~15s |
| Task scheduling latency | ~5-10s | ~2-3s | <1s |
| Memory (idle scheduler) | ~500MB | ~300MB | ~150MB |
| Web UI response | ~800ms | ~200ms | ~50ms |
Migration Path
From 1.10 to 2.x
- Python 3.6+ required
- Import paths changed (reorganized)
- Operators moved to providers package
- Test thoroughly in staging
- Gradual production rollout (1-2 weeks)
From 2.x to 3.0
- Most DAGs work without changes
- Drop support for Python < 3.10
- Update async patterns
- Leverage new performance features
- Simpler migration (mostly backward compatible)
Version Selection
| Scenario | Recommended |
|---|---|
| New projects | 2.5+ or 3.0 |
| Production (stable) | 2.4.x or 2.5.x |
| Enterprise | 2.5+ |
| Performance-critical | 3.0 |
| Learning/Development | 2.5+ or 3.0 |
Why Not Use 1.10?
- EOL (End of Life) as of 2021
- No deferrable operators
- No TaskFlow API
- No datasets
- Slow DAG parsing
- No modern Python support
Action: If you're on 1.10, plan migration to 2.5+ now.
SECTION 13: BEST PRACTICES
DAG Design DO's β
- Use TaskFlow API for modern code
- Keep tasks small and focused
- Name tasks descriptively
- Document complex logic
- Use type hints
- Add comprehensive logging
- Handle errors gracefully
- Test DAGs before production
DAG Design DON'Ts β
- Put heavy processing in DAG definition
- Create 1000s of tasks per DAG
- Hardcode credentials
- Ignore data validation
- Use DAG-level database connections
- Forget about idempotency
- Skip error handling
- Assume tasks always succeed
Task Design DO's β
- Make tasks idempotent (safe to retry)
- Set appropriate timeouts
- Implement retries with backoff
- Use pools for resource control
- Skip tasks on errors (when appropriate)
- Log important events
- Clean up resources
Task Design DON'Ts β
- Assume tasks always succeed
- Run long operations without timeout
- Hold resources unnecessarily
- Store large data in XCom
- Use blocking operations
- Ignore upstream failures
- Forget cleanup
Scheduling DO's β
- Match frequency to data needs
- Account for upstream dependencies
- Plan for failures
- Test schedule changes
- Document rationale
- Monitor adherence
Scheduling DON'Ts β
- Schedule too frequently
- Ignore resource constraints
- Forget about catchup implications
- Use ambiguous expressions
- Create tight couplings
- Skip monitoring
Deployment DO's β
- Use external secrets backend
- Store DAG code in Git
- Implement monitoring
- Set up disaster recovery
- Rotate credentials quarterly
- Use separate environments (dev, staging, prod)
- Document runbooks
Deployment DON'Ts β
- Store credentials in DAGs
- Deploy without monitoring
- Skip backups
- Use same credentials everywhere
- Forget version control
- Deploy without testing
SECTION 14: CONNECTIONS & SECRETS
What are Connections?
Connections store authentication credentials to external systems centrally. Instead of hardcoding database passwords or API keys in DAGs, Airflow stores these securely.
Components:
-
Connection ID: Unique identifier (e.g.,
postgres_default) - Connection Type: System type (postgres, aws, http, mysql, etc.)
- Host: Server address
- Login: Username
- Password: Secret credential
- Port: Service port
- Schema: Database or default path
- Extra: JSON for additional parameters
Why Use Connections?
- Credentials not in DAG code (won't be accidentally committed to git)
- Centralized credential management
- Easy credential rotation
- Different credentials per environment (dev, staging, prod)
- Audit trail of access
Connection Types
| Type | Used For | Example |
|---|---|---|
| postgres | PostgreSQL databases | Analytics warehouse |
| mysql | MySQL databases | Transactional data |
| snowflake | Snowflake warehouse | Cloud data platform |
| aws | AWS services (S3, Redshift, Lambda) | Cloud infrastructure |
| gcp | Google Cloud services | BigQuery, Cloud Storage |
| http | REST APIs | External services |
| ssh | Remote server access | Bastion hosts |
| slack | Slack messaging | Notifications |
| sftp | SFTP file transfer | Remote servers |
How to Add Connections
Method 1: Web UI (Easiest)
- Airflow UI β Admin β Connections β Create Connection
- Fill form with credentials
- Click Save
Method 2: Environment Variables
export AIRFLOW_CONN_POSTGRES_DEFAULT=postgresql://user:password@localhost:5432/mydb
export AIRFLOW_CONN_AWS_DEFAULT=aws://ACCESS_KEY:SECRET_KEY@
Method 3: Airflow CLI
airflow connections add postgres_prod \
--conn-type postgres \
--conn-host localhost \
--conn-login user \
--conn-password password
Using Connections in DAGs
Connections are accessed through Hooks (connection abstraction layer):
from airflow.hooks.postgres_hook import PostgresHook
def query_database(**context):
hook = PostgresHook(postgres_conn_id='postgres_default')
result = hook.get_records("SELECT * FROM users")
return result
Benefits of Hooks:
- Encapsulate connection logic
- Reusable across tasks
- Handle connection pooling
- Error handling built-in
- No credentials in task code
What are Variables/Secrets?
Variables store non-sensitive configuration values.
Secrets store sensitive data (marked as secret in UI).
| Aspect | Variables | Secrets |
|---|---|---|
| Purpose | Configuration settings | Passwords, API keys |
| Visibility | Visible in UI | Masked/hidden in UI |
| Storage | Metadata database | External backend (Vault, etc.) |
| Example | batch_size, email | api_key, db_password |
How to Add Variables/Secrets
Method 1: Web UI
- Airflow UI β Admin β Variables β Create Variable
- Enter Key and Value
- Check "Is Secret" for sensitive data
- Click Save
Method 2: Environment Variables
export AIRFLOW_VAR_BATCH_SIZE=1000
export AIRFLOW_VAR_API_KEY=secret123
Method 3: Airflow CLI
airflow variables set batch_size 1000
airflow variables set api_key secret123
Using Variables in DAGs
from airflow.models import Variable
batch_size = Variable.get('batch_size', default_var=1000)
api_key = Variable.get('api_key')
is_production = Variable.get('is_production', default_var='false')
# With JSON parsing
config = Variable.get('config', deserialize_json=True)
Secrets Backend: Advanced Security
For production, use external secrets backends instead of storing in Airflow database.
| Backend | Security Level | Best For |
|---|---|---|
| Metadata DB | Low | Development only |
| AWS Secrets Manager | High | AWS deployments |
| HashiCorp Vault | Very High | Enterprise |
| Google Secret Manager | High | GCP deployments |
| Azure Key Vault | High | Azure deployments |
How it works:
- Configure Airflow to use external backend
- Airflow queries backend when credential needed
- Secrets never stored in database
- Automatic rotation supported
- Audit logging in backend
Security Best Practices
DO β
- Use external secrets backend in production
- Rotate credentials quarterly
- Use IAM roles instead of hardcoded keys
- Encrypt connections in transit (SSL/TLS)
- Encrypt secrets at rest (database encryption)
- Limit who can see connections (RBAC)
- Use short-lived credentials (temporary tokens)
- Audit access to secrets
- Never commit credentials to git
DON'T β
- Hardcode credentials in DAG code
- Store passwords in comments or docstrings
- Share credentials via Slack/Email
- Use default passwords
- Store credentials in config files in git
- Forget to revoke old credentials
- Use same credentials everywhere
- Log credential values
Testing Connections
Via Web UI:
- Navigate to Admin β Connections
- Click on connection
- Click "Test Connection"
Via CLI:
airflow connections test postgres_default
airflow connections test aws_default
Via Python:
from airflow.hooks.postgres_hook import PostgresHook
hook = PostgresHook(postgres_conn_id='postgres_default')
hook.get_conn() # Will raise error if connection fails
SUMMARY & RECOMMENDATIONS
For New Projects Starting Today
- Choose Airflow 2.5+ or 3.0 (not 1.10)
- Use TaskFlow API (modern, clean code)
- Use Timetable API for scheduling (not cron strings)
- Use Deferrable operators for long waits (>5 min)
- Plan for scale from day one (don't start with LocalExecutor)
- Set up monitoring immediately (not later)
- Use event-driven scheduling (Datasets) where possible
- Store credentials externally (Secrets Manager, not hardcoded)
For Teams Upgrading from 1.10
- Plan 2-3 months for migration (not quick)
- Test thoroughly in staging environment
- Keep 1.10 backup for emergency rollback
- Update DAGs incrementally (not all at once)
- Monitor closely first month post-upgrade
- Document breaking changes for team
- Celebrate milestones (keep team motivated)
For Large Enterprise Deployments
- Use Kubernetes executor (scalability, isolation)
- Implement comprehensive monitoring (critical for troubleshooting)
- Set up disaster recovery (business continuity)
- Regular chaos engineering tests (find weak points)
- Capacity planning quarterly (predict growth)
- Performance optimization ongoing (never "done")
- Security audits regular (compliance, risk)
For Data Pipeline Use Cases
- Use event-driven scheduling (Datasets, not time-based)
- Implement data quality checks (validate early)
- Monitor SLAs closely (data freshness critical)
- Implement idempotent tasks (safe retries)
- Plan catchup carefully (historical data loads)
- Version control everything (DAGs, configurations)
- Use separate environments (dev, staging, prod)
KEY TAKEAWAYS
β Fundamentals: DAGs, operators, sensors, and TaskFlow API are the foundation
β Scheduling: Choose appropriate method (time-based, event-driven, or hybrid)
β Reliability: Retries, error handling, SLAs, and monitoring ensure success
β Scale: Deferrable operators and proper architecture enable massive workloads
β Security: Centralized credential management with external backends
β Versions: Upgrade from 1.10 to 2.5+ now (3.0 for greenfield projects)
SUCCESS FORMULA
Treat DAGs as code β Version control, code review, testing
Monitor continuously β You can't fix what you don't see
Document thoroughly β Future you and teammates will thank you
Automate responses β Reduce manual incident response
Iterate and improve β Perfect is enemy of done
Airflow is Perfect For
- β Scheduled batch processing (daily, hourly, etc.)
- β Multi-stage ETL/ELT workflows
- β Cross-system orchestration
- β Data quality validation
- β DevOps automation
- β ML pipeline orchestration
- β Report generation
Airflow is NOT For
- β Real-time streaming (use Kafka, Spark Streaming)
- β Replace data warehouses (use Snowflake, BigQuery)
- β Sub-second latency requirements
- β Complex transactional systems
Final Wisdom
Success with Airflow depends on:
- Technical setup: Infrastructure, configuration, security
- Organizational practices: Code discipline, documentation, monitoring
- Team culture: Knowledge sharing, continuous learning, collaboration
Apache Airflow's strength lies in its flexibility, active community, and rich ecosystem. Whether you're building simple daily pipelines or complex multi-stage workflows orchestrating hundreds of systems, Airflow provides the tools, patterns, and reliability to succeed.
Start small. Learn thoroughly. Scale thoughtfully. Monitor obsessively.
[End of Tutorial]
Top comments (0)