Data Tech Bridge

Posted on Jan 8

Apache Airflow for Production: Essential Concepts Every Developer Should Know

#airflow #dataengineering #developer

From local development to enterprise-scale orchestration

📚 Table of Contents

PART 1: FUNDAMENTALS

Section 1: Introduction
Section 2: Architecture
Section 3: DAG & Building Blocks
Section 4: TaskFlow API

PART 2: ADVANCED CONCEPTS

Section 5: Deferrable Operators
Section 6: Dependencies & Relationships
Section 7: Scheduling
Section 8: Failure Handling
Section 9: SLA Management
Section 10: Catchup Behavior

PART 3: OPERATIONS & DEPLOYMENT

Section 11: Configuration
Section 12: Version Comparison
Section 13: Best Practices
Section 14: Connections & Secrets
Summary & Recommendations

SECTION 1: INTRODUCTION

What is Apache Airflow?

Apache Airflow is a workflow orchestration platform that allows you to define, schedule, and monitor complex data pipelines as code using Python to define Directed Acyclic Graphs (DAGs).

Key Benefits

Programmatic workflows: Define complex pipelines in Python with full control
Scalable: From single-machine setups to distributed Kubernetes deployments
Extensible: Create custom operators, sensors, and hooks without modifying core
Rich monitoring: Web UI with real-time pipeline visibility and control
Flexible scheduling: Time-based, event-driven, or custom business logic
Reliable: Built-in retry logic, SLAs, and comprehensive failure handling
Community-driven: Active ecosystem with hundreds of provider integrations

Who Uses Airflow?

Data engineers building ETL/ELT pipelines
ML engineers orchestrating training workflows
DevOps teams managing infrastructure automation
Analytics teams automating reports
Any organization with complex scheduled workflows

↑ Back to Table of Contents

SECTION 2: ARCHITECTURE

Core Components

Web Server

Provides REST API and web UI (port 8080 by default)
Allows manual DAG triggering, viewing logs, managing connections
Stateless (multiple instances can be deployed)

Scheduler

Monitors all DAGs and determines when tasks should execute
Parses DAG files periodically (every 5 minutes by default)
Handles task dependency resolution and sequencing
Single instance per deployment (HA coming in Airflow 3.1+)

Executor

Determines HOW tasks are physically executed
Options:
- LocalExecutor: Single machine (~10-50 parallel tasks)
- CeleryExecutor: Distributed across worker nodes (100s of tasks)
- KubernetesExecutor: Each task runs in separate pod (elastic scaling)

Metadata Database

Central repository for DAG definitions, task states, variables, connections
PostgreSQL 12+ or MySQL 8+ in production
SQLite only for development
Requires daily backups

Message Broker (for distributed setups)

Redis or RabbitMQ
Queues tasks from Scheduler to Workers
Required only for CeleryExecutor

Triggerer Service (Airflow 2.2+)

Manages all deferrable/async tasks
Single lightweight async service per deployment
Can handle 1000+ concurrent deferred tasks
Fires triggers when conditions are met

↑ Back to Table of Contents

SECTION 3: DAG & BUILDING BLOCKS

What is a DAG?

A DAG (Directed Acyclic Graph) is the fundamental unit in Airflow:

Directed: Tasks execute in specific direction (A → B → C)
Acyclic: No circular dependencies allowed
Graph: Visual representation of tasks and relationships

Key DAG Properties

Property	Purpose	Example
`dag_id`	Unique identifier	`'daily_etl'`
`start_date`	First execution date	`datetime(2024, 1, 1)`
`schedule`	When to run	`'@daily'`, `CronTimetable`, `Dataset`
`max_active_runs`	Concurrent runs allowed	`1-5`
`default_args`	Task defaults	`retries`, `timeout`, `owner`
`description`	What DAG does	`"Daily customer ETL"`
`tags`	Categorization	`['etl', 'daily']`

Operators (Task Types)

Operators are classes that execute actual work. Each is a reusable task type.

Operator	Purpose	Use Case
PythonOperator	Execute Python function	Data processing, API calls
BashOperator	Run bash commands	Scripts, system commands
SqlOperator	Execute SQL query	Database transformations
EmailOperator	Send emails	Notifications, reports
S3Operator	S3 file operations	Data movement, management
HttpOperator	Make HTTP requests	External service integration
BranchOperator	Conditional branching	Different execution paths
DummyOperator	Placeholder/no-op	Flow control

Sensors (Waiting for Conditions)

Sensors are operators that wait for something to happen. They poll a condition until it becomes true, blocking further execution.

Common Sensors

Sensor	Waits For	Use Case
FileSensor	File creation on disk	Wait for uploaded data
S3KeySensor	S3 object existence	Wait for data in S3
SqlSensor	SQL query condition	Wait for record count
TimeSensor	Specific time of day	Wait until 2 PM
ExternalTaskSensor	External DAG task completion	Cross-DAG dependency
HttpSensor	HTTP endpoint response	Wait for API ready
TimeDeltaSensor	Time duration elapsed	Wait 30 minutes

Sensor Modes

Poke Mode (default): Worker thread continuously checks → holds worker slot
Reschedule Mode: Worker released, task rescheduled when ready
Smart Mode: Batched checking for efficiency

⚠️ Performance Note: Many idle sensors waste resources. Use deferrable operators (with Async suffix) for waits longer than 5 minutes.

Hooks (Connection Abstractions)

Hooks encapsulate connection and authentication logic to external systems. They're the reusable layer between operators and external services.

Hook	Connects To	Purpose
PostgresHook	PostgreSQL	Query execution
MySqlHook	MySQL	Database operations
S3Hook	AWS S3	File operations
HttpHook	REST APIs	HTTP requests
SlackHook	Slack	Send messages
SSHHook	Remote servers	Execute commands

Benefits:

Centralize credential management
Reusable across multiple tasks
No credentials in DAG code
Connection pooling support
Error handling abstraction

Tasks

A task is an instance of an operator within a DAG—the smallest unit of work.

Setting	Purpose	Example
`task_id`	Unique identifier	`'extract_customers'`
`retries`	Retry attempts on failure	`2-3`
`retry_delay`	Wait between retries	`timedelta(minutes=5)`
`timeout`	Maximum execution time	`timedelta(hours=1)`
`pool`	Concurrency control	`'default_pool'`
`priority_weight`	Execution priority	`1-100` (higher = urgent)
`trigger_rule`	When to execute	`'all_success'` (default)

XCom (Cross-Communication)

XCom allows tasks to pass data to each other (automatically serialized to JSON).

# Push
task_instance.xcom_push(key='count', value=1000)

# Pull
count = task_instance.xcom_pull(task_ids='upstream', key='count')

Limitation: Intended for metadata only, not large datasets (causes database bloat).

↑ Back to Table of Contents

SECTION 4: TASKFLOW API

What is TaskFlow API?

TaskFlow API (Airflow 2.0+) is the modern, Pythonic way to write DAGs using decorators instead of verbose operator instantiation.

Comparison: Traditional vs TaskFlow

Aspect	Traditional	TaskFlow
Syntax	Verbose operator creation	Clean function decorators
Data Passing	Manual XCom push/pull	Implicit via return values
Dependencies	Explicit `>>` operators	Implicit via parameters
IDE Support	Basic	Excellent (autocomplete)
Code Length	More lines, repetitive	Fewer lines, concise

Key Decorators

@dag

Decorates function that contains tasks
Function body defines workflow
Returns DAG instance

@task

Decorates function that becomes a task
Function is task logic
Return value becomes XCom value
Parameters are upstream dependencies

@task_group

Groups related tasks together
Creates visual subgraph in UI
Simplifies large DAGs

@task.expand (Dynamic Mapping - Airflow 2.3+)

Creates task for each item in list
Reduces repetitive task creation
Dynamic fan-out pattern

When to Use TaskFlow

Use for:

New DAGs (recommended)
Complex dependencies
Python-savvy teams
Modern deployments (2.0+)

Alternative if:

Legacy Airflow 1.10 requirement
Using operators not available as tasks

↑ Back to Table of Contents

SECTION 5: DEFERRABLE OPERATORS & TRIGGERS

The Problem: Traditional Sensors Waste Resources

Traditional Approach (Blocking)

Sensor holds worker slot while waiting
Worker CPU: ~5-10% even during idle
Memory: ~50MB per idle task
Can handle: ~50-100 concurrent waits
Issue: Massive waste for long waits (hours/days)

Example: 1000 tasks waiting 1 hour = 1000 worker slots held continuously!

The Solution: Deferrable Operators

Deferrable Approach (Non-blocking)

Task releases worker slot immediately after deferring
Minimal CPU usage: ~0.1% during wait
Memory: ~1MB per deferred task
Can handle: 1000+ concurrent waits
Savings: ~95% resource reduction

How Deferrable Operators Work

4-Phase Execution:

Execution Phase: Task runs initial logic, then defers to trigger
- Worker slot RELEASED
Deferred Phase: Trigger monitors event asynchronously
- Triggerer service handles
- Very lightweight
- No worker slot needed
Completion Phase: Event occurs
- Trigger detects condition met
- Returns event data
Resume Phase: Task resumes with new worker slot
- Task completes with trigger result

Available Deferrable Operators

Operator	Defers For	Best For
TimeSensorAsync	Specific time	Waiting until business hours
S3KeySensorAsync	S3 object	Long-running file uploads
ExternalTaskSensorAsync	External DAG	Cross-DAG long waits
SqlSensorAsync	SQL condition	Database state changes
HttpSensorAsync	HTTP response	External service availability

When to Use Deferrable

Use when:

Wait duration > 5 minutes
Many concurrent waits
Resource-constrained environment
Can tolerate event latency (<1 second)

Don't use when:

Wait < 5 minutes (overhead not worth it)
Real-time detection critical (sub-second)
Simple local operations

Performance Impact

1000 Tasks Waiting 1 Hour

Approach	Cost
Traditional	1000 worker slots × 1 hour = 1000 slot-hours wasted
Deferrable	0 worker slots + minimal CPU = 95% savings

↑ Back to Table of Contents

SECTION 6: DEPENDENCIES & RELATIONSHIPS

Task Dependencies

Linear Chain: A → B → C → D

Sequential execution
Best for: ETL pipelines with stages

Fan-out / Fan-in: One task triggers many, then merge

      ├─ Process A ─┐
Extract ┤            ├─ Merge
      └─ Process B ─┘

Parallel processing
Best for: Multiple independent operations

Branching: Conditional execution

      ┌─ Success path
Branch ┤
      └─ Error path

Different paths based on condition

Trigger Rules

Determine WHEN downstream tasks execute based on UPSTREAM status.

Rule	Executes When	Use Case
`all_success`	All upstream succeed	Default, sequential flow
`all_failed`	All upstream fail	Cleanup after failures
`all_done`	All complete (any status)	Final reporting
`one_success`	At least one succeeds	Alternative paths
`one_failed`	At least one fails	Error alerts
`none_failed`	No failures (skips OK)	⭐ Robust pipelines
`none_failed_or_skipped`	No failures/skips	Strict validation

Best Practice: Use none_failed for fault-tolerant pipelines.

DAG-to-DAG Dependencies

Method	Mechanism	Best For
ExternalTaskSensor	Poll for completion	Different schedules
TriggerDagRunOperator	Actively trigger	Orchestrate workflows
Datasets	Event-driven (2.3+)	Data pipelines

Datasets (Airflow 2.3+)

Producer DAG updates dataset
Consumer DAG triggered automatically
Real-time responsiveness: <1 second
Perfect for data-driven workflows

Example:

ETL DAG extracts and updates orders_dataset
Analytics DAG triggered automatically when dataset updates
No waiting, real-time processing

↑ Back to Table of Contents

SECTION 7: SCHEDULING

Three Scheduling Methods

Time-based: Specific times/intervals
Event-driven: When data arrives/updates
Manual: User-triggered via UI

Time-Based: Cron Expressions

Legacy method using cron format: minute hour day month day_of_week

Common Patterns

@hourly: Every hour
@daily: Every day at midnight
@weekly: Every Sunday at midnight
@monthly: First day of month
0 9 * * MON-FRI: 9 AM on weekdays
*/15 * * * *: Every 15 minutes

Limitations: Simple time-only scheduling, no business logic

Time-Based: Timetable API (Modern)

Object-oriented scheduling (Airflow 2.2+) with unlimited flexibility.

Built-in Timetables

CronTimetable: Cron-based (replaces schedule_interval)
DailyTimetable: Daily runs
IntervalTimetable: Fixed intervals

Custom Timetables Allow

Business day scheduling (Mon-Fri)
Holiday-aware schedules
Timezone-specific runs
Complex business logic
End-of-month runs
Business hour runs (9 AM - 5 PM)

Event-Driven: Datasets (Airflow 2.3+)

Enable data-driven workflows instead of time-driven.

How it Works

Producer task completes and updates dataset
Scheduler detects dataset update
Consumer DAG triggered immediately
No scheduled waiting

Benefits

Real-time responsiveness (<1 second)
Resource efficient (no idle waits)
Data-quality focused
Perfect for data pipelines

Hybrid Scheduling

Combine time-based and event-driven:

DAG runs on schedule OR when data arrives
Whichever comes first
Best of both worlds

Scheduling Decision Tree

TIME-BASED?
├─ YES → Complex logic?
│        ├─ YES → Custom Timetable
│        └─ NO → Cron Expression
│
└─ NO → Event-driven?
         ├─ YES → Use Datasets
         └─ NO → Manual trigger or reconsider

↑ Back to Table of Contents

SECTION 8: FAILURE HANDLING

Retry Mechanisms

Allow tasks to recover from transient failures (network timeout, API rate limit, etc.).

Configuration

retries: Number of retry attempts
retry_delay: Wait time between retries
retry_exponential_backoff: Exponentially increasing delays
max_retry_delay: Cap on maximum wait

Example Retry Sequence

Attempt 1: Fail → Wait 1 minute
Attempt 2: Fail → Wait 2 minutes (1 × 2^1)
Attempt 3: Fail → Wait 4 minutes (1 × 2^2)
Attempt 4: Fail → Wait 8 minutes (1 × 2^3)
Attempt 5: Fail → Wait 16 minutes (1 × 2^4)
Attempt 6: Fail → Wait 1 hour max (capped)
Attempt 7: Fail → Task FAILED

Best Practice: 2-3 retries with exponential backoff for most tasks.

Callbacks

Execute custom logic when task state changes:

on_failure_callback: When task fails
on_retry_callback: When task retries
on_success_callback: When task succeeds

Common Uses

Send notifications (Slack, email, PagerDuty)
Log to external monitoring systems
Trigger incident management
Update dashboards
Cleanup resources

Error Handling Patterns

Pattern 1: Idempotency

Make tasks safe to retry
Delete-then-insert instead of insert
Same result regardless of retry count

Pattern 2: Circuit Breaker

Stop retrying after N attempts
Raise critical exceptions
Alert operations team
Prevent cascading failures

Pattern 3: Graceful Degradation

Continue with partial data
Log warnings
Don't fail entire pipeline

Pattern 4: Skip on Condition

Raise AirflowSkipException
Skip task without marking as failure
Allow downstream to continue
Best for conditional workflows

↑ Back to Table of Contents

SECTION 9: SLA MANAGEMENT

What is SLA?

SLA (Service Level Agreement) is a guarantee that a task will complete within a specified time window.

If SLA is missed:

Task marked as "SLA Missed"
Notifications triggered
Callbacks executed
Logged for auditing/metrics

Setting SLAs

At DAG Level (all tasks inherit):

default_args = {
    'sla': timedelta(hours=1)
}

At Task Level (specific to individual task):

task = PythonOperator(
    task_id='critical_task',
    sla=timedelta(minutes=30)
)

Per Task Overrides: Task SLA overrides DAG default

SLA Monitoring

Key Metrics

SLA compliance rate (%)
Average overdue duration
Most violated tasks
Impact on downstream

Alerts

Send notifications when SLA at risk (80%)
Escalate on missed SLA
Create incidents for critical SLAs

Dashboard

Track compliance trends
Identify problematic tasks
Alert history
Remediation tracking

SLA Best Practices

Practice	Benefit
Set realistic SLAs	Avoid false positives
Different tiers	Critical < Normal < Optional
Review quarterly	Account for growth
Document rationale	Team alignment
Escalation policy	Proper incident response
Callbacks	Automated notifications
Monitor compliance	Identify trends

↑ Back to Table of Contents

SECTION 10: CATCHUP BEHAVIOR

What is Catchup?

Catchup is automatic backfill of missed DAG runs.

Example Scenario

Start date:     2024-01-01
Today:          2024-05-15
Schedule:       @daily (every day)
Catchup:        True

RESULT: Airflow creates 135 DAG runs instantly!
(One for each day from Jan 1 to May 14)

Catchup Settings

Setting	Behavior	When to Use
`catchup=True`	Backfill all missed runs	Historical data loading
`catchup=False`	No backfill, start fresh	New pipelines

Catchup Risks

Potential Issues

Resource exhaustion (CPU, memory, database)
Network bandwidth spike
Third-party API rate limits exceeded
Database connection pool exhausted
Worker node crashes
Scheduler paralysis

Safe Catchup Strategy

Step 1: Use Separate Backfill DAG

Main DAG: catchup=False (normal operations)
Backfill DAG: catchup=True, end_date set (historical only)
Run backfill during maintenance window

Step 2: Control Parallelism

max_active_runs: Limit concurrent DAG runs (1-3)
max_active_tasks_per_dag: Limit concurrent tasks (16-32)
Prevents resource exhaustion

Step 3: Staged Approach

Test with subset of dates first
Monitor resource usage
Gradually increase pace
Have rollback plan

Step 4: Validation

Verify data completeness
Check for duplicates
Validate transformations
Compare with expected results

Catchup Checklist

Before enabling catchup:

☐ Database capacity available
☐ Worker resources adequate
☐ Source data exists for all periods
☐ Data schema hasn't changed
☐ No API rate limits exceeded
☐ Network bandwidth available
☐ Stakeholders informed
☐ Rollback procedure ready

↑ Back to Table of Contents

SECTION 11: CONFIGURATION

Configuration Hierarchy

Airflow reads configuration in this order (first found wins):

Environment Variables (AIRFLOW___ prefix)
airflow.cfg file (main config)
Default configuration (hardcoded)
Command-line arguments (runtime)

Critical Settings

Setting	Purpose	Production Value
`sql_alchemy_conn`	Metadata database	PostgreSQL connection
`executor`	Task execution	CeleryExecutor or KubernetesExecutor
`parallelism`	Max tasks globally	32-128
`max_active_runs_per_dag`	Concurrent DAG runs	1-3
`max_active_tasks_per_dag`	Concurrent tasks	16-32
`remote_logging`	Centralized logs	True (to S3)
`email_backend`	Email provider	SMTP

Database Configuration

Production Database

PostgreSQL 12+ (recommended) or MySQL 8+
Dedicated database server (not shared)
Daily backups automated
Connection pooling enabled (pool_size=5, max_overflow=10)
Monitoring for slow queries
Regular maintenance (VACUUM, ANALYZE)

Why Not SQLite

Not concurrent-access safe
Locks on writes
Single machine only
Not suitable for production

Executor Configuration

Executor	Config	Best For
LocalExecutor	`[local_executor]`	Development
CeleryExecutor	`[celery]`	Distributed, high volume
KubernetesExecutor	`[kubernetes]`	Cloud-native

↑ Back to Table of Contents

SECTION 12: VERSION COMPARISON

Version Timeline

1.10.x (Deprecated, EOL 2021)
    ↓ Major Redesign
2.0 (March 2021) - TaskFlow API introduced
    ↓
2.1-2.2 (Improvements & Deferrable Operators)
    ↓
2.3 (Nov 2022) - Datasets event-driven scheduling
    ↓
2.4-2.5 (Current stable, production-ready)
    ↓
3.0 (Latest) - Performance focus, async improvements

Key Feature Comparison

Feature	1.10	2.x	3.0
TaskFlow API	❌	✅	✅
Deferrable Operators	❌	✅	✅
Datasets (Event-driven)	❌	❌	✅
Performance	Baseline	3-5x faster	5-6x faster
Python 3.10+	❌	✅	✅ Required
Full Async/Await	❌	Limited	✅ Full

Performance Improvements

Metric	1.10	2.5	3.0
DAG parsing (1000 DAGs)	~120s	~45s	~15s
Task scheduling latency	~5-10s	~2-3s	<1s
Memory (idle scheduler)	~500MB	~300MB	~150MB
Web UI response	~800ms	~200ms	~50ms

Migration Path

From 1.10 to 2.x

Python 3.6+ required
Import paths changed (reorganized)
Operators moved to providers package
Test thoroughly in staging
Gradual production rollout (1-2 weeks)

From 2.x to 3.0

Most DAGs work without changes
Drop support for Python < 3.10
Update async patterns
Leverage new performance features
Simpler migration (mostly backward compatible)

Version Selection

Scenario	Recommended
New projects	2.5+ or 3.0
Production (stable)	2.4.x or 2.5.x
Enterprise	2.5+
Performance-critical	3.0
Learning/Development	2.5+ or 3.0

Why Not Use 1.10?

EOL (End of Life) as of 2021
No deferrable operators
No TaskFlow API
No datasets
Slow DAG parsing
No modern Python support

Action: If you're on 1.10, plan migration to 2.5+ now.

↑ Back to Table of Contents

SECTION 13: BEST PRACTICES

DAG Design DO's ✅

Use TaskFlow API for modern code
Keep tasks small and focused
Name tasks descriptively
Document complex logic
Use type hints
Add comprehensive logging
Handle errors gracefully
Test DAGs before production

DAG Design DON'Ts ❌

Put heavy processing in DAG definition
Create 1000s of tasks per DAG
Hardcode credentials
Ignore data validation
Use DAG-level database connections
Forget about idempotency
Skip error handling
Assume tasks always succeed

Task Design DO's ✅

Make tasks idempotent (safe to retry)
Set appropriate timeouts
Implement retries with backoff
Use pools for resource control
Skip tasks on errors (when appropriate)
Log important events
Clean up resources

Task Design DON'Ts ❌

Assume tasks always succeed
Run long operations without timeout
Hold resources unnecessarily
Store large data in XCom
Use blocking operations
Ignore upstream failures
Forget cleanup

Scheduling DO's ✅

Match frequency to data needs
Account for upstream dependencies
Plan for failures
Test schedule changes
Document rationale
Monitor adherence

Scheduling DON'Ts ❌

Schedule too frequently
Ignore resource constraints
Forget about catchup implications
Use ambiguous expressions
Create tight couplings
Skip monitoring

Deployment DO's ✅

Use external secrets backend
Store DAG code in Git
Implement monitoring
Set up disaster recovery
Rotate credentials quarterly
Use separate environments (dev, staging, prod)
Document runbooks

Deployment DON'Ts ❌

Store credentials in DAGs
Deploy without monitoring
Skip backups
Use same credentials everywhere
Forget version control
Deploy without testing

↑ Back to Table of Contents

SECTION 14: CONNECTIONS & SECRETS

What are Connections?

Connections store authentication credentials to external systems centrally. Instead of hardcoding database passwords or API keys in DAGs, Airflow stores these securely.

Components:

Connection ID: Unique identifier (e.g., postgres_default)
Connection Type: System type (postgres, aws, http, mysql, etc.)
Host: Server address
Login: Username
Password: Secret credential
Port: Service port
Schema: Database or default path
Extra: JSON for additional parameters

Why Use Connections?

Credentials not in DAG code (won't be accidentally committed to git)
Centralized credential management
Easy credential rotation
Different credentials per environment (dev, staging, prod)
Audit trail of access

Connection Types

Type	Used For	Example
postgres	PostgreSQL databases	Analytics warehouse
mysql	MySQL databases	Transactional data
snowflake	Snowflake warehouse	Cloud data platform
aws	AWS services (S3, Redshift, Lambda)	Cloud infrastructure
gcp	Google Cloud services	BigQuery, Cloud Storage
http	REST APIs	External services
ssh	Remote server access	Bastion hosts
slack	Slack messaging	Notifications
sftp	SFTP file transfer	Remote servers

How to Add Connections

Method 1: Web UI (Easiest)

Airflow UI → Admin → Connections → Create Connection
Fill form with credentials
Click Save

Method 2: Environment Variables

export AIRFLOW_CONN_POSTGRES_DEFAULT=postgresql://user:password@localhost:5432/mydb
export AIRFLOW_CONN_AWS_DEFAULT=aws://ACCESS_KEY:SECRET_KEY@

Method 3: Airflow CLI

airflow connections add postgres_prod \
    --conn-type postgres \
    --conn-host localhost \
    --conn-login user \
    --conn-password password

Using Connections in DAGs

Connections are accessed through Hooks (connection abstraction layer):

from airflow.hooks.postgres_hook import PostgresHook

def query_database(**context):
    hook = PostgresHook(postgres_conn_id='postgres_default')
    result = hook.get_records("SELECT * FROM users")
    return result

Benefits of Hooks:

Encapsulate connection logic
Reusable across tasks
Handle connection pooling
Error handling built-in
No credentials in task code

What are Variables/Secrets?

Variables store non-sensitive configuration values.
Secrets store sensitive data (marked as secret in UI).

Aspect	Variables	Secrets
Purpose	Configuration settings	Passwords, API keys
Visibility	Visible in UI	Masked/hidden in UI
Storage	Metadata database	External backend (Vault, etc.)
Example	batch_size, email	api_key, db_password

How to Add Variables/Secrets

Method 1: Web UI

Airflow UI → Admin → Variables → Create Variable
Enter Key and Value
Check "Is Secret" for sensitive data
Click Save

Method 2: Environment Variables

export AIRFLOW_VAR_BATCH_SIZE=1000
export AIRFLOW_VAR_API_KEY=secret123

Method 3: Airflow CLI

airflow variables set batch_size 1000
airflow variables set api_key secret123

Using Variables in DAGs

from airflow.models import Variable

batch_size = Variable.get('batch_size', default_var=1000)
api_key = Variable.get('api_key')
is_production = Variable.get('is_production', default_var='false')

# With JSON parsing
config = Variable.get('config', deserialize_json=True)

Secrets Backend: Advanced Security

For production, use external secrets backends instead of storing in Airflow database.

Backend	Security Level	Best For
Metadata DB	Low	Development only
AWS Secrets Manager	High	AWS deployments
HashiCorp Vault	Very High	Enterprise
Google Secret Manager	High	GCP deployments
Azure Key Vault	High	Azure deployments

How it works:

Configure Airflow to use external backend
Airflow queries backend when credential needed
Secrets never stored in database
Automatic rotation supported
Audit logging in backend

Security Best Practices

DO ✅

Use external secrets backend in production
Rotate credentials quarterly
Use IAM roles instead of hardcoded keys
Encrypt connections in transit (SSL/TLS)
Encrypt secrets at rest (database encryption)
Limit who can see connections (RBAC)
Use short-lived credentials (temporary tokens)
Audit access to secrets
Never commit credentials to git

DON'T ❌

Hardcode credentials in DAG code
Store passwords in comments or docstrings
Share credentials via Slack/Email
Use default passwords
Store credentials in config files in git
Forget to revoke old credentials
Use same credentials everywhere
Log credential values

Testing Connections

Via Web UI:

Navigate to Admin → Connections
Click on connection
Click "Test Connection"

Via CLI:

airflow connections test postgres_default
airflow connections test aws_default

Via Python:

from airflow.hooks.postgres_hook import PostgresHook

hook = PostgresHook(postgres_conn_id='postgres_default')
hook.get_conn()  # Will raise error if connection fails

↑ Back to Table of Contents

SUMMARY & RECOMMENDATIONS

For New Projects Starting Today

Choose Airflow 2.5+ or 3.0 (not 1.10)
Use TaskFlow API (modern, clean code)
Use Timetable API for scheduling (not cron strings)
Use Deferrable operators for long waits (>5 min)
Plan for scale from day one (don't start with LocalExecutor)
Set up monitoring immediately (not later)
Use event-driven scheduling (Datasets) where possible
Store credentials externally (Secrets Manager, not hardcoded)

For Teams Upgrading from 1.10

Plan 2-3 months for migration (not quick)
Test thoroughly in staging environment
Keep 1.10 backup for emergency rollback
Update DAGs incrementally (not all at once)
Monitor closely first month post-upgrade
Document breaking changes for team
Celebrate milestones (keep team motivated)

For Large Enterprise Deployments

Use Kubernetes executor (scalability, isolation)
Implement comprehensive monitoring (critical for troubleshooting)
Set up disaster recovery (business continuity)
Regular chaos engineering tests (find weak points)
Capacity planning quarterly (predict growth)
Performance optimization ongoing (never "done")
Security audits regular (compliance, risk)

For Data Pipeline Use Cases

Use event-driven scheduling (Datasets, not time-based)
Implement data quality checks (validate early)
Monitor SLAs closely (data freshness critical)
Implement idempotent tasks (safe retries)
Plan catchup carefully (historical data loads)
Version control everything (DAGs, configurations)
Use separate environments (dev, staging, prod)

KEY TAKEAWAYS

✅ Fundamentals: DAGs, operators, sensors, and TaskFlow API are the foundation

✅ Scheduling: Choose appropriate method (time-based, event-driven, or hybrid)

✅ Reliability: Retries, error handling, SLAs, and monitoring ensure success

✅ Scale: Deferrable operators and proper architecture enable massive workloads

✅ Security: Centralized credential management with external backends

✅ Versions: Upgrade from 1.10 to 2.5+ now (3.0 for greenfield projects)

SUCCESS FORMULA

Treat DAGs as code → Version control, code review, testing
Monitor continuously → You can't fix what you don't see
Document thoroughly → Future you and teammates will thank you
Automate responses → Reduce manual incident response
Iterate and improve → Perfect is enemy of done

Airflow is Perfect For

✅ Scheduled batch processing (daily, hourly, etc.)
✅ Multi-stage ETL/ELT workflows
✅ Cross-system orchestration
✅ Data quality validation
✅ DevOps automation
✅ ML pipeline orchestration
✅ Report generation

Airflow is NOT For

❌ Real-time streaming (use Kafka, Spark Streaming)
❌ Replace data warehouses (use Snowflake, BigQuery)
❌ Sub-second latency requirements
❌ Complex transactional systems

Final Wisdom

Success with Airflow depends on:

Technical setup: Infrastructure, configuration, security
Organizational practices: Code discipline, documentation, monitoring
Team culture: Knowledge sharing, continuous learning, collaboration

Apache Airflow's strength lies in its flexibility, active community, and rich ecosystem. Whether you're building simple daily pipelines or complex multi-stage workflows orchestrating hundreds of systems, Airflow provides the tools, patterns, and reliability to succeed.

Start small. Learn thoroughly. Scale thoughtfully. Monitor obsessively.

[End of Tutorial]