DEV Community

Cover image for Apache Airflow for Production: Essential Concepts Every Developer Should Know
Data Tech Bridge
Data Tech Bridge

Posted on

Apache Airflow for Production: Essential Concepts Every Developer Should Know

From local development to enterprise-scale orchestration

πŸ“š Table of Contents

PART 1: FUNDAMENTALS

PART 2: ADVANCED CONCEPTS

PART 3: OPERATIONS & DEPLOYMENT


SECTION 1: INTRODUCTION

What is Apache Airflow?

Apache Airflow is a workflow orchestration platform that allows you to define, schedule, and monitor complex data pipelines as code using Python to define Directed Acyclic Graphs (DAGs).

Key Benefits

  • Programmatic workflows: Define complex pipelines in Python with full control
  • Scalable: From single-machine setups to distributed Kubernetes deployments
  • Extensible: Create custom operators, sensors, and hooks without modifying core
  • Rich monitoring: Web UI with real-time pipeline visibility and control
  • Flexible scheduling: Time-based, event-driven, or custom business logic
  • Reliable: Built-in retry logic, SLAs, and comprehensive failure handling
  • Community-driven: Active ecosystem with hundreds of provider integrations

Who Uses Airflow?

  • Data engineers building ETL/ELT pipelines
  • ML engineers orchestrating training workflows
  • DevOps teams managing infrastructure automation
  • Analytics teams automating reports
  • Any organization with complex scheduled workflows

↑ Back to Table of Contents


SECTION 2: ARCHITECTURE

Core Components

Web Server

  • Provides REST API and web UI (port 8080 by default)
  • Allows manual DAG triggering, viewing logs, managing connections
  • Stateless (multiple instances can be deployed)

Scheduler

  • Monitors all DAGs and determines when tasks should execute
  • Parses DAG files periodically (every 5 minutes by default)
  • Handles task dependency resolution and sequencing
  • Single instance per deployment (HA coming in Airflow 3.1+)

Executor

  • Determines HOW tasks are physically executed
  • Options:
    • LocalExecutor: Single machine (~10-50 parallel tasks)
    • CeleryExecutor: Distributed across worker nodes (100s of tasks)
    • KubernetesExecutor: Each task runs in separate pod (elastic scaling)

Metadata Database

  • Central repository for DAG definitions, task states, variables, connections
  • PostgreSQL 12+ or MySQL 8+ in production
  • SQLite only for development
  • Requires daily backups

Message Broker (for distributed setups)

  • Redis or RabbitMQ
  • Queues tasks from Scheduler to Workers
  • Required only for CeleryExecutor

Triggerer Service (Airflow 2.2+)

  • Manages all deferrable/async tasks
  • Single lightweight async service per deployment
  • Can handle 1000+ concurrent deferred tasks
  • Fires triggers when conditions are met

↑ Back to Table of Contents


SECTION 3: DAG & BUILDING BLOCKS

What is a DAG?

A DAG (Directed Acyclic Graph) is the fundamental unit in Airflow:

  • Directed: Tasks execute in specific direction (A β†’ B β†’ C)
  • Acyclic: No circular dependencies allowed
  • Graph: Visual representation of tasks and relationships

Key DAG Properties

Property Purpose Example
dag_id Unique identifier 'daily_etl'
start_date First execution date datetime(2024, 1, 1)
schedule When to run '@daily', CronTimetable, Dataset
max_active_runs Concurrent runs allowed 1-5
default_args Task defaults retries, timeout, owner
description What DAG does "Daily customer ETL"
tags Categorization ['etl', 'daily']

Operators (Task Types)

Operators are classes that execute actual work. Each is a reusable task type.

Operator Purpose Use Case
PythonOperator Execute Python function Data processing, API calls
BashOperator Run bash commands Scripts, system commands
SqlOperator Execute SQL query Database transformations
EmailOperator Send emails Notifications, reports
S3Operator S3 file operations Data movement, management
HttpOperator Make HTTP requests External service integration
BranchOperator Conditional branching Different execution paths
DummyOperator Placeholder/no-op Flow control

Sensors (Waiting for Conditions)

Sensors are operators that wait for something to happen. They poll a condition until it becomes true, blocking further execution.

Common Sensors

Sensor Waits For Use Case
FileSensor File creation on disk Wait for uploaded data
S3KeySensor S3 object existence Wait for data in S3
SqlSensor SQL query condition Wait for record count
TimeSensor Specific time of day Wait until 2 PM
ExternalTaskSensor External DAG task completion Cross-DAG dependency
HttpSensor HTTP endpoint response Wait for API ready
TimeDeltaSensor Time duration elapsed Wait 30 minutes

Sensor Modes

  • Poke Mode (default): Worker thread continuously checks β†’ holds worker slot
  • Reschedule Mode: Worker released, task rescheduled when ready
  • Smart Mode: Batched checking for efficiency

⚠️ Performance Note: Many idle sensors waste resources. Use deferrable operators (with Async suffix) for waits longer than 5 minutes.

Hooks (Connection Abstractions)

Hooks encapsulate connection and authentication logic to external systems. They're the reusable layer between operators and external services.

Hook Connects To Purpose
PostgresHook PostgreSQL Query execution
MySqlHook MySQL Database operations
S3Hook AWS S3 File operations
HttpHook REST APIs HTTP requests
SlackHook Slack Send messages
SSHHook Remote servers Execute commands

Benefits:

  • Centralize credential management
  • Reusable across multiple tasks
  • No credentials in DAG code
  • Connection pooling support
  • Error handling abstraction

Tasks

A task is an instance of an operator within a DAGβ€”the smallest unit of work.

Setting Purpose Example
task_id Unique identifier 'extract_customers'
retries Retry attempts on failure 2-3
retry_delay Wait between retries timedelta(minutes=5)
timeout Maximum execution time timedelta(hours=1)
pool Concurrency control 'default_pool'
priority_weight Execution priority 1-100 (higher = urgent)
trigger_rule When to execute 'all_success' (default)

XCom (Cross-Communication)

XCom allows tasks to pass data to each other (automatically serialized to JSON).

# Push
task_instance.xcom_push(key='count', value=1000)

# Pull
count = task_instance.xcom_pull(task_ids='upstream', key='count')
Enter fullscreen mode Exit fullscreen mode

Limitation: Intended for metadata only, not large datasets (causes database bloat).

↑ Back to Table of Contents


SECTION 4: TASKFLOW API

What is TaskFlow API?

TaskFlow API (Airflow 2.0+) is the modern, Pythonic way to write DAGs using decorators instead of verbose operator instantiation.

Comparison: Traditional vs TaskFlow

Aspect Traditional TaskFlow
Syntax Verbose operator creation Clean function decorators
Data Passing Manual XCom push/pull Implicit via return values
Dependencies Explicit >> operators Implicit via parameters
IDE Support Basic Excellent (autocomplete)
Code Length More lines, repetitive Fewer lines, concise

Key Decorators

@dag

  • Decorates function that contains tasks
  • Function body defines workflow
  • Returns DAG instance

@task

  • Decorates function that becomes a task
  • Function is task logic
  • Return value becomes XCom value
  • Parameters are upstream dependencies

@task_group

  • Groups related tasks together
  • Creates visual subgraph in UI
  • Simplifies large DAGs

@task.expand (Dynamic Mapping - Airflow 2.3+)

  • Creates task for each item in list
  • Reduces repetitive task creation
  • Dynamic fan-out pattern

When to Use TaskFlow

Use for:

  • New DAGs (recommended)
  • Complex dependencies
  • Python-savvy teams
  • Modern deployments (2.0+)

Alternative if:

  • Legacy Airflow 1.10 requirement
  • Using operators not available as tasks

↑ Back to Table of Contents


SECTION 5: DEFERRABLE OPERATORS & TRIGGERS

The Problem: Traditional Sensors Waste Resources

Traditional Approach (Blocking)

  • Sensor holds worker slot while waiting
  • Worker CPU: ~5-10% even during idle
  • Memory: ~50MB per idle task
  • Can handle: ~50-100 concurrent waits
  • Issue: Massive waste for long waits (hours/days)

Example: 1000 tasks waiting 1 hour = 1000 worker slots held continuously!

The Solution: Deferrable Operators

Deferrable Approach (Non-blocking)

  • Task releases worker slot immediately after deferring
  • Minimal CPU usage: ~0.1% during wait
  • Memory: ~1MB per deferred task
  • Can handle: 1000+ concurrent waits
  • Savings: ~95% resource reduction

How Deferrable Operators Work

4-Phase Execution:

  1. Execution Phase: Task runs initial logic, then defers to trigger

    • Worker slot RELEASED
  2. Deferred Phase: Trigger monitors event asynchronously

    • Triggerer service handles
    • Very lightweight
    • No worker slot needed
  3. Completion Phase: Event occurs

    • Trigger detects condition met
    • Returns event data
  4. Resume Phase: Task resumes with new worker slot

    • Task completes with trigger result

Available Deferrable Operators

Operator Defers For Best For
TimeSensorAsync Specific time Waiting until business hours
S3KeySensorAsync S3 object Long-running file uploads
ExternalTaskSensorAsync External DAG Cross-DAG long waits
SqlSensorAsync SQL condition Database state changes
HttpSensorAsync HTTP response External service availability

When to Use Deferrable

Use when:

  • Wait duration > 5 minutes
  • Many concurrent waits
  • Resource-constrained environment
  • Can tolerate event latency (<1 second)

Don't use when:

  • Wait < 5 minutes (overhead not worth it)
  • Real-time detection critical (sub-second)
  • Simple local operations

Performance Impact

1000 Tasks Waiting 1 Hour

Approach Cost
Traditional 1000 worker slots Γ— 1 hour = 1000 slot-hours wasted
Deferrable 0 worker slots + minimal CPU = 95% savings

↑ Back to Table of Contents


SECTION 6: DEPENDENCIES & RELATIONSHIPS

Task Dependencies

Linear Chain: A β†’ B β†’ C β†’ D

  • Sequential execution
  • Best for: ETL pipelines with stages

Fan-out / Fan-in: One task triggers many, then merge

      β”œβ”€ Process A ─┐
Extract ─            β”œβ”€ Merge
      └─ Process B β”€β”˜
Enter fullscreen mode Exit fullscreen mode
  • Parallel processing
  • Best for: Multiple independent operations

Branching: Conditional execution

      β”Œβ”€ Success path
Branch ─
      └─ Error path
Enter fullscreen mode Exit fullscreen mode
  • Different paths based on condition

Trigger Rules

Determine WHEN downstream tasks execute based on UPSTREAM status.

Rule Executes When Use Case
all_success All upstream succeed Default, sequential flow
all_failed All upstream fail Cleanup after failures
all_done All complete (any status) Final reporting
one_success At least one succeeds Alternative paths
one_failed At least one fails Error alerts
none_failed No failures (skips OK) ⭐ Robust pipelines
none_failed_or_skipped No failures/skips Strict validation

Best Practice: Use none_failed for fault-tolerant pipelines.

DAG-to-DAG Dependencies

Method Mechanism Best For
ExternalTaskSensor Poll for completion Different schedules
TriggerDagRunOperator Actively trigger Orchestrate workflows
Datasets Event-driven (2.3+) Data pipelines

Datasets (Airflow 2.3+)

  • Producer DAG updates dataset
  • Consumer DAG triggered automatically
  • Real-time responsiveness: <1 second
  • Perfect for data-driven workflows

Example:

  • ETL DAG extracts and updates orders_dataset
  • Analytics DAG triggered automatically when dataset updates
  • No waiting, real-time processing

↑ Back to Table of Contents


SECTION 7: SCHEDULING

Three Scheduling Methods

  1. Time-based: Specific times/intervals
  2. Event-driven: When data arrives/updates
  3. Manual: User-triggered via UI

Time-Based: Cron Expressions

Legacy method using cron format: minute hour day month day_of_week

Common Patterns

  • @hourly: Every hour
  • @daily: Every day at midnight
  • @weekly: Every Sunday at midnight
  • @monthly: First day of month
  • 0 9 * * MON-FRI: 9 AM on weekdays
  • */15 * * * *: Every 15 minutes

Limitations: Simple time-only scheduling, no business logic

Time-Based: Timetable API (Modern)

Object-oriented scheduling (Airflow 2.2+) with unlimited flexibility.

Built-in Timetables

  • CronTimetable: Cron-based (replaces schedule_interval)
  • DailyTimetable: Daily runs
  • IntervalTimetable: Fixed intervals

Custom Timetables Allow

  • Business day scheduling (Mon-Fri)
  • Holiday-aware schedules
  • Timezone-specific runs
  • Complex business logic
  • End-of-month runs
  • Business hour runs (9 AM - 5 PM)

Event-Driven: Datasets (Airflow 2.3+)

Enable data-driven workflows instead of time-driven.

How it Works

  1. Producer task completes and updates dataset
  2. Scheduler detects dataset update
  3. Consumer DAG triggered immediately
  4. No scheduled waiting

Benefits

  • Real-time responsiveness (<1 second)
  • Resource efficient (no idle waits)
  • Data-quality focused
  • Perfect for data pipelines

Hybrid Scheduling

Combine time-based and event-driven:

  • DAG runs on schedule OR when data arrives
  • Whichever comes first
  • Best of both worlds

Scheduling Decision Tree

TIME-BASED?
β”œβ”€ YES β†’ Complex logic?
β”‚        β”œβ”€ YES β†’ Custom Timetable
β”‚        └─ NO β†’ Cron Expression
β”‚
└─ NO β†’ Event-driven?
         β”œβ”€ YES β†’ Use Datasets
         └─ NO β†’ Manual trigger or reconsider
Enter fullscreen mode Exit fullscreen mode

↑ Back to Table of Contents


SECTION 8: FAILURE HANDLING

Retry Mechanisms

Allow tasks to recover from transient failures (network timeout, API rate limit, etc.).

Configuration

  • retries: Number of retry attempts
  • retry_delay: Wait time between retries
  • retry_exponential_backoff: Exponentially increasing delays
  • max_retry_delay: Cap on maximum wait

Example Retry Sequence

Attempt 1: Fail β†’ Wait 1 minute
Attempt 2: Fail β†’ Wait 2 minutes (1 Γ— 2^1)
Attempt 3: Fail β†’ Wait 4 minutes (1 Γ— 2^2)
Attempt 4: Fail β†’ Wait 8 minutes (1 Γ— 2^3)
Attempt 5: Fail β†’ Wait 16 minutes (1 Γ— 2^4)
Attempt 6: Fail β†’ Wait 1 hour max (capped)
Attempt 7: Fail β†’ Task FAILED
Enter fullscreen mode Exit fullscreen mode

Best Practice: 2-3 retries with exponential backoff for most tasks.

Callbacks

Execute custom logic when task state changes:

  • on_failure_callback: When task fails
  • on_retry_callback: When task retries
  • on_success_callback: When task succeeds

Common Uses

  • Send notifications (Slack, email, PagerDuty)
  • Log to external monitoring systems
  • Trigger incident management
  • Update dashboards
  • Cleanup resources

Error Handling Patterns

Pattern 1: Idempotency

  • Make tasks safe to retry
  • Delete-then-insert instead of insert
  • Same result regardless of retry count

Pattern 2: Circuit Breaker

  • Stop retrying after N attempts
  • Raise critical exceptions
  • Alert operations team
  • Prevent cascading failures

Pattern 3: Graceful Degradation

  • Continue with partial data
  • Log warnings
  • Don't fail entire pipeline

Pattern 4: Skip on Condition

  • Raise AirflowSkipException
  • Skip task without marking as failure
  • Allow downstream to continue
  • Best for conditional workflows

↑ Back to Table of Contents


SECTION 9: SLA MANAGEMENT

What is SLA?

SLA (Service Level Agreement) is a guarantee that a task will complete within a specified time window.

If SLA is missed:

  • Task marked as "SLA Missed"
  • Notifications triggered
  • Callbacks executed
  • Logged for auditing/metrics

Setting SLAs

At DAG Level (all tasks inherit):

default_args = {
    'sla': timedelta(hours=1)
}
Enter fullscreen mode Exit fullscreen mode

At Task Level (specific to individual task):

task = PythonOperator(
    task_id='critical_task',
    sla=timedelta(minutes=30)
)
Enter fullscreen mode Exit fullscreen mode

Per Task Overrides: Task SLA overrides DAG default

SLA Monitoring

Key Metrics

  • SLA compliance rate (%)
  • Average overdue duration
  • Most violated tasks
  • Impact on downstream

Alerts

  • Send notifications when SLA at risk (80%)
  • Escalate on missed SLA
  • Create incidents for critical SLAs

Dashboard

  • Track compliance trends
  • Identify problematic tasks
  • Alert history
  • Remediation tracking

SLA Best Practices

Practice Benefit
Set realistic SLAs Avoid false positives
Different tiers Critical < Normal < Optional
Review quarterly Account for growth
Document rationale Team alignment
Escalation policy Proper incident response
Callbacks Automated notifications
Monitor compliance Identify trends

↑ Back to Table of Contents


SECTION 10: CATCHUP BEHAVIOR

What is Catchup?

Catchup is automatic backfill of missed DAG runs.

Example Scenario

Start date:     2024-01-01
Today:          2024-05-15
Schedule:       @daily (every day)
Catchup:        True

RESULT: Airflow creates 135 DAG runs instantly!
(One for each day from Jan 1 to May 14)
Enter fullscreen mode Exit fullscreen mode

Catchup Settings

Setting Behavior When to Use
catchup=True Backfill all missed runs Historical data loading
catchup=False No backfill, start fresh New pipelines

Catchup Risks

Potential Issues

  • Resource exhaustion (CPU, memory, database)
  • Network bandwidth spike
  • Third-party API rate limits exceeded
  • Database connection pool exhausted
  • Worker node crashes
  • Scheduler paralysis

Safe Catchup Strategy

Step 1: Use Separate Backfill DAG

  • Main DAG: catchup=False (normal operations)
  • Backfill DAG: catchup=True, end_date set (historical only)
  • Run backfill during maintenance window

Step 2: Control Parallelism

  • max_active_runs: Limit concurrent DAG runs (1-3)
  • max_active_tasks_per_dag: Limit concurrent tasks (16-32)
  • Prevents resource exhaustion

Step 3: Staged Approach

  • Test with subset of dates first
  • Monitor resource usage
  • Gradually increase pace
  • Have rollback plan

Step 4: Validation

  • Verify data completeness
  • Check for duplicates
  • Validate transformations
  • Compare with expected results

Catchup Checklist

Before enabling catchup:

  • ☐ Database capacity available
  • ☐ Worker resources adequate
  • ☐ Source data exists for all periods
  • ☐ Data schema hasn't changed
  • ☐ No API rate limits exceeded
  • ☐ Network bandwidth available
  • ☐ Stakeholders informed
  • ☐ Rollback procedure ready

↑ Back to Table of Contents


SECTION 11: CONFIGURATION

Configuration Hierarchy

Airflow reads configuration in this order (first found wins):

  1. Environment Variables (AIRFLOW___ prefix)
  2. airflow.cfg file (main config)
  3. Default configuration (hardcoded)
  4. Command-line arguments (runtime)

Critical Settings

Setting Purpose Production Value
sql_alchemy_conn Metadata database PostgreSQL connection
executor Task execution CeleryExecutor or KubernetesExecutor
parallelism Max tasks globally 32-128
max_active_runs_per_dag Concurrent DAG runs 1-3
max_active_tasks_per_dag Concurrent tasks 16-32
remote_logging Centralized logs True (to S3)
email_backend Email provider SMTP

Database Configuration

Production Database

  • PostgreSQL 12+ (recommended) or MySQL 8+
  • Dedicated database server (not shared)
  • Daily backups automated
  • Connection pooling enabled (pool_size=5, max_overflow=10)
  • Monitoring for slow queries
  • Regular maintenance (VACUUM, ANALYZE)

Why Not SQLite

  • Not concurrent-access safe
  • Locks on writes
  • Single machine only
  • Not suitable for production

Executor Configuration

Executor Config Best For
LocalExecutor [local_executor] Development
CeleryExecutor [celery] Distributed, high volume
KubernetesExecutor [kubernetes] Cloud-native

↑ Back to Table of Contents


SECTION 12: VERSION COMPARISON

Version Timeline

1.10.x (Deprecated, EOL 2021)
    ↓ Major Redesign
2.0 (March 2021) - TaskFlow API introduced
    ↓
2.1-2.2 (Improvements & Deferrable Operators)
    ↓
2.3 (Nov 2022) - Datasets event-driven scheduling
    ↓
2.4-2.5 (Current stable, production-ready)
    ↓
3.0 (Latest) - Performance focus, async improvements
Enter fullscreen mode Exit fullscreen mode

Key Feature Comparison

Feature 1.10 2.x 3.0
TaskFlow API ❌ βœ… βœ…
Deferrable Operators ❌ βœ… βœ…
Datasets (Event-driven) ❌ ❌ βœ…
Performance Baseline 3-5x faster 5-6x faster
Python 3.10+ ❌ βœ… βœ… Required
Full Async/Await ❌ Limited βœ… Full

Performance Improvements

Metric 1.10 2.5 3.0
DAG parsing (1000 DAGs) ~120s ~45s ~15s
Task scheduling latency ~5-10s ~2-3s <1s
Memory (idle scheduler) ~500MB ~300MB ~150MB
Web UI response ~800ms ~200ms ~50ms

Migration Path

From 1.10 to 2.x

  • Python 3.6+ required
  • Import paths changed (reorganized)
  • Operators moved to providers package
  • Test thoroughly in staging
  • Gradual production rollout (1-2 weeks)

From 2.x to 3.0

  • Most DAGs work without changes
  • Drop support for Python < 3.10
  • Update async patterns
  • Leverage new performance features
  • Simpler migration (mostly backward compatible)

Version Selection

Scenario Recommended
New projects 2.5+ or 3.0
Production (stable) 2.4.x or 2.5.x
Enterprise 2.5+
Performance-critical 3.0
Learning/Development 2.5+ or 3.0

Why Not Use 1.10?

  • EOL (End of Life) as of 2021
  • No deferrable operators
  • No TaskFlow API
  • No datasets
  • Slow DAG parsing
  • No modern Python support

Action: If you're on 1.10, plan migration to 2.5+ now.

↑ Back to Table of Contents


SECTION 13: BEST PRACTICES

DAG Design DO's βœ…

  • Use TaskFlow API for modern code
  • Keep tasks small and focused
  • Name tasks descriptively
  • Document complex logic
  • Use type hints
  • Add comprehensive logging
  • Handle errors gracefully
  • Test DAGs before production

DAG Design DON'Ts ❌

  • Put heavy processing in DAG definition
  • Create 1000s of tasks per DAG
  • Hardcode credentials
  • Ignore data validation
  • Use DAG-level database connections
  • Forget about idempotency
  • Skip error handling
  • Assume tasks always succeed

Task Design DO's βœ…

  • Make tasks idempotent (safe to retry)
  • Set appropriate timeouts
  • Implement retries with backoff
  • Use pools for resource control
  • Skip tasks on errors (when appropriate)
  • Log important events
  • Clean up resources

Task Design DON'Ts ❌

  • Assume tasks always succeed
  • Run long operations without timeout
  • Hold resources unnecessarily
  • Store large data in XCom
  • Use blocking operations
  • Ignore upstream failures
  • Forget cleanup

Scheduling DO's βœ…

  • Match frequency to data needs
  • Account for upstream dependencies
  • Plan for failures
  • Test schedule changes
  • Document rationale
  • Monitor adherence

Scheduling DON'Ts ❌

  • Schedule too frequently
  • Ignore resource constraints
  • Forget about catchup implications
  • Use ambiguous expressions
  • Create tight couplings
  • Skip monitoring

Deployment DO's βœ…

  • Use external secrets backend
  • Store DAG code in Git
  • Implement monitoring
  • Set up disaster recovery
  • Rotate credentials quarterly
  • Use separate environments (dev, staging, prod)
  • Document runbooks

Deployment DON'Ts ❌

  • Store credentials in DAGs
  • Deploy without monitoring
  • Skip backups
  • Use same credentials everywhere
  • Forget version control
  • Deploy without testing

↑ Back to Table of Contents


SECTION 14: CONNECTIONS & SECRETS

What are Connections?

Connections store authentication credentials to external systems centrally. Instead of hardcoding database passwords or API keys in DAGs, Airflow stores these securely.

Components:

  • Connection ID: Unique identifier (e.g., postgres_default)
  • Connection Type: System type (postgres, aws, http, mysql, etc.)
  • Host: Server address
  • Login: Username
  • Password: Secret credential
  • Port: Service port
  • Schema: Database or default path
  • Extra: JSON for additional parameters

Why Use Connections?

  • Credentials not in DAG code (won't be accidentally committed to git)
  • Centralized credential management
  • Easy credential rotation
  • Different credentials per environment (dev, staging, prod)
  • Audit trail of access

Connection Types

Type Used For Example
postgres PostgreSQL databases Analytics warehouse
mysql MySQL databases Transactional data
snowflake Snowflake warehouse Cloud data platform
aws AWS services (S3, Redshift, Lambda) Cloud infrastructure
gcp Google Cloud services BigQuery, Cloud Storage
http REST APIs External services
ssh Remote server access Bastion hosts
slack Slack messaging Notifications
sftp SFTP file transfer Remote servers

How to Add Connections

Method 1: Web UI (Easiest)

  1. Airflow UI β†’ Admin β†’ Connections β†’ Create Connection
  2. Fill form with credentials
  3. Click Save

Method 2: Environment Variables

export AIRFLOW_CONN_POSTGRES_DEFAULT=postgresql://user:password@localhost:5432/mydb
export AIRFLOW_CONN_AWS_DEFAULT=aws://ACCESS_KEY:SECRET_KEY@
Enter fullscreen mode Exit fullscreen mode

Method 3: Airflow CLI

airflow connections add postgres_prod \
    --conn-type postgres \
    --conn-host localhost \
    --conn-login user \
    --conn-password password
Enter fullscreen mode Exit fullscreen mode

Using Connections in DAGs

Connections are accessed through Hooks (connection abstraction layer):

from airflow.hooks.postgres_hook import PostgresHook

def query_database(**context):
    hook = PostgresHook(postgres_conn_id='postgres_default')
    result = hook.get_records("SELECT * FROM users")
    return result
Enter fullscreen mode Exit fullscreen mode

Benefits of Hooks:

  • Encapsulate connection logic
  • Reusable across tasks
  • Handle connection pooling
  • Error handling built-in
  • No credentials in task code

What are Variables/Secrets?

Variables store non-sensitive configuration values.
Secrets store sensitive data (marked as secret in UI).

Aspect Variables Secrets
Purpose Configuration settings Passwords, API keys
Visibility Visible in UI Masked/hidden in UI
Storage Metadata database External backend (Vault, etc.)
Example batch_size, email api_key, db_password

How to Add Variables/Secrets

Method 1: Web UI

  • Airflow UI β†’ Admin β†’ Variables β†’ Create Variable
  • Enter Key and Value
  • Check "Is Secret" for sensitive data
  • Click Save

Method 2: Environment Variables

export AIRFLOW_VAR_BATCH_SIZE=1000
export AIRFLOW_VAR_API_KEY=secret123
Enter fullscreen mode Exit fullscreen mode

Method 3: Airflow CLI

airflow variables set batch_size 1000
airflow variables set api_key secret123
Enter fullscreen mode Exit fullscreen mode

Using Variables in DAGs

from airflow.models import Variable

batch_size = Variable.get('batch_size', default_var=1000)
api_key = Variable.get('api_key')
is_production = Variable.get('is_production', default_var='false')

# With JSON parsing
config = Variable.get('config', deserialize_json=True)
Enter fullscreen mode Exit fullscreen mode

Secrets Backend: Advanced Security

For production, use external secrets backends instead of storing in Airflow database.

Backend Security Level Best For
Metadata DB Low Development only
AWS Secrets Manager High AWS deployments
HashiCorp Vault Very High Enterprise
Google Secret Manager High GCP deployments
Azure Key Vault High Azure deployments

How it works:

  • Configure Airflow to use external backend
  • Airflow queries backend when credential needed
  • Secrets never stored in database
  • Automatic rotation supported
  • Audit logging in backend

Security Best Practices

DO βœ…

  • Use external secrets backend in production
  • Rotate credentials quarterly
  • Use IAM roles instead of hardcoded keys
  • Encrypt connections in transit (SSL/TLS)
  • Encrypt secrets at rest (database encryption)
  • Limit who can see connections (RBAC)
  • Use short-lived credentials (temporary tokens)
  • Audit access to secrets
  • Never commit credentials to git

DON'T ❌

  • Hardcode credentials in DAG code
  • Store passwords in comments or docstrings
  • Share credentials via Slack/Email
  • Use default passwords
  • Store credentials in config files in git
  • Forget to revoke old credentials
  • Use same credentials everywhere
  • Log credential values

Testing Connections

Via Web UI:

  • Navigate to Admin β†’ Connections
  • Click on connection
  • Click "Test Connection"

Via CLI:

airflow connections test postgres_default
airflow connections test aws_default
Enter fullscreen mode Exit fullscreen mode

Via Python:

from airflow.hooks.postgres_hook import PostgresHook

hook = PostgresHook(postgres_conn_id='postgres_default')
hook.get_conn()  # Will raise error if connection fails
Enter fullscreen mode Exit fullscreen mode

↑ Back to Table of Contents


SUMMARY & RECOMMENDATIONS

For New Projects Starting Today

  1. Choose Airflow 2.5+ or 3.0 (not 1.10)
  2. Use TaskFlow API (modern, clean code)
  3. Use Timetable API for scheduling (not cron strings)
  4. Use Deferrable operators for long waits (>5 min)
  5. Plan for scale from day one (don't start with LocalExecutor)
  6. Set up monitoring immediately (not later)
  7. Use event-driven scheduling (Datasets) where possible
  8. Store credentials externally (Secrets Manager, not hardcoded)

For Teams Upgrading from 1.10

  1. Plan 2-3 months for migration (not quick)
  2. Test thoroughly in staging environment
  3. Keep 1.10 backup for emergency rollback
  4. Update DAGs incrementally (not all at once)
  5. Monitor closely first month post-upgrade
  6. Document breaking changes for team
  7. Celebrate milestones (keep team motivated)

For Large Enterprise Deployments

  1. Use Kubernetes executor (scalability, isolation)
  2. Implement comprehensive monitoring (critical for troubleshooting)
  3. Set up disaster recovery (business continuity)
  4. Regular chaos engineering tests (find weak points)
  5. Capacity planning quarterly (predict growth)
  6. Performance optimization ongoing (never "done")
  7. Security audits regular (compliance, risk)

For Data Pipeline Use Cases

  1. Use event-driven scheduling (Datasets, not time-based)
  2. Implement data quality checks (validate early)
  3. Monitor SLAs closely (data freshness critical)
  4. Implement idempotent tasks (safe retries)
  5. Plan catchup carefully (historical data loads)
  6. Version control everything (DAGs, configurations)
  7. Use separate environments (dev, staging, prod)

KEY TAKEAWAYS

βœ… Fundamentals: DAGs, operators, sensors, and TaskFlow API are the foundation

βœ… Scheduling: Choose appropriate method (time-based, event-driven, or hybrid)

βœ… Reliability: Retries, error handling, SLAs, and monitoring ensure success

βœ… Scale: Deferrable operators and proper architecture enable massive workloads

βœ… Security: Centralized credential management with external backends

βœ… Versions: Upgrade from 1.10 to 2.5+ now (3.0 for greenfield projects)


SUCCESS FORMULA

Treat DAGs as code β†’ Version control, code review, testing
Monitor continuously β†’ You can't fix what you don't see
Document thoroughly β†’ Future you and teammates will thank you
Automate responses β†’ Reduce manual incident response
Iterate and improve β†’ Perfect is enemy of done


Airflow is Perfect For

  • βœ… Scheduled batch processing (daily, hourly, etc.)
  • βœ… Multi-stage ETL/ELT workflows
  • βœ… Cross-system orchestration
  • βœ… Data quality validation
  • βœ… DevOps automation
  • βœ… ML pipeline orchestration
  • βœ… Report generation

Airflow is NOT For

  • ❌ Real-time streaming (use Kafka, Spark Streaming)
  • ❌ Replace data warehouses (use Snowflake, BigQuery)
  • ❌ Sub-second latency requirements
  • ❌ Complex transactional systems

Final Wisdom

Success with Airflow depends on:

  • Technical setup: Infrastructure, configuration, security
  • Organizational practices: Code discipline, documentation, monitoring
  • Team culture: Knowledge sharing, continuous learning, collaboration

Apache Airflow's strength lies in its flexibility, active community, and rich ecosystem. Whether you're building simple daily pipelines or complex multi-stage workflows orchestrating hundreds of systems, Airflow provides the tools, patterns, and reliability to succeed.

Start small. Learn thoroughly. Scale thoughtfully. Monitor obsessively.


[End of Tutorial]

Top comments (0)