DEV Community

Cover image for Introduction to Apache Airflow
John Kioko
John Kioko

Posted on

Introduction to Apache Airflow

If you're new to data engineering or workflow automation, you may have heard of Apache Airflow. It's a powerful open-source platform that simplifies creating, scheduling, and monitoring workflows using Python. Think of it as a conductor orchestrating your tasks to ensure they run in the right order. In this beginner-friendly guide, we'll explore what Airflow is, why it's valuable, and how to get started with a simple example.

What is Apache Airflow?

Apache Airflow is a tool for managing and automating workflows. It's widely used for data pipelines, such as ETL (Extract, Transform, Load) processes, but it can handle any sequence of tasks. Airflow organizes workflows as DAGs (Directed Acyclic Graphs), which are collections of tasks with defined dependencies, ensuring they execute in the correct order without looping.

Why Use Apache Airflow?

Airflow is popular among data engineers and developers for several reasons:

  • Python-Based: Workflows are defined in Python, making it approachable if you know the basics.
  • Flexible Scheduling: Run tasks hourly, daily, or on custom schedules.
  • Scalable: Handles everything from small scripts to large-scale enterprise pipelines.
  • Extensible: Connects to databases, cloud platforms, or APIs with a variety of operators and plugins.
  • Monitoring: A web interface provides real-time tracking and debugging of tasks.

For beginners, Airflow is an excellent way to learn workflow automation while leveraging Python skills.

Key Concepts in Airflow

Here are the essential terms to understand:

  • DAG: A workflow represented as a Directed Acyclic Graph, defining tasks and their dependencies.
  • Task: A single unit of work, like running a Python script or querying a database.
  • Operator: Specifies what a task does (e.g., PythonOperator for Python functions, BashOperator for shell commands).
  • Scheduler: The engine that triggers tasks based on their schedule or dependencies.
  • Executor: Determines how tasks are executed, either locally or across multiple machines.

Getting Started with Apache Airflow

Let's set up Airflow and create a simple DAG with two tasks. This hands-on example will help you grasp the basics.

Step 1: Install Apache Airflow

You'll need Python (3.7 or higher). Use a virtual environment to avoid dependency conflicts:

# Create and activate a virtual environment
python3 -m venv airflow_env
source airflow_env/bin/activate

# Install Airflow with a constraint file for compatibility
pip install apache-airflow==2.7.3 --constraint "https://raw.githubusercontent.com/apache/airflow/constraints-2.7.3/constraints-3.8.txt"
Enter fullscreen mode Exit fullscreen mode

Step 2: Set Up Airflow

Initialize the Airflow database and start the webserver and scheduler:

# Set the Airflow home directory
export AIRFLOW_HOME=~/airflow

# Initialize the database
airflow db init

# Start the webserver (runs on http://localhost:8080)
airflow webserver --port 8080 &

# Start the scheduler
airflow scheduler &
Enter fullscreen mode Exit fullscreen mode

Visit http://localhost:8080 in your browser to access the Airflow web interface. Use the default credentials: username admin, password admin.

Step 3: Create Your First DAG

DAGs are defined in Python files placed in the ~/airflow/dags folder. Here's a simple DAG that runs two tasks: one prints "Hello" and the other prints "World!".

Create a file named hello_world_dag.py in the ~/airflow/dags directory:

from datetime import datetime
from airflow import DAG
from airflow.operators.python import PythonOperator

# Define Python functions for tasks
def print_hello():
    print("Hello")

def print_world():
    print("World!")

# Define the DAG
with DAG(
    dag_id='hello_world_dag',
    start_date=datetime(2025, 1, 1),
    schedule_interval='@daily',
    catchup=False
) as dag:
    # Define tasks
    task_hello = PythonOperator(
        task_id='print_hello_task',
        python_callable=print_hello
    )

    task_world = PythonOperator(
        task_id='print_world_task',
        python_callable=print_world
    )

    # Set task dependencies
    task_hello >> task_world
Enter fullscreen mode Exit fullscreen mode

Explanation of the DAG

  • DAG Setup: The DAG object defines the workflow's ID, start date, and schedule (@daily runs it once a day).
  • Tasks: The PythonOperator creates two tasks that call the print_hello and print_world functions.
  • Dependencies: The task_hello >> task_world line ensures "Hello" prints before "World!".

Step 4: Run and Monitor Your DAG

Airflow automatically detects the DAG file. In the web interface, locate hello_world_dag, toggle it to "On," and trigger it manually by clicking the play button. Check the logs to confirm the tasks ran, printing "Hello" followed by "World!".

Common Use Cases

Airflow is versatile and used for:

  • ETL Pipelines: Automating data extraction, transformation, and loading.
  • Machine Learning: Scheduling model training and deployment.
  • Monitoring: Running periodic checks on data quality or system health.

Next Steps and Resources

Want to learn more? Here are some top-notch resources to deepen your Airflow knowledge:

Conclusion

Apache Airflow is a powerful, Python-based tool for automating workflows, making it ideal for beginners and seasoned developers alike. Its ability to manage complex dependencies and schedules sets it apart. By starting with a simple DAG and exploring the web interface, you'll quickly unlock its potential. Install Airflow, create your first DAG, and take charge of your workflows!

Got questions or Airflow projects to share? Drop a comment below and let’s keep the conversation going!

Top comments (0)