DEV Community

Davide Gazzè
Davide Gazzè

Posted on

Running Apache Airflow via Docker Compose

Hi everyone,
in this post, I will summarize how to run Apache Airflow using Docker-compose.

Apache Airflow is a platform to programmatically author, schedule, and monitor workflows.
It is simple to start with it because:

  • Apache Airflow has a nice UI
  • Apache Airflow has a programmatic way to create workflows
  • Apache Airflow is very important in the community, and that means that you have a lot of courses and posts for starting
  • Apache Airflow is simple compared to Apache Nifi (see my articles here and here)

For your first steps, in Apache Airflow worlds, you can use the development as explained here. In this post, you will use SQLite as the database for running the tutorial.

However, if you want to use this fantastic tool, you can learn how to directly from the Apache Airflow website.

To summarize, the post will tell to you to:

  1. select a Database backend like Mysql/MariaDB or Postgres
  2. use the LocalExecutor for the local machine of Kubernetes executor or the Celery executor in a multi-node setup
  3. setting Stackdriver Logging, Elasticsearch, or Amazon CloudWatch for saving the logs

The above information is just the starting point for a robust orchestrator in production.

As usual, it is not always simple to set up everything in your local machine for testing a production environment.
For this reason, Docker, and in particular docker-compose can help us.

As described here, you can test a production environment on your local machine.

The proposed configuration

An Airflow installation is composed of the following components:

  • scheduler for triggering scheduled workflow:
  • executor for running the tasks
  • webserver for managing, inspecting, triggering and debugging the DAGs and tasks
  • folder with all DAG files
  • metadata database for saving the states of the scheduler, executors and webserver.

The image below represents the Airflow architecture:

The proposed configuration is not a run production-ready Docker Compose Airflow installation. It is just a quick-start docker-compose to get your hands dirty with Airflow.

Of course, you have to install it on your laptop:

It is recommended to reserve at least 4GB (better 8GB) of memory for Docker.

The docker-compose.yaml file

The community file is available here, you can download it using the command:

curl -LfO 'https://airflow.apache.org/docs/apache-airflow/2.3.4/docker-compose.yaml'

The docker-compose file is formed by:

  • airflow-scheduler: The scheduler monitors all DAGs
  • airflow-webserver: The AIRFLOW webserver (available at the URL: http://localhost:8080)
  • airflow-worker: The executor of each task in a DAG
  • airflow-init: The initialization service
  • postgres: The database
  • redis: The broker that forwards messages from scheduler to worker
  • flower: The optional application that monitors the environment (you can start using: docker-compose --profile flower up)

The docker-compose set three volumes:

  • dags: the folder where you can put your DAG
  • logs: the folder that contains logs from task execution and scheduler
  • plugins: the folder where you can put your custom plugins

all of these volumes are persisted onto your local machine. So it is simple to perform some tests locally and then move to another machine without losing anything.
In any case, the folders are created in the same folder where the file docker-compose.yaml is present.
If you want, you can change the local folders in lines 63 - 65:

  volumes:
    - /tmp/airflow/dags:/opt/airflow/dags
    - /tmp/airflow/logs:/opt/airflow/logs
    - /tmp/airflow/plugins:/opt/airflow/plugins
Enter fullscreen mode Exit fullscreen mode

if you do not want the initial dags, you should set false line 59:

    AIRFLOW__CORE__LOAD_EXAMPLES: 'true'
Enter fullscreen mode Exit fullscreen mode

Another little improvement is to set the name of the PostgreSQL container adding the following in line 77:

    container_name: db
Enter fullscreen mode Exit fullscreen mode

In this way, you can refer to the PostgreSQL database with the name db.
Otherwise, you could change line 83 to set the Postgres data folder.

Moreover, you can change some other basic configurations:

  • AIRFLOW_IMAGE_NAME: The Docker image name used to run Airflow (Default: apache/airflow:2.3.4)
  • AIRFLOW_UID: The user ID in Airflow containers (Default: 50000)
  • _AIRFLOW_WWW_USER_USERNAME: The username for the administrator account (Default: airflow)
  • _AIRFLOW_WWW_USER_PASSWORD: The password for the administrator account (Default: airflow)
  • _PIP_ADDITIONAL_REQUIREMENTS: The additional PIP requirements to add when starting all containers (Default: )

Toward the starting

Before starting everything, you have to set the following:

  1. Create the Airflow folders:

    mkdir -p ./dags ./logs ./plugins ./postgres-db

  2. Set the Airflow user:

    echo -e "AIRFLOW_UID=$(id -u)" > .env

  3. Initialize the database:

    docker-compose up airflow-init

In particular, after the last step, you will see the following output:

Attaching to airflow-init_1
....
airflow-init_1  | DB: postgresql+psycopg2://airflow:***@postgres/airflow
airflow-init_1  | Performing upgrade with database postgresql+psycopg2://airflow:***@postgres/airflow
airflow-init_1  | [2022-09-10 07:47:18,664] {db.py:1466} INFO - Creating tables
airflow-init_1  | INFO  [alembic.runtime.migration] Context impl PostgresqlImpl.
....
airflow-init_1  | Upgrades done
....
airflow-init_1  |   FutureWarning,
airflow-init_1  | 2.3.4
airflow-init_1 exited with code 0
Enter fullscreen mode Exit fullscreen mode
  1. Start Airflow typing:

    docker-compose up -d

If everything goes well, the output of the docker ps is the following:

CONTAINER ID   IMAGE                  COMMAND                  CREATED        STATUS                    PORTS                    NAMES
5508c60831d4   apache/airflow:2.3.4   "/usr/bin/dumb-init …"   12 hours ago   Up 40 seconds (healthy)   0.0.0.0:8080->8080/tcp   resources_airflow-webserver_1
37f71f65f758   apache/airflow:2.3.4   "/usr/bin/dumb-init …"   12 hours ago   Up 40 seconds (healthy)   8080/tcp                 resources_airflow-worker_1
44c2588958cb   apache/airflow:2.3.4   "/usr/bin/dumb-init …"   12 hours ago   Up 40 seconds (healthy)   8080/tcp                 resources_airflow-scheduler_1
cc939447d676   apache/airflow:2.3.4   "/usr/bin/dumb-init …"   12 hours ago   Up 40 seconds (healthy)   8080/tcp                 resources_airflow-triggerer_1
d36e8e849ff8   redis:latest           "docker-entrypoint.s…"   12 hours ago   Up 40 seconds (healthy)   6379/tcp                 resources_redis_1
9ba46b104c7a   postgres:13            "docker-entrypoint.s…"   12 hours ago   Up 41 seconds (healthy)   5432/tcp                 resources_postgres_1
Enter fullscreen mode Exit fullscreen mode

Some useful commands

Some useful commands are the following:

Run airflow commands

To run an airflow command type:

docker-compose run airflow-worker airflow info

otherwise, you can download the wrapper script (only for MacOS or Linux):

curl -LfO 'https://airflow.apache.org/docs/apache-airflow/2.3.4/airflow.sh'
chmod +x airflow.sh
Enter fullscreen mode Exit fullscreen mode

finally, run:

./airflow.sh info

See Graphical User Interface

The Airflow GUI is available http://localhost:8080 with:

  • Username: airflow
  • Password: airflow

Shutting down Airflow and clean

For shutting down everything, you can type:

docker-compose down
Enter fullscreen mode Exit fullscreen mode

If you also want to delete the volumes type:

docker-compose down --volumes --rmi all
Enter fullscreen mode Exit fullscreen mode

Final recommendations

It is funny the final recommendations that the Apache Airflow community tells here:

DO NOT attempt to customize images and the Docker Compose if you do not know exactly what you are doing, do not know Docker Compose, or are not prepared to debug and resolve problems on your own.
....
Even if many users think of Docker Compose as “ready to use”, it is really a developer tool ...
It is extremely easy to make mistakes that lead to difficult-to-diagnose problems and if you are not ready to spend your own time on learning and diagnosing and resolving those problems on your own do not follow this path. You have been warned.
...
DO NOT expect the Docker Compose below will be enough to run production-ready Docker Compose Airflow installation using it. This is truly quick-start docker-compose for you to get Airflow up and running locally and get your hands dirty with Airflow.
Enter fullscreen mode Exit fullscreen mode

In any case, you can see the Helm Chart for Apache Airflow for more information on how to install Airflow over Kubernetes.

The rise of Amazon Web Service

As is usually the case, there where there is a configuration problem. Companies see a sales opportunity. And it happened here, too. AWS provides Amazon Managed Workflows for Apache Airflow (MWAA)
MWAA is a managed orchestration service for Apache Airflow to create end-to-end data pipelines in the cloud at scale.
Managed Workflows is optimized to use Airflow and Python to create workflows without taking care the scalability, availability, and security.

Summary

In this post, we start to see the importance of setup a production environment for Apache Airflow.
After a short introduction, we see a simple docker-compose implementation to start with Apache Airflow local.
Finally, we shortly see the Amazon Managed Workflows for Apache Airflow and an all-in-one solution for the developer.

That's all for this post, in the next, I will go deeper inside the production problem of Apache Airflow, and
we will analyze the helm and MWAA.

Original published on [Davide Gazzè's Medium]((https://medium.com/p/bcbb19f30cd6)

Top comments (0)