Hi everyone,
in this post, I will summarize how to run Apache Airflow using Docker-compose.
Apache Airflow is a platform to programmatically author, schedule, and monitor workflows.
It is simple to start with it because:
- Apache Airflow has a nice UI
- Apache Airflow has a programmatic way to create workflows
- Apache Airflow is very important in the community, and that means that you have a lot of courses and posts for starting
- Apache Airflow is simple compared to Apache Nifi (see my articles here and here)
For your first steps, in Apache Airflow worlds, you can use the development as explained here. In this post, you will use SQLite as the database for running the tutorial.
However, if you want to use this fantastic tool, you can learn how to directly from the Apache Airflow website.
To summarize, the post will tell to you to:
- select a Database backend like Mysql/MariaDB or Postgres
- use the LocalExecutor for the local machine of Kubernetes executor or the Celery executor in a multi-node setup
- setting Stackdriver Logging, Elasticsearch, or Amazon CloudWatch for saving the logs
The above information is just the starting point for a robust orchestrator in production.
As usual, it is not always simple to set up everything in your local machine for testing a production environment.
For this reason, Docker, and in particular docker-compose can help us.
As described here, you can test a production environment on your local machine.
The proposed configuration
An Airflow installation is composed of the following components:
- scheduler for triggering scheduled workflow:
- executor for running the tasks
- webserver for managing, inspecting, triggering and debugging the DAGs and tasks
- folder with all DAG files
- metadata database for saving the states of the scheduler, executors and webserver.
The image below represents the Airflow architecture:
The proposed configuration is not a run production-ready Docker Compose Airflow installation. It is just a quick-start docker-compose to get your hands dirty with Airflow.
Of course, you have to install it on your laptop:
- Docker Community Edition (CE)
- Docker Compose v1.29.1 and newer
It is recommended to reserve at least 4GB (better 8GB) of memory for Docker.
The docker-compose.yaml file
The community file is available here, you can download it using the command:
curl -LfO 'https://airflow.apache.org/docs/apache-airflow/2.3.4/docker-compose.yaml'
The docker-compose file is formed by:
- airflow-scheduler: The scheduler monitors all DAGs
- airflow-webserver: The AIRFLOW webserver (available at the URL: http://localhost:8080)
- airflow-worker: The executor of each task in a DAG
- airflow-init: The initialization service
- postgres: The database
- redis: The broker that forwards messages from scheduler to worker
- flower: The optional application that monitors the environment (you can start using:
docker-compose --profile flower up
)
The docker-compose set three volumes:
- dags: the folder where you can put your DAG
- logs: the folder that contains logs from task execution and scheduler
- plugins: the folder where you can put your custom plugins
all of these volumes are persisted onto your local machine. So it is simple to perform some tests locally and then move to another machine without losing anything.
In any case, the folders are created in the same folder where the file docker-compose.yaml
is present.
If you want, you can change the local folders in lines 63
- 65
:
volumes:
- /tmp/airflow/dags:/opt/airflow/dags
- /tmp/airflow/logs:/opt/airflow/logs
- /tmp/airflow/plugins:/opt/airflow/plugins
if you do not want the initial dags, you should set false line 59
:
AIRFLOW__CORE__LOAD_EXAMPLES: 'true'
Another little improvement is to set the name of the PostgreSQL container adding the following in line 77:
container_name: db
In this way, you can refer to the PostgreSQL database with the name db
.
Otherwise, you could change line 83 to set the Postgres data folder.
Moreover, you can change some other basic configurations:
- AIRFLOW_IMAGE_NAME: The Docker image name used to run Airflow (Default: apache/airflow:2.3.4)
- AIRFLOW_UID: The user ID in Airflow containers (Default: 50000)
- _AIRFLOW_WWW_USER_USERNAME: The username for the administrator account (Default: airflow)
- _AIRFLOW_WWW_USER_PASSWORD: The password for the administrator account (Default: airflow)
- _PIP_ADDITIONAL_REQUIREMENTS: The additional PIP requirements to add when starting all containers (Default: )
Toward the starting
Before starting everything, you have to set the following:
-
Create the Airflow folders:
mkdir -p ./dags ./logs ./plugins ./postgres-db
-
Set the Airflow user:
echo -e "AIRFLOW_UID=$(id -u)" > .env
-
Initialize the database:
docker-compose up airflow-init
In particular, after the last step, you will see the following output:
Attaching to airflow-init_1
....
airflow-init_1 | DB: postgresql+psycopg2://airflow:***@postgres/airflow
airflow-init_1 | Performing upgrade with database postgresql+psycopg2://airflow:***@postgres/airflow
airflow-init_1 | [2022-09-10 07:47:18,664] {db.py:1466} INFO - Creating tables
airflow-init_1 | INFO [alembic.runtime.migration] Context impl PostgresqlImpl.
....
airflow-init_1 | Upgrades done
....
airflow-init_1 | FutureWarning,
airflow-init_1 | 2.3.4
airflow-init_1 exited with code 0
-
Start Airflow typing:
docker-compose up -d
If everything goes well, the output of the docker ps
is the following:
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
5508c60831d4 apache/airflow:2.3.4 "/usr/bin/dumb-init …" 12 hours ago Up 40 seconds (healthy) 0.0.0.0:8080->8080/tcp resources_airflow-webserver_1
37f71f65f758 apache/airflow:2.3.4 "/usr/bin/dumb-init …" 12 hours ago Up 40 seconds (healthy) 8080/tcp resources_airflow-worker_1
44c2588958cb apache/airflow:2.3.4 "/usr/bin/dumb-init …" 12 hours ago Up 40 seconds (healthy) 8080/tcp resources_airflow-scheduler_1
cc939447d676 apache/airflow:2.3.4 "/usr/bin/dumb-init …" 12 hours ago Up 40 seconds (healthy) 8080/tcp resources_airflow-triggerer_1
d36e8e849ff8 redis:latest "docker-entrypoint.s…" 12 hours ago Up 40 seconds (healthy) 6379/tcp resources_redis_1
9ba46b104c7a postgres:13 "docker-entrypoint.s…" 12 hours ago Up 41 seconds (healthy) 5432/tcp resources_postgres_1
Some useful commands
Some useful commands are the following:
Run airflow commands
To run an airflow command type:
docker-compose run airflow-worker airflow info
otherwise, you can download the wrapper script (only for MacOS or Linux):
curl -LfO 'https://airflow.apache.org/docs/apache-airflow/2.3.4/airflow.sh'
chmod +x airflow.sh
finally, run:
./airflow.sh info
See Graphical User Interface
The Airflow GUI is available http://localhost:8080
with:
- Username: airflow
- Password: airflow
Shutting down Airflow and clean
For shutting down everything, you can type:
docker-compose down
If you also want to delete the volumes type:
docker-compose down --volumes --rmi all
Final recommendations
It is funny the final recommendations that the Apache Airflow community tells here:
DO NOT attempt to customize images and the Docker Compose if you do not know exactly what you are doing, do not know Docker Compose, or are not prepared to debug and resolve problems on your own.
....
Even if many users think of Docker Compose as “ready to use”, it is really a developer tool ...
It is extremely easy to make mistakes that lead to difficult-to-diagnose problems and if you are not ready to spend your own time on learning and diagnosing and resolving those problems on your own do not follow this path. You have been warned.
...
DO NOT expect the Docker Compose below will be enough to run production-ready Docker Compose Airflow installation using it. This is truly quick-start docker-compose for you to get Airflow up and running locally and get your hands dirty with Airflow.
In any case, you can see the Helm Chart for Apache Airflow for more information on how to install Airflow over Kubernetes.
The rise of Amazon Web Service
As is usually the case, there where there is a configuration problem. Companies see a sales opportunity. And it happened here, too. AWS provides Amazon Managed Workflows for Apache Airflow (MWAA)
MWAA is a managed orchestration service for Apache Airflow to create end-to-end data pipelines in the cloud at scale.
Managed Workflows is optimized to use Airflow and Python to create workflows without taking care the scalability, availability, and security.
Summary
In this post, we start to see the importance of setup a production environment for Apache Airflow.
After a short introduction, we see a simple docker-compose implementation to start with Apache Airflow local.
Finally, we shortly see the Amazon Managed Workflows for Apache Airflow and an all-in-one solution for the developer.
That's all for this post, in the next, I will go deeper inside the production problem of Apache Airflow, and
we will analyze the helm and MWAA.
Original published on [Davide Gazzè's Medium]((https://medium.com/p/bcbb19f30cd6)
Top comments (0)