DEV Community

loading...
Cover image for Quickly Setup Airflow for Development with Breeze

Quickly Setup Airflow for Development with Breeze

mucio profile image mucio Updated on ・6 min read

Disclaimer: I have submitted the PRs for two of the Breeze features mentioned in this article (the start-airflow command and the --init-scripts flag). I feel responsible for your user experience using them, so if you have questions or feedback please reach out to me.

TL;DR

To have Airflow running on your machine do the following:

  1. Install Docker and Docker Compose
  2. Clone the Airflow repository git clone git@github.com:apache/airflow.git
  3. In the Airflow folder run ./breeze start-airflow

With the first run Breeze creates the folder files/dags in the repo folder. Adding DAG files in that folder will make them appear in Airflow.

Go to https://localhost:28080 to see your Airflow running.

Intro

If you do not like when food recipes start with pages of blabbing skip this part.

My problem

I started a to write a few blog posts for people who are approaching Python and Apache Airflow for the first time. I needed an quick way for my readers to setup their own Airflow and an even quicker way to explain how to do it.

I wanted something so simple that you and I could focus only on DAG's code. Enter Breeze.

What is Breeze?

Breeze is a command line tool to spin up a dockerized* Airflow instance for development or testing. It can be used to create an environment with specific properties to run tests, before deploying to production. This is pretty cool if you are into CI/CD.

The first time I met Breeze I was working on automating the creation of our own environment to run tests (for a data warehouse, not for Airflow), and I was very intrigued by the idea.

Therefore when I started thinking about how easily have a dev Airflow running, Breeze was on top of my list.

  • Dockerized stands for "running in a virtual machine with no impact on your computer (called the host, while the vm is the guest)." Well, no impact beside consuming CPU and RAM 😕

Setup

Prerequisites

I run my Airflow/Breeze using WLS2 on Windows 10, people using a Mac or a Linux machine will have probably a smoother experience than me, but WLS2 with Ubuntu is quite good (if you are on Windows 10 the WLS2 setup is covered here) and Breeze runs more easily in a linux box (or a mac).

What do you need:

In my case I installed all these tools in my Ubuntu WLS.

Installation and first run

Clone the Airflow repository from Github with:

git clone git@github.com:apache/airflow.git
Enter fullscreen mode Exit fullscreen mode

Once the repo is downloaded go to the Airflow folder and run Breeze:

cd airflow
./breeze start-airflow
Enter fullscreen mode Exit fullscreen mode

Breeze will download a number of docker images and will ask you if you want to build some of them, just say "yes" when asked (you can use the flags --assume-yes or --assume-no if you find this annoying). The first build can take few minutes, depending on your internet speed and machine.

If everything goes as expected you should see a screen like this:
started airflow

"I love it when a plan comes together"

Congratulations, your Airflow is up and running.

If you go to http://localhost:28080/ you will see the Airflow UI. The default credentials are admin/admin.

admin/admin to login

Username: admin - Password: admin

How to use this?

What you see are three tmux panes (tmux is a Linux tool to create a terminal session and split it in multiple parts, called panes). In the lower left corner you have the Airflow Scheduler, which takes care of running things, on the right the Webserver is waiting for you to visit the Airflow Web UI. The top pane is to run additional commands.

If you press Ctrl+b followed by an arrow key you will be able to move between panes. There is not much you need to do in the bottom panes, you can stop the scheduler and the webserver with Ctrl+C. The top one is use the Airflow CLI commands (run airflow --help if you want to know more).

To get quickly out from tmux run the following command:

./stop_airflow.sh
Enter fullscreen mode Exit fullscreen mode

The purpose of having these three panes is to allow you to observe what is happening in Airflow and in case use the command line interface (although this is for more advanced use cases).

Developing with Breeze

Now that Airflow is running, you can just put your dags in the folder files/dags created in your Airflow repository folder. If the folder is not there, Breeze will create it. The DAGs could take few minutes to appear on the web UI.

In case a DAG syntax is wrong the bottom left pane (the Webserver one) shows the errors.

Few additional notes:

  • In case you run Breeze using an SQLite database as Airflow backend (see below), that database is recreated with every run. In case you want to store Airflow configuration objects (like connections to your databases, users, etc.) use a different backend or use an initialization script (again see below).
  • Environment variables can be entered in the file files/airflow-breeze-config/variables.env (create it, if not there), these are set preparing the Airflow environment.
  • In case you want to initialize Airflow, you can put a file called init.sh in the folder files/airflow-breeze-config. The instructions in this file will be executed before Airflow Scheduler and Webserver start.

Some details and recipes

The start-airflow command provides a simple way to start Airflow and monitor it. Behind the scene Breeze initialize the Airflow backend database and create an admin user that can be used to login into the web UI (credential admin/admin).

Recipe 1 - A persistent backend

As mentioned above the default database is recreated with every execution, if you want to have something more persistent you can use a different backend, for example PostgreSQL:

./breeze start-airflow -b postgres
Enter fullscreen mode Exit fullscreen mode

This is start an additional container with a database dedicated for Airflow. Now your changes will survive a restart.

Recipe 2 - A different Airflow version

By default Breeze will start the most recent version of Airflow (currently 2.0.0dev) which is probably different from what you have in production. The good thing is that Breeze allows you to pick the version you need with another flag:

./breeze start-airflow --install-airflow-version 1.10.10
Enter fullscreen mode Exit fullscreen mode

Of course you can compose multiple flags:

./breeze start-airflow --install-airflow-version 1.10.10 -b postgres
Enter fullscreen mode Exit fullscreen mode

Feel free to go ahead and explore the other possible flags.

Recipe 3 - Initialize Airflow with your own database connection

One way to do it is to use a resilient backend, you can add your connection in the web UI and use it. At least this is what I was doing when I first started using Airflow.

A more interesting approach is to use the optional initialization script for Breeze to create the connection. This will make easier to maintain the connections and other Airflow settings, plus you can store this file in your versioning tool (e.g. git).

Here an example of init.sh file:

# Connections
airflow connections add \
    --conn-login my_user \
    --conn-password my_pwd \
    --conn-type jdbc \
    --conn-host localhost \
    --conn-port 9457 \
    --conn-extra {} \
    my_connection

# Variable
airflow variables set my_variable variable_content
Enter fullscreen mode Exit fullscreen mode

Using this file will create a JDBC connection called my_connection and a Variable called my_variable. You can see them in the Web UI clicking on the corresponding section in the menu Admin.

Additional information

The main point of Breeze was to provide an easy way to run automatic tests for the core Airflow developers, the people building Airflow not with Airflow. Breeze's goal is to lay down the foundation to easily run Airflow, taking care of:

  • start the needed docker containers
  • expose the ports for the Airflow components (e.g. webserver and backend database)
  • provide an convenient way to run new code in Airflow (e.g. put the dags in files/dags)
  • eventually run tests

These features were too interesting to leave them just to the core developers ;)

But this is not everything, if you want to know more about the possibilities offered by Breeze I suggest you to take a look at this video (Airflow Breeze - Development and Test environment fro Apache Airflow); it will not make your DAGs better, but will give you more ideas on how to use Breeze and your new dev environment.

Final words

If you are still here, feel free to leave a comment and provide your feedback. I will be happy to assist you and answer your questions (if I am able to).

Shameless plug

In case you need support or assistance feel free to reach out to me in the comment or direct messages. On twitter you can find me with the handler @mucio.

If you need more structured help, the nice people at Untitled Data Company (which includes me) will be happy to help you with all your data needs.

Discussion

pic
Editor guide