Josh Holbrook

Posted on Jul 12, 2020 • Edited on Oct 29, 2020

How To Run Airflow on Windows (with Docker)

#dataengineering #etl #airflow

A problem I've noticed a lot of aspiring data engineers running into recently is trying to run Airflow on Windows. This is harder than it sounds.

For many (most?) Python codebases, running on Windows is reasonable enough. For data, Anaconda even makes it easy - create an environment, install your library and go. Unfortunately, Airbnb handed us a pathologically non-portable codebase. I was flabbergasted to find that casually trying to run Airflow on Windows resulted in a bad shim script, a really chintzy pathing bug, a symlinking issue* and an attempt to use the Unix-only passwords database.

So running Airflow in Windows natively is dead in the water, unless you want to spend a bunch of months rewriting a bunch of the logic and arguing with the maintainers**. Luckily, there are two fairly sensible alternate approaches to consider which will let you run Airflow on a Windows machine: WSL and Docker.

WSL

WSL stands for the "Windows Subsystem for Linux", and it's actually really cool. Basically, steps look something like this:

Install the WSL by running some cryptic PowerShell commands
Install Ubuntu from the Microsoft Store
Type "Ubuntu" into the search bar, mash enter, and be dumped into a containerized Linux environment

I have WSL 2 installed, which is faster and better in many ways aside but which (until recently? unclear) needs an insider build of Windows.

Given that this is a fully operational Ubuntu environment, any tutorial that you follow for Ubuntu should also work in this environment.

Docker

The alternative, and the one I'm going to demo in this post, is to use Docker.

Docker is a tool for managing Linux containers, which are a little like virtual machines without the virtualization, making them act like self-contained machines but much more lightweight than a full VM. Surprisingly it works on Windows - casually, even.

Brief sidebar: Docker isn't a silver bullet, and honestly it's kind of a pain in the butt. I personally find it tough to debug and its aggressive caching makes both cache busting and resource clearing difficult. Even so, the alternatives - such as Vagrant - are generally worse. Docker is also a pseudo-standard and Kubernetes - the heinously confusing thing your DevOps team makes you deploy to - works with Docker images, so it's overall a useful tool to reach for especially for problems like this one.

Setting up Docker Compose

Docker containers can be ran in two ways: either in a bespoke capacity via the command line, or using a tool called Docker Compose that takes a yaml file which specifies which containers to run and how, and then does what's needed. For a single container the command line is often the thing you want - and we use it later on - but for a collection of services that need to talk to each other, Docker Compose is what we need.

So to get started, create a directory somewhere - mine's in ~\software\jfhbrook\airflow-docker-windows but yours can be anywhere - and create a docker-compose.yml file that looks like this:



version: '3.8'
services:
  metadb:
    image: postgres
    environment:
      POSTGRES_USER: airflow
      POSTGRES_PASSWORD: airflow
      POSTGRES_DB: airflow
    networks:
      - airflow
    restart: unless-stopped
    volumes:
      - ./data:/var/lib/postgresql/data
  scheduler:
    image: apache/airflow
    command: scheduler
    depends_on:
      - metadb
    networks:
      - airflow
    restart: unless-stopped
    volumes:
      - ./airflow:/opt/airflow
  webserver:
    image: apache/airflow
    command: webserver
    depends_on:
      - metadb
    networks:
      - airflow
    ports:
      - 8080:8080
    restart: unless-stopped
    volumes:
      - ./airflow:/opt/airflow
networks:
  airflow:

There's a lot going on here. I'll try to go over the highlights, but I recommend referring to the file format reference docs.

First of all, we create three services: a metadb, a scheduler and a webserver. Architecturally, Airflow stores its state in a database (the metadb), the scheduler process connects to that database to figure out what to run when, and the webserver process puts a web UI in front of the whole thing. Individual jobs can connect to other databases, such as RedShift, to do actual ETL.

Docker containers are created based on Docker images, which hold the starting state for a container. We use two images here: apache/airflow, the official Airflow image, and postgres, the official PostgreSQL image.

Airflow also reads configuration, DAG files and so on, out of a directory specified by an environment variable called AIRFLOW_HOME. The default if installed on your MacBook is ~/airflow, but in the Docker image it's set to /opt/airflow.

We use Docker's volumes functionality to mount the directory ./airflow under /opt/airflow. We'll revisit the contents of this directory before trying to start the cluster.

The metadb implementation is pluggable and supports most SQL databases via SQLAlchemy. Airflow uses SQLite by default, but in practice most people either use MySQL or PostgreSQL. I'm partial to the latter, so I chose to set it up here.

On the PostgreSQL side: you need to configure it to have a user and database that Airflow can connect to. The Docker image supports this via environment variables. There are many variables that are supported, but the ones I used are POSTGRES_USER, POSTGRES_PASSWORD and POSTGRES_DB. By setting all of these to airflow, I ensured that there was a superuser named airflow, with a password of airflow and a default database of airflow.

Note that you'll definitely want to think about this harder before you go to production. Database security is out of scope of this post, but you'll probably want to create a regular user for Airflow, set up secrets management with your deploy system, and possibly change the authentication backend. Your DevOps team, if you have one, can probably help you here.

PostgreSQL stores all of its data in a volume as well. The location in the container is at /var/lib/postgresql/data, and I put it in ./data on my machine.

Docker has containers connect over virtual networks. Practically speaking, this means that you have to make sure that any containers that need to talk to each other are all connected to the same network (named "airflow" in this example), and that any containers that you need to talk to from outside have their ports explicitly exposed. You'll definitely want to expose port 8080 of the webserver to your host so that you can visit the UI in your browser. You may want to expose PostgreSQL as well, though I haven't done that here.

Finally, by default Docker Compose won't bother to restart a container if it crashes. This may be desired behavior, but in my case I wanted them to restart unless I told them to stop, and so set it to unless-stopped.

Setting Up Your Filesystem

As mentioned, a number of directories need to exist and be populated in order for Airflow to do something useful.

First, let's create the data directory, so that PostgreSQL has somewhere to put its data:



mkdir ./data

Next, let's create the airflow directory, which will contain the files inside Airflow's AIRFLOW_HOME:



mkdir ./airflow

When Airflow starts it looks for a file called airflow.cfg inside of the AIRFLOW_HOME directory, which is ini-formatted and which is used to configure Airflow. This file supports a number of options, but the only one we need for now is core.sql_alchemy_conn. This field contains a SQLAlchemy connection string for connecting to PostgreSQL.

Crack open ./airflow/airflow.cfg in your favorite text editor and make it look like this:



[core]
sql_alchemy_conn = postgresql+psycopg2://airflow:airflow@metadb:5432/airflow

Some highlights:

The protocol is "postgresql+psycopg2", which tells SQLAlchemy to use the psycopg2 library when making the connection
The username is airflow, the password is airflow, the port is 5432 and the database is airflow.
The hostname is metadb. This is unintuitive and tripped me up - what's important here is that when Docker Compose sets up all of the networking stuff, it sets the hostnames for the containers to be the same as the name of the container as typed into the docker-compose.yml file. This service was called "metadb", so the hostname is likewise "metadb".

Initializing the Database

Once you have those pieces together, you can let 'er rip:



docker-compose up

However, you'll notice that the Airflow services start crash-looping immediately, complaining that various tables don't exist. (If it complains that the db isn't up, shrug, ctrl-c and try again. Computers amirite?)

This is because we need to initialize the metadb to have all of the tables that Airflow expects. Airflow ships with a CLI command that will do this - unfortunately, our compose file doesn't handle it.

Keep the Airflow containers crash-looping in the background; we can use the Docker CLI to connect to the PostgreSQL instance running in our compose setup and ninja in a fix.

Create a file called ./Invoke-Airflow.ps1 with the following contents:



$Network = "{0}_airflow" -f @(Split-Path $PSScriptRoot -Leaf)

docker run --rm --network $Network --volume "${PSScriptRoot}\airflow:/opt/airflow" apache/airflow @Args

The --rm flag removes the container after it's done running so it doesn't cutter things up. The --network flag tells docker to connect to the virtual network you created in your docker-compose.yml file, and the --volume flag tells Docker how to mount your AIRFLOW_HOME. Finally, @Args uses a feature of PowerShell called splatting to pass arguments to your script through to Airflow.

Once that's saved, we can run initdb against our Airflow install:



.\Invoke-Airflow.ps1 initdb

You should notice that Airflow is suddenly a lot happier. You should also be able to connect to Airflow by visiting localhost:8080 in your browser:

For bonus points, we can use the postgres container to connect to the database using the psql CLI using a very similar trick. Put this in Invoke-Psql.ps1:



$Network = "{0}_airflow" -f @(Split-Path $PSScriptRoot -Leaf)

docker run -it --rm --network $Network postgres psql -h metadb -U airflow --db airflow @Args

and then run .\Invoke-Psql in the terminal.

Now you should be able to run \dt at the psql prompt and see all of the tables that airflow initdb created:



psql (12.3 (Debian 12.3-1.pgdg100+1))

Type "help" for help.

airflow=# \dt

                    List of relations

 Schema |             Name              | Type  |  Owner

--------+-------------------------------+-------+---------

 public | alembic_version               | table | airflow

 public | chart                         | table | airflow

 public | connection                    | table | airflow

 public | dag                           | table | airflow

 public | dag_code                      | table | airflow

 public | dag_pickle                    | table | airflow

 public | dag_run                       | table | airflow

 public | dag_tag                       | table | airflow

 public | import_error                  | table | airflow

 public | job                           | table | airflow

 public | known_event                   | table | airflow

 public | known_event_type              | table | airflow

 public | kube_resource_version         | table | airflow

 public | kube_worker_uuid              | table | airflow

 public | log                           | table | airflow

 public | rendered_task_instance_fields | table | airflow

 public | serialized_dag                | table | airflow

 public | sla_miss                      | table | airflow

 public | slot_pool                     | table | airflow

 public | task_fail                     | table | airflow

 public | task_instance                 | table | airflow

 public | task_reschedule               | table | airflow

 public | users                         | table | airflow

 public | variable                      | table | airflow

 public | xcom                          | table | airflow

(25 rows)

Conclusions

Now we have a working Airflow install that we can mess with. You'll notice that I didn't really go into how to write a DAG - there are other tutorials for that which should now be follow-able - whenever they say to run the airflow CLI tool, run Invoke-Airflow.ps1 instead.

Using Docker, Docker Compose and a few wrapper PowerShell scripts, we were able to get Airflow running on Windows, a platform that's otherwise unsupported. In addition, we were able to build tooling to run multiple services in a nice, self-contained way, including a PostgreSQL database. Finally, by using a little PowerShell, we were able to make using these tools easy.

Cheers!

* Symbolic links in Windows are a very long story. Windows traditionally has had no support for them at all - however, recent versions of NTFS technically allow symlinks but require Administrator privileges to create them, and none of the tooling works with them.

** I'm not saying that the Airflow maintainers would be hostile towards Windows support - I don't know them for one, but also I have to assume they would be stoked. However, I also have to assume that they would have opinions. Big changes require a lot of discussion.

Top comments (3)

Josh Holbrook • Jul 15 '20

Addendum: Running in Production

I had someone ask me today about using this process to run Airflow in production. It should be noted that Docker doesn't work on all Windows installs. In particular, this reportedly won't work with server instances on Azure.

That said, if you're trying to run Airflow in production, you should probably deploy to Linux - or, if using Docker, to a managed Kubernetes product such as AKS on Azure or GKE on Google Cloud. Luckily, the only Windows-specific aspects of the procedure laid out here are the PowerShell snippets, and even PowerShell can run on Linux/MacOS if you install it.

Ovo Okpubuluku • Dec 4 '21

I think Airflow now comes with an authentication requirement too...

Josh Holbrook • Jan 3 '22

I don't have time to run through this tutorial to update the directions, but if someone tells me what changed and what they did I'm happy to post an update (with a /ht!)

DEV Community

How To Run Airflow on Windows (with Docker)

WSL

Docker

Setting up Docker Compose

Setting Up Your Filesystem

Initializing the Database

Conclusions

Top comments (3)

Addendum: Running in Production

Read next

Crafting Robust Applications Across AWS, On-Premises, and Data Centers: A Comprehensive Technical Guide

Precise Allocations with Big.js: Handling Rounding and Leftover Redistribution

Introduction to Gleam Programming Language

What developers really want