From Local Scripts to Cloud Servers: Demystifying Docker for DataOps

#beginners #dataengineering #devops #docker

"...But it works on my machine."

If you spend enough time in data engineering or software development, you will inevitably hear this phrase. You might write a brilliant ETL script that works flawlessly on your laptop, but the moment you move that code to a cloud server, everything breaks. The server has the wrong version of Python, missing libraries or conflicting dependencies.

This exact problem is why Docker exists.

To understand how Docker works in the real world, we are going to break down its role in a live DataOps project:
an automated NBA Analytics pipeline that extracts game statistics and transforms them using Apache Airflow and dbt.

What is Docker?
Docker is an open source platform for developing, shipping and running applications. Docker enables you to separate your applications from your infrastructure so you can deliver software quickly. Look at it this way:

Instead of installing your code, libraries and tools directly onto a computer, you package them all into a template known as a Docker Image. When you run this image, it forms a Container which is an isolated environment.
Because the container holds everything your application needs to run, you can drop it onto a laptop or a server of your choice, and it will run exactly the same way every single time.

In this project, the orchestrator, Apache Airflow, is hosted on an Azure Virtual Machine. It is supposed to trigger a local worker to extract data, and then execute transformations using dbt SQL models inside Snowflake.

This creates a massive dependency headache.

Instead of manually installing Airflow on the Azure server and hoping for the best, Docker is initialized to create a container where Airflow is strictly pinned to version 2.10.0.

Deconstructing the Dockerfile

The Dockerfile contains a set of instructions on how to build a an image. Think of it as a recipe.

Here is the exact Dockerfile used to build the Airflow orchestrator for this NBA project:

FROM apache/airflow:2.10.0-python3.10

# Step 1: Install system-level tools
USER root
RUN apt-get update && apt-get install -y --no-install-recommends build-essential

# Step 2: Switch back to standard user for security
USER airflow

# Step 3: Install Python packages
COPY --chown=airflow:root requirements.txt /requirements.txt
RUN pip install --upgrade pip && \
    pip install --no-cache-dir -r /requirements.txt

# Step 4: Copy the dbt models into the container
COPY --chown=airflow:root nba_analytics /opt/airflow/nba_analytics

Let's break it down line by line:

FROM apache/airflow:2.10.0...
Every Dockerfile starts with a FROM command. This tells Docker what "base image" to start with. Instead of building an operating system from scratch, we are telling Docker to go grab the official Apache Airflow 2.10.0 blueprint from its registry Docker Hub. This instantly guarantees we bypass the version conflict issues mentioned earlier.
USER root & RUN apt-get...: We temporarily switch to the administrative root user to install system tools, then safely switch back to USER airflow.
COPY & RUN pip install: We copy the requirements.txt file from our local computer into the container. The RUN command then executes a terminal command to install all our necessary libraries. The --no-cache-dir flag tells Docker not to save the leftover installation files, keeping the final container lightweight.
COPY ... nba_analytics: By copying the nba_analytics folder directly into the container, we ensure our orchestrator has immediate access to the SQL models it needs to run.

Docker Compose

A Dockerfile is just the blueprint for a single service.

However, enterprise tools like Apache Airflow are rarely just one service. Airflow, for instance, requires three separate services to function: a Scheduler, Webserver and Database. (More on Airflow here)

To spin up all of these services on our Azure VM, the project utilizes Docker Compose. This requires a docker-compose.yml file, which acts as a master blueprint.
Here is a simplified look at how it defines our architecture:

services:
  postgres:
    image: postgres:13
    environment:
      POSTGRES_DB: airflow

  airflow-webserver:
    build: .
    ports:
      - "8080:8080"
    depends_on:
      - postgres

  airflow-scheduler:
    build: .
    depends_on:
      - postgres

Instead of running long, complex terminal commands to start each piece manually, Docker Compose reads this YAML file and handles the networking automatically.

To build the container, you only need to run one command:

docker compose up -d

Docker then downloads the database, builds your custom Airflow image using your Dockerfile, links them all together and boots up an isolated orchestration server.
The -d flag simply tells it to run in "detached" mode, meaning it runs quietly in the background so you can continue using your terminal.

Summary

By containerizing the orchestrator, this data pipeline achieves perfect environment consistency. It doesn't matter if you deploy this project on an Azure VM, a Google Cloud instance or your laptop, Docker ensures that Airflow 2.10.0 and every other Python library are locked and ready to orchestrate your data.

DEV Community

From Local Scripts to Cloud Servers: Demystifying Docker for DataOps

Deconstructing the Dockerfile

Docker Compose

Summary

Top comments (0)