Understanding Docker for Data Engineering

#docker #dataengineering #beginners

In your data engineering journey, you may have pipelines running locally inside your development environment, and it works beautifully, on your machine.

But what happens if you want to hand over a project to a colleague, deploy it to a shared server managed by your team leader, or push it to a cloud provider?

Suddenly, a storm of errors appears. Like...:

You don't have PostgreSQL 15 installed locally?"
"Your machine is running an older version of Python that doesn't support that syntax?"
"Your operating system is missing the specific database drivers needed for psycopg2"

This is called Dependency Hell. In data engineering, ensuring your pipelines run exactly the same way everywhere is just as important as writing the code itself. That is why we use Docker.

The Container Analogy:

Before the 1950s, shipping goods across the world was incredibly messy. Workers had to manually load barrels of oil, sacks, and crates of electronics onto ships. Every item was a different shape and size, making loading slow, inefficient, and prone to accidents.

Then came the standardized shipping container.

It didn’t matter what was inside the box, whether it was cars, clothes, or frozen food, the shipping container was always the exact same size, had the same hooks, and fit perfectly on every ship, train, and crane in the world.

Docker does exactly this for software. Instead of shipping raw Python scripts and text files, Docker lets you package your application code, dependencies, runtime, and configurations into a single, standardized box called a Container. If a machine can run Docker, it can run your container seamlessly, whether it’s a Windows laptop, a Mac, or a Linux server.

A common question beginners ask is: "Why not just use a Virtual Machine (VM) to isolate our code?"

While VMs provide isolation, they are quite heavy. A VM copies an entire guest operating system (like a whole installation of Windows or Ubuntu), which consumes gigabytes of RAM and CPU before your code even starts running.

Docker containers are lightweight. They don't include a whole operating system; instead, they share the host machine’s operating system kernel and only pack the bare essentials (your app code and libraries). This means a container can spin up in seconds rather than minutes, using a fraction of the system resources.

The Three Core Concepts in Docker

1. The Dockerfile

A text document containing step by step instructions on how to build a Docker environment.
You specify the base image, install your python packages, and copy your scripts into it.

2. The Docker Image

When you run a build command on your Dockerfile, it compiles into a Docker Image. This is a read-only blueprint of your environment.
It contains all the snapshots of your libraries and setup files. You can share this image on platforms like Docker Hub.

3. The Docker Container

When you run an image, it becomes a Container. This is the active instance of Docker containing everything an application needs to run.
You can start it, stop it, write data inside it, and delete it when you're done.

Bleow is a guide to help you install Docker on Linux or on Windows if you have Windows Subsytem for Linux, (WSL)
-- Using Docker Engine

# Step 1: Update the system and install prerequisites    
    sudo apt update && sudo apt upgrade -y    
    sudo apt install -y ca-certificates curl    

    # Step 2: Add Docker's official GPG key and repository    
    sudo install -m 0755 -d /etc/apt/keyrings  # create the keyring directory
    # Download Docker's GPG key
    sudo curl -fsSL https://download.docker.com/linux/ubuntu/gpg \    
      -o /etc/apt/keyrings/docker.asc  
    # Set correct permissions on the key file
    sudo chmod a+r /etc/apt/keyrings/docker.asc    
     # Add Docker's stable apt repository  
    echo "deb [arch=$(dpkg --print-architecture) \    
      signed-by=/etc/apt/keyrings/docker.asc] \    
      https://download.docker.com/linux/ubuntu \    
      $(. /etc/os-release && echo "$VERSION_CODENAME") stable" | \    
      sudo tee /etc/apt/sources.list.d/docker.list > /dev/null    

    # Step 3: Install Docker Engine (latest version)
    sudo apt update    
    sudo apt install -y docker-ce docker-ce-cli containerd.io \    
      docker-buildx-plugin docker-compose-plugin    

    # Step 4: Verify installation    
    docker --version    
    docker compose version    

    # Step 5: Check if Docker is running 
    sudo systemctl start docker   # start the Docker daemon   
    sudo systemctl status docker  # verify Docker is running
    sudo systemctl enable docker    # enable Docker to start on boot

When you install Docker Desktop or run Docker commands via your terminal, you get an organized view of your environment.

To confirm Docker is working, pull and run the hello-world image
docker run hello-world

What happens behind the scenes:

Docker checks locally for the 'hello-world' image
Not found locally → pulls it from Docker Hub
Creates a container from the image
Runs it → prints the success message
Container exits

Core Docker Commands

    # ── Containers ──────────────────────────────────────────────────────────────    
    docker ps                    # list RUNNING containers only    
    docker ps -a                 # list ALL containers (running + stopped + exited)    
    docker run <image>           # create and run a container from an image    
    docker run -it ubuntu bash   # run interactively (-i) with a terminal (-t)    
    docker stop <container_id>   # gracefully stop a running container    
    docker rm <container_id>     # remove a stopped container    
    docker rm $(docker ps -aq)   # remove ALL stopped containers    

    # ── Images ──────────────────────────────────────────────────────────────────    
    docker images                # list all locally stored images    
    docker pull python:3.10      # download an image from Docker Hub without running it    
    docker rmi <image_id>        # remove a local image    
    docker rmi $(docker images -q) # remove ALL local images    

    # ── Building ─────────────────────────────────────────────────────────────────    
    docker build -t myapp .      # build an image from Dockerfile in current folder    
                                 # -t = tag (name) for the image    
                                 # .  = path to the Dockerfile (current directory)    

    # ── Running Interactively ────────────────────────────────────────────────────    
    docker run -it ubuntu bash               # run Ubuntu container with bash shell    
    docker exec -it <container_id> bash      # open a shell inside a RUNNING container    

    # ── System Info ──────────────────────────────────────────────────────────────    
    docker version               # show Docker client and server (daemon) version    
    docker info                  # detailed Docker system information

What's Next?

Understanding individual containers is the first step. However, in data engineering, projects rarely rely on just one thing. A project could need a Python environment to run a script and a PostgreSQL database to store the data.
This will require two containers to run. Running those as separate containers manually and trying to link their networks together can get complicated.

In my next article, we are going to look at Docker Compose — a tool that lets us define and run multi-container applications using a single YAML file.
We will package an entire ETL pipeline so that anyone can spin up a database and execution script with just one command: docker-compose up.