In your data engineering journey, you may have pipelines running locally inside your development environment, and it works beautifully, on your machine.
But what happens if you want to hand over a project to a colleague, deploy it to a shared server managed by your team leader, or push it to a cloud provider?
Suddenly, a storm of errors appears. Like...:
- You don't have PostgreSQL 15 installed locally?"
- "Your machine is running an older version of Python that doesn't support that syntax?"
- "Your operating system is missing the specific database drivers needed for
psycopg2"
This is called Dependency Hell. In data engineering, ensuring your pipelines run exactly the same way everywhere is just as important as writing the code itself. That is why we use Docker.
The Container Analogy:
Before the 1950s, shipping goods across the world was incredibly messy. Workers had to manually load barrels of oil, sacks, and crates of electronics onto ships. Every item was a different shape and size, making loading slow, inefficient, and prone to accidents.
Then came the standardized shipping container.
It didn’t matter what was inside the box, whether it was cars, clothes, or frozen food, the shipping container was always the exact same size, had the same hooks, and fit perfectly on every ship, train, and crane in the world.
Docker does exactly this for software. Instead of shipping raw Python scripts and text files, Docker lets you package your application code, dependencies, runtime, and configurations into a single, standardized box called a Container. If a machine can run Docker, it can run your container seamlessly, whether it’s a Windows laptop, a Mac, or a Linux server.
A common question beginners ask is: "Why not just use a Virtual Machine (VM) to isolate our code?"
While VMs provide isolation, they are quite heavy. A VM copies an entire guest operating system (like a whole installation of Windows or Ubuntu), which consumes gigabytes of RAM and CPU before your code even starts running.
Docker containers are lightweight. They don't include a whole operating system; instead, they share the host machine’s operating system kernel and only pack the bare essentials (your app code and libraries). This means a container can spin up in seconds rather than minutes, using a fraction of the system resources.
The Three Core Concepts in Docker
1. The Dockerfile
A text document containing step by step instructions on how to build a Docker environment.
You specify the base image, install your python packages, and copy your scripts into it.
2. The Docker Image
When you run a build command on your Dockerfile, it compiles into a Docker Image. This is a read-only blueprint of your environment.
It contains all the snapshots of your libraries and setup files. You can share this image on platforms like Docker Hub.
3. The Docker Container
When you run an image, it becomes a Container. This is the active instance of Docker containing everything an application needs to run.
You can start it, stop it, write data inside it, and delete it when you're done.
Bleow is a guide to help you install Docker on Linux or on Windows if you have Windows Subsytem for Linux, (WSL)
-- Using Docker Engine
# Step 1: Update the system and install prerequisites
sudo apt update && sudo apt upgrade -y
sudo apt install -y ca-certificates curl
# Step 2: Add Docker's official GPG key and repository
sudo install -m 0755 -d /etc/apt/keyrings # create the keyring directory
# Download Docker's GPG key
sudo curl -fsSL https://download.docker.com/linux/ubuntu/gpg \
-o /etc/apt/keyrings/docker.asc
# Set correct permissions on the key file
sudo chmod a+r /etc/apt/keyrings/docker.asc
# Add Docker's stable apt repository
echo "deb [arch=$(dpkg --print-architecture) \
signed-by=/etc/apt/keyrings/docker.asc] \
https://download.docker.com/linux/ubuntu \
$(. /etc/os-release && echo "$VERSION_CODENAME") stable" | \
sudo tee /etc/apt/sources.list.d/docker.list > /dev/null
# Step 3: Install Docker Engine (latest version)
sudo apt update
sudo apt install -y docker-ce docker-ce-cli containerd.io \
docker-buildx-plugin docker-compose-plugin
# Step 4: Verify installation
docker --version
docker compose version
# Step 5: Check if Docker is running
sudo systemctl start docker # start the Docker daemon
sudo systemctl status docker # verify Docker is running
sudo systemctl enable docker # enable Docker to start on boot
When you install Docker Desktop or run Docker commands via your terminal, you get an organized view of your environment.
To confirm Docker is working, pull and run the hello-world image
docker run hello-world
What happens behind the scenes:
- Docker checks locally for the 'hello-world' image
- Not found locally → pulls it from Docker Hub
- Creates a container from the image
- Runs it → prints the success message
- Container exits
Core Docker Commands
# ── Containers ──────────────────────────────────────────────────────────────
docker ps # list RUNNING containers only
docker ps -a # list ALL containers (running + stopped + exited)
docker run <image> # create and run a container from an image
docker run -it ubuntu bash # run interactively (-i) with a terminal (-t)
docker stop <container_id> # gracefully stop a running container
docker rm <container_id> # remove a stopped container
docker rm $(docker ps -aq) # remove ALL stopped containers
# ── Images ──────────────────────────────────────────────────────────────────
docker images # list all locally stored images
docker pull python:3.10 # download an image from Docker Hub without running it
docker rmi <image_id> # remove a local image
docker rmi $(docker images -q) # remove ALL local images
# ── Building ─────────────────────────────────────────────────────────────────
docker build -t myapp . # build an image from Dockerfile in current folder
# -t = tag (name) for the image
# . = path to the Dockerfile (current directory)
# ── Running Interactively ────────────────────────────────────────────────────
docker run -it ubuntu bash # run Ubuntu container with bash shell
docker exec -it <container_id> bash # open a shell inside a RUNNING container
# ── System Info ──────────────────────────────────────────────────────────────
docker version # show Docker client and server (daemon) version
docker info # detailed Docker system information
What's Next?
Understanding individual containers is the first step. However, in data engineering, projects rarely rely on just one thing. A project could need a Python environment to run a script and a PostgreSQL database to store the data.
This will require two containers to run. Running those as separate containers manually and trying to link their networks together can get complicated.
In my next article, we are going to look at Docker Compose — a tool that lets us define and run multi-container applications using a single YAML file.
We will package an entire ETL pipeline so that anyone can spin up a database and execution script with just one command: docker-compose up.
Top comments (0)