Introduction
Imagine being able to package your entire data pipeline into a neat, portable box that runs anywhere without the dreaded “but it works on my machine” excuse. That’s the magic of Docker.
For data engineers, where workflows often span multiple environments which include local machines, cloud servers, and clusters, Docker provides consistency, scalability, and speed. But before we dive in, let’s get an idea of what Docker is and its history.
Docker is a virtualization tool and was first released in 2013 by Solomon Hykes at dotCloud, a Platform as a Service company (PaaS). It quickly gained popularity for making containerization mainstream. Instead of running heavy, resource-consuming virtual machines, Docker introduced lightweight, portable containers. Fast forward to today, and Docker has become a cornerstone of modern data engineering.
Functions of Docker in Data Engineering
Below are functions of Docker that make it indispensable to data engineers. They include:
Portability – Package your pipelines and dependencies into a single container that runs anywhere.
Consistency – No more environment conflicts between dev, staging, and production.
Isolation – Each container runs independently, so one failing service won’t break the rest.
Efficiency – Containers use fewer resources compared to virtual machines.
Scalability – Run multiple containers simultaneously and scale your pipelines with ease.
Docker vs Virtual Machines (VMs)
The table below shows the main differences between a Virtual Machine and Docker.
Feature | Docker (Containers) | Virtual Machines (VMs) |
---|---|---|
Startup Time | Seconds | Minutes, as it has to boot a full OS |
Resource Usage | Low. Shares host OS kernel | High. Each VM runs its own OS |
Isolation | Process-level isolation | Full OS-level isolation |
Portability | Runs anywhere with Docker installed | Limited, requires hypervisor support |
Efficiency | High, lightweight and fast | Lower, more overhead |
Compatibility | Only compatible with Linux distros e.g. Ubuntu, Arch Linux | Supports multiple Operating Systems (OS) |
Virtualization | Virtualizes the OS Application layer | Virtualizes the OS Application and Kernel layer |
QUICK NOTE: Kernel allows for the comunication between the machine/hardware and the application layer, see Traditional OS architecture chart below.
Docker Compatibility, its Reservations and Solution
Initially, Docker was meant to run on Linux machines, hence does not run natively on Windows and Mac.
To sort this issue out, Docker introduced Docker Desktop much later after the release of Docker at PyCon 2013 with a hypervisor layer (its own lightweight Linux OS kernel) to support Windows and Mac machines, enabling them to build and run containers as if they were on Linux. It bridges the gap seamlessly for developers and data engineers.On Windows, this is powered by WSL2 (Windows Subsystem for Linux 2) or Hyper-V. On macOS, Docker Desktop uses a LinuxKit VM.
The chart below shows how the hypervisor layer helped Windows and Mac machines run Docker smoothly.
Docker Images and Docker Containers.
A Docker image is a read-only template containing an application and all its necessary components, including the application code, libraries, dependencies, and configuration files. It serves as a blueprint for creating Docker containers, which are isolated, runnable instances of the application.
A Docker container is a lightweight, isolated, and executable software package that bundles an application and all its dependencies, including the runtime, libraries, and configuration files, into a single unit. It is a running instance of a Docker image.
Docker and Docker Compose
Running one container is cool, but data engineers usually need multiple services for their pipelines to run effectively e.g., PostgreSQL + Airflow + Spark. That’s where Docker Compose comes in handy.
With a simple docker-compose.yml
file, you can:
Define multiple services e.g.,
postgres
,spark
.Configure networks and volumes.
Start everything with a single command,
docker compose up
Below is an example of a docker-compose.yml
file with defined services producer
, trainer
, fraud_consumer
and transaction_consumer
:
services:
producer:
build: .
command: python3 scripts/producer.py
environment:
- BOOTSTRAP_SERVERS=${BOOTSTRAP_SERVERS}
- USERNAME=${USERNAME}
- PASSWORD=${PASSWORD}
- TOPIC=${TOPIC}
transaction_consumer:
build: .
command: python3 scripts/transaction_consumer.py
environment:
- BOOTSTRAP_SERVERS=${BOOTSTRAP_SERVERS}
- USERNAME=${USERNAME}
- PASSWORD=${PASSWORD}
- DB_URL=${DB_URL}
- DB_PASSWORD=${DB_PASSWORD}
- DB_USER=${DB_USER}
fraud_consumer:
build: .
command: python3 scripts/fraud_consumer.py
environment:
- BOOTSTRAP_SERVERS=${BOOTSTRAP_SERVERS}
- USERNAME=${USERNAME}
- PASSWORD=${PASSWORD}
- DB_URL=${DB_URL}
- DB_PASSWORD=${DB_PASSWORD}
- DB_USER=${DB_USER}
trainer:
build: .
command: python3 scripts/model.py
environment:
- DB_URL=${DB_URL}
Public and Private Docker Registries
A Docker Registry is where images are stored and shared.
-
Public Registry: Docker Hub is the most popular public Docker registry in the world. You can pull official images like
postgres
,python
, orspark
usingdocker pull IMAGE_NAME:TAG
.
docker pull postgres:13
- Private Registry: Companies often host private registries for security and control. Example: AWS Elastic Container Registry (ECR), Google Container Registry (GCR), or self-hosted Harbor.
As a data engineer, you might pull public images from Docker Hub but pull/push your team’s custom images to a private registry.
Port Binding
Containers run in isolation. To make a service inside a container accessible from your machine, you bind container ports to host ports.
For example, PostgreSQL runs on port 5432. To access it from your laptop, run:
docker run -d -p 5432:5432 --name my_postgres postgres
Here:
The first
5432
is the host port (your machine).The second
5432
is the container port (inside the container).
This way, you can connect to localhost:5432
on your machine, and Docker routes traffic into the container.
Building custom images using Dockerfile
Sometimes the official images aren’t enough; you’ll need to build your own image. That’s where a Dockerfile comes in.
The example shows a custom Python image with Pandas installed.
# start from official Python image
FROM python:3.10
# set working directory
WORKDIR /app
# copy project files
COPY . /app
# install dependencies
RUN pip install pandas
# run the script
CMD ["python", "main.py"]
Then, to build and run our Dockerfile:
docker build -t my-python-app .
docker run --rm pandas-prep
To learn more about Docker commands, visit this page to learn more.
Mini Project: PostgreSQL and pgAdmin4 User Interface with Docker and Docker Compose
Let's practice what we've learnt by building a simple Docker project. We’ll run a PostgreSQL database and pgAdmin, a web user interface that allows data engineers to access PostgreSQL from the web.
a. Create a docker-compose.yml
file:
Let's define our services in our docker-compose.yml
file. Since we are not building a custom Dockerfile in this mini project, we can pull images from Docker Hub using the image
parameter inside the postgres
and pgadmin
services
services:
postgres:
image: postgres:13
environment:
POSTGRES_USER: admin
POSTGRES_PASSWORD: secret
POSTGRES_DB: demo_db
ports:
- "5432:5432"
volumes:
- pg_data:/var/lib/postgresql/data
pgadmin:
image: dpage/pgadmin4
environment:
PGADMIN_DEFAULT_EMAIL: admin@admin.com
PGADMIN_DEFAULT_PASSWORD: admin
ports:
- "5000:80"
volumes:
pg_data:
NOTE: Always remember to place your environment variables inside a
.env
file to avoid leaking sensitive data. But, for this mini project, there's no need to place the credentials inside a.env
file, but do so for future projects. To reference an environment variable inside adocker-compose.yml
file:
services:
postgres:
image: postgres:14
container_name: postgres_db
environment:
- POSTGRES_USER=${POSTGRES_USER}
- POSTGRES_PASSWORD=${POSTGRES_PASSWORD}
- POSTGRES_DB=${POSTGRES_DB}
b. Start the services
To start our services, run the following command in the terminal:
docker compose up -d # -d runs the services in the background
Postgres will now be available on localhost:5432
and pgAdmin UI will be available on localhost:5000
.
c. Access pgAdmin
You can now access your pgAdmin instance on localhost:5000
.
Let's log in with the credentials we specified in our docker-compose.yml
file, which are PGADMIN_DEFAULT_EMAIL
and PGADMIN_DEFAULT_PASSWORD
.
We can now also add a new server in pgAdmin with the following credentials:
- Host:
postgres
, the service name from Docker Compose - Username:
admin
- Password:
secret
Congratulations! You now have a working PostgreSQL database with a nice web UI, all running inside Docker containers.
Conclusion
For data engineers, Docker is like a Swiss Army knife, it simplifies workflows, ensures consistency, and allows rapid experimentation without breaking your system.
Whether you’re spinning up a PostgreSQL database, deploying Airflow pipelines, or running Spark clusters, Docker makes it smooth and repeatable. So next time you hear “it worked on my machine”, you’ll know Docker could’ve saved the day. Where did you first use Docker, and how has it helped you containerize your application/pipelines? Please share in the comment section
Please like, comment, share widely, and follow for more data engineering content! For collaboration on projects, please email me at denzelkinyua11@gmail.com or visit any of my social media platforms linked on my GitHub page.
Top comments (0)