DEV Community

Cover image for Docker for Data Engineers: The Complete Beginner’s Guide
Denzel Kanyeki
Denzel Kanyeki

Posted on

Docker for Data Engineers: The Complete Beginner’s Guide

Introduction

Imagine being able to package your entire data pipeline into a neat, portable box that runs anywhere without the dreaded “but it works on my machine” excuse. That’s the magic of Docker.

For data engineers, where workflows often span multiple environments which include local machines, cloud servers, and clusters, Docker provides consistency, scalability, and speed. But before we dive in, let’s get an idea of what Docker is and its history.

Docker is a virtualization tool and was first released in 2013 by Solomon Hykes at dotCloud, a Platform as a Service company (PaaS). It quickly gained popularity for making containerization mainstream. Instead of running heavy, resource-consuming virtual machines, Docker introduced lightweight, portable containers. Fast forward to today, and Docker has become a cornerstone of modern data engineering.

Functions of Docker in Data Engineering

Below are functions of Docker that make it indispensable to data engineers. They include:

  • Portability – Package your pipelines and dependencies into a single container that runs anywhere.

  • Consistency – No more environment conflicts between dev, staging, and production.

  • Isolation – Each container runs independently, so one failing service won’t break the rest.

  • Efficiency – Containers use fewer resources compared to virtual machines.

  • Scalability – Run multiple containers simultaneously and scale your pipelines with ease.

Docker vs Virtual Machines (VMs)

The table below shows the main differences between a Virtual Machine and Docker.

Feature Docker (Containers) Virtual Machines (VMs)
Startup Time Seconds Minutes, as it has to boot a full OS
Resource Usage Low. Shares host OS kernel High. Each VM runs its own OS
Isolation Process-level isolation Full OS-level isolation
Portability Runs anywhere with Docker installed Limited, requires hypervisor support
Efficiency High, lightweight and fast Lower, more overhead
Compatibility Only compatible with Linux distros e.g. Ubuntu, Arch Linux Supports multiple Operating Systems (OS)
Virtualization Virtualizes the OS Application layer Virtualizes the OS Application and Kernel layer

QUICK NOTE: Kernel allows for the comunication between the machine/hardware and the application layer, see Traditional OS architecture chart below.

Traditional OS Architecture

Docker Compatibility, its Reservations and Solution

Initially, Docker was meant to run on Linux machines, hence does not run natively on Windows and Mac.

Incompatibility Architecture diagram

To sort this issue out, Docker introduced Docker Desktop much later after the release of Docker at PyCon 2013 with a hypervisor layer (its own lightweight Linux OS kernel) to support Windows and Mac machines, enabling them to build and run containers as if they were on Linux. It bridges the gap seamlessly for developers and data engineers.On Windows, this is powered by WSL2 (Windows Subsystem for Linux 2) or Hyper-V. On macOS, Docker Desktop uses a LinuxKit VM.

The chart below shows how the hypervisor layer helped Windows and Mac machines run Docker smoothly.

Compatibility diagram

Docker Images and Docker Containers.

A Docker image is a read-only template containing an application and all its necessary components, including the application code, libraries, dependencies, and configuration files. It serves as a blueprint for creating Docker containers, which are isolated, runnable instances of the application.

Docker image

A Docker container is a lightweight, isolated, and executable software package that bundles an application and all its dependencies, including the runtime, libraries, and configuration files, into a single unit. It is a running instance of a Docker image.

Docker containers

Docker and Docker Compose

Running one container is cool, but data engineers usually need multiple services for their pipelines to run effectively e.g., PostgreSQL + Airflow + Spark. That’s where Docker Compose comes in handy.

With a simple docker-compose.yml file, you can:

  • Define multiple services e.g., postgres, spark.

  • Configure networks and volumes.

  • Start everything with a single command, docker compose up

Below is an example of a docker-compose.yml file with defined services producer, trainer, fraud_consumer and transaction_consumer:

services:
    producer:
        build: .
        command: python3 scripts/producer.py
        environment:
            - BOOTSTRAP_SERVERS=${BOOTSTRAP_SERVERS}
            - USERNAME=${USERNAME}
            - PASSWORD=${PASSWORD}
            - TOPIC=${TOPIC}

    transaction_consumer:
        build: .
        command: python3 scripts/transaction_consumer.py
        environment:
            - BOOTSTRAP_SERVERS=${BOOTSTRAP_SERVERS}
            - USERNAME=${USERNAME}
            - PASSWORD=${PASSWORD}
            - DB_URL=${DB_URL}
            - DB_PASSWORD=${DB_PASSWORD}
            - DB_USER=${DB_USER}

    fraud_consumer:
        build: .
        command: python3 scripts/fraud_consumer.py
        environment:
            - BOOTSTRAP_SERVERS=${BOOTSTRAP_SERVERS}
            - USERNAME=${USERNAME}
            - PASSWORD=${PASSWORD}
            - DB_URL=${DB_URL}
            - DB_PASSWORD=${DB_PASSWORD}
            - DB_USER=${DB_USER}

    trainer:
        build: .
        command: python3 scripts/model.py
        environment:
            - DB_URL=${DB_URL}
Enter fullscreen mode Exit fullscreen mode

Public and Private Docker Registries

A Docker Registry is where images are stored and shared.

  • Public Registry: Docker Hub is the most popular public Docker registry in the world. You can pull official images like postgres, python, or spark using docker pull IMAGE_NAME:TAG.
  docker pull postgres:13
Enter fullscreen mode Exit fullscreen mode
  • Private Registry: Companies often host private registries for security and control. Example: AWS Elastic Container Registry (ECR), Google Container Registry (GCR), or self-hosted Harbor.

As a data engineer, you might pull public images from Docker Hub but pull/push your team’s custom images to a private registry.

Port Binding

Containers run in isolation. To make a service inside a container accessible from your machine, you bind container ports to host ports.

For example, PostgreSQL runs on port 5432. To access it from your laptop, run:

docker run -d -p 5432:5432 --name my_postgres postgres
Enter fullscreen mode Exit fullscreen mode

Here:

  • The first 5432 is the host port (your machine).

  • The second 5432 is the container port (inside the container).

This way, you can connect to localhost:5432 on your machine, and Docker routes traffic into the container.

Building custom images using Dockerfile

Sometimes the official images aren’t enough; you’ll need to build your own image. That’s where a Dockerfile comes in.

The example shows a custom Python image with Pandas installed.

# start from official Python image
FROM python:3.10

# set working directory
WORKDIR /app

# copy project files
COPY . /app

# install dependencies
RUN pip install pandas

# run the script
CMD ["python", "main.py"]
Enter fullscreen mode Exit fullscreen mode

Then, to build and run our Dockerfile:

docker build -t my-python-app .

docker run --rm pandas-prep
Enter fullscreen mode Exit fullscreen mode

To learn more about Docker commands, visit this page to learn more.

Mini Project: PostgreSQL and pgAdmin4 User Interface with Docker and Docker Compose

Let's practice what we've learnt by building a simple Docker project. We’ll run a PostgreSQL database and pgAdmin, a web user interface that allows data engineers to access PostgreSQL from the web.

a. Create a docker-compose.yml file:

Let's define our services in our docker-compose.yml file. Since we are not building a custom Dockerfile in this mini project, we can pull images from Docker Hub using the image parameter inside the postgres and pgadmin services

services:
  postgres:
    image: postgres:13
    environment:
      POSTGRES_USER: admin
      POSTGRES_PASSWORD: secret
      POSTGRES_DB: demo_db
    ports:
      - "5432:5432"
    volumes:
     - pg_data:/var/lib/postgresql/data

  pgadmin:
    image: dpage/pgadmin4
    environment:
      PGADMIN_DEFAULT_EMAIL: admin@admin.com
      PGADMIN_DEFAULT_PASSWORD: admin
    ports:
      - "5000:80"

volumes:
  pg_data:
Enter fullscreen mode Exit fullscreen mode

NOTE: Always remember to place your environment variables inside a .env file to avoid leaking sensitive data. But, for this mini project, there's no need to place the credentials inside a .env file, but do so for future projects. To reference an environment variable inside a docker-compose.yml file:

services:
    postgres:
        image: postgres:14
        container_name: postgres_db
        environment:
           - POSTGRES_USER=${POSTGRES_USER}
           - POSTGRES_PASSWORD=${POSTGRES_PASSWORD}
           - POSTGRES_DB=${POSTGRES_DB}
Enter fullscreen mode Exit fullscreen mode

b. Start the services

To start our services, run the following command in the terminal:

docker compose up -d # -d runs the services in the background
Enter fullscreen mode Exit fullscreen mode

Postgres will now be available on localhost:5432 and pgAdmin UI will be available on localhost:5000.

c. Access pgAdmin

You can now access your pgAdmin instance on localhost:5000.
Let's log in with the credentials we specified in our docker-compose.yml file, which are PGADMIN_DEFAULT_EMAIL and PGADMIN_DEFAULT_PASSWORD.

pgAdmin dashboard

We can now also add a new server in pgAdmin with the following credentials:

  • Host: postgres, the service name from Docker Compose
  • Username: admin
  • Password: secret

psql database

Congratulations! You now have a working PostgreSQL database with a nice web UI, all running inside Docker containers.

Conclusion

For data engineers, Docker is like a Swiss Army knife, it simplifies workflows, ensures consistency, and allows rapid experimentation without breaking your system.

Whether you’re spinning up a PostgreSQL database, deploying Airflow pipelines, or running Spark clusters, Docker makes it smooth and repeatable. So next time you hear “it worked on my machine”, you’ll know Docker could’ve saved the day. Where did you first use Docker, and how has it helped you containerize your application/pipelines? Please share in the comment section

Please like, comment, share widely, and follow for more data engineering content! For collaboration on projects, please email me at denzelkinyua11@gmail.com or visit any of my social media platforms linked on my GitHub page.

Top comments (0)