DEV Community

Cover image for How to Set Up Local Data Engineering Environments with Docker Compose
Lucy
Lucy

Posted on

How to Set Up Local Data Engineering Environments with Docker Compose

TL;DR: Docker Compose lets you spin up a full local data stack — Airflow, PostgreSQL, Spark, Redis — with a single YAML file and one command. This guide walks you through the exact setup, real compose configs, and the mistakes most engineers make along the way.


Why Your Local Data Environment Is Probably a Mess

Here's the thing: most data engineers I know have a local setup that technically works — but only on their machine, on a good day, when the stars align.

You install PostgreSQL manually. Pin a Python version. Struggle to get Airflow running without breaking something else. And then a new teammate joins and spends three days just trying to reproduce your environment.

That's not an engineering problem. It's a tooling problem. And Docker Compose solves it.

Manual installs vs Docker Compose workflow

Docker Compose lets you describe your entire local data stack as code — services, networks, volumes, environment variables — and spin it up or tear it down with one command. No more "works on my machine." No more three-day onboarding nightmares.

This guide covers the full picture: what Docker Compose actually is (and what it's not), the building blocks you need to understand, and a production-quality example stack with Airflow, PostgreSQL, Redis, and Spark.


What Is Docker Compose (and What It's Not)

Docker Compose is an orchestration tool for defining and running multi-container Docker applications on a single machine. You write a compose.yaml file that describes each service — what image it uses, what ports it exposes, how it connects to other services, and where it stores data.

A quick note before we go further: Docker Compose v1 reached end-of-life in July 2023. The old docker-compose binary (with the hyphen) is gone. You should be using Docker Compose v2, which ships as a built-in CLI plugin. If you see docker-compose anywhere in your scripts or tutorials, replace it with docker compose (space, no hyphen).

Also worth knowing: the version: field at the top of your compose file is now officially deprecated. You don't need it. Drop it entirely from any new file you write.

Docker Compose is NOT:

  • A replacement for Kubernetes in production
  • A tool for managing distributed multi-machine deployments
  • A substitute for proper secrets management in prod

But for local development, CI pipelines, and single-machine staging? It's hard to beat.

Sources: Docker official documentation — docs.docker.com/compose; freeCodeCamp Docker Compose v2 guide (2026); Docker Compose specification at compose-spec.io


Prerequisites

Before we write a single line of YAML, make sure you have:

  • Docker Desktop (v4.0+) or Docker Engine + docker-compose-plugin on Linux
  • At least 8GB RAM available (data stacks eat memory)
  • 4 CPU cores minimum — Spark in particular needs headroom
  • Basic familiarity with the command line

Run this to confirm your setup is current:

docker compose version
# Should show v2.24 or later in 2026
Enter fullscreen mode Exit fullscreen mode

If that command fails or shows v1.x, update Docker before continuing.


Understanding the Core Building Blocks

Before jumping into the full stack, you need a mental model of the four things Docker Compose actually manages.

Docker Compose core concepts infographic

Services

A service is a running container. Each entry under services: in your compose file becomes one or more containers. For a data engineering stack, your services are things like your database, your orchestrator, your message broker, your transformation tool.

Networks

By default, every service in a compose file can talk to every other service using the service name as the hostname. No IP addresses. No manual DNS. This is one of the most underrated features — your Airflow scheduler connects to Postgres by literally using postgres as the hostname.

Volumes

Volumes are how your data survives container restarts. There are two flavors: named volumes (managed by Docker, recommended for databases) and bind mounts (a folder on your host machine mounted into the container, useful for DAGs, scripts, and code you're actively editing).

Environment Variables

Never hardcode credentials in your compose file. Always use a .env file and reference variables with ${VARIABLE_NAME} syntax. Your .env file stays out of version control. Your compose.yaml doesn't.


Building the Stack: A Real Data Engineering Environment

Let's build something real. This stack covers the tools that appear in most data engineering workflows:

Service Role Port
PostgreSQL Metadata DB + data warehouse 5432
Redis Message broker for Celery tasks 6379
Apache Airflow Workflow orchestration 8080
Apache Spark Distributed data processing 4040, 7077
Adminer Lightweight DB GUI 8085

Data engineering stack and flowchart

Step 1 — Project Structure

Start with a clean folder structure. This matters more than most tutorials admit — messy folders create tangled volume mounts and confusing build contexts.

data-eng-local/
├── compose.yaml
├── .env
├── .env.example
├── airflow/
│   ├── dags/
│   ├── logs/
│   ├── plugins/
│   └── config/
├── spark/
│   └── jobs/
├── postgres/
│   └── init/
│       └── 01_create_schemas.sql
└── README.md
Enter fullscreen mode Exit fullscreen mode

Two rules: put your compose.yaml at the root, and never commit .env to git. Add it to .gitignore now, before you forget.

Step 2 — The .env File

# .env — DO NOT commit to version control
POSTGRES_USER=dataeng
POSTGRES_PASSWORD=changeme_local
POSTGRES_DB=warehouse

AIRFLOW_UID=50000
AIRFLOW__CORE__FERNET_KEY=your_fernet_key_here
AIRFLOW__WEBSERVER__SECRET_KEY=your_secret_key_here

REDIS_PASSWORD=redis_local_pass
Enter fullscreen mode Exit fullscreen mode

Generate a Fernet key with:

python3 -c "from cryptography.fernet import Fernet; print(Fernet.generate_key().decode())"
Enter fullscreen mode Exit fullscreen mode

Step 3 — The compose.yaml File

Here's the full configuration. Read the inline comments — they explain the decisions, not just the syntax.

# compose.yaml — No version field needed (deprecated in Compose v2)

x-airflow-common: &airflow-common
  image: apache/airflow:3.0.4
  environment: &airflow-common-env
    AIRFLOW__CORE__EXECUTOR: CeleryExecutor
    AIRFLOW__DATABASE__SQL_ALCHEMY_CONN: postgresql+psycopg2://${POSTGRES_USER}:${POSTGRES_PASSWORD}@postgres/${POSTGRES_DB}
    AIRFLOW__CELERY__RESULT_BACKEND: db+postgresql://${POSTGRES_USER}:${POSTGRES_PASSWORD}@postgres/${POSTGRES_DB}
    AIRFLOW__CELERY__BROKER_URL: redis://:${REDIS_PASSWORD}@redis:6379/0
    AIRFLOW__CORE__FERNET_KEY: ${AIRFLOW__CORE__FERNET_KEY}
    AIRFLOW__WEBSERVER__SECRET_KEY: ${AIRFLOW__WEBSERVER__SECRET_KEY}
    AIRFLOW_UID: ${AIRFLOW_UID}
  env_file:
    - .env
  volumes:
    - ./airflow/dags:/opt/airflow/dags        # Bind mount — edit DAGs without rebuilding
    - ./airflow/logs:/opt/airflow/logs
    - ./airflow/plugins:/opt/airflow/plugins
    - ./airflow/config:/opt/airflow/config
  depends_on:
    postgres:
      condition: service_healthy              # Wait for postgres to be ready — not just started
    redis:
      condition: service_healthy

services:
  # ─── DATABASE ──────────────────────────────────────────────────────────────
  postgres:
    image: postgres:16
    environment:
      POSTGRES_USER: ${POSTGRES_USER}
      POSTGRES_PASSWORD: ${POSTGRES_PASSWORD}
      POSTGRES_DB: ${POSTGRES_DB}
    ports:
      - "5432:5432"
    volumes:
      - postgres_data:/var/lib/postgresql/data          # Named volume — data survives restarts
      - ./postgres/init:/docker-entrypoint-initdb.d     # SQL files run on first startup
    healthcheck:
      test: ["CMD-SHELL", "pg_isready -U ${POSTGRES_USER}"]
      interval: 10s
      timeout: 5s
      retries: 5
    restart: unless-stopped

  # ─── MESSAGE BROKER ────────────────────────────────────────────────────────
  redis:
    image: redis:7-alpine                    # Alpine = smaller image, same functionality
    command: redis-server --requirepass ${REDIS_PASSWORD}
    ports:
      - "6379:6379"
    volumes:
      - redis_data:/data
    healthcheck:
      test: ["CMD", "redis-cli", "-a", "${REDIS_PASSWORD}", "ping"]
      interval: 10s
      timeout: 5s
      retries: 5
    restart: unless-stopped

  # ─── AIRFLOW ───────────────────────────────────────────────────────────────
  airflow-init:
    <<: *airflow-common
    entrypoint: /bin/bash
    command:
      - -c
      - |
        airflow db migrate &&
        airflow users create \
          --username admin \
          --password admin \
          --firstname Admin \
          --lastname User \
          --role Admin \
          --email admin@example.com
    restart: "no"                            # Run once and exit — not a long-running service

  airflow-webserver:
    <<: *airflow-common
    command: webserver
    ports:
      - "8080:8080"
    healthcheck:
      test: ["CMD", "curl", "--fail", "http://localhost:8080/health"]
      interval: 30s
      timeout: 10s
      retries: 5
    restart: unless-stopped

  airflow-scheduler:
    <<: *airflow-common
    command: scheduler
    restart: unless-stopped

  airflow-worker:
    <<: *airflow-common
    command: celery worker
    restart: unless-stopped

  # ─── SPARK ─────────────────────────────────────────────────────────────────
  spark-master:
    image: bitnami/spark:3.5
    environment:
      - SPARK_MODE=master
    ports:
      - "4040:4040"                          # Spark UI
      - "7077:7077"                          # Spark master port
    volumes:
      - ./spark/jobs:/opt/spark-jobs
    restart: unless-stopped

  spark-worker:
    image: bitnami/spark:3.5
    environment:
      - SPARK_MODE=worker
      - SPARK_MASTER_URL=spark://spark-master:7077
      - SPARK_WORKER_MEMORY=2G
      - SPARK_WORKER_CORES=2
    depends_on:
      - spark-master
    restart: unless-stopped

  # ─── DB GUI ────────────────────────────────────────────────────────────────
  adminer:
    image: adminer:latest
    ports:
      - "8085:8080"
    depends_on:
      - postgres
    restart: unless-stopped

# ─── VOLUMES ───────────────────────────────────────────────────────────────
volumes:
  postgres_data:        # Docker-managed, persists across down/up cycles
  redis_data:
Enter fullscreen mode Exit fullscreen mode

Step 4 — Start the Stack

# First time — initialize Airflow's database and create the admin user
docker compose up airflow-init

# Then start everything else in detached mode
docker compose up -d

# Watch it come up
docker compose ps

# Stream logs for a specific service (useful for debugging)
docker compose logs -f airflow-scheduler
Enter fullscreen mode Exit fullscreen mode

That's it. Once everything is green, your services are at:


Health Checks: The Feature Most People Skip

This is the part most tutorials skip — and it's genuinely important.

Without health checks, depends_on is almost useless. By default, Docker considers a container "started" the moment the process launches, not when the service inside it is actually ready to accept connections. PostgreSQL needs a few seconds to initialize. Redis needs a moment to come up. If Airflow tries to connect before they're ready, it crashes — and you end up with a confusing pile of restart loops.

The condition: service_healthy syntax in depends_on fixes this. It tells Docker: don't start this service until that other service's health check is passing. Pair it with a proper healthcheck block on the dependency, and your stack starts in the right order every time.

# This is how you do it properly
depends_on:
  postgres:
    condition: service_healthy  # ← This is the key
Enter fullscreen mode Exit fullscreen mode

Without this, you're relying on timing. Timing is not a strategy.


Common Mistakes and How to Avoid Them

Hardcoding credentials in compose.yaml — Don't. Use .env files. It takes 30 seconds to set up and prevents you from accidentally committing passwords to a public repo. It happens more than you'd think.

Using docker-compose (v1) instead of docker compose (v2) — The old binary is dead. If you're copying configs from tutorials older than mid-2023, check for this.

Including version: in your compose file — This field is obsolete as of Docker Compose v2 and now triggers deprecation warnings in Docker Desktop. Remove it from any file you write or maintain.

Not pinning image versionspostgres:latest is a trap. One upgrade and your init SQL might fail, your connection string might change, your extension might not exist. Pin to postgres:16. Always.

Forgetting to add .env to .gitignore — Seriously. Do this first.

Running docker compose down -v when you meant docker compose down — The -v flag deletes named volumes. That means your database data is gone. There's no undo. Be very intentional with that flag.


Working With Your Stack Day-to-Day

Once it's running, these are the commands you'll reach for most:

# Check what's running and the health status
docker compose ps

# Tail logs from everything
docker compose logs -f

# Tail logs from one service only
docker compose logs -f airflow-worker

# Restart a single service without touching the rest
docker compose restart airflow-scheduler

# Run a one-off command inside a running container
docker compose exec postgres psql -U dataeng -d warehouse

# Open a shell in a container for debugging
docker compose exec airflow-webserver bash

# Stop everything (preserves volumes — safe)
docker compose down

# Nuclear option — stops everything AND deletes all volumes
docker compose down -v
Enter fullscreen mode Exit fullscreen mode

Docker Compose developer workflow diagram


Managing Multiple Environments with Profiles

Here's something worth knowing once your stack gets more complex: Docker Compose Profiles let you define services that only start in certain contexts. You tag a service with profiles: [dev] or profiles: [monitoring] and it only runs when you explicitly request that profile.

services:
  # This only starts when you run: docker compose --profile monitoring up
  prometheus:
    image: prom/prometheus:latest
    profiles:
      - monitoring
    ports:
      - "9090:9090"
Enter fullscreen mode Exit fullscreen mode

This is how you keep a single compose.yaml that works for all environments — local dev, CI, staging — without maintaining multiple files. It's one of the features that quietly makes Compose much more production-capable than its reputation suggests.


E-E-A-T Reference: Tools, Versions, and Sources

Here's a quick reference table with the verified tool versions used in this guide — and where to find official documentation.

Tool Version Used Official Docs
Docker Compose v2.24+ docs.docker.com/compose
Apache Airflow 3.0.4 airflow.apache.org
PostgreSQL 16 postgresql.org/docs
Redis 7 (Alpine) redis.io/docs
Apache Spark 3.5 (Bitnami) spark.apache.org/docs

Further reading:


Wrapping Up

Setting up a reliable local data engineering environment used to be a day-long exercise in frustration. With Docker Compose, it's a one-time investment: write the compose.yaml once, commit it to your repo, and everyone on your team gets the exact same environment with docker compose up.

A few things to remember as you go:

  • Drop the version: field — it's deprecated
  • Always use condition: service_healthy in depends_on
  • Pin your image versions, not latest
  • Keep credentials in .env, never in the compose file
  • Name your volumes — data you can't recover isn't worth much

The config in this guide is a starting point. As your stack grows, you'll want to look at override files (compose.override.yaml) for environment-specific tweaks, and Compose profiles for toggling monitoring tools, debug containers, or test databases without polluting your main setup.

The whole point is reproducibility. When a teammate clones your repo and runs docker compose up, they should get exactly what you have. That's not a nice-to-have — it's the baseline for any serious data engineering workflow.


Need a Data Engineer Who Already Knows This Stack?

Honestly, this is where a lot of projects stall. The setup is one thing — building production-grade pipelines on top of it, handling schema drift, optimizing Spark jobs, writing reliable Airflow DAGs that don't silently fail at 2am — that's the real work.

If your team is scaling and you need someone who has done this before (not just read about it), that's what we do.

Lucent Innovation helps product teams and growing companies hire experienced data engineers — people who can own the full stack, from local containerized environments to cloud-scale data infrastructure on AWS, GCP, or Azure.

Whether you need to augment your current team with a specialist or are looking to build a data engineering function from scratch, we can help you find and place the right person fast — usually within 2–3 weeks.

💬 Talk to us about hiring a data engineer →

No long intake forms. Just a conversation about what you're building and what you need.


Have questions about this setup or running into issues with a specific service? The comments are open. Also worth checking: the official Apache Airflow Docker Compose documentation gets updated with each release and is often more current than any tutorial you'll find.

Top comments (0)