DEV Community

Sreekanth Kuruba
Sreekanth Kuruba

Posted on

Dockerfile Internals and the Image Build Pipeline

When engineers say "Docker builds an image," they usually mean a single command.
In reality, docker build triggers a deterministic pipeline that transforms a text file into an OCI-compliant artifact, composed of immutable, content-addressed layers.

Understanding this pipeline explains why cache behaves the way it does, why instruction order matters, and why small Dockerfile changes can dramatically impact build time and image size.


From Dockerfile to Build Graph

The build process starts long before any filesystem changes occur.

Docker first parses the Dockerfile into an internal instruction graph.
This phase validates syntax, resolves build stages, and prepares the build context after applying .dockerignore. No layers are created here. The output is a dependency-aware plan for how the image could be built.

Only after this plan is constructed does execution begin.

Practical Impact: The .dockerignore Advantage

# Without .dockerignore:
Sending build context to Docker daemon  1.2GB  # Slow transfer

# With proper .dockerignore:
Sending build context to Docker daemon  12.3kB  # Fast transfer
Enter fullscreen mode Exit fullscreen mode

Key files to exclude:

node_modules/
.git/
*.log
.env
dist/  # For multi-stage builds
Enter fullscreen mode Exit fullscreen mode

Layer Creation Is Content, Not Commands

Each filesystem-changing instruction such as RUN, COPY, or ADD produces a new layer.
These layers are immutable and identified by a cryptographic hash derived from their content and their parent layer.

This is why Docker caching is reliable.
If the inputs are identical, the resulting layer hash is identical. The build system does not care why a command ran, only what it produced.

Cache Key Composition

Layer Hash = SHA256(
  Parent Layer Hash +
  Instruction Content + 
  File Content (for COPY/ADD) +
  Build Arguments at this point
)
Enter fullscreen mode Exit fullscreen mode

Example Cache Behavior:

# Layer 1: Always cached (base image)
FROM node:18-alpine

# Layer 2: Cached unless WORKDIR changes
WORKDIR /app

# Layer 3: Cache breaks if package.json changes
COPY package*.json ./

# Layer 4: Cache breaks if Layer 3 changes
RUN npm ci

# Layer 5: Cache breaks if ANY file changes
COPY . .

# Layer 6: Always cached (metadata)
CMD ["npm", "start"]
Enter fullscreen mode Exit fullscreen mode

This design is what allows Docker to reuse layers across images, hosts, and even registries.


Why BuildKit Changed Everything

The classic Docker builder executed instructions sequentially, treating each step as an isolated operation.
BuildKit replaces this with a graph-based execution model.

With BuildKit, independent steps can execute in parallel, cache keys are more precise, and sensitive data such as credentials can be mounted at build time without ever becoming part of an image layer.

BuildKit vs Classic: A Performance Comparison

# Classic Builder (sequential)
Step 1/8 : FROM alpine:latest
Step 2/8 : RUN apk add --no-cache python3
Step 3/8 : RUN pip install pandas
... # Each step waits for previous

# BuildKit (concurrent possible)
[+] Building 8.2s (15/15) FINISHED
 => CACHED [stage-1 2/6] ...
 => CACHED [stage-1 3/6] ...  # Parallel execution
 => CACHED [stage-1 4/6] ...
Enter fullscreen mode Exit fullscreen mode

Advanced BuildKit Features

1. Build Secrets (Never in Image Layers)

RUN --mount=type=secret,id=npm_token \
    echo "//registry.npmjs.org/:_authToken=$(cat /run/secrets/npm_token)" > .npmrc && \
    npm ci
Enter fullscreen mode Exit fullscreen mode

2. Cache Mounts (Persistent Between Builds)

RUN --mount=type=cache,target=/var/cache/apt \
    apt-get update && apt-get install -y packages
Enter fullscreen mode Exit fullscreen mode

This is not an optimization.
It is a fundamental shift in how image builds are modeled.


Multi-Stage Builds as a Security Boundary

Multi-stage builds are often described as a size optimization.
More importantly, they create a clean separation between build-time and runtime concerns.

Compilers, package managers, and secrets exist only in intermediate stages.
The final image contains exactly what is required to run the application, and nothing else.

Security Impact Analysis

# Single-Stage (Vulnerable)
FROM node:18
COPY . .
RUN npm ci  # 600+ dev dependencies
RUN npm run build
CMD ["node", "dist/app.js"]
# Result: 1.2GB image with dev tools, compilers, secrets

# Multi-Stage (Secure)
FROM node:18 AS builder
COPY . .
RUN npm ci && npm run build  # Dev dependencies here

FROM node:18-alpine
COPY --from=builder /app/dist ./dist
COPY --from=builder /app/package*.json ./
RUN npm ci --only=production  # Only 40 prod dependencies
# Result: 180MB image, no dev tools, no build secrets
Enter fullscreen mode Exit fullscreen mode

This reduces attack surface, simplifies vulnerability scanning, and makes image provenance easier to reason about.


Debugging Builds Means Debugging Inputs

Most Docker build issues are not runtime problems.
They are cache invalidation problems.

Unexpected rebuilds almost always trace back to:

  • Changing inputs in early layers
  • Overly broad COPY instructions
  • Uncontrolled build arguments

Diagnostic Toolkit

1. Layer Inspection

docker history myimage --no-trunc --format "{{.CreatedBy}}"
dive myimage  # Interactive layer explorer
Enter fullscreen mode Exit fullscreen mode

2. Cache Analysis

# See why cache invalidated
docker build --progress=plain .

# Check specific layer
docker inspect myimage --format='{{.RootFS.Layers}}'
Enter fullscreen mode Exit fullscreen mode

3. Context Troubleshooting

# See what's being sent to daemon
docker build --no-cache . 2>&1 | grep "sending build context"
Enter fullscreen mode Exit fullscreen mode

Tools like docker build --progress=plain, docker history, and layer inspection utilities expose these relationships directly, turning "Docker magic" back into observable behavior.


Production Patterns

1. Deterministic Builds

# Pin everything
FROM node:18.20.1-alpine3.19  # Not :latest
RUN npm ci --frozen-lockfile  # Not npm install
Enter fullscreen mode Exit fullscreen mode

2. Build-Time Optimization

# Order matters: Stable → Changing
COPY package*.json ./     # Infrequent changes
RUN npm ci               # Expensive operation
COPY . .                 # Frequent changes
Enter fullscreen mode Exit fullscreen mode

3. Size Optimization

# Clean as you go
RUN apt-get update && \
    apt-get install -y build-essential && \
    # Build something && \
    apt-get remove -y build-essential && \
    apt-get autoremove -y && \
    rm -rf /var/lib/apt/lists/*
Enter fullscreen mode Exit fullscreen mode

The OCI Artifact: What Actually Gets Built

At the end of the pipeline, Docker produces:

  1. Image Manifest - Metadata and layer references
  2. Image Config - Environment, entrypoint, working directory
  3. Layer Tarballs - Compressed filesystem diffs
  4. Index (multi-arch) - Platform-specific manifests
{
  "schemaVersion": 2,
  "layers": [
    {
      "digest": "sha256:abc123...",  // Content hash
      "size": 1234567
    }
  ],
  "config": {
    "digest": "sha256:def456...",
    "Cmd": ["npm", "start"]
  }
}
Enter fullscreen mode Exit fullscreen mode

Summary

The Docker build pipeline transforms human-readable instructions into a secure, efficient, distributable artifact through:

  1. Graph-based planning - Not linear execution
  2. Content-addressable storage - Deterministic layer creation
  3. Stage isolation - Build/runtime separation
  4. Observable behavior - Every layer is inspectable

Understanding these internals moves teams from "Docker builds" to "engineered artifact pipelines."


Top comments (0)