DEV Community

Shireen Bano A
Shireen Bano A

Posted on • Edited on

Production-Grade Container Hardening

Docker Security

Link to Dockerfile

In modern DevOps, running containers as root isn't just sloppy — it's an open invitation. If your application is compromised while running as root, the attacker isn't just inside your app. They own the entire container. Every secret, every mounted volume, every network socket.

The good news? You can architect containers where a successful exploit lands an attacker in a box with nothing — no shell, no tools, no write access, no privileges. That's what this article is about.

We're building a hardened, production-grade container designed to run on AWS ECS Fargate, using defense-in-depth at every layer: the image, the process manager, the filesystem, and the task definition itself.

Layer 1: The Multi-Stage Build — Asset Stripping, Not Just Space Saving

Most developers know multi-stage builds shrink image size. Fewer realize they're also your first line of defense.
The strategy is simple: build dirty, run clean. Your first stage installs compilers, pulls npm packages, runs tests — all the messy work. Your final stage inherits none of it.

# Stage 1: The dirty build environment
FROM node:20-alpine AS builder
WORKDIR /app
COPY . .
RUN npm ci && npm run build

# Stage 2: The clean runtime — no npm, no git, no source code
FROM nginx:1.25-alpine
COPY --from=builder --chown=appuser:appgroup /app/client/dist /usr/share/nginx/html
Enter fullscreen mode Exit fullscreen mode

Notice the --chown flag on the COPY instruction. Files land with the correct ownership immediately — no root middleman, no chmod dance afterward.

Layer 2: The Ghost Account — Least Privilege as Architecture

Alpine Linux defaults to running everything as root. We fix that immediately.

RUN addgroup -S appgroup && adduser -S appuser -G appgroup
Enter fullscreen mode Exit fullscreen mode

The -S flag creates a system user — no password, no login shell, no home directory with a .bashrc to backdoor. It's a ghost account: it exists only so the kernel has a non-root identity to assign to your process.

USER appuser

This single line changes everything. From this point forward, every RUN, CMD, and ENTRYPOINT executes as appuser. The ceiling is enforced by the OS itself.

Layer 3: Taming Nginx — The Privileged Citizen Problem

Here's where it gets interesting. Standard Nginx assumes it's running as root. It wants to write its PID file to /var/run/nginx.pid and its logs to /var/log/nginx/. Our appuser is forbidden from touching either of those paths.
Rather than granting extra permissions, we patch Nginx to work within our constraints:

#Redirect Nginx internals to paths appuser actually owns
RUN sed -i 's|pid /var/run/nginx.pid;|pid /tmp/nginx.pid;|g' /etc/nginx/nginx.conf

# Pre-create the temp paths and hand them to appuser
RUN mkdir -p /tmp/client_body /tmp/proxy_temp /var/cache/nginx \
    && chown -R appuser:appgroup /tmp /var/cache/nginx
Enter fullscreen mode Exit fullscreen mode

We're not lowering the security bar to accommodate Nginx — we're forcing Nginx to operate within our security model. The PID file and all scratch storage move to /tmp, which we then mount as ephemeral, hardened tmpfs volumes in the Fargate task definition.

linuxParameters = {
  tmpfs = [
    { containerPath = "/tmp",      size = 128, mountOptions = ["noexec", "nosuid", "nodev"] },
    { containerPath = "/app/logs", size = 64,  mountOptions = ["noexec", "nosuid", "nodev"] },
  ]
  readonlyRootFilesystem = true
}
Enter fullscreen mode Exit fullscreen mode

Those three mount options are doing serious work:

noexec — Nothing in /tmp can be executed. Even if an attacker writes a binary there, it won't run.
nosuid — Blocks privilege escalation via setuid binaries dropped into the volume.
nodev — Prevents creation of device files that could be used to bypass hardware-level security.

And readonlyRootFilesystem = true is the crown jewel: the entire container filesystem is immutable at runtime. The only writable paths are the explicitly mounted tmpfs volumes — and those can't execute anything.

Layer 4: Supervisord Without the Crown

In a traditional setup, a process manager like systemd runs as root. We use Supervisord, and we strip its crown before it starts

[supervisord]
user=appuser
logfile=/tmp/supervisord.log
pidfile=/tmp/supervisord.pid

[program:nginx]
command=nginx -g 'daemon off;'
stdout_logfile=/dev/stdout
stderr_logfile=/dev/stderr
Enter fullscreen mode Exit fullscreen mode

user=appuser means even the manager of processes has no administrative power. It coordinates but cannot escalate.
The stdout_logfile=/dev/stdout line solves another problem quietly: logs are streamed directly to Docker's logging driver and never written to disk inside the container. No sensitive log data sitting in a writable layer. No persistence for an attacker to mine.

Layer 5: Dropping Linux Capabilities — Cutting the Kernel's Leash

Even a non-root user can hold Linux capabilities — granular kernel permissions like the ability to bind low-numbered ports (NET_BIND_SERVICE), manipulate network interfaces (NET_ADMIN), or bypass file permission checks (DAC_OVERRIDE).

Every capability your container holds is an attack surface. The principle is simple: if you don't need it, drop it.

In your Fargate task definition:
linuxParameters = {
  capabilities = {
    drop = ["ALL"]
  }
}
Enter fullscreen mode Exit fullscreen mode

After dropping all capabilities, your container's /proc/self/status should show:

CapEff: 0000000000000000
CapBnd: 0000000000000000
Enter fullscreen mode Exit fullscreen mode

CapEff at zero means the process has no active kernel privileges. CapBnd at zero means it can never acquire any — capabilities removed from the bounding set cannot be added back. The kernel's leash is cut.

Layer 6: The Task Definition as a Second Lock

Your Dockerfile hardens the image. Your Fargate Task Definition hardens the runtime. These are two independent locks on the same door.

json"containerDefinitions": [
  {
    "privileged": false,
    "user": "appuser",
    "readonlyRootFilesystem": true,
    "linuxParameters": {
      "capabilities": { "drop": ["ALL"] }
    }
  }
]
Enter fullscreen mode Exit fullscreen mode

Why does this matter if the Dockerfile already sets USER appuser? Because the Task Definition is enforced by the AWS Fargate agent at runtime, independently of what the image contains. Even if someone pushes a misconfigured image that forgot the USER directive, Fargate will still enforce appuser. Defense-in-depth means each layer protects against the failure of the layer before it.

privileged: false is the explicit rejection of the Docker --privileged flag, which would otherwise give the container near-full host access. On Fargate, the "host" is AWS's infrastructure — you definitely don't want that.

Layer 7: Trust But Verify — Container Image Signing with Notation

Hardening your runtime is only half the story. How do you know the image you're deploying is the one you built? Supply chain attacks — where a malicious image is substituted somewhere between your registry and your cluster — are a growing threat.

Notation (a CNCF project) lets you cryptographically sign container images and verify those signatures before deployment.

bash# Sign after pushing to ECR
notation sign <your-ecr-registry>/your-app:latest
Enter fullscreen mode Exit fullscreen mode

Verify before deploying

notation verify <your-ecr-registry>/your-app:latest
Enter fullscreen mode Exit fullscreen mode

Integrate this into your CI/CD pipeline: sign on push, verify on deploy. If the signature doesn't match, the deployment doesn't happen. You get cryptographic proof that what's running in Fargate is exactly what your pipeline built — no substitutions, no tampering.

🛡️ Security Hardening Matrix

Layer Focus Risk Threat Mitigation
L1 Multi-Stage Build Build-time residue (compilers, .git, secrets) left in image. Lateral Movement: Attackers use leftover tools to compile malware or pivot deeper into the network. Build Dirty, Run Clean: Separate build and runtime stages; only production assets move to the final image.
L2 Non-Root Identity Containers running as root (UID 0) by default. Host Escape: Exploits (e.g., CVE-2024-21626) allow a root process to break out and control the host. Ghost Accounts: Create a system appuser with no shell or home directory to enforce a permission "ceiling."
L3 Hardened Nginx App requires root access to write to system paths like /var/run. Runtime Tampering: Attackers overwrite configs or web files to serve malware or redirect traffic. User-Owned Paths: Patch Nginx to use /tmp for PIDs/cache and chown those paths to the appuser.
L4 Unprivileged Manager Process managers (Supervisord) traditionally running with root "crowns." Privilege Escalation: A hijacked manager grants the attacker "God Mode" over all managed sub-processes. Powerless Manager: Run supervisord as appuser and stream logs to stdout to prevent local data mining.
L5 Immutable Filesystem Writable runtime layers allow attackers to modify the OS environment. Malware Persistence: Attackers download web shells or scripts that survive as long as the container runs. The "DVD" Model: Enable readonlyRootFilesystem and use tmpfs mounts with noexec to kill execution.
L6 Kernel Capabilities Granular kernel permissions (Capabilities) active by default. Privilege Jumping: Attackers use NET_RAW or DAC_OVERRIDE to sniff traffic or bypass file security. The Blackout: Use drop = ["ALL"] to zero out CapEff and CapBnd, stripping all kernel-level privileges.
L7 Image Signing Unverified images pulled from an untrusted or compromised registry. Supply Chain Attack: An attacker swaps a legitimate image with a "poisoned" version containing a backdoor. Notation (CNCF): Cryptographically sign images in CI/CD and verify signatures before every deployment.

Top comments (0)