Kohei Aoki

Posted on Apr 14

How a Custom Docker Image Made My AWS Amplify Builds 10–20% Faster (and Killed Flaky Build Failures)

#aws #amplify #docker #devops

Intro

If you keep running an app on AWS Amplify for long enough, you'll hit this problem: your amplify.yml quietly gets more complex and your builds quietly get slower. Every new feature or tool swap adds one more line to preBuild, and before you know it, no one on the team has the whole build graph in their head. Builds creep up, external downloads pile up, and flaky failures start showing up at 3 a.m.

This is the kind of thing you have to revisit periodically, and this post is about one of those revisits.

Concretely: I swapped our AWS Amplify Gen2 build environment for a custom Docker image, and our staging build time dropped from a median of 9m43s to 8m45s — about 10% per build, roughly 1 minute. Depending on your framework, the realistic range is 10–20%, and Next.js apps can hit 50–70%. On top of the raw speed, download-caused build failures dropped to zero.

This post walks through the design, implementation, real-world measurements, and operational tips from running this in production. If you operate a monorepo on AWS Amplify Gen2 and you're thinking "our builds are slower than they should be" or "I want to kill these flaky download-caused failures," this is for you.

Why I built a custom build image

Our app is a pnpm workspace monorepo: apps/web-app (Vite + React) plus packages/gen2-shared-backend (Amplify Gen2 backend, Hono Lambda, CDK custom resources). With the default Amplify build image, every single build was paying these costs:

nvm install 22 to reinstall Node.js: ~15s
corepack enable && corepack prepare pnpm@10.28.1 --activate: ~10s
curl'ing a 64 MB Chromium Lambda Layer ZIP from GitHub Releases: ~20–40s
pnpm install --frozen-lockfile from an empty store: 60–90s

That's 2–3 minutes of pure setup every build. And as I'll show, the raw time wasn't even the biggest problem — stability was.

The trigger: a midnight build failure

The real push came on 2026-03-22 when our staging backend build broke. Digging in, I found the curl'd Chromium Layer ZIP was corrupt — Could not unzip uploaded file. GitHub Releases was apparently having a bad minute, and our curl didn't even have --fail, so the HTTP error body had been dutifully written to disk as a "ZIP file."

We filed it in our build-fix logs as "P4: silent failure on external asset download." It's a pattern we've hit multiple times.

At that point I had two choices:

Patch the amplify.yml: add --fail --retry 3 --retry-delay 5, add file size validation, make curl careful.
Bake the Chromium Layer into the build image itself and remove the download entirely.

Option 1 still leaves 20–40 seconds of downloading every build, and the network-failure risk never really goes to zero. Option 2 burns the file in once, at image build time, and also lets us delete the nvm and pnpm setup steps. I picked option 2.

The multi-stage Dockerfile design

I split docker/Dockerfile into multi-stage, producing two images from one base:

Stage	Tag	Purpose	Size
`build`	`latest`	Amplify custom build + GitHub Actions lint/build	~800 MB
`e2e`	`e2e`	GitHub Actions Playwright E2E	~1.4 GB

The base image mistake I made

I started with node:22-bookworm-slim. Debian-based, lightweight, official Node image — seemed like the obvious safe pick.

Except builds got flaky once Amplify ran it. The root cause: Amplify's default build image is Amazon Linux 2 (now 2023). Different glibc versions mean ampx and aws-cdk pull different native binaries, and the difference can bite you in non-obvious ways.

So I rewrote it to amazonlinux:2023. Match Amplify's default OS — it's the simplest, safest choice.

FROM amazonlinux:2023 AS build

ENV PNPM_HOME="/root/.local/share/pnpm" \
    PNPM_STORE_DIR="/root/.local/share/pnpm/store" \
    BUN_INSTALL="/root/.bun" \
    NVM_DIR="/root/.nvm" \
    CHROMIUM_LAYER_DIR="/opt/chromium-layer" \
    CHROMIUM_LAYER_VERSION="v127.0.0"

ENV PATH="$PNPM_HOME:$BUN_INSTALL/bin:$PATH"

RUN dnf update -y && dnf install -y \
    git openssh-clients bash jq tar wget zip unzip gzip \
    which findutils procps-ng ca-certificates \
    && dnf clean all && rm -rf /var/cache/dnf

What's in the image

Node.js 22 LTS (via nvm)
pnpm 10.28.1 (via corepack — still pnpm for workspace management)
bun — we can't fully migrate (no --filter, CDK/ampx compatibility is unclear), but we use bun for hono Lambda's isolated install and tsx script execution
AWS CLI v2 — required for ampx, appsync, ssm
aws-cdk (global)
Chromium Lambda Layer v127.0.0 — pre-baked to /opt/chromium-layer/. We validate the file size at image build time and hard-fail if it's under 1 MB:

RUN mkdir -p "$CHROMIUM_LAYER_DIR" \
    && curl -fSL --retry 3 --retry-delay 5 \
       -o "$CHROMIUM_LAYER_DIR/chromium-${CHROMIUM_LAYER_VERSION}-layer.zip" \
       "https://github.com/Sparticuz/chromium/releases/download/${CHROMIUM_LAYER_VERSION}/chromium-${CHROMIUM_LAYER_VERSION}-layer.zip" \
    && LAYER_SIZE=$(stat -c%s "$CHROMIUM_LAYER_DIR/chromium-${CHROMIUM_LAYER_VERSION}-layer.zip") \
    && if [ "$LAYER_SIZE" -lt 1000000 ]; then \
         echo "ERROR: Chromium layer is only ${LAYER_SIZE} bytes"; exit 1; \
       fi

The trick is running curl -fSL --retry 3 + size validation at image build time, not production build time. If something fails here, that image version just never makes it to ECR — our production Amplify builds never see the failure. We cut off the "3 a.m. build break" path at the root.

Lock down the pnpm store path across every context

This is the single biggest speedup. pnpm install speed is determined by whether the tarball is already in the content-addressable store:

Empty store → 60–90 seconds (full download every time)
Store populated → 5–15 seconds (hardlinks only)

You need the same PNPM_STORE_DIR in Dockerfile, Amplify build, and GitHub Actions, and you need to list that path in Amplify's cache.paths. If they don't agree, your second build still pays the full download cost.

RUN . "$NVM_DIR/nvm.sh" \
    && corepack enable \
    && corepack prepare pnpm@10.28.1 --activate \
    && pnpm config set store-dir "$PNPM_STORE_DIR" \
    && pnpm --version

Simplifying amplify.yml

The clearest before/after is in the amplify.yml itself.

Before

version: 1
backend:
  phases:
    preBuild:
      commands:
        - nvm install 22 && nvm use 22
        - corepack enable && corepack prepare pnpm@10.28.1 --activate
        - |
          curl -L \
            -o packages/gen2-shared-backend/layer/chromium-v127.0.0-layer.zip \
            https://github.com/Sparticuz/chromium/releases/download/v127.0.0/chromium-v127.0.0-layer.zip
        - pnpm install --frozen-lockfile

Every build: reinstall Node, set up pnpm, download a 64 MB ZIP from GitHub Releases, then finally run the real install. No --fail, no retries, no size check on that curl.

After

version: 1
backend:
  phases:
    preBuild:
      commands:
        # Custom image already has pnpm. Fallback just in case.
        - command -v pnpm >/dev/null 2>&1 || { nvm install && nvm use && corepack enable && corepack prepare pnpm@10.28.1 --activate; }
        # Chromium Layer is baked in — just copy it
        - |
          LAYER_DIR=/opt/chromium-layer
          if [ -f "$LAYER_DIR/chromium-v127.0.0-layer.zip" ]; then
            cp "$LAYER_DIR/chromium-v127.0.0-layer.zip" packages/gen2-shared-backend/layer/
          fi
        - pnpm install --frozen-lockfile
cache:
  paths:
    - /root/.local/share/pnpm/store
    - node_modules/.pnpm
    - apps/web-app/.build-cache
    - packages/gen2-shared-backend/.build-cache

I kept the command -v pnpm fallback intentionally: the amplify.yml should still work even if the custom image isn't in effect. I learned this the hard way — one PR removed the fallback, Amplify quietly fell back to the standard image for a branch, and every build died with pnpm: command not found. Design the amplify.yml so custom image = fast, no custom image = still works.

The numbers

I pulled staging build times via aws amplify list-jobs, filtered to successful jobs, and compared the windows before and after the image switch:

Window	Count	Median	Trimmed mean (trim top/bottom 10%)
Before (38 jobs)	38	9m43s	9m48s
After (57 jobs)	57	8m45s	8m53s

~58 seconds faster on median, 10.0%. A minute per build doesn't sound like much, but at ~50 builds a week that's 50 minutes of saved lead time per week, every week.

Your framework caps your ceiling

The most important thing to understand: the upper bound of this speedup is set by your framework, not by how clever your Dockerfile is. Here's the table I put together when designing the migration, with expected speedup per framework:

Framework	First build	Subsequent builds	Speedup
Next.js (SSR)	90–150s	30–60s	50–70%
React Router v7 / Remix	30–60s	25–50s	10–20%
Vite (our current setup)	60–90s	50–80s	10–15%

Our measured 10.0% landed right in the middle of the Vite band. Figuring out which band you're in tells you, up front, whether this project is worth the effort.

Vite doesn't benefit much because Vite has no persistent production build cache (see vitejs/vite#15092). node_modules/.vite/ is just the dev-server dependency pre-bundle — it does nothing for pnpm build. So the only win for Vite projects is the pnpm store cache.

Next.js is a different story entirely. .next/cache/webpack persists compiled Webpack/SWC chunks — that's a real production build cache, and the second-build speedup is an order of magnitude bigger. If you're on Next.js, this is unambiguously worth doing.

Three wins bigger than the minute of speedup

Honestly, the one-minute median speedup wasn't even the main win. The three side effects were:

1. Build failures dropped to zero

This is the biggest one. Before the image, we were losing builds a few times a month to curl failures on the Chromium Layer or GitHub Releases rate limits. You know the drill: stop the release at midnight, retry it before standup.

Since the switch, zero failures from external downloads. The ZIP is baked into the image, and the image build itself validates the size, so by the time production Amplify runs, "the file exists and is the right size" is mathematically guaranteed.

That peace of mind is worth more than the minute to me.

2. The "which Node version are we on?" question went away

Amplify's standard image updates on its own schedule. One day Node's minor version jumps, a preinstalled tool changes, and suddenly something that passed locally fails in CI. We've been bitten by this multiple times.

With the custom image, amazonlinux:2023 + Node.js 22 LTS + pnpm 10.28.1 are pinned everywhere — local dev, CI, and all four Amplify environments (prod / staging + two regional variants) all run the same runtime derived from the same Dockerfile. Quietly huge for reproducibility.

3. The next project gets the payoff

Our next project is going to be Next.js. Same image, and we hit the 50–70% speedup by design. The 10% on the Vite app is really the down payment — the real ROI compounds as we add more projects.

If you run multiple monorepos, the horizontal expansion cost is ~zero after the first one. That matters.

Bonus: it works in GitHub Actions too

The image isn't Amplify-specific. You can drop it straight into GitHub Actions via container:, which removes setup-node and setup-pnpm from your workflow entirely. That's another 30–60 seconds shaved off CI.

jobs:
  e2e:
    runs-on: ubuntu-latest
    container:
      image: public.ecr.aws/j9g5b1t3/amplify-build:e2e
    steps:
      - uses: actions/checkout@v6
      - uses: actions/cache@v5
        with:
          path: /root/.local/share/pnpm/store
          key: pnpm-store-${{ hashFiles('pnpm-lock.yaml') }}
          restore-keys: pnpm-store-
      - run: pnpm install --frozen-lockfile
      - run: pnpm test:e2e

The e2e stage also has Playwright + Chromium pre-baked, so npx playwright install is gone too. The really nice structural property here is that "Amplify and CI ran with slightly different runtimes" is now impossible — they're literally the same image. Use the same actions/cache key as Amplify's cache paths and your entire caching layer is unified.

If you've been putting off writing a Dockerfile "just for Amplify," the honest framing is: you're writing it for CI too, and that makes the investment much easier to justify.

Operational tips

A few things I learned the hard way.

The NODE_OPTIONS trap

I initially baked ENV NODE_OPTIONS="--max-old-space-size=6144" into the Dockerfile. Even on Amplify's STANDARD_8GB compute, the OOM killer still came for us. Turns out that when pnpm, Vite, and tsx spike simultaneously, peak memory can exceed 6 GB — and NODE_OPTIONS inherits into every Node process.

I dropped it to 4096, then eventually removed NODE_OPTIONS from the Dockerfile entirely. Amplify's runtime controls this when it needs to; pinning it in the image just overrode that logic. When in doubt, leave the image "plain" and let Amplify manage runtime constraints.

Assume `rsync` doesn't exist

I had a postBuild step doing rsync --exclude='*.map' dist/ out/ to strip source maps. It failed in Amplify's default environment because rsync isn't preinstalled.

# Before
rsync -a --exclude='*.map' dist/ out/

# After (find + cp alternative)
mkdir -p out && (cd dist && find . -type f ! -name '*.map' -exec cp --parents {} ../out/ \;)

I could have just added rsync to the custom image, but the principle here matters: the amplify.yml should work even when the custom image isn't active. If you lean too hard on "but my custom image has X," you make your deployment fragile. Keep the amplify.yml executable on vanilla Amplify.

Try the public image yourself

The actual image from this article is on ECR Public — no auth needed, pull it directly:

# build stage (for Amplify custom build / CI lint / build)
docker pull public.ecr.aws/j9g5b1t3/amplify-build:latest

# e2e stage (includes Playwright + Chromium)
docker pull public.ecr.aws/j9g5b1t3/amplify-build:e2e

Gallery: https://gallery.ecr.aws/j9g5b1t3/amplify-build
Contains: Amazon Linux 2023 / Node.js 22 / pnpm 10.28.1 / bun / AWS CLI v2 / CDK / Chromium Lambda Layer v127.0.0
Caveat: This registry is published for the article. For long-term production use, fork the image into your own ECR so you control tags and lifecycle.

Takeaways

Custom Docker image for AWS Amplify Gen2 monorepo (Vite): ~1 minute (10%) faster on median. Realistic range by framework: 10–20% for most, 50–70% for Next.js.
Framework characteristics set the ceiling. Vite: 10–15%. React Router v7 / Remix: 10–20%. Next.js: 50–70%. Decide whether to invest based on which band you're in, before you start.
The speedup isn't the main prize. Zero download-caused build failures, unified runtime across all environments, and near-zero cost to roll this out to the next project — those are what actually matter day to day.
Keep the amplify.yml backward-compatible. command -v pnpm fallback + existence check for /opt/chromium-layer/ = your production doesn't die if the custom image ever isn't in effect.

Build-time improvements tend to get framed as "make it faster," but in practice, "make it not fail" is just as important — maybe more. A custom Docker image gets you both for one piece of work. That's a good trade.

References

Sparticuz/chromium — the Chromium Lambda Layer
How pnpm's store works
Vite issue tracking persistent production build cache (vitejs/vite#15092)

DEV Community

How a Custom Docker Image Made My AWS Amplify Builds 10–20% Faster (and Killed Flaky Build Failures)

Intro

Why I built a custom build image

The trigger: a midnight build failure

The multi-stage Dockerfile design

The base image mistake I made

What's in the image

Lock down the pnpm store path across every context

Simplifying amplify.yml

Before

After

The numbers

Your framework caps your ceiling

Three wins bigger than the minute of speedup

1. Build failures dropped to zero

2. The "which Node version are we on?" question went away

3. The next project gets the payoff

Bonus: it works in GitHub Actions too

Operational tips

The NODE_OPTIONS trap

Assume `rsync` doesn't exist

Try the public image yourself

Takeaways

References

Top comments (0)

Intro

Why I built a custom build image

The trigger: a midnight build failure

The multi-stage Dockerfile design

The base image mistake I made

What's in the image

Lock down the pnpm store path across every context

Simplifying amplify.yml

Before

After

The numbers

Your framework caps your ceiling

Three wins bigger than the minute of speedup

1. Build failures dropped to zero

2. The "which Node version are we on?" question went away

3. The next project gets the payoff

Bonus: it works in GitHub Actions too

Operational tips

The NODE_OPTIONS trap

Assume rsync doesn't exist

Try the public image yourself

Takeaways

References

Assume `rsync` doesn't exist