DEV Community

Vitalii Buhaiov for MarketTrace

Posted on

ChunkLoadError on every deploy: the in-place rebuild trap in Next.js standalone

We run a Next.js 16 site behind nginx on a single VPS. Recently Google Search Console reported a single 500 on one of our locale-prefixed pages. The page was working fine by the time I clicked through. I almost ignored it. I'm glad I didn't. The trail led to a bug that fires on every deploy, and the fix is short.

Here's the story and what the fix cost us.

The single 500

Search Console flagged a locale-prefixed product route. The URL returned a clean 200 when I curled it. So either the indexer hit a transient blip, or something in our deploy flow occasionally leaks a 500 to whichever request happens to be in flight at the wrong second.

The nginx access log made it concrete. One 500 for that URL, single timestamp, never before or after:

[06:58:05]  GET /es/products/details  500
Enter fullscreen mode Exit fullscreen mode

Now the matching journalctl -u frontend for the same second:

06:58:04  Error [ChunkLoadError]: Failed to load chunk
          server/chunks/ssr/messages_es_json_[json]_cjs_xxxxxxxx._.js
          from module 83578
   [cause]: Error: Cannot find module
            '/opt/app/frontend/.next/standalone/.next/server/chunks/ssr/...'
06:58:04  Error [ChunkLoadError]: Failed to load chunk
          server/chunks/ssr/[root-of-the-server]__xxxxxxx._.js ...
06:58:04  ⨯ unhandledRejection: ChunkLoadError ...
Enter fullscreen mode Exit fullscreen mode

Hundreds of these in a five-second window, then silence. That five-second window matched the deploy run from earlier that morning to the second. A later deploy left a bigger spread of 500s across other locale-prefixed routes. Same root cause, same five seconds, more URLs simply because more requests landed in the window.

What the rebuild was doing

Our deploy on master push was:

cd /opt/app/frontend
npm ci --prefer-offline
npm run build              # writes .next/standalone/ + .next/static/ + .next/server/
cp -r .next/static .next/standalone/.next/static
cp -r public        .next/standalone/public
systemctl restart frontend
Enter fullscreen mode Exit fullscreen mode

The WorkingDirectory of the systemd unit was .next/standalone/. next build overwrites that directory in place. So during a 3-minute rebuild, the running Node process held a CPU full of in-memory references to chunk filenames (say, server/chunks/ssr/messages_es_json_[json]_cjs_xxxxxxxx._.js) that the new build had just deleted and replaced with a different hash. Then systemctl restart finally killed the old process and started a new one.

Any SSR request that hit the old process during that ~5-second window between "files replaced" and "process restarted" tried to lazy-load a chunk by its old filename. Node went to disk, didn't find it, threw ChunkLoadError. Next.js doesn't handle that in the SSR path. It bubbles up as a 500.

In-memory code that pre-loaded its chunks at boot kept working. Anything that touched a route that lazy-loaded (a different locale, an MDX-rendered page, a dynamic import) was a coin flip.

This isn't a Next.js bug. It's the cost of in-place rebuild deploys for any Node.js process that uses dynamic imports. We had lots of them: one per locale message bundle, one per MDX route, one per locale-prefixed page.

What we considered

Four options, in increasing order of "actually adequate":

  1. Stop, then build, then start. systemctl stop before npm run build. The running process never sees mismatched chunks because it isn't running. Cost: nginx returns 502 for 30–60 seconds while the build runs. 502 is "service unavailable, retry later", which Google treats as transient. Much friendlier than 500. Users still see a maintenance-ish page for a minute.

  2. Atomic directory swap. Build into a sibling directory, then mv .next/standalone .next/standalone-old && mv .next/standalone-new .next/standalone && systemctl restart. The running process keeps reading its old (now-renamed) directory until restart. Window shrinks from 30 seconds of 502 to 3–5 seconds of 502. Still some downtime, no 500s.

  3. proxy_next_upstream with a backup server. Tell nginx to retry on a backup if the primary returns 500. Requires keeping two upstream instances in sync forever, including during deploys. That sync is exactly the problem we were trying to solve, so this just relocates it.

  4. Blue-green at the systemd + nginx layer. Two long-running pools on different ports. Build into the idle one. Health-check it. Atomically swap nginx upstream. Drain. Stop the old. Zero failed requests during deploy.

We chose 4. The first three each shave a different chunk off the failure window; 4 closes it entirely. And it costs almost nothing on a 16 GB box (more on this below).

The pieces

Two systemd instances from one template unit

# /etc/systemd/system/frontend@.service
[Unit]
Description=Frontend (Next.js standalone, %i pool)
After=network.target
ConditionPathExists=/opt/app/frontend/pools/%i/server.js

[Service]
Type=simple
User=app
WorkingDirectory=/opt/app/frontend/pools/%i
EnvironmentFile=/etc/frontend-%i.env
Environment=NODE_ENV=production
Environment=HOSTNAME=127.0.0.1
ExecStart=/usr/bin/node server.js
Restart=always
RestartSec=5
Enter fullscreen mode Exit fullscreen mode

%i is the instance name. frontend@blue runs from pools/blue/, frontend@green from pools/green/. The per-color env files supply PORT=3000 and PORT=3001 respectively, kept VPS-local because they don't belong in git.

ConditionPathExists is doing real work. Without it, an empty pool slot (fresh install, partial deploy) would loop on Restart=always. With it, systemd just doesn't start the unit until the path appears.

Nginx upstream as an include file

# /etc/nginx/conf.d/frontend-upstream.conf
upstream frontend {
    include /etc/nginx/frontend-upstream-active.inc;
}
Enter fullscreen mode Exit fullscreen mode
# /etc/nginx/frontend-upstream-active.inc
server 127.0.0.1:3001;
Enter fullscreen mode Exit fullscreen mode

The deploy script never edits frontend-upstream.conf. It writes a new frontend-upstream-active.inc via temp-file + mv (which is atomic on a single filesystem), then sends nginx -s reload. mv(2) flips the upstream pointer in one instruction; reload graceful-rotates the workers.

One trap: name the include file with an extension that isn't .conf, or put it outside /etc/nginx/conf.d/. Otherwise the top-level include /etc/nginx/conf.d/*.conf will try to load it as a standalone config and choke on the bare server directive. We used .inc.

Deploy script flow

ACTIVE=$(cat pools/active-color 2>/dev/null || echo blue)
if [ "$ACTIVE" = "blue" ]; then
  IDLE=green; IDLE_PORT=3001; ACTIVE_PORT=3000
else
  IDLE=blue;  IDLE_PORT=3000; ACTIVE_PORT=3001
fi

# Sanity check: abort if the world is inconsistent.
NGINX_PORT=$(grep -oE '127\.0\.0\.1:[0-9]+' "$NGINX_UPSTREAM" | cut -d: -f2)
[ "$NGINX_PORT" = "$ACTIVE_PORT" ] || { echo "FAIL: marker mismatch"; exit 1; }

# Build (only writes inside .next/, not pools/).
rm -rf "pools/$IDLE"
npm run build

# Stage build into idle pool.
mv .next/standalone "pools/$IDLE"

# Bring idle online and prove it works.
systemctl restart "frontend@$IDLE"
for i in $(seq 1 60); do
  curl -sf -o /dev/null --max-time 2 \
    -H 'Host: example.com' \
    "http://127.0.0.1:$IDLE_PORT/" && break
  sleep 1
done

# Multi-route smoke (locale-prefixed + MDX + dynamic) before cutover.
for route in / /es/ /products/details /docs/guide /blog; do
  curl -s -L -o /dev/null --max-time 5 -w '%{http_code}' \
    -H 'Host: example.com' "http://127.0.0.1:$IDLE_PORT$route" | grep -qE '^[23]'
done

# Atomic upstream swap + reload.
printf 'server 127.0.0.1:%s;\n' "$IDLE_PORT" > "${NGINX_UPSTREAM}.new"
mv "${NGINX_UPSTREAM}.new" "$NGINX_UPSTREAM"
nginx -t && nginx -s reload

# Mark, drain, retire.
echo "$IDLE" > pools/active-color
systemctl enable  "frontend@$IDLE"
systemctl disable "frontend@$ACTIVE"
sleep 30                              # drain in-flight requests on the old pool
systemctl stop "frontend@$ACTIVE"
Enter fullscreen mode Exit fullscreen mode

The order matters more than it looks. Two specifics.

Write the marker file immediately after the nginx reload, before the drain. If the script crashes during the sleep or the systemctl stop, the marker reflects what nginx is doing right now. The next deploy reads truth, not stale state.

Sanity-check before destructive ops. rm -rf pools/$IDLE is fine if $IDLE really is idle. If the marker file lies (say a previous rollback was incomplete), $IDLE could be the pool that's serving traffic. The pre-flight check compares the marker against nginx's upstream port and refuses to proceed on a mismatch.

What it costs us

Measured on the live VPS. Your absolute numbers will vary with bundle size and traffic; the ratios won't:

Before After (steady) After (during 30-s cutover)
Frontend RAM (RSS) 241 MB 241 MB 482 MB (both pools running)
Disk used by pool dirs 173 MB 346 MB (active + previous, kept for rollback) 346 MB
Frontend CPU ~0 % idle ~0 % ~0 % (both pools idle during cutover)
Build-phase RAM peak ~1.0–1.5 GB unchanged unchanged

Against a 16 GB / 150 GB box where Redis already eats 4–5 GB resident, this is rounding error. The build itself is the expensive part of any deploy and it didn't change.

What it bought us:

  • Zero 500s during deploy. The old pool keeps serving its own (unchanged) chunks until it's gracefully stopped. The new pool starts from a complete on-disk build before nginx ever sends it a request.
  • Zero 502s during deploy. No restart window. nginx -s reload is graceful and doesn't drop in-flight connections.
  • Cheap rollback. The previous pool's directory is retained until the next deploy. To revert, the rollback script starts the old pool, writes the include file back, reloads nginx. No rebuild needed. About 10 seconds end-to-end.
  • Honest failure mode. If the build fails, the script aborts before touching nginx; the old pool is still serving. If the new pool fails health-check, the script stops it and exits non-zero; the old pool is still serving. There's no state in which the deploy can take the site down mid-flight.

Gotchas

next build cleans .next/ at start

The first cut of this had pool directories at .next/standalone-blue/ and .next/standalone-green/. They got wiped on every rebuild. next build does a recursive clean of .next/ before running. If you want anything to survive across builds, keep it outside .next/. We moved pools to pools/<color>/ (sibling of .next/).

Not Next.js-specific. Most build tools assume their output dir is theirs to own. Don't squat in it.

mv is safe under a running Linux process

While migrating prod to the new layout I had to move pools/blue/ while frontend@blue was actively serving from inside it. Linux inode semantics make this fine: a process holds inode references through its open file descriptors and CWD, not path strings. mv within a single filesystem is just a rename(2); the inodes don't move. The running pool kept serving without noticing.

Same reason tail -f keeps working when you rotate a log file by renaming it. Useful primitive once you remember it.

Don't put backup files in sites-enabled/

I made a backup of /etc/nginx/sites-enabled/default next to the original, then nginx -t started warning about "conflicting server name" entries. The top-level include /etc/nginx/sites-enabled/* was loading my .bak as a config. Move backups elsewhere or rename them so the glob misses.

systemd templates aren't auto-enabled by enable --now

Our CI workflow has a generic loop that auto-enables newly-installed singleton units. Templates (foo@.service) are explicitly skipped because they need an instance name. That's the right behavior for our case: we want exactly one of blue/green enabled at a time, and the deploy script decides which.

Health-check should match production conditions

A bare curl http://127.0.0.1:$PORT/ will succeed in a lot of cases where production is broken. Add -H 'Host: example.com' if you're behind a reverse proxy, follow redirects with -L, and probe routes that exercise the middleware / SSR / MDX paths that you care about. We had a Next.js + Cloudflare + nginx interaction bug that only surfaced when the request Host header didn't match 127.0.0.1. A localhost-only health check wouldn't have caught it.

When this isn't worth it

This pattern is small enough to recommend for any single-VPS deploy that uses systemd + a reverse proxy. It scales to multiple boxes the same way. Replace "two systemd instances on one box" with "two server fleets behind a load balancer" and the swap mechanic is identical.

It is not worth it if:

  • Your app boots in under a second and has no in-flight state that matters across restarts. A plain restart is simpler and the failure window is too short to care.
  • You already use a real orchestrator. Kubernetes, Nomad, ECS: all of them do this for you as a rolling deploy. If you have it, use it.
  • You're on a serverless platform where the runtime owns the deploy lifecycle. Same reason.

For a single-VPS Node.js process behind nginx, though, blue-green is the proportionate fix. Half a day of work, no new dependencies.

What changed for us, concretely

  • The five-second ChunkLoadError window is gone by construction. The old pool's chunks never get touched; the new pool starts from a complete build before nginx ever sends it a request.
  • Rollback is a 10-second nginx-upstream rewrite, not a rebuild.
  • The next time next build evicts a critical chunk filename (and it will), nobody outside our journal will know.

The hour we spent figuring out the original bug was longer than the hour we spent implementing the fix. If you're running anything stateful behind nginx and your deploy is git pull && build && restart, look at what your single-500 window looks like.

Top comments (0)