We run a Next.js 16 site behind nginx on a single VPS. Recently Google Search Console reported a single 500 on one of our locale-prefixed pages. The page was working fine by the time I clicked through. I almost ignored it. I'm glad I didn't. The trail led to a bug that fires on every deploy, and the fix is short.
Here's the story and what the fix cost us.
The single 500
Search Console flagged a locale-prefixed product route. The URL returned a clean 200 when I curled it. So either the indexer hit a transient blip, or something in our deploy flow occasionally leaks a 500 to whichever request happens to be in flight at the wrong second.
The nginx access log made it concrete. One 500 for that URL, single timestamp, never before or after:
[06:58:05] GET /es/products/details 500
Now the matching journalctl -u frontend for the same second:
06:58:04 Error [ChunkLoadError]: Failed to load chunk
server/chunks/ssr/messages_es_json_[json]_cjs_xxxxxxxx._.js
from module 83578
[cause]: Error: Cannot find module
'/opt/app/frontend/.next/standalone/.next/server/chunks/ssr/...'
06:58:04 Error [ChunkLoadError]: Failed to load chunk
server/chunks/ssr/[root-of-the-server]__xxxxxxx._.js ...
06:58:04 ⨯ unhandledRejection: ChunkLoadError ...
Hundreds of these in a five-second window, then silence. That five-second window matched the deploy run from earlier that morning to the second. A later deploy left a bigger spread of 500s across other locale-prefixed routes. Same root cause, same five seconds, more URLs simply because more requests landed in the window.
What the rebuild was doing
Our deploy on master push was:
cd /opt/app/frontend
npm ci --prefer-offline
npm run build # writes .next/standalone/ + .next/static/ + .next/server/
cp -r .next/static .next/standalone/.next/static
cp -r public .next/standalone/public
systemctl restart frontend
The WorkingDirectory of the systemd unit was .next/standalone/. next build overwrites that directory in place. So during a 3-minute rebuild, the running Node process held a CPU full of in-memory references to chunk filenames (say, server/chunks/ssr/messages_es_json_[json]_cjs_xxxxxxxx._.js) that the new build had just deleted and replaced with a different hash. Then systemctl restart finally killed the old process and started a new one.
Any SSR request that hit the old process during that ~5-second window between "files replaced" and "process restarted" tried to lazy-load a chunk by its old filename. Node went to disk, didn't find it, threw ChunkLoadError. Next.js doesn't handle that in the SSR path. It bubbles up as a 500.
In-memory code that pre-loaded its chunks at boot kept working. Anything that touched a route that lazy-loaded (a different locale, an MDX-rendered page, a dynamic import) was a coin flip.
This isn't a Next.js bug. It's the cost of in-place rebuild deploys for any Node.js process that uses dynamic imports. We had lots of them: one per locale message bundle, one per MDX route, one per locale-prefixed page.
What we considered
Four options, in increasing order of "actually adequate":
Stop, then build, then start.
systemctl stopbeforenpm run build. The running process never sees mismatched chunks because it isn't running. Cost: nginx returns502for 30–60 seconds while the build runs.502is "service unavailable, retry later", which Google treats as transient. Much friendlier than500. Users still see a maintenance-ish page for a minute.Atomic directory swap. Build into a sibling directory, then
mv .next/standalone .next/standalone-old && mv .next/standalone-new .next/standalone && systemctl restart. The running process keeps reading its old (now-renamed) directory until restart. Window shrinks from 30 seconds of502to 3–5 seconds of502. Still some downtime, no500s.proxy_next_upstreamwith a backup server. Tell nginx to retry on a backup if the primary returns500. Requires keeping two upstream instances in sync forever, including during deploys. That sync is exactly the problem we were trying to solve, so this just relocates it.Blue-green at the systemd + nginx layer. Two long-running pools on different ports. Build into the idle one. Health-check it. Atomically swap nginx upstream. Drain. Stop the old. Zero failed requests during deploy.
We chose 4. The first three each shave a different chunk off the failure window; 4 closes it entirely. And it costs almost nothing on a 16 GB box (more on this below).
The pieces
Two systemd instances from one template unit
# /etc/systemd/system/frontend@.service
[Unit]
Description=Frontend (Next.js standalone, %i pool)
After=network.target
ConditionPathExists=/opt/app/frontend/pools/%i/server.js
[Service]
Type=simple
User=app
WorkingDirectory=/opt/app/frontend/pools/%i
EnvironmentFile=/etc/frontend-%i.env
Environment=NODE_ENV=production
Environment=HOSTNAME=127.0.0.1
ExecStart=/usr/bin/node server.js
Restart=always
RestartSec=5
%i is the instance name. frontend@blue runs from pools/blue/, frontend@green from pools/green/. The per-color env files supply PORT=3000 and PORT=3001 respectively, kept VPS-local because they don't belong in git.
ConditionPathExists is doing real work. Without it, an empty pool slot (fresh install, partial deploy) would loop on Restart=always. With it, systemd just doesn't start the unit until the path appears.
Nginx upstream as an include file
# /etc/nginx/conf.d/frontend-upstream.conf
upstream frontend {
include /etc/nginx/frontend-upstream-active.inc;
}
# /etc/nginx/frontend-upstream-active.inc
server 127.0.0.1:3001;
The deploy script never edits frontend-upstream.conf. It writes a new frontend-upstream-active.inc via temp-file + mv (which is atomic on a single filesystem), then sends nginx -s reload. mv(2) flips the upstream pointer in one instruction; reload graceful-rotates the workers.
One trap: name the include file with an extension that isn't .conf, or put it outside /etc/nginx/conf.d/. Otherwise the top-level include /etc/nginx/conf.d/*.conf will try to load it as a standalone config and choke on the bare server directive. We used .inc.
Deploy script flow
ACTIVE=$(cat pools/active-color 2>/dev/null || echo blue)
if [ "$ACTIVE" = "blue" ]; then
IDLE=green; IDLE_PORT=3001; ACTIVE_PORT=3000
else
IDLE=blue; IDLE_PORT=3000; ACTIVE_PORT=3001
fi
# Sanity check: abort if the world is inconsistent.
NGINX_PORT=$(grep -oE '127\.0\.0\.1:[0-9]+' "$NGINX_UPSTREAM" | cut -d: -f2)
[ "$NGINX_PORT" = "$ACTIVE_PORT" ] || { echo "FAIL: marker mismatch"; exit 1; }
# Build (only writes inside .next/, not pools/).
rm -rf "pools/$IDLE"
npm run build
# Stage build into idle pool.
mv .next/standalone "pools/$IDLE"
# Bring idle online and prove it works.
systemctl restart "frontend@$IDLE"
for i in $(seq 1 60); do
curl -sf -o /dev/null --max-time 2 \
-H 'Host: example.com' \
"http://127.0.0.1:$IDLE_PORT/" && break
sleep 1
done
# Multi-route smoke (locale-prefixed + MDX + dynamic) before cutover.
for route in / /es/ /products/details /docs/guide /blog; do
curl -s -L -o /dev/null --max-time 5 -w '%{http_code}' \
-H 'Host: example.com' "http://127.0.0.1:$IDLE_PORT$route" | grep -qE '^[23]'
done
# Atomic upstream swap + reload.
printf 'server 127.0.0.1:%s;\n' "$IDLE_PORT" > "${NGINX_UPSTREAM}.new"
mv "${NGINX_UPSTREAM}.new" "$NGINX_UPSTREAM"
nginx -t && nginx -s reload
# Mark, drain, retire.
echo "$IDLE" > pools/active-color
systemctl enable "frontend@$IDLE"
systemctl disable "frontend@$ACTIVE"
sleep 30 # drain in-flight requests on the old pool
systemctl stop "frontend@$ACTIVE"
The order matters more than it looks. Two specifics.
Write the marker file immediately after the nginx reload, before the drain. If the script crashes during the sleep or the systemctl stop, the marker reflects what nginx is doing right now. The next deploy reads truth, not stale state.
Sanity-check before destructive ops. rm -rf pools/$IDLE is fine if $IDLE really is idle. If the marker file lies (say a previous rollback was incomplete), $IDLE could be the pool that's serving traffic. The pre-flight check compares the marker against nginx's upstream port and refuses to proceed on a mismatch.
What it costs us
Measured on the live VPS. Your absolute numbers will vary with bundle size and traffic; the ratios won't:
| Before | After (steady) | After (during 30-s cutover) | |
|---|---|---|---|
| Frontend RAM (RSS) | 241 MB | 241 MB | 482 MB (both pools running) |
| Disk used by pool dirs | 173 MB | 346 MB (active + previous, kept for rollback) | 346 MB |
| Frontend CPU | ~0 % idle | ~0 % | ~0 % (both pools idle during cutover) |
| Build-phase RAM peak | ~1.0–1.5 GB | unchanged | unchanged |
Against a 16 GB / 150 GB box where Redis already eats 4–5 GB resident, this is rounding error. The build itself is the expensive part of any deploy and it didn't change.
What it bought us:
- Zero
500s during deploy. The old pool keeps serving its own (unchanged) chunks until it's gracefully stopped. The new pool starts from a complete on-disk build before nginx ever sends it a request. - Zero
502s during deploy. No restart window.nginx -s reloadis graceful and doesn't drop in-flight connections. - Cheap rollback. The previous pool's directory is retained until the next deploy. To revert, the rollback script starts the old pool, writes the include file back, reloads nginx. No rebuild needed. About 10 seconds end-to-end.
- Honest failure mode. If the build fails, the script aborts before touching nginx; the old pool is still serving. If the new pool fails health-check, the script stops it and exits non-zero; the old pool is still serving. There's no state in which the deploy can take the site down mid-flight.
Gotchas
next build cleans .next/ at start
The first cut of this had pool directories at .next/standalone-blue/ and .next/standalone-green/. They got wiped on every rebuild. next build does a recursive clean of .next/ before running. If you want anything to survive across builds, keep it outside .next/. We moved pools to pools/<color>/ (sibling of .next/).
Not Next.js-specific. Most build tools assume their output dir is theirs to own. Don't squat in it.
mv is safe under a running Linux process
While migrating prod to the new layout I had to move pools/blue/ while frontend@blue was actively serving from inside it. Linux inode semantics make this fine: a process holds inode references through its open file descriptors and CWD, not path strings. mv within a single filesystem is just a rename(2); the inodes don't move. The running pool kept serving without noticing.
Same reason tail -f keeps working when you rotate a log file by renaming it. Useful primitive once you remember it.
Don't put backup files in sites-enabled/
I made a backup of /etc/nginx/sites-enabled/default next to the original, then nginx -t started warning about "conflicting server name" entries. The top-level include /etc/nginx/sites-enabled/* was loading my .bak as a config. Move backups elsewhere or rename them so the glob misses.
systemd templates aren't auto-enabled by enable --now
Our CI workflow has a generic loop that auto-enables newly-installed singleton units. Templates (foo@.service) are explicitly skipped because they need an instance name. That's the right behavior for our case: we want exactly one of blue/green enabled at a time, and the deploy script decides which.
Health-check should match production conditions
A bare curl http://127.0.0.1:$PORT/ will succeed in a lot of cases where production is broken. Add -H 'Host: example.com' if you're behind a reverse proxy, follow redirects with -L, and probe routes that exercise the middleware / SSR / MDX paths that you care about. We had a Next.js + Cloudflare + nginx interaction bug that only surfaced when the request Host header didn't match 127.0.0.1. A localhost-only health check wouldn't have caught it.
When this isn't worth it
This pattern is small enough to recommend for any single-VPS deploy that uses systemd + a reverse proxy. It scales to multiple boxes the same way. Replace "two systemd instances on one box" with "two server fleets behind a load balancer" and the swap mechanic is identical.
It is not worth it if:
- Your app boots in under a second and has no in-flight state that matters across restarts. A plain restart is simpler and the failure window is too short to care.
- You already use a real orchestrator. Kubernetes, Nomad, ECS: all of them do this for you as a rolling deploy. If you have it, use it.
- You're on a serverless platform where the runtime owns the deploy lifecycle. Same reason.
For a single-VPS Node.js process behind nginx, though, blue-green is the proportionate fix. Half a day of work, no new dependencies.
What changed for us, concretely
- The five-second
ChunkLoadErrorwindow is gone by construction. The old pool's chunks never get touched; the new pool starts from a complete build before nginx ever sends it a request. - Rollback is a 10-second nginx-upstream rewrite, not a rebuild.
- The next time
next buildevicts a critical chunk filename (and it will), nobody outside our journal will know.
The hour we spent figuring out the original bug was longer than the hour we spent implementing the fix. If you're running anything stateful behind nginx and your deploy is git pull && build && restart, look at what your single-500 window looks like.
Top comments (0)