Next.js ISR works great on a single pod. But the moment you scale to multiple replicas — whether on Kubernetes, ECS, Cloud Foundry, or any orchestrator — you get a hidden efficiency problem: every pod independently regenerates the same pages, writes duplicate cache entries, and hammers your backend on every restart.
We run a high-traffic Next.js app serving tens of thousands of requests per minute across multiple pods (Kubernetes Deployment). After switching ISR's cache backend from local disk to Redis, we cut backend regeneration calls by 90%, eliminated cold-start storms, and reduced cache memory by 77% with gzip.
This article focuses on the Redis caching efficiency side:
- How duplicate writes happen and how to prevent them
- Compression strategy that saves 77% memory at <1ms cost
- Lock-based write deduplication
- Cold start prevention via route injection
- Load test numbers: disk vs Redis vs Redis+gzip
The Duplicate Write Problem
With disk-based ISR cache, each pod is an island:
Load Balancer / K8S Service
├── Pod 0 → local disk cache (ephemeral, independent)
├── Pod 1 → local disk cache (ephemeral, independent)
├── Pod 2 → local disk cache (ephemeral, independent)
└── Pod 7 → local disk cache (ephemeral, independent)
When /courses/javascript-101 is requested, the pod that handles it generates the page and writes to its own ephemeral disk. The other 7 pods don't know. Next request to the same page on a different pod? Full regeneration again. That's 8x the backend calls, 8x the compute, 8x the cache storage.
With Redis as a shared cache:
Load Balancer / K8S Service
├── Pod 0 ──┐
├── Pod 1 ──┤
├── Pod 2 ──┼──► Redis (single shared cache)
└── Pod 7 ──┘
One pod generates, one write to Redis, all 8 pods read from it. But just swapping to Redis isn't enough — you still need to handle concurrent writes and cold start storms.
Setting Up Redis as ISR Cache Backend
We used @neshca/cache-handler to replace Next.js's default disk cache with Redis.
// cache-handler.mjs
import { CacheHandler } from '@neshca/cache-handler';
import createRedisHandler from '@neshca/cache-handler/redis-stack';
import createLruHandler from '@neshca/cache-handler/local-lru';
import { createClient } from 'redis';
CacheHandler.onCreation(async () => {
let client;
try {
client = createClient({
url: process.env.REDIS_URL,
socket: {
connectTimeout: 5000,
reconnectStrategy: (retries) => Math.min(retries * 100, 5000),
},
});
client.on('error', () => {
// Silence connection errors — they flood logs in prod
});
await client.connect();
} catch (error) {
console.warn('Redis unavailable, falling back to LRU cache');
}
if (client?.isReady) {
const handler = await createRedisHandler({
client,
keyPrefix: 'isr:',
timeoutMs: 1000,
});
return { handlers: [handler] };
}
// Fallback: in-memory LRU (single-instance only)
return { handlers: [createLruHandler()] };
});
export default CacheHandler;
Wire it in next.config.js:
module.exports = {
cacheHandler: require.resolve('./cache-handler.mjs'),
cacheMaxMemorySize: 0, // Disable default in-memory cache, Redis only
};
This gets you shared caching. But two problems remain: write efficiency and cold start storms.
Write Deduplication: Redis Locks
When multiple pods receive requests for the same uncached page simultaneously, they all try to regenerate and write to Redis. This is wasteful — only one write is needed.
We use a simple Redis lock to ensure only one pod writes a given page at a time:
async function acquireLock(client, key, ttlMs = 30000) {
const lockKey = `lock:${key}`;
const result = await client.set(lockKey, '1', {
NX: true, // Only set if key doesn't exist
PX: ttlMs, // Auto-expire after 30s (crash recovery)
});
return result === 'OK';
}
async function releaseLock(client, key) {
await client.del(`lock:${key}`);
}
The flow becomes:
Pod 0: GET /page → cache miss → acquireLock("page") → OK → regenerate → write Redis → releaseLock
Pod 1: GET /page → cache miss → acquireLock("page") → FAIL → wait/serve stale → read from Redis
Pod 2: GET /page → cache miss → acquireLock("page") → FAIL → wait/serve stale → read from Redis
One regeneration instead of 8. The 30-second TTL on the lock is critical: if the pod crashes mid-regeneration, the lock auto-releases so other pods aren't stuck waiting forever.
On-Demand Revalidation Dedup
ISR supports on-demand revalidation via res.revalidate(path) — triggered by CMS webhooks when content changes. In a multi-pod setup, a webhook hits one random pod. That pod revalidates and writes to Redis, but the others don't know.
We solved this with Redis Pub/Sub where a single designated pod handles all revalidation:
// Determine pod identity
// K8S StatefulSet: pod name ends with ordinal index (e.g., app-0, app-1)
// Also works on CF (CF_INSTANCE_INDEX), ECS, PM2, etc.
function getPodIndex() {
// K8S StatefulSet — most common
if (process.env.HOSTNAME) {
const match = process.env.HOSTNAME.match(/-(\d+)$/);
if (match) return parseInt(match[1], 10);
}
// Cloud Foundry
if (process.env.CF_INSTANCE_INDEX) return parseInt(process.env.CF_INSTANCE_INDEX, 10);
// PM2 cluster
if (process.env.NODE_APP_INSTANCE) return parseInt(process.env.NODE_APP_INSTANCE, 10);
return 0;
}
const POD_INDEX = getPodIndex();
if (POD_INDEX === 0) {
const subscriber = client.duplicate();
await subscriber.connect();
await subscriber.subscribe('revalidate', async (message) => {
const { path } = JSON.parse(message);
const locked = await acquireLock(client, `reval:${path}`, 10000);
if (!locked) return; // Another revalidation already in progress
try {
await res.revalidate(path);
console.log(`Revalidated: ${path}`);
} catch (err) {
console.error(`Failed to revalidate ${path}:`, err);
} finally {
await releaseLock(client, `reval:${path}`);
}
});
}
Any pod receiving the webhook publishes to the revalidate channel. Pod 0 picks it up, regenerates once, writes to Redis. All pods serve the updated page.
Note: Pod 0 as sole subscriber is a simplification. In production, we later moved revalidation to a dedicated worker pod that doesn't serve user traffic. More on that in the lessons section.
Compression: 77% Memory Savings at <1ms Cost
Cached ISR pages are surprisingly large — HTML + JSON data routes. In our case, 50–200KB per page. With thousands of pages × locale variants × HTML + JSON versions, Redis memory grows fast.
We added gzip compression at level 1 (fastest compression, still great ratio):
import { promisify } from 'util';
import zlib from 'zlib';
const gzip = promisify(zlib.gzip);
const gunzip = promisify(zlib.gunzip);
const GZIP_LEVEL = parseInt(process.env.CACHE_GZIP_LEVEL || '1', 10);
// Before writing to Redis
async function compress(data) {
const json = JSON.stringify(data);
const compressed = await gzip(Buffer.from(json), { level: GZIP_LEVEL });
return compressed;
}
// After reading from Redis
async function decompress(buffer) {
const decompressed = await gunzip(buffer);
return JSON.parse(decompressed.toString());
}
Why Level 1?
| Gzip Level | Compression Ratio | Time per Page |
|---|---|---|
| 1 (fast) | ~77% reduction | <1ms |
| 6 (default) | ~82% reduction | ~3ms |
| 9 (max) | ~84% reduction | ~8ms |
Level 1 gives you most of the savings with almost zero CPU cost. The extra 5-7% from higher levels isn't worth the latency in a cache hot path.
Memory Impact
| Metric | No Gzip | Gzip Level 1 |
|---|---|---|
| Avg page size in Redis | 120KB | 28KB |
| Redis memory (10K pages) | ~1.2GB | ~280MB |
| Redis memory (50K pages) | ~6GB | ~1.4GB |
We started with a small Redis instance in production. Within a week, it was full. After adding gzip, we had headroom. We eventually scaled Redis capacity significantly with allkeys-lru eviction for long-term stability.
Tip:
allkeys-lrueviction is perfect for ISR caching. Popular pages stay cached, cold pages get evicted automatically. Don't usenoeviction— you'll get write failures when memory is full.
Cold Start Prevention: The Regeneration Storm
This is the most underreported problem with multi-pod ISR. On every pod restart — rolling deployment, crash recovery, HPA scaling event — every first request to every URL on that pod triggers backend regeneration, even if the page exists in Redis.
Why? Next.js maintains an in-memory route table for dynamic routes. On restart, this table is empty. The framework doesn't know which routes exist, so it treats every incoming request as a brand-new dynamic route and regenerates from scratch.
The Scale of the Problem
| Event | Without Fix | With Fix |
|---|---|---|
| Rolling deploy (8 pods restart) | ~12,000 regeneration calls in first 5 min | ~0 |
| HPA scale up (add 2 pods) | ~3,000 regeneration calls | ~0 |
| Single pod crash (OOMKilled, etc.) | ~1,500 regeneration calls | ~0 |
With disk cache, every rolling deployment is a mini-DDoS on your own backend.
The Fix: Route Injection into prerender-manifest.json
After next build, Next.js generates .next/prerender-manifest.json — an internal file that lists known routes. If a route is in this file, Next.js checks the cache first instead of regenerating.
The fix: inject your known URLs into this file after build:
// scripts/inject-routes.mjs
import fs from 'fs';
import path from 'path';
const MANIFEST_PATH = path.join(process.cwd(), '.next', 'prerender-manifest.json');
const REVALIDATE_SECONDS = 86400; // 1 day
async function getKnownUrls() {
// Read from sitemap, Redis cache keys, CMS API, or a cached URL list
const sitemap = await fetch('https://your-site.com/sitemap.xml');
return parseSitemap(await sitemap.text());
}
async function injectRoutes() {
const manifest = JSON.parse(fs.readFileSync(MANIFEST_PATH, 'utf-8'));
const knownUrls = await getKnownUrls();
const buildId = fs.readFileSync(
path.join(process.cwd(), '.next', 'BUILD_ID'), 'utf-8'
).trim();
for (const url of knownUrls) {
if (!manifest.routes[url]) {
manifest.routes[url] = {
initialRevalidateSeconds: REVALIDATE_SECONDS,
srcRoute: url,
dataRoute: `/_next/data/${buildId}${url === '/' ? '/index' : url}.json`,
};
}
}
fs.writeFileSync(MANIFEST_PATH, JSON.stringify(manifest, null, 2));
console.log(`Injected ${knownUrls.length} routes into prerender-manifest.json`);
}
injectRoutes();
Add to your build pipeline:
{
"scripts": {
"build": "next build && node scripts/inject-routes.mjs"
}
}
Warning:
prerender-manifest.jsonis an internal Next.js implementation detail, not a public API. We've tested it across Next.js 13.5–14.x and the structure has been stable, but verify after every upgrade.
Load Test Results
1,000 concurrent users, multiple pods, 30 minutes.
Response Times
| Metric | Disk Cache | Redis (no gzip) | Redis (gzip-1) |
|---|---|---|---|
| Avg response time | 320ms | 180ms | 185ms |
| P95 response time | 1,200ms | 420ms | 430ms |
| P99 response time | 3,500ms | 850ms | 880ms |
Efficiency
| Metric | Disk Cache | Redis (gzip-1) |
|---|---|---|
| Backend regeneration calls | 12,400 | 1,200 |
| Cache hit rate | 62% | 96% |
| Duplicate writes | ~8x per page | 1x per page (locked) |
| Pod restart recovery | ~45s (cold regen) | <1s (warm from Redis) |
Key findings:
- 90% fewer backend calls — shared cache + locks mean one regeneration serves all pods
- Gzip adds ~5ms latency but saves 77% Redis memory — no-brainer at level 1
- Restart recovery is instant — Redis persists across pod lifecycle
- P99 improves 4x — the cold-start + cache-miss worst case nearly disappears
Production Lessons
1. prerender-manifest.json is fragile
Not a public API. Test after every Next.js upgrade. Pin your Next.js version in production. If injection fails, worst case is the old cold-start behavior — degraded performance, not broken functionality.
2. Don't overload your revalidation pod
We initially had Pod 0 handle both user traffic AND pub/sub revalidation. It ended up processing tens of thousands of slow requests and significant timeouts because it was doing double duty. The fix: move revalidation to a dedicated worker pod (a separate Deployment with 1 replica) that doesn't serve user traffic.
3. Set revalidation TTL to days, not hours
With on-demand revalidation handling content changes, short time-based TTLs are unnecessary. We use revalidate: 604800 (7 days). Content changes trigger immediate on-demand revalidation; the long TTL is just a safety net.
4. Monitor Redis memory from day one
Pages × compression × locale variants × JSON + HTML versions = more than you expect. We scaled Redis capacity significantly over the first month. Use allkeys-lru and set up memory alerts.
5. This works on any multi-pod/multi-instance platform
The architecture is platform-agnostic. The only requirement: each pod needs a way to determine its ordinal index, and all pods must reach the same Redis.
| Platform | Pod/Instance Identity | Notes |
|---|---|---|
| Kubernetes (StatefulSet) |
HOSTNAME → app-0, app-1
|
Most common. Use StatefulSet for stable ordinals |
| Kubernetes (Deployment) | Use leader election or dedicated worker pod | Deployment pods have random names |
| AWS ECS | Task metadata endpoint | Works with Fargate and EC2 launch types |
| Cloud Foundry | CF_INSTANCE_INDEX |
Built-in ordinal index |
| Docker Compose | Custom INSTANCE_ID env var |
Set per service replica |
| PM2 Cluster | NODE_APP_INSTANCE |
Built-in worker index |
Architecture Summary
┌─────────────┐
│ CMS Webhook │
└──────┬──────┘
│
▼
┌──────────────────────┐
│ K8S Service / LB │
└──────────┬───────────┘
│
┌───────────┬───────┼───────┬───────────┐
▼ ▼ ▼ ▼ ▼
Pod 0 Pod 1 Pod 2 ... Pod 7
(+ pub/sub │ │ │
subscriber) │ │ │
│ │ │ │
└──────────┴───────┴───────────┘
│
▼
┌───────────────┐
│ Redis │
│ │
│ • ISR cache │
│ • Write locks│
│ • Pub/sub │
│ • Gzip-1 │
└───────────────┘
Policy: allkeys-lru
TL;DR
| Problem | Solution | Impact |
|---|---|---|
| 8 pods each regenerate the same page | Redis shared cache | 90% fewer backend calls |
| Concurrent writes for same page | Redis NX lock (30s TTL) | 1 write instead of 8 |
| Cached pages eat Redis memory | Gzip level 1 compression | 77% memory reduction, <1ms cost |
| Cold start regeneration storm on deploy | Route injection into prerender-manifest.json | ~0 regeneration on pod restart |
| Revalidation webhook hits random pod | Pub/Sub → single subscriber pod | Consistent cache invalidation |
Note: Details have been generalized to focus on architectural patterns rather than any specific production system.
Top comments (0)