DEV Community

eddylee
eddylee

Posted on

Filling a maintainer's "Help needed": shipping a Next.js 16 Redis cache handler

Filling a maintainer's "Help needed": shipping a Next.js 16 Redis cache handler

Next.js 16 split caching into two distinct handler interfaces:

  • cacheHandler (singular) — Pages Router ISR, on-demand revalidation
  • cacheHandlers (plural) — the new 'use cache' directive, cacheComponents: true

The most popular OSS Redis handler today is @fortedigital/nextjs-cache-handler@3.2.0. It declares peerDependencies.next: ">=16.1.5". But its README marks the entire plural-API column as ❌:

cacheHandlers config (plural)         ❌ Not yet supported - Help needed
'use cache' directive                 ❌ Not yet supported - Help needed
'use cache: remote' directive         ❌ Not yet supported - Help needed
'use cache: private' directive        ❌ Not yet supported - Help needed
cacheComponents                       ❌ Not yet supported - Help needed
Enter fullscreen mode Exit fullscreen mode

The community attempt to fix this — PR #207 — has been stalled for three months on a PHASE_PRODUCTION_BUILD regression that the maintainer rejected. The maintainer also said in Issue #152: "Next.js does not care about any other cloud or cluster environment than Vercel" — a candid acknowledgement that fortedigital's roadmap may not include this any time soon.

I had a multi-instance Next.js 16 deployment running on AWS ECS Fargate that needed all of this working today. So I built a separate small package focused on filling those gaps:

📦 @leejpsd/nextjs-cache-handler — currently 0.2.0, MIT licensed.

This post is the technical writeup — what it does, why it exists, the trap that almost shipped silently, and what live-traffic dogfood actually verified.


What it actually does

If you have a Next.js 16 app deployed across multiple containers (ECS task / Kubernetes pod / Fly.io machine), the default in-memory cache fragments per-instance. Two tasks behind one ALB will independently evaluate 'use cache' functions, write into their own local LRU, and never see each other's writes. revalidateTag('posts') only invalidates the task that received the call.

The fix Next.js documents is "register a custom cache handler that writes to a shared store". The interface is well-defined; the actual implementation has more landmines than the docs imply.

This package implements both interfaces in one wrapper, with a few production-driven defaults that the upstream OSS landscape currently doesn't cover.

// next.config.ts
const nextConfig = {
  cacheComponents: true,
  cacheHandler: require.resolve("./cache-incremental.cjs"),
  cacheHandlers: { default: require.resolve("./cache-components.cjs") },
};
Enter fullscreen mode Exit fullscreen mode
// cache-components.cjs
const { createCacheComponentsHandler } = require("@leejpsd/nextjs-cache-handler/cache-components");
module.exports = createCacheComponentsHandler({
  client: { type: "redis", url: process.env.REDIS_URL },
  buildNamespace: process.env.DEPLOYMENT_VERSION, // auto deploy isolation
  abortTimeoutMs: 1500,
  staleWhileRevalidate: true,
  singleFlight: true, // optional, opt-in stampede protection (v0.2)
});
Enter fullscreen mode Exit fullscreen mode

That's it. 'use cache', revalidateTag, updateTag, cacheLife all work. The library handles the build-time vs runtime split, the Lua-atomic tag updates, and the deploy-boundary key namespacing.


Compatibility matrix

Feature this @fortedigital 3.2.0 nextjs-turbo-redis-cache 1.13
cacheHandlers config (plural) ❌ Help needed ✅ since 1.11
'use cache' directive ❌ Help needed ✅ since 1.11
'use cache: remote' partial
cacheComponents: true ✅ Production-validated
Build-phase skip (PHASE_PRODUCTION_BUILD) ✅ default-on ✅ (singular only)
Auto deploy isolation BUILD_NAMESPACE env-resolved manual BUILD_ID since 1.13
Lua-atomic SET+tag ✅ Lua scripts partial (MULTI) partial
Single-flight refresh lock ✅ opt-in (v0.2)
AbortSignal timeout ✅ per-op ✅ Proxy-wrapped
OpenTelemetry hook onMetric (v0.2)
Integration tests vs real Redis ✅ 21 scenarios (v0.2) partial
Live-traffic dogfood report ✅ public 24h soak not published not published

(Verified 2026-05-10. Both upstream packages move quickly; please check their READMEs for the latest state.)


The trap that almost shipped silently

The most useful artifact in this whole exercise wasn't the handler implementation — it was a single landmine I tripped during dogfood deployment.

Setup: an env-var toggle in next.config.ts that flips between the in-tree handler (existing implementation) and the new library, so I could ship the library to staging behind a one-flag rollback.

// next.config.ts (the buggy version)
const useLibrary = process.env.USE_LIBRARY_HANDLER === "true";
const path = useLibrary
  ? "./lib-cache-components.cjs"
  : "./redis-handler.cjs";

const nextConfig = {
  cacheHandlers: { default: require.resolve(path) },
  // ...
};
Enter fullscreen mode Exit fullscreen mode

Looks fine, right? Toggle flag, swap path, done.

I deployed this. CloudWatch confirmed USE_LIBRARY_HANDLER=true was set on the ECS task. Cache state inspection showed entries being written. But the cache key shapes were wrong — they had no BUILD_NAMESPACE prefix, which is the library's signature feature.

I added console.log("loaded") to the library wrapper. Re-deployed. Searched CloudWatch.

0 results.
Enter fullscreen mode Exit fullscreen mode

The library wrapper was never being required at runtime. Despite:

  • USE_LIBRARY_HANDLER=true correctly set
  • The deploy commit hash showing the latest code
  • The library being installed in node_modules
  • The next.config.ts toggle logic being correct

What actually happened

next.config.ts is evaluated at build time. Specifically, require.resolve(...) resolves the absolute file path once during the Docker build, then bakes that resolved path into the standalone server bundle.

In the Docker build environment, USE_LIBRARY_HANDLER was not set. So:

build time:
  process.env.USE_LIBRARY_HANDLER === undefined
  → useLibrary === false
  → path = "./redis-handler.cjs"
  → require.resolve("./redis-handler.cjs") = "/abs/path/redis-handler.cjs"
  → that absolute path is what Next.js bakes into the server bundle

runtime:
  process.env.USE_LIBRARY_HANDLER === "true"  // (irrelevant — already baked)
  → Next.js loads /abs/path/redis-handler.cjs
  → the library is NEVER required
Enter fullscreen mode Exit fullscreen mode

The runtime env var was completely ignored.

The fix: a request-time router module

Move the env check from next.config.ts into a dedicated module that's loaded at request time:

// cache-components-router.cjs
"use strict";
const useLibrary = process.env.USE_LIBRARY_HANDLER === "true";
module.exports = useLibrary
  ? require("./lib-cache-components.cjs")
  : require("./redis-handler.cjs");
Enter fullscreen mode Exit fullscreen mode

Now next.config.ts always points at the same router file. The router reads the env var when it's actually loaded — which is when a request comes in, with the runtime environment fully populated.

// next.config.ts (fixed)
cacheHandlers: {
  default: require.resolve("./cache-components-router.cjs"),
},
Enter fullscreen mode Exit fullscreen mode

Plus an outputFileTracingIncludes so Next.js's standalone build copies both backend handler files (the library AND the in-tree fallback) into .next/standalone/:

outputFileTracingIncludes: {
  "/**/*": [
    "./cache-components-router.cjs",
    "./incremental-router.cjs",
    "./lib-cache-components.cjs",
    "./lib-incremental-cache-handler.cjs",
    "./redis-handler.cjs",
    "./incremental-cache-handler.js",
    "./node_modules/@leejpsd/nextjs-cache-handler/**/*",
  ],
},
Enter fullscreen mode Exit fullscreen mode

After this, the library activated correctly. CloudWatch logs showed the wrapper being loaded. Cache keys carried the BUILD_NAMESPACE prefix.

Why this matters

If you're writing a Next.js cache handler — or any module loaded by next.config.ts — the build-time vs runtime trap is silent in the worst possible way: no errors, no warnings, just the wrong code path at runtime. Even Next.js's own docs don't call it out clearly.

The full writeup with diagrams is in docs/build-phase.md. It's also exactly the trap that the PR #207 maintainer review was pointing at — and which has now been resolved cleanly in this package's shouldUseRedis() helper.


What live-traffic dogfood verified

Before promoting 0.1.0-rc.1 to stable, I deployed it behind the env-var toggle described above and let production traffic flow through it on AWS ECS Fargate (multi-instance, ElastiCache Redis).

Snapshot from the validation window:

$ curl /api/cache-debug | jq '.cacheState'
{
  "entryKeys": 2,
  "tagKeys": 2,
  "tagExpirationKeys": 1,
  "incrementalEntryKeys": 9,
  "incrementalTagKeys": 1,
  "sample": "next-cache:entry:8d5a4f71c4cc:[\"build-...\"]"
}

$ curl /api/health | jq '.checks.redis'
{
  "ok": true,
  "latencyMs": 2,
  "reason": null
}
Enter fullscreen mode Exit fullscreen mode

Two things to note:

  1. Cache key shapes carry the BUILD_NAMESPACE prefix (8d5a4f71c4cc is the deployment SHA). If the in-tree handler were active, keys would be next-cache:entry:["build-..."] with no namespace segment. That single character difference is the deployment-isolation guarantee in action.
  2. Redis ping at 2ms — well within the 1500ms abortTimeoutMs budget. No timeout events recorded during the validation window.

The signals were stable enough to promote to 0.1.0.


v0.2: three differentiators

A pre-publish self-audit on 0.1.0 flagged three gaps the matrix didn't yet cover. v0.2 closes them — each one fills an area that no other Next.js Redis handler currently has:

1. Single-flight refresh lock

Stale entries in the SWR window are served instantly. With many instances all crossing the revalidate boundary at the same moment, each one independently triggers its own background refresh — N parallel re-renders for the same key, all hitting your origin once.

singleFlight: true adds an opt-in Redis lock at the SWR boundary. The first instance to acquire it becomes the leader and runs the refresh; the rest become followers and keep serving the same stale entry. Lock acquisition uses a Lua-atomic SETNX-style script with a 10-second TTL (configurable):

-- refresh-tag-lock.lua
if redis.call('GET', KEYS[1]) then return 0 end
redis.call('SET', KEYS[1], ARGV[1], 'EX', tonumber(ARGV[2]))
return 1
Enter fullscreen mode Exit fullscreen mode

Two new MetricEvent types appear on the onMetric hook so operators can verify leadership balance across the fleet:

event meaning
cache.stale.refresh.leader this instance just acquired the lock
cache.stale.refresh.follower another instance holds it; we serve stale

If lock acquisition fails (Redis hiccup), the handler defaults to the follower path. The stale entry is always served, never dropped. The lock is an optimization, not a correctness primitive.

2. OpenTelemetry reference adapter

The library deliberately ships zero observability dependencies — the onMetric(event) hook gives strictly-typed events you wire to whichever stack you already run.

examples/opentelemetry/ is a copy-paste reference wrapper that exposes a counter (nextjs_cache.events_total) and a histogram (nextjs_cache.op_latency_ms), both with bounded cardinality (no cache keys or tag names emitted as attributes).

Three suggested dashboards in the example README: hit-rate over time, single-flight leadership distribution, op latency p50/p95/p99.

3. Integration tests against real Redis

72 unit tests with a MockRedisClient give fast, hermetic coverage of the spec. They don't catch:

  • redis@5 / ioredis method shape changes between minor versions
  • Lua EVAL/EVALSHA semantics on a real server
  • Cursor-based scanIterator chunk behavior (the redis@4 → redis@5 upgrade silently broke scanning in the reference deployment, surfacing only under live traffic)
  • TTL/EX behavior under real Redis time

v0.2 adds 21 integration scenarios that bring up Redis 7 in docker-compose and run the full test grid against both redis@5 and ioredis adapters. Same scenarios, swapped underlying client. CI runs them on every PR via a service container:

# .github/workflows/ci.yml
integration:
  services:
    redis:
      image: redis:7-alpine
      ports: [6390:6379]
  steps:
    - run: npm run test:integration
      env:
        INTEGRATION_REDIS_URL: redis://127.0.0.1:6390
Enter fullscreen mode Exit fullscreen mode

The hardest bug they caught was during initial setup: a vitest transform hook combined with assetsInclude was running twice on .lua files, emitting export default "export default \"...\"", which Redis rejected with '=' expected near 'default'. A single load hook (no transform) fixed it.


Honest limitations (v0.2)

  • The dogfood window is a starting point, not a completion criterion. Memory leaks, timer drift, and edge cases that only surface after extended uptime are not yet covered. Patch releases will accumulate live time as the package ages.
  • Redis Cluster is implemented but not load-tested at scale. The hashTag: true flag routes multi-key Lua scripts to the same slot, but I haven't run a real-world cluster benchmark. v0.3 milestone.
  • Vercel KV / Upstash adapters ship in v0.3. Both work today via the standard redis@5 adapter against their Redis-compatible endpoints, but native adapters with edge-runtime support are scoped for v0.3.
  • Provenance attestation ships from the GitHub Actions OIDC publish path (now wired up in release.yml). The first stable was published from a local machine without provenance; v0.2.x tarballs published via the workflow will carry the verified attestation.

These are spelled out in the README's Roadmap section; the goal is that "what's not yet supported" is as visible as "what is".


What I'd repeat / what I'd change

Repeat

  • Dogfood before promotion. Running the rc against my own production traffic surfaced the build-time-vs-runtime trap before it could embarrass me publicly. The dogfood plan (docs/staging-dogfood.md) is the single most useful piece of project hygiene I added.
  • Frozen spec snapshot in repo. I copied Next.js's official cacheHandlers spec verbatim into docs/next16-spec.md before writing any handler code. CI prints its sha256 on every run so spec drift gets noticed before it bites.
  • Compatibility matrix with timestamp. Both upstream packages I compare against are evolving. Putting "verified 2026-05-10" on the matrix is the difference between an honest snapshot and a future lie.

Change

  • Should have started with outputFileTracingIncludes from day one. I burned half a day on the build-phase trap because Next.js's output: standalone quietly strips files that aren't transitively required by the build-time code path. If you're building anything that gets loaded via require.resolve() from next.config.ts, pin it explicitly.
  • Should have shipped Redis Cluster load test results before claiming "Redis Cluster ✅" in the matrix. The current matrix line says "unit-tested, not yet load-tested at scale" — honest, but only because I caught the gap during the pre-publish self-audit. Future me writes the load test first.

Try it

npm install @leejpsd/nextjs-cache-handler redis
# or
npm install @leejpsd/nextjs-cache-handler ioredis
Enter fullscreen mode Exit fullscreen mode

Wiring is two CommonJS wrapper files plus a next.config.ts toggle. The full quick-start is in the README.

Most useful entry points if you're considering it:

Issues, PRs, and feedback — especially on Redis Cluster behavior under real load — are all welcome at github.com/leejpsd/nextjs-cache-handler.


Disclaimer: I'm not affiliated with @fortedigital, @neshca, or nextjs-turbo-redis-cache. The compatibility matrix is verified against their public READMEs as of 2026-05-10; all three projects move quickly and snapshots can go stale. The maintainers of all three deserve credit — the patterns I built on came directly from reading their source.

Top comments (0)