Let's say you have dozens of Next.js instances in production, running in your Kubernetes cluster. Most of your pages use Incremental Static Regeneration (ISR), allowing pages to be generated and saved in file storage upon a user's first visit. Subsequent requests to the same page are served instantly from the saved version, bypassing regeneration, at least until the set revalidation period expires. Sounds good, right?
Except it does not scale very well.
Problem
The data is generated but never cleaned up. Moreover, every instance of NextJS uses the same data, duplicated and isolated. Here at Odrabiamy.pl, we noticed that all of our k8s instances were taking up to 30GB of storage each. That is a massive amount of data for one node, but what if we have 20 nodes? That would be 600 GB of data, which could easily be shared.
Possible Solutions
We tried to come up with a solution to this problem, and these were our options:
-
Use a Kubernetes persistent volume and share the inside of the
.next
directory, but it has its cons:- Every pod would have read/write access, which could cause massive problems with race conditions between pods. We would have to write our own cache handler to make sure everything is stable.
- A mechanism would be needed to copy the
.next
directory to a shared volume during deployment and, after it is not needed anymore, to delete it.
- Use Redis and the existing Next.js config to store all the generated pages - which turned out to be perfect for us in terms of the required time to implement and the complexity of the solution.
Next.js and Redis
By default, Next.js uses a file-based cache handler. However, Vercel has published a new config option to customize that. To do this, we have to load a custom cache handler in our next.config.js
:
cacheHandler:
process.env.NODE_ENV === 'production'
? require.resolve('./cache-handler.cjs')
: undefined,
We only load it in the production environment, as it isn’t necessary in development mode. Now it is time to implement the cache-handler.cjs
file. (Note: depending on your npm config, you might need to write this using ES modules.)
We will utilize the @neshca/cache-handler package, which is a library that comes with pre-written handlers. The plan is to:
- Set Redis as the primary cache handler
- As a backup, use LRU cache (Least Recently Used, in-memory cache)
The basic implementation will be as follows:
// cache-handler.cjs
const createClient = require('redis').createClient;
const CacheHandler = require('@neshca/cache-handler').CacheHandler;
const createLruCache = require('@neshca/cache-handler/local-lru').default;
const createRedisCache = require('@neshca/cache-handler/redis-strings').default;
CacheHandler.onCreation(async () => {
const localCache = createLruCache({
maxItemsNumber: 10000,
maxItemSizeBytes: 1024 * 1024 * 250, // Limit to 250 MB
});
let redisCache;
if (!process.env.REDIS_URL) {
console.warn('REDIS_URL env is not set, using local cache only.');
} else {
try {
const client = createClient({
url: process.env.REDIS_URL,
});
client.on('error', (error) => {
console.error('Redis error', error);
});
await client.connect();
redisCache = createRedisCache({
client,
keyPrefix: `next-shared-cache-${process.env.NEXT_PUBLIC_BUILD_NUMBER}:`,
// timeout for the Redis client operations like `get` and `set`
// after this timeout, the operation will be considered failed and the `localCache` will be used
timeoutMs: 5000,
});
} catch (error) {
console.log(
'Failed to initialize Redis cache, using local cache only.',
error,
);
}
}
return {
handlers: [redisCache, localCache],
ttl: {
// This value is also used as revalidation time for every ISR site
defaultStaleAge: process.env.NEXT_PUBLIC_CACHE_IN_SECONDS,
// This makes sure, that resources without set revalidation time aren't stored infinitely in Redis
estimateExpireAge: (staleAge) => staleAge,
},
};
});
module.exports = CacheHandler;
But here is one interesting caveat. What if Redis isn’t available during server start? The line await client.connect();
will fail, and the page will load with a delay. But because of this, Next.js will try to initialize a new CacheHandler every time someone visits any page.
That is why we decided to use only LRU in such cases. However, the solution to this problem is not trivial, as createClient
doesn’t throw errors; it operates only on callbacks. So a workaround is needed:
...
let isReady = false;
const client = createClient({
url: process.env.REDIS_URL,
socket: {
reconnectStrategy: () => (isReady ? 5000 : false),
},
client.on('error', (error) => {
console.error('Redis error', error);
});
client.on('ready', () => {
isReady = true;
});
await client.connect();
...
This ensures that Next.js will not try to reconnect if the initial connection fails. In other cases, reconnection is desired and works like a charm.
Performance and stability
Our performance tests showed that CPU usage increased by about 2%, but response times stayed the same.
At Odrabiamy, our goal is not only to have a performant solution but also to have independent infrastructure layers, so that any failure does not influence the functioning of the entire application. This is where the Least Recently Used (LRU) cache comes into play as a crucial fallback mechanism. During our performance tests, we manually terminated Redis multiple times, which resulted in zero downtime. The transition between Redis and the LRU cache was so seamless that it wasn’t even noticeable in our performance graphs.
Conclusion
In the case of multiple Next.js instances running on a Kubernetes cluster, it is worth considering replacing the default file-system based cache with a Redis one. This can free up your storage resources without any risks and performance downgrades. Setting this configuration up is quite easy and was already battle-tested in our production environment.
Top comments (1)
Thanks for this great article.
Feedback: There's a missing
)}
in the workaround.And two questions:
Again, thanks!