DEV Community

Rafał Szczepanik
Rafał Szczepanik

Posted on

Scaling Next.js with Redis cache handler

Let's say you have dozens of Next.js instances in production, running in your Kubernetes cluster. Most of your pages use Incremental Static Regeneration (ISR), allowing pages to be generated and saved in file storage upon a user's first visit. Subsequent requests to the same page are served instantly from the saved version, bypassing regeneration, at least until the set revalidation period expires. Sounds good, right?

Except it does not scale very well.

Problem

The data is generated but never cleaned up. Moreover, every instance of NextJS uses the same data, duplicated and isolated. Here at Odrabiamy.pl, we noticed that all of our k8s instances were taking up to 30GB of storage each. That is a massive amount of data for one node, but what if we have 20 nodes? That would be 600 GB of data, which could easily be shared.

Possible Solutions

We tried to come up with a solution to this problem, and these were our options:

  1. Use a Kubernetes persistent volume and share the inside of the .next directory, but it has its cons:
    1. Every pod would have read/write access, which could cause massive problems with race conditions between pods. We would have to write our own cache handler to make sure everything is stable.
    2. A mechanism would be needed to copy the .next directory to a shared volume during deployment and, after it is not needed anymore, to delete it.
  2. Use Redis and the existing Next.js config to store all the generated pages - which turned out to be perfect for us in terms of the required time to implement and the complexity of the solution.

Next.js and Redis

By default, Next.js uses a file-based cache handler. However, Vercel has published a new config option to customize that. To do this, we have to load a custom cache handler in our next.config.js:

  cacheHandler:
    process.env.NODE_ENV === 'production'
      ? require.resolve('./cache-handler.cjs')
      : undefined,

Enter fullscreen mode Exit fullscreen mode

We only load it in the production environment, as it isn’t necessary in development mode. Now it is time to implement the cache-handler.cjs file. (Note: depending on your npm config, you might need to write this using ES modules.)

We will utilize the @neshca/cache-handler package, which is a library that comes with pre-written handlers. The plan is to:

  • Set Redis as the primary cache handler
  • As a backup, use LRU cache (Least Recently Used, in-memory cache)

The basic implementation will be as follows:

// cache-handler.cjs
const createClient = require('redis').createClient;

const CacheHandler = require('@neshca/cache-handler').CacheHandler;
const createLruCache = require('@neshca/cache-handler/local-lru').default;
const createRedisCache = require('@neshca/cache-handler/redis-strings').default;

CacheHandler.onCreation(async () => {
  const localCache = createLruCache({
    maxItemsNumber: 10000,
    maxItemSizeBytes: 1024 * 1024 * 250, // Limit to 250 MB
  });

  let redisCache;
  if (!process.env.REDIS_URL) {
    console.warn('REDIS_URL env is not set, using local cache only.');
  } else {
    try {
      const client = createClient({
        url: process.env.REDIS_URL,
      });

      client.on('error', (error) => {
        console.error('Redis error', error);
      });

      await client.connect();

      redisCache = createRedisCache({
        client,
        keyPrefix: `next-shared-cache-${process.env.NEXT_PUBLIC_BUILD_NUMBER}:`,
        // timeout for the Redis client operations like `get` and `set`
        // after this timeout, the operation will be considered failed and the `localCache` will be used
        timeoutMs: 5000,
      });
    } catch (error) {
      console.log(
        'Failed to initialize Redis cache, using local cache only.',
        error,
      );
    }
  }

  return {
    handlers: [redisCache, localCache],
    ttl: {
        // This value is also used as revalidation time for every ISR site
      defaultStaleAge: process.env.NEXT_PUBLIC_CACHE_IN_SECONDS, 
      // This makes sure, that resources without set revalidation time aren't stored infinitely in Redis
      estimateExpireAge: (staleAge) => staleAge, 
    },
  };
});

module.exports = CacheHandler;

Enter fullscreen mode Exit fullscreen mode

But here is one interesting caveat. What if Redis isn’t available during server start? The line await client.connect(); will fail, and the page will load with a delay. But because of this, Next.js will try to initialize a new CacheHandler every time someone visits any page.

That is why we decided to use only LRU in such cases. However, the solution to this problem is not trivial, as createClient doesn’t throw errors; it operates only on callbacks. So a workaround is needed:

      ...
      let isReady = false;

      const client = createClient({
        url: process.env.REDIS_URL,
        socket: {
          reconnectStrategy: () => (isReady ? 5000 : false),
        },

      client.on('error', (error) => {
        console.error('Redis error', error);
      });

      client.on('ready', () => {
        isReady = true;
      });

      await client.connect();

      ...

Enter fullscreen mode Exit fullscreen mode

This ensures that Next.js will not try to reconnect if the initial connection fails. In other cases, reconnection is desired and works like a charm.

Performance and stability

Our performance tests showed that CPU usage increased by about 2%, but response times stayed the same.

At Odrabiamy, our goal is not only to have a performant solution but also to have independent infrastructure layers, so that any failure does not influence the functioning of the entire application. This is where the Least Recently Used (LRU) cache comes into play as a crucial fallback mechanism. During our performance tests, we manually terminated Redis multiple times, which resulted in zero downtime. The transition between Redis and the LRU cache was so seamless that it wasn’t even noticeable in our performance graphs.

Conclusion

In the case of multiple Next.js instances running on a Kubernetes cluster, it is worth considering replacing the default file-system based cache with a Redis one. This can free up your storage resources without any risks and performance downgrades. Setting this configuration up is quite easy and was already battle-tested in our production environment.

Top comments (1)

Collapse
 
samuel_7bb212cf2ae profile image
Samuel Milton

Thanks for this great article.

Feedback: There's a missing )} in the workaround.

And two questions:

  1. How do you set the NEXT_PUBLIC_BUILD_NUMBER?
  2. If I use ondemand revalidation, how does that work if you also have local LRU caches on all pods? I guess a revalidate event would not empty the LRU caches but only the redis? At least not on all pods? We use contentful and send revalidate events on entry publish, its important that the cache is updated for all pods.

Again, thanks!