Lalitha Govada

Posted on Mar 25

How I Built a Two-Level Cache to Serve Millions of Lookups in Under a Millisecond

#architecture #backend #performance #systemdesign

Every high-traffic system eventually hits the same wall: your data store can't keep up. For us, the breaking point came when a simple product lookup — backed by Elasticsearch — started showing tail latencies creeping past 80ms. At scale, that's the kind of number that keeps you up at night.
The solution wasn't a faster cluster. It was rethinking where data lives before it ever reaches Elasticsearch at all. This post walks through the two-level caching strategy we built using Caffeine as an in-process L1 cache and Redis as a distributed L2 cache, with Elasticsearch sitting behind as the source of truth.

The Problem with Single-Layer Caching
Most teams reach for Redis the moment they need a cache. It's fast, it's familiar, and it works. But Redis still lives over the network. Even on a low-latency internal network, you're paying 1–5ms per hop. Do that a few thousand times per second across many services, and it adds up.
The other option — caching inside the application process using something like Caffeine — gives you sub-millisecond reads, but it doesn't survive restarts, it doesn't share state between instances, and it can balloon your heap if you're not careful.
Neither option alone was good enough. What we needed was both.

The Architecture: Three Layers, One Request Path

The lookup flow works like this:

A request arrives. We check Caffeine first. If the key exists in the local heap, we return immediately — no network call, no serialisation, typically under 0.1ms.
On a Caffeine miss, we check Redis. If Redis has the value, we return it and asynchronously backfill Caffeine so the next request doesn't pay the Redis cost again.
On a Redis miss, we hit Elasticsearch. We fetch the result, write it back into both Redis and Caffeine, and return the value to the caller.

Each layer is a safety net for the one above it.

Why Caffeine for L1?
Caffeine is the go-to in-process cache for the JVM. It's built on a variant of the W-TinyLFU eviction policy, which gives it near-optimal hit rates in practice. It supports:

Time-based expiry — both after write and after last access
Size-based eviction — bounds by entry count or byte weight
Async loading — blocking only the first thread for a cold key, queuing subsequent requests

For our use case, we set a short TTL (30–60 seconds) and a bounded size per service instance. The goal isn't to cache everything — it's to absorb the hot tail of your access distribution, the keys that every instance sees repeatedly.

// CaffeineConfig.java
@Bean
public Cache<String, Product> caffeineProductCache() {
    return Caffeine.newBuilder()
            .expireAfterWrite(Duration.ofSeconds(ttlSeconds))  // default 30 s
            .maximumSize(maxSize)                              // default 5 000
            .recordStats()   // exposes hit rate to Micrometer / Prometheus
            .build();
}

recordStats() wires Caffeine's internal hit/miss counters into Micrometer — your hit rate shows up in Prometheus or Cloud Monitoring with zero extra code.

Why Redis for L2?
Redis bridges the gap between the ephemeral local cache and the durable source of truth. Its role here is twofold: it survives application restarts, and it's shared across all service instances. When a new pod spins up and Caffeine is cold, Redis absorbs the load that would otherwise spike straight to Elasticsearch.
We use Redis with a longer TTL than Caffeine — typically 5–15 minutes depending on the data domain — and we're careful about serialisation. We lean on a compact binary format (MessagePack) rather than JSON to reduce memory footprint and deserialisation cost.

// RedisConfig.java — MessagePack gives ~32% smaller payloads vs JSON
@Bean
public ObjectMapper msgpackObjectMapper() {
    return new ObjectMapper(new MessagePackFactory())
            .registerModule(new JavaTimeModule())
            .disable(SerializationFeature.WRITE_DATES_AS_TIMESTAMPS);
}

@Bean
public RedisTemplate<String, Object> redisTemplate(
        RedisConnectionFactory connectionFactory,
        ObjectMapper msgpackObjectMapper) {

    var serializer = new Jackson2JsonRedisSerializer<>(msgpackObjectMapper, Object.class);
    var template = new RedisTemplate<String, Object>();
    template.setConnectionFactory(connectionFactory);
    template.setKeySerializer(new StringRedisSerializer());
    template.setValueSerializer(serializer);
    template.afterPropertiesSet();
    return template;
}

One thing worth being explicit about: Redis is not a replacement for your data store. It is a cache. Plan your TTL strategy accordingly and make sure your invalidation logic is correct.
On GCP, point REDIS_HOST at your Memorystore for Redis instance IP — fully managed, private VPC, automatic failover. It's a drop-in replacement, no code changes needed.

The Cache-Aside Pattern in Code
The core logic is where everything comes together:

java// ProductCacheService.java
public LookupResult getProduct(String id) {
    long start = System.currentTimeMillis();

    // L1 — Caffeine
    Product cached = caffeineCache.getIfPresent(id);
    if (cached != null) {
        return LookupResult.fromCaffeine(cached, elapsed(start));
    }

    // L2 — Redis
    Object redisRaw = redisTemplate.opsForValue().get(redisKey(id));
    if (redisRaw instanceof Product redisProduct) {
        backfillCaffeineAsync(id, redisProduct);   // don't block the caller
        return LookupResult.fromRedis(redisProduct, elapsed(start));
    }

    // L3 — Elasticsearch
    return productRepository.findById(id)
            .map(esProduct -> {
                backfillRedis(id, esProduct);       // TTL 10 min
                caffeineCache.put(id, esProduct);   // warm L1 too
                return LookupResult.fromElasticsearch(esProduct, elapsed(start));
            })
            .orElseGet(() -> LookupResult.notFound(elapsed(start)));
}

The key discipline here is that every miss at a higher layer backfills that layer on the way back. This is the cache-aside (lazy-loading) pattern, and it keeps your caches warm without requiring a separate warming process.
Notice that the Caffeine backfill after an L2 hit is async — the caller doesn't wait. If it fails, the next request just hits Redis again. No correctness issue.
The LookupResult record tells the caller which layer answered, which makes warm-up visible in real time:

java
public record LookupResult(Product product, String servedBy, long latencyMs) {
    public static LookupResult fromCaffeine(Product p, long ms) {
        return new LookupResult(p, "L1_CAFFEINE", ms);
    }
    public static LookupResult fromRedis(Product p, long ms) {
        return new LookupResult(p, "L2_REDIS", ms);
    }
    public static LookupResult fromElasticsearch(Product p, long ms) {
        return new LookupResult(p, "L3_ELASTICSEARCH", ms);
    }
}

Try it yourself — watch the layer flip as the cache warms up:

bashdocker-compose up -d
./mvnw spring-boot:run

# Create a product
curl -X POST http://localhost:8080/api/products \
  -H 'Content-Type: application/json' \
  -d '{"id":"p1","name":"Widget Pro","category":"widgets","price":9.99,"inStock":true}'

# First call — cold cache
curl http://localhost:8080/api/products/p1
# → {"servedBy":"L3_ELASTICSEARCH","latencyMs":45}

# Second call — Caffeine is warm
curl http://localhost:8080/api/products/p1
# → {"servedBy":"L1_CAFFEINE","latencyMs":0}

Invalidation: The Hard Part
Cache invalidation in a two-level setup is where most teams get caught out. You now have two places holding potentially stale data, and they expire on different schedules.
Our approach has three layers:

TTL-based expiry as the baseline. Short TTLs in Caffeine mean local caches self-heal quickly. Longer TTLs in Redis reduce ES load for moderately static data.
Event-driven invalidation for critical updates. When a product is updated, we publish an event. Each service instance subscribes and evicts the key from both Caffeine and Redis immediately. This ensures strong consistency when it matters.

public void invalidate(String id) {
    caffeineCache.invalidate(id);                              // local L1
    redisTemplate.delete(redisKey(id));                        // shared L2
    redisTemplate.convertAndSend(INVALIDATION_CHANNEL, id);   // notify all pods
}

Redis pub/sub for cross-instance Caffeine invalidation. Without this, the problem is subtle but painful. Pod A updates a product and evicts its own Caffeine copy. Pods B and C still hold the old value in their local heap and will keep serving it until their 30-second TTL expires naturally. In a cluster of 10 pods, 9 out of 10 requests could still be hitting stale data, and you'd have no visibility into it from your metrics. The pub/sub channel costs almost nothing and closes that window from 30 seconds down to milliseconds. We use a lightweight Redis pub/sub channel so that an invalidation event on one instance propagates the eviction to all running instances within milliseconds.

java
// CacheInvalidationListener.java — runs on every pod
@Override
public void onMessage(Message message, byte[] pattern) {
    String productId = new String(message.getBody());
    productCacheService.evictLocalCaffeine(productId);
}

Without this, a product update on pod A evicts A's Caffeine but pods B and C serve stale data for up to 30 seconds. With it, eviction propagates to all pods within milliseconds at near-zero cost.

Results
After rolling this out across our product lookup service:

P50 latency dropped from ~15ms to ~0.3ms for hot keys
Elasticsearch request volume fell by over 70% during peak traffic
Cache hit rate across both layers held above 95% for our access pattern

The architecture is not novel — most high-scale systems use some variant of this. But the implementation details matter, and getting those details right in production is where the real engineering lives.

When Not to Use This Pattern
Two-level caching adds operational complexity. Before reaching for it, ask yourself:

Is your data highly cacheable? Frequently changing data will see poor hit rates and risk serving stale values.
Do you have strong consistency requirements? If stale reads are unacceptable, caching may not be appropriate at all without careful invalidation guarantees.
Are you actually bottlenecked at the data layer? Profile first. Premature caching is its own kind of technical debt.

If the answer to all three is yes, this pattern will likely cause more pain than it relieves.

Final Thoughts
The two-level cache pattern is one of the highest-leverage architectural moves you can make for read-heavy systems. Caffeine keeps your hottest data at heap speed. Redis absorbs cross-instance and restart volatility. Elasticsearch stays your source of truth without being beaten to death by repetitive reads.
The full working implementation — including config, tests, docker-compose, and GCP deployment notes — is all in the repo:
👉 https://github.com/lalithaGovada/two-level-cache-demo

Top comments (2)

Andre Cytryn • Mar 25

the pub/sub invalidation piece is underrated. most L1+L2 setups I've seen get the happy path right but miss this -- you end up with pods serving stale data for the full TTL window after an update. using redis pub/sub to propagate local Caffeine evictions across instances is a neat solution to that. have you run into any issues with the pub/sub listener lagging under high update rates?

Lalitha Govada • Mar 26

Thanks for the kind words You've put your finger on exactly the thing that bites most teams in production.
We haven't hit listener lag under high update rates in our setup, but that's partly because our product catalogue isn't a firehose of updates. It's a read-heavy workload where writes are relatively infrequent. The pub/sub channel stays quiet most of the time, which is the sweet spot for this pattern.
That said, you're right to flag it as a risk. A few things that help if you're in a high-write scenario:

Pub/sub is fire-and-forget in Redis, if a subscriber is slow or a pod restarts mid-message, that eviction is lost silently. The TTL acts as your safety net, but if your TTL is long (say 10 min) and writes are frequent, you could serve stale data for a while. One mitigation is keeping L1 TTL very short (10–15 s) so even a missed eviction self-heals quickly.
If update rates are truly high, pub/sub starts feeling fragile. At that point we'd probably look at a short L1 TTL only (no event-driven invalidation at all), or flip to a read-through pattern with a version/etag check against Redis before serving from Caffeine, adds a tiny bit of overhead but gives you stronger consistency guarantees.
Batching invalidations is another option, debounce rapid updates to the same key and publish once, rather than one pub/sub message per write. Cuts the channel traffic significantly for hot keys. Curious what pattern you've settled on for high-write cases. Have you gone the short-TTL-only route or something else?