DEV Community

Young Gao
Young Gao

Posted on • Originally published at younggao.hashnode.dev

Building a High-Performance Cache Layer in Go

Your service is slow. You add Redis. It gets faster. Then Redis becomes the bottleneck -- every request still makes a network round-trip, serialization costs add up, and under load you start seeing latency spikes from connection pool contention.

Sound familiar? In this article, we'll build a two-tier cache layer in Go that combines a local in-memory cache with Redis, prevent cache stampedes using singleflight, and discuss the production considerations that separate a toy cache from a battle-tested one.

Why Not Just Redis?

Redis is excellent. But it's still a network hop away. For a typical service:

Operation Latency
Local memory read ~50ns
Redis GET (same AZ) ~0.5-1ms
PostgreSQL query ~2-10ms

That's a 10,000x difference between local memory and Redis. For hot keys that get read thousands of times per second, this matters.

A local cache also gives you:

  • Zero network overhead -- no serialization, no TCP, no connection pools
  • Resilience -- your service still responds if Redis goes down briefly
  • Reduced Redis load -- fewer commands means lower Redis CPU and network usage

The tradeoff? Local caches are per-instance and can serve stale data. We'll address both.

Tier 1: Local In-Memory Cache with TTL

Let's start with a simple but effective local cache. We'll use sync.Map for concurrent access and a background goroutine for TTL eviction.

package cache

import (
    "sync"
    "time"
)

type entry struct {
    value     any
    expiresAt time.Time
}

type LocalCache struct {
    data    sync.Map
    maxSize int
    size    int64
    mu      sync.Mutex // guards size
}

func NewLocalCache(maxSize int, evictInterval time.Duration) *LocalCache {
    c := &LocalCache{maxSize: maxSize}
    go c.evictLoop(evictInterval)
    return c
}

func (c *LocalCache) Get(key string) (any, bool) {
    raw, ok := c.data.Load(key)
    if !ok {
        return nil, false
    }
    e := raw.(*entry)
    if time.Now().After(e.expiresAt) {
        c.data.Delete(key)
        c.decrSize()
        return nil, false
    }
    return e.value, true
}

func (c *LocalCache) Set(key string, value any, ttl time.Duration) {
    _, loaded := c.data.LoadOrStore(key, &entry{
        value:     value,
        expiresAt: time.Now().Add(ttl),
    })
    if !loaded {
        c.incrSize()
    } else {
        c.data.Store(key, &entry{
            value:     value,
            expiresAt: time.Now().Add(ttl),
        })
    }
}

func (c *LocalCache) Delete(key string) {
    if _, loaded := c.data.LoadAndDelete(key); loaded {
        c.decrSize()
    }
}

func (c *LocalCache) evictLoop(interval time.Duration) {
    ticker := time.NewTicker(interval)
    defer ticker.Stop()
    for range ticker.C {
        now := time.Now()
        c.data.Range(func(key, value any) bool {
            if now.After(value.(*entry).expiresAt) {
                c.data.Delete(key)
                c.decrSize()
            }
            return true
        })
    }
}

func (c *LocalCache) incrSize() { c.mu.Lock(); c.size++; c.mu.Unlock() }
func (c *LocalCache) decrSize() { c.mu.Lock(); c.size--; c.mu.Unlock() }
Enter fullscreen mode Exit fullscreen mode

This gives us O(1) reads and writes with lazy + periodic expiration. The sync.Map is optimized for the read-heavy, write-light pattern that caches typically exhibit.

Why not a regular map with sync.RWMutex? For read-dominated workloads with many goroutines, sync.Map avoids lock contention on the read path entirely. Under write-heavy loads, a sharded map with RWMutex can outperform it -- but caches are almost always read-heavy.

Tier 2: Two-Tier Cache with Cache-Aside Pattern

Now let's compose the local cache with Redis into a two-tier system. The lookup flow:

  1. Check local cache -> hit? Return immediately.
  2. Check Redis -> hit? Backfill local cache, return.
  3. Call the loader (DB, API, etc.) -> Populate both caches, return.
package cache

import (
    "context"
    "fmt"
    "time"

    "github.com/redis/go-redis/v9"
)

type TieredCache struct {
    local    *LocalCache
    redis    *redis.Client
    localTTL time.Duration
    redisTTL time.Duration
}

func NewTieredCache(rc *redis.Client, localTTL, redisTTL time.Duration) *TieredCache {
    return &TieredCache{
        local:    NewLocalCache(10000, 30*time.Second),
        redis:    rc,
        localTTL: localTTL,
        redisTTL: redisTTL,
    }
}

func (tc *TieredCache) Get(ctx context.Context, key string) ([]byte, bool) {
    // Tier 1: local memory
    if val, ok := tc.local.Get(key); ok {
        return val.([]byte), true
    }
    // Tier 2: Redis
    val, err := tc.redis.Get(ctx, key).Bytes()
    if err == nil {
        tc.local.Set(key, val, tc.localTTL) // backfill L1
        return val, true
    }
    return nil, false
}

func (tc *TieredCache) Set(ctx context.Context, key string, value []byte) error {
    tc.local.Set(key, value, tc.localTTL)
    return tc.redis.Set(ctx, key, value, tc.redisTTL).Err()
}

// GetOrLoad implements the full cache-aside pattern.
func (tc *TieredCache) GetOrLoad(
    ctx context.Context,
    key string,
    loader func(ctx context.Context) ([]byte, error),
) ([]byte, error) {
    if val, ok := tc.Get(ctx, key); ok {
        return val, nil
    }
    val, err := loader(ctx)
    if err != nil {
        return nil, fmt.Errorf("loader for key %s: %w", key, err)
    }
    _ = tc.Set(ctx, key, val) // best-effort cache write
    return val, nil
}
Enter fullscreen mode Exit fullscreen mode

Usage is clean:

data, err := cache.GetOrLoad(ctx, "user:1234", func(ctx context.Context) ([]byte, error) {
    u, err := db.GetUser(ctx, 1234)
    if err != nil {
        return nil, err
    }
    return json.Marshal(u)
})
Enter fullscreen mode Exit fullscreen mode

Important: keep the local TTL shorter than the Redis TTL. A good starting point is local 10-30s, Redis 5-15 minutes. This bounds cross-instance staleness while still absorbing the vast majority of reads locally.

Preventing Cache Stampedes with singleflight

There's a critical problem with GetOrLoad. When a popular key expires, hundreds of goroutines simultaneously discover the miss and all call the loader. This is a cache stampede -- it can flatten your database.

Go's golang.org/x/sync/singleflight deduplicates concurrent calls for the same key so only one goroutine does the actual work:

import "golang.org/x/sync/singleflight"

type TieredCache struct {
    local    *LocalCache
    redis    *redis.Client
    localTTL time.Duration
    redisTTL time.Duration
    sf       singleflight.Group
}

func (tc *TieredCache) GetOrLoad(
    ctx context.Context,
    key string,
    loader func(ctx context.Context) ([]byte, error),
) ([]byte, error) {
    if val, ok := tc.Get(ctx, key); ok {
        return val, nil
    }

    // Only one goroutine executes per key; others wait and share the result.
    result, err, shared := tc.sf.Do(key, func() (any, error) {
        // Double-check: another goroutine may have filled the cache
        // while we waited for the singleflight slot.
        if val, ok := tc.Get(ctx, key); ok {
            return val, nil
        }
        val, err := loader(ctx)
        if err != nil {
            return nil, err
        }
        _ = tc.Set(ctx, key, val)
        return val, nil
    })
    if err != nil {
        return nil, err
    }

    _ = shared // useful for metrics: high share rate = stampede prevention working
    return result.([]byte), nil
}
Enter fullscreen mode Exit fullscreen mode

The double-check inside Do matters. Between the initial miss and acquiring the singleflight slot, another goroutine may have already populated the cache. Without this, you'd still make one redundant database call per stampede event.

Benchmarks

Test setup: 8-core machine, 100 concurrent goroutines, 10K unique keys with Zipfian distribution (some keys much hotter than others, like real traffic).

func BenchmarkCacheTiers(b *testing.B) {
    b.Run("redis-only", func(b *testing.B) {
        b.RunParallel(func(pb *testing.PB) {
            for pb.Next() {
                rdb.Get(ctx, zipfKey())
            }
        })
    })
    b.Run("local-only", func(b *testing.B) {
        b.RunParallel(func(pb *testing.PB) {
            for pb.Next() {
                local.Get(zipfKey())
            }
        })
    })
    b.Run("tiered", func(b *testing.B) {
        b.RunParallel(func(pb *testing.PB) {
            for pb.Next() {
                tiered.Get(ctx, zipfKey())
            }
        })
    })
}
Enter fullscreen mode Exit fullscreen mode

Results:

Approach ops/sec p50 p99
Redis only 85,000 0.6ms 2.1ms
Local only 12,000,000 48ns 210ns
Tiered (warm) 10,500,000 52ns 380ns
Tiered (cold start) 78,000 0.7ms 2.4ms

At steady state the tiered cache runs at near-local-only speed because hot keys live in L1. The extra local-miss check on cold paths adds only ~4ns of overhead before falling through to Redis.

Stampede test: 1000 goroutines hitting the same expired key simultaneously:

Without singleflight With singleflight
1000 DB calls 1 DB call
p99: 850ms p99: 12ms

The difference is dramatic and gets worse under real load.

Production Considerations

Memory Limits

An unbounded local cache will OOM your process. Two approaches:

  1. Max entry count -- simple and predictable. Evict oldest entries when full. Add a size check in Set and use an LRU library like hashicorp/golang-lru/v2 when you need eviction ordering.

  2. Max memory bytes -- more precise but harder. For []byte values you can sum lengths directly; for arbitrary types, estimation gets complex.

Start with max entry count + short TTL. Monitor via runtime.MemStats and adjust.

Eviction Policies

TTL-based eviction is often sufficient. When you also need to cap size:

  • LRU -- the default choice. Well-understood, works for most access patterns.
  • LFU -- better for heavily skewed workloads. More complex to implement correctly.
  • Random -- surprisingly effective and nearly free. Consider it for unpredictable access patterns.

For most services, LRU + TTL hits the sweet spot.

Cache Invalidation

Options for multi-instance consistency:

  • Short local TTLs -- accept bounded staleness (10-30s). Simplest approach, often sufficient.
  • Redis Pub/Sub -- publish invalidation events on write; instances subscribe and evict locally.
func (tc *TieredCache) Invalidate(ctx context.Context, key string) error {
    tc.local.Delete(key)
    tc.redis.Del(ctx, key)
    return tc.redis.Publish(ctx, "cache:invalidate", key).Err()
}

// Each instance subscribes on startup:
func (tc *TieredCache) SubscribeInvalidations(ctx context.Context) {
    sub := tc.redis.Subscribe(ctx, "cache:invalidate")
    go func() {
        for msg := range sub.Channel() {
            tc.local.Delete(msg.Payload)
        }
    }()
}
Enter fullscreen mode Exit fullscreen mode

Negative Caching

Cache misses too. If a key doesn't exist in your database, store a sentinel to prevent repeated lookups:

var sentinel = []byte("__MISS__")

// In the loader:
if errors.Is(err, ErrNotFound) {
    _ = cache.Set(ctx, key, sentinel) // short TTL
    return nil, ErrNotFound
}
Enter fullscreen mode Exit fullscreen mode

Without this, a nonexistent key generates a database query on every request -- a pattern attackers can exploit.

Monitoring

Track these metrics (export to Prometheus, Datadog, etc.):

  • Hit rate per tier -- local should be 80%+ for hot paths
  • Singleflight share rate -- high = stampede prevention working
  • Cache size -- entry count and estimated memory
  • Loader latency -- what you're protecting the system from
type Metrics struct {
    LocalHits   atomic.Int64
    LocalMisses atomic.Int64
    RedisHits   atomic.Int64
    RedisMisses atomic.Int64
    SFShared    atomic.Int64
}
Enter fullscreen mode Exit fullscreen mode

A dashboard showing per-tier hit rates will immediately tell you whether your cache is earning its complexity.

The Complete Architecture

Request -> Local Cache (L1, ~50ns)
              |miss
           Redis (L2, ~0.5ms)
              |miss
           singleflight dedup
              |
           Database (~5ms)
              |
           Populate L1 + L2
Enter fullscreen mode Exit fullscreen mode

Key takeaways:

  1. Two tiers beat one -- local absorbs hot reads, Redis handles the long tail and cross-instance sharing.
  2. singleflight is non-negotiable -- without it, cache expiration under load becomes a database stampede.
  3. Short local TTLs -- 10-30s balances freshness against hit rate.
  4. Monitor everything -- hit rates, sizes, loader latency. Caches fail silently.

Start with the simple version. Measure. Then add complexity only where the numbers justify it.


This is part of the **Production Backend Patterns* series, where we tackle real infrastructure problems with practical Go code. Follow for the next post on rate limiting and backpressure.*


If this article helped you, consider buying me a coffee on Ko-fi! Follow me for more production backend patterns.

Top comments (0)