You have a slow endpoint. Someone suggests Redis. You add Redis. The endpoint gets faster. You ship it. Six months later you are debugging a production incident where users see stale data, your cache hit rate is 12%, and you have no idea what is actually in Redis anymore.
That is not a caching strategy. That is a prayer with an expiry time.
This post is not here to roast you. It is here to give you the patterns, the code, and the mental model to do this right.
Table of Contents
- Caching Is a Contract
- The Three Classic Patterns (And When to Use Each)
- Build a Cache Client Worth Using
- Cache Key Design: More Important Than It Looks
- Cache Invalidation: The Part Everyone Skips
- The Thundering Herd and How to Solve It
- Observability: Know What Is Actually Happening
- The Decision Checklist Before You Add a Cache
- Putting It All Together
Caching Is a Contract
Before you touch Redis, understand what you are agreeing to.
Caching is a contract. You are telling your system: "I accept that this data may be slightly wrong for a period of time, in exchange for speed." Most teams sign that contract without reading it.
Before you add any cache entry, answer these four questions:
- What is the acceptable staleness window for this data?
- Who invalidates this entry and when?
- What happens when the cache is cold?
- What happens when the cache is wrong?
If you cannot answer all four, you do not have a caching strategy. You have optimism.
The rest of this post is about answering each of those questions with real code.
The Three Classic Patterns (And When to Use Each)
There are three fundamental ways to integrate a cache with your database. Most teams only know one and use it everywhere.
| Pattern | How It Works | When to Use It |
|---|---|---|
| Cache Aside | App checks cache, misses go to DB, app writes to cache | Default. Works well for most read heavy workloads. |
| Write Through | Every write goes to DB and cache together, atomically | Read heavy data that changes infrequently. Keeps cache always warm. |
| Write Behind | Write to cache immediately, flush to DB asynchronously | Very high write throughput: analytics, metrics, event ingestion, rate limiting counters. |
Most engineers default to cache aside everywhere, which is fine until it is not. Write behind in particular is underused. When you are recording analytics events or incrementing rate limit counters, you do not need each write to round trip to a database. Write to Redis, flush to Postgres in batches. Your database handles a fraction of the load.
The important thing is that you choose consciously. Each pattern has tradeoffs. Write behind carries real risk of data loss if Redis fails before the flush. That is acceptable for a view counter and unacceptable for a financial transaction. Know which one you are dealing with.
Build a Cache Client Worth Using
Before diving into patterns, establish a typed, reusable cache client. This becomes the foundation for everything below.
import { createClient, RedisClientType } from "redis";
export interface CacheOptions {
ttl: number; // base TTL in seconds
jitter?: number; // max random seconds to add (prevents stampedes)
stale?: number; // extra seconds to serve stale while revalidating
}
export interface CacheEntry<T> {
value: T;
cachedAt: number;
expiresAt: number;
}
export class CacheClient {
private client: RedisClientType;
constructor(redisUrl: string) {
this.client = createClient({ url: redisUrl }) as RedisClientType;
}
async connect(): Promise<void> {
await this.client.connect();
}
protected getUnderlyingClient(): RedisClientType {
return this.client;
}
private effectiveTTL(options: CacheOptions): number {
const jitter = options.jitter
? Math.floor(Math.random() * options.jitter)
: 0;
return options.ttl + jitter;
}
async get<T>(key: string): Promise<CacheEntry<T> | null> {
const raw = await this.client.get(key);
if (!raw) return null;
return JSON.parse(raw) as CacheEntry<T>;
}
async set<T>(key: string, value: T, options: CacheOptions): Promise<void> {
const ttl = this.effectiveTTL(options);
const staleTTL = ttl + (options.stale ?? 0);
const entry: CacheEntry<T> = {
value,
cachedAt: Date.now(),
expiresAt: Date.now() + ttl * 1000,
};
await this.client.setEx(key, staleTTL, JSON.stringify(entry));
}
async del(key: string): Promise<void> {
await this.client.del(key);
}
async delByPattern(pattern: string): Promise<void> {
const keys = await this.client.keys(pattern);
if (keys.length > 0) {
await this.client.del(keys);
}
}
}
The CacheEntry wrapper stores cachedAt and expiresAt alongside the value. This unlocks stale while revalidate later without a second Redis call.
Cache Key Design: More Important Than It Looks
This is one of the most overlooked parts of a caching system. Your key schema is an architectural decision, not a naming convention.
A good cache key answers five questions, in order:
app : v2 : user : 123 : profile
^ ^ ^ ^ ^
| | | | shape or query variant
| | | entity ID
| | entity type
| cache version
app namespace
More examples:
app:v2:user:123:permissions
app:v2:feed:user:123:page:2:limit:20
app:v2:product:456:inventory:warehouse:uk
Bad keys look like this:
user:123
profile_123_v2
user_data_new_123
temp_user_123
No namespace means you cannot isolate patterns for deletion. No version means changing the shape of a cached object requires flushing all of Redis. No structure means you cannot delete "everything for user 123" with a single pattern.
The key schema also tells you something about your architecture. If your keys look inconsistent, your caching layer grew organically without a plan. Standardize the schema early and enforce it through a key builder:
const CACHE_VERSION = process.env.CACHE_VERSION ?? "v1";
export const keys = {
userProfile: (userId: string) =>
`app:${CACHE_VERSION}:user:${userId}:profile`,
userPermissions: (userId: string) =>
`app:${CACHE_VERSION}:user:${userId}:permissions`,
userFeed: (userId: string, page: number, limit: number) =>
`app:${CACHE_VERSION}:feed:user:${userId}:page:${page}:limit:${limit}`,
userAll: (userId: string) =>
`app:${CACHE_VERSION}:user:${userId}:*`,
};
Now when you need to invalidate all data for a user, it is one call: delByPattern(keys.userAll(userId)).
To handle a breaking shape change, bump CACHE_VERSION in your deploy config. Old keys expire naturally. New requests populate the new shape. No coordinated flush against production Redis.
Cache Invalidation: The Part Everyone Skips
Most cache bugs are not "we cached the wrong thing." They are "we forgot to uncache it when the underlying data changed."
Here is the pattern that causes incidents:
// The read path is carefully thought through
async function getUserProfile(userId: string) {
const key = keys.userProfile(userId);
const cached = await redis.get(key);
if (cached) return JSON.parse(cached);
const profile = await db.findUser(userId);
await redis.setEx(key, 3600, JSON.stringify(profile));
return profile;
}
// The write path does not think about the cache at all
async function updateUserProfile(userId: string, data: Partial<User>) {
await db.updateUser(userId, data);
// cache is now wrong. silently. for the next hour.
}
The fix is to own both paths in the same service:
export class UserService {
constructor(
private db: Database,
private cache: CacheClient
) {}
async getProfile(userId: string): Promise<UserProfile> {
const key = keys.userProfile(userId);
const entry = await this.cache.get<UserProfile>(key);
if (entry) return entry.value;
const profile = await this.db.findUser(userId);
await this.cache.set(key, profile, { ttl: 3600, jitter: 300 });
return profile;
}
async updateProfile(userId: string, data: Partial<UserProfile>): Promise<UserProfile> {
const updated = await this.db.updateUser(userId, data);
// Write through: update cache immediately with fresh data
await this.cache.set(keys.userProfile(userId), updated, {
ttl: 3600,
jitter: 300,
});
return updated;
}
async suspendAccount(userId: string): Promise<void> {
await this.db.suspendUser(userId);
// Permissions must be immediately consistent.
// Never serve stale permission data. Delete, do not wait for TTL.
await this.cache.del(keys.userPermissions(userId));
}
}
Notice suspendAccount does not use write through. It deletes the permissions key outright. Serving a stale permission is a security issue, not a UX issue. The next read will hit the database and get the correct answer.
TTL vs. Event Driven Invalidation
These are two different tools for two different problems.
TTL invalidation is for data where being slightly stale is acceptable. Exchange rates, public blog posts, product catalog pages. Set a TTL and let entries expire naturally.
Event driven invalidation is for data where correctness matters. User permissions, account status, pricing. Delete or update the cache entry at the moment of the change, not on a timer.
Most systems use TTL for everything because it requires less upfront thinking. Then they get burned when a permissions update does not take effect for an hour. The fix is not to lower the TTL. Lowering the TTL is how you get a thundering herd.
The Thundering Herd and How to Solve It
A typical Redis hit takes 1 to 3ms. A typical database query takes 50 to 300ms. That means one cache miss can cost as much as 100 cache hits. At scale, getting this wrong does not just make things slow. It brings services down.
Imagine a hot cache entry expires at 3:00:00 AM. You have 500 concurrent users. At 3:00:01, all 500 get a cache miss simultaneously and fire a database query. Your database, handling 10 queries per second comfortably, suddenly receives 500 in the same second and collapses.
You just traded slightly stale data for a full outage.
Without protection:
500 requests ──► 500 DB queries ──► DB overwhelmed ──► Outage
With jitter + lock:
500 requests ──► 1 DB query ──► Cache filled ──► 499 served from cache
Three solutions, applied in layers.
Solution 1: Jitter
The simplest fix. Add randomness to expiry so entries do not all expire at the same moment. Already built into the CacheClient above via the jitter option.
// Without jitter: 500 entries for a popular endpoint all expire at 03:00:00
await cache.set(key, value, { ttl: 3600 });
// With jitter: entries expire anywhere between 3600 and 3900 seconds
// Stampede risk drops dramatically for zero added complexity
await cache.set(key, value, { ttl: 3600, jitter: 300 });
Add jitter everywhere. It costs nothing.
Solution 2: Stale While Revalidate
Serve the stale entry immediately. Trigger a background refresh. The next request gets fresh data. This is how HTTP caching has worked for decades.
Request ──► Cache hit? ──► Yes, and fresh? ──► Return immediately
│
│ Yes, but stale?
├──► Return immediately (user does not wait)
│ └──► Trigger background refresh
│
│ No (full miss)
└──► Fetch from DB ──► Populate cache ──► Return
type FetchFn<T> = () => Promise<T>;
export async function getWithStaleRevalidate<T>(
cache: CacheClient,
key: string,
fetchFn: FetchFn<T>,
options: CacheOptions & { stale: number }
): Promise<T> {
const entry = await cache.get<T>(key);
if (entry) {
const isStale = Date.now() > entry.expiresAt;
if (isStale) {
// Serve the stale value immediately, refresh in the background
void refreshInBackground(cache, key, fetchFn, options);
}
return entry.value;
}
// Full miss: fetch synchronously
const value = await fetchFn();
await cache.set(key, value, options);
return value;
}
async function refreshInBackground<T>(
cache: CacheClient,
key: string,
fetchFn: FetchFn<T>,
options: CacheOptions
): Promise<void> {
try {
const value = await fetchFn();
await cache.set(key, value, options);
} catch (err) {
console.error(`Background cache refresh failed for ${key}:`, err);
}
}
Usage at call sites is one line:
const profile = await getWithStaleRevalidate(
cache,
keys.userProfile(userId),
() => db.findUser(userId),
{ ttl: 300, stale: 600, jitter: 60 }
);
Users never wait on a cache refresh. The background task handles it. The next user gets the fresh value.
Solution 3: Distributed Lock on Cache Miss
For hot single keys, when a miss happens only one process should fetch and repopulate. Others wait briefly. This prevents 500 processes all querying the database for the same row at the same time.
import Redlock from "redlock";
export class LockedCacheClient extends CacheClient {
private redlock: Redlock;
constructor(redisUrl: string) {
super(redisUrl);
this.redlock = new Redlock([this.getUnderlyingClient()]);
}
async getOrFetch<T>(
key: string,
fetchFn: FetchFn<T>,
options: CacheOptions
): Promise<T> {
const entry = await this.get<T>(key);
if (entry) return entry.value;
const lockKey = `lock:${key}`;
let lock;
try {
lock = await this.redlock.acquire([lockKey], 5000);
// Re-check after acquiring the lock.
// Another process may have already populated the cache while we waited.
const recheck = await this.get<T>(key);
if (recheck) return recheck.value;
const value = await fetchFn();
await this.set(key, value, options);
return value;
} catch {
// Lock contention: fall back to a direct DB fetch rather than failing the request
const value = await fetchFn();
await this.set(key, value, options);
return value;
} finally {
await lock?.release();
}
}
}
The re-check after acquiring the lock is critical. Without it, the second process acquires the lock and queries the database anyway even though the first process just populated the cache. This is the double checked locking pattern applied to distributed systems.
Observability: Know What Is Actually Happening
You cannot improve what you cannot see. Wrap your cache client with metrics once and label call sites with a pattern name.
import { Counter, Histogram, Registry } from "prom-client";
export class ObservableCacheClient extends CacheClient {
private hits: Counter;
private misses: Counter;
private hitLatency: Histogram;
private missLatency: Histogram;
constructor(redisUrl: string, registry: Registry) {
super(redisUrl);
this.hits = new Counter({
name: "cache_hits_total",
help: "Total cache hits",
labelNames: ["key_pattern"],
registers: [registry],
});
this.misses = new Counter({
name: "cache_misses_total",
help: "Total cache misses",
labelNames: ["key_pattern"],
registers: [registry],
});
this.hitLatency = new Histogram({
name: "cache_hit_duration_seconds",
help: "Latency of cache hits",
labelNames: ["key_pattern"],
registers: [registry],
});
this.missLatency = new Histogram({
name: "cache_miss_duration_seconds",
help: "Latency of cache misses including the upstream fetch",
labelNames: ["key_pattern"],
registers: [registry],
});
}
async getOrFetchObserved<T>(
key: string,
keyPattern: string,
fetchFn: FetchFn<T>,
options: CacheOptions
): Promise<T> {
const start = performance.now();
const entry = await this.get<T>(key);
const elapsed = () => (performance.now() - start) / 1000;
if (entry) {
this.hits.inc({ key_pattern: keyPattern });
this.hitLatency.observe({ key_pattern: keyPattern }, elapsed());
return entry.value;
}
this.misses.inc({ key_pattern: keyPattern });
const value = await fetchFn();
await this.set(key, value, options);
this.missLatency.observe({ key_pattern: keyPattern }, elapsed());
return value;
}
}
The three metrics that matter most:
Hit rate per key pattern, not aggregate. If your overall hit rate is 80% but a critical key pattern sits at 20%, the aggregate number is hiding a real problem. Always break this down by pattern.
Eviction rate. If Redis is evicting keys because you are out of memory, you are thrashing, not caching. A Redis hit at 1ms becomes meaningless if the key you need was evicted 30 seconds ago. Increase memory, shorten TTLs, or stop caching data that expires before it is ever read again.
Miss latency. At scale, a cache miss costs 50 to 300ms of database time. If your service depends on sub-10ms responses, one cold endpoint can blow your entire p99. Miss latency tells you exactly how bad the degradation is when your cache fails.
The Decision Checklist Before You Add a Cache
Not every performance problem needs a cache. Run through this before reaching for Redis.
Is the query actually slow or just called too often? N+1 patterns make caching look like the answer when a JOIN is. Profile the query execution plan before adding a cache layer on top of a structural problem.
Can the data be denormalized instead? If you always cache the same join result, consider materializing it in the schema. A cache is sometimes a workaround for a schema designed for write convenience rather than read performance.
Is this actually read heavy? If data is written more often than it is read, cache invalidation overhead exceeds the savings. You are paying the write cost twice: once to the database, once to update or invalidate the cache entry.
What is the cost of serving stale data? Product listings: probably acceptable. Account balance: no. Permissions and access control: never. Make this decision explicitly. Saying "we will just set a short TTL" is not an answer. It is a way of avoiding the question.
What is your cold start story? If your service falls over every time it restarts because the cache is cold, write through caching or a warming script is the fix, not hoping traffic is light during the deploy window.
Putting It All Together
Here is a service that uses everything from this post: typed versioned keys, write through on mutations, stale while revalidate on reads, jitter throughout, immediate delete for permissions, and full observability.
export class ProductionUserService {
constructor(
private db: Database,
private cache: ObservableCacheClient
) {}
async getProfile(userId: string): Promise<UserProfile> {
return getWithStaleRevalidate(
this.cache,
keys.userProfile(userId),
() => this.db.findUser(userId),
{ ttl: 300, stale: 600, jitter: 60 }
);
}
async updateProfile(userId: string, data: Partial<UserProfile>): Promise<UserProfile> {
const updated = await this.db.updateUser(userId, data);
// Write through: fresh data goes straight to cache on every mutation
await this.cache.set(
keys.userProfile(userId),
updated,
{ ttl: 300, stale: 600, jitter: 60 }
);
return updated;
}
async updatePermissions(userId: string, permissions: Permission[]): Promise<void> {
await this.db.updatePermissions(userId, permissions);
// Access control data is deleted immediately, never served stale
await this.cache.del(keys.userPermissions(userId));
}
}
Reads use stale while revalidate so users never block on a cache refresh. Profile writes use write through so the cache stays warm after every mutation. Permission changes delete immediately because serving a stale permission is a security issue, not a UX issue.
Closing
The teams that handle caching well treat it as a first class architectural concern. They document cache contracts: what is cached, for how long, what triggers invalidation, and what the acceptable staleness window is. They test cold start and stampede scenarios explicitly. They monitor cache health with the same seriousness as database health.
The teams that handle caching poorly add Redis when things get slow and ship it.
Both approaches work right up until they do not. The difference is that the first team knows exactly what breaks and why when it does. The second team opens their laptop at past midnight and starts reading documentation for the first time or throw in " pls help me, redis no work, idk what is redis no more " into ChatGPT.
The patterns in this post are not theoretical. They are the things you wish were already in the codebase when the incident starts. Put them in before it does.
Top comments (0)