How We Cut API Response Time from 2.3s to 180ms Using Redis + Smart Caching

#redis #python #performance #api

p95 latency dropped from 2.3 seconds to 180 milliseconds. Same hardware, same database, same traffic. The only thing that changed was how we cached — and I don't mean slapping @lru_cache on a function.

I'm writing this because every Redis caching tutorial I read before this project showed me the same 15-line example: redis.get(key) or fetch_from_db(). That code works in a notebook. It will absolutely wreck you in production the first time real traffic hits it.

This is the layered strategy that actually survived. FastAPI + Python on the server, Redis 7 for caching, Postgres behind it. Everything below is from a real project we shipped for a B2B client earlier this year — roughly 800 requests per minute on the hot endpoints, with read-heavy traffic around product catalog and pricing.

The endpoint that was killing us

The problematic endpoint returned a pricing quote for a product variant, filtered by region, customer tier, and active promotions. Three joins, a couple of window functions for volume-based discounts, and a call to a legacy PHP service for tax lookup. Most requests landed somewhere between 1.8 and 2.6 seconds. p95 sat at 2.3s. We had complaints.

Here's the naive caching attempt we rolled out first. Guess how long it lasted in production.

# app/pricing.py
import json
import redis
from fastapi import APIRouter

r = redis.Redis(host="localhost", port=6379, decode_responses=True)
router = APIRouter()

@router.get("/quote/{product_id}")
def get_quote(product_id: int, region: str, tier: str):
    key = f"quote:{product_id}:{region}:{tier}"
    cached = r.get(key)
    if cached:
        return json.loads(cached)

    quote = compute_quote(product_id, region, tier)  # slow path
    r.setex(key, 300, json.dumps(quote))  # 5 min TTL
    return quote

Three problems showed up within the first hour of production traffic. I want to walk through each, because these are the things the tutorials don't tell you.

Gotcha 1: The thundering herd

The first alert fired at 2:14 AM. Our compute_quote function was being called 400+ times per second for the same key when the cache expired. Postgres spiked, the legacy tax service timed out, everything melted.

This is called cache stampede (or thundering herd). When a popular key expires, every in-flight request misses the cache simultaneously and hammers the origin. The fix is request coalescing — only one request actually does the work, the rest wait for it.

# app/cache.py
import json
import time
import redis

r = redis.Redis(host="localhost", port=6379, decode_responses=True)

def cached_with_lock(key: str, ttl: int, loader):
    """
    Get value for `key` from Redis. If missing, acquire a short lock,
    call `loader()` to compute, write the value, release the lock.
    Other waiters poll until the value appears.
    """
    cached = r.get(key)
    if cached is not None:
        return json.loads(cached)

    lock_key = f"lock:{key}"
    acquired = r.set(lock_key, "1", nx=True, ex=10)

    if acquired:
        try:
            value = loader()
            r.setex(key, ttl, json.dumps(value))
            return value
        finally:
            r.delete(lock_key)

    # Someone else is computing. Poll briefly.
    for _ in range(50):
        time.sleep(0.05)
        cached = r.get(key)
        if cached is not None:
            return json.loads(cached)

    # Lock holder died or is too slow. Fall through and compute ourselves.
    return loader()

Two details matter. First, the lock TTL (10 seconds) must be longer than your worst-case loader time. If loader takes 15 seconds, the lock expires, and now you have two processes computing. Second, the polling fallback at the end prevents permanent deadlock if the lock holder crashes mid-compute.

You can get fancier with pub/sub notifications instead of polling, but the 50ms poll interval is cheap enough that it rarely matters. We tried the pub/sub version. It was 40 lines more code and saved us maybe 15ms on cache misses. Not worth it.

Gotcha 2: Cache key design is 80% of the game

Our first cache key was quote:{product_id}:{region}:{tier}. Looked reasonable. It was wrong in at least three ways.

First, active promotions affected the price but weren't in the key. So a customer would fetch a quote, the promo would end, and they'd keep getting the stale promoted price for 5 minutes. Support tickets rolled in.

Second, we had currency as an optional query param that defaulted to USD. When a user explicitly passed currency=EUR, we'd cache it under the EUR key. When they then hit the endpoint without the param (defaulting to USD), we'd correctly hit a different key. Fine so far. But internal services that always sent currency=USD and external clients that omitted it were filling the cache with duplicates for the same logical value.

Third, we were caching per-user-tier, but three tiers (silver, gold, platinum) had identical pricing for 70% of products. We were storing three copies of the same thing.

Here's the redesigned key structure:

# app/keys.py
import hashlib
import json

def quote_cache_key(product_id: int, region: str, tier: str,
                    currency: str, active_promo_ids: list[int]) -> str:
    """
    Stable cache key that accounts for every input that can change the output.
    """
    normalized = {
        "product_id": product_id,
        "region": region.upper(),
        "tier": tier.lower(),
        "currency": (currency or "USD").upper(),
        "promos": sorted(active_promo_ids or []),
    }
    payload = json.dumps(normalized, separators=(",", ":"), sort_keys=True)
    digest = hashlib.sha1(payload.encode()).hexdigest()[:16]
    return f"quote:v2:{digest}"

Three changes from the original. We normalize inputs before keying (uppercase region, lowercase tier, default currency). We include every input that can change the output, including promotion IDs. And we hash the whole thing with a version prefix so we can invalidate the entire cache namespace by bumping v2 to v3.

That version prefix has saved us twice since then. When we changed the pricing rules in March, we shipped with v3 and every old cached quote became instantly irrelevant — no flush, no downtime.

Gotcha 3: Writes should invalidate, not wait for TTL

TTL-based invalidation is fine for data that's allowed to be slightly stale. It's terrible for anything a user just changed themselves.

The pattern I see everywhere is: cache with TTL, shrug when users report they're seeing old data right after updating a record. "Wait 5 minutes and refresh." That's not acceptable in 2026.

We use a write-through-invalidate pattern. When a write happens, we delete related cache keys explicitly. The hard part is knowing which keys are related — this is where the key design from the previous section pays off.

# app/invalidation.py
import redis

r = redis.Redis(host="localhost", port=6379, decode_responses=True)

def invalidate_product(product_id: int):
    """
    A product's price data changed. Invalidate every cached quote for it.
    Uses SCAN instead of KEYS to avoid blocking Redis on large datasets.
    """
    pattern = f"quote:v2:*"
    cursor = 0
    deleted = 0

    while True:
        cursor, batch = r.scan(cursor=cursor, match=pattern, count=500)
        for key in batch:
            meta_key = f"{key}:meta"
            stored_product_id = r.hget(meta_key, "product_id")
            if stored_product_id and int(stored_product_id) == product_id:
                r.delete(key, meta_key)
                deleted += 1
        if cursor == 0:
            break

    return deleted

Two things to notice. We use SCAN instead of KEYS. KEYS blocks the Redis event loop on large datasets — we learned this when a KEYS quote:* took 800ms during a traffic peak and we queued up a couple thousand waiting commands behind it. Don't use KEYS in production. Ever.

Second, we store a lightweight metadata hash alongside each cached value ({key}:meta) so we can look up what's inside the opaque hashed key. This costs us a tiny bit of memory but makes targeted invalidation possible without re-deriving keys from the invalidation side.

Bringing it together

Here's the actual handler that ships in production, roughly.

# app/pricing.py
from fastapi import APIRouter, Query
from app.cache import cached_with_lock
from app.keys import quote_cache_key
from app.quotes import compute_quote, get_active_promo_ids

router = APIRouter()
QUOTE_TTL_SECONDS = 300  # 5 minutes

@router.get("/quote/{product_id}")
def get_quote(
    product_id: int,
    region: str,
    tier: str,
    currency: str = Query("USD"),
):
    promo_ids = get_active_promo_ids(product_id, region)
    key = quote_cache_key(product_id, region, tier, currency, promo_ids)

    def loader():
        return compute_quote(product_id, region, tier, currency, promo_ids)

    return cached_with_lock(key, QUOTE_TTL_SECONDS, loader)

Six lines of logic on top of the shared cached_with_lock helper. Most of the complexity lives in two places: the key function and the loader. The route itself stays boring, which is what you want. Python development shines when the code reads this plainly — if you're building this kind of API layer and want a second pair of eyes, our team offers Python development services and we've walked through this exact pattern with a few clients.

The numbers, measured

I hate when people claim a perf win without showing the measurement method. Here's how we actually verified it.

# scripts/benchmark.py
import asyncio
import statistics
import time
import httpx

async def hit(client, url):
    start = time.perf_counter()
    r = await client.get(url)
    r.raise_for_status()
    return time.perf_counter() - start

async def main():
    url = "http://api.local/quote/1234?region=EU&tier=gold"
    async with httpx.AsyncClient(timeout=10) as client:
        latencies = await asyncio.gather(*[hit(client, url) for _ in range(500)])

    latencies.sort()
    print(f"count     : {len(latencies)}")
    print(f"p50 (ms)  : {statistics.median(latencies) * 1000:.0f}")
    print(f"p95 (ms)  : {latencies[int(0.95 * len(latencies))] * 1000:.0f}")
    print(f"p99 (ms)  : {latencies[int(0.99 * len(latencies))] * 1000:.0f}")

if __name__ == "__main__":
    asyncio.run(main())

Results over 500 requests against the same endpoint, warmed cache:

Metric	Before	After
p50	1.9s	110ms
p95	2.3s	180ms
p99	3.1s	240ms

The p99 number is the one I care about most. p50 and p95 dropping is expected when cache hit rate goes up. But p99 represents the tail — the cache misses, the unlucky timing windows, the lock-wait fallbacks. If your p99 doesn't drop meaningfully, you've accidentally traded one set of slow requests for another.

What not to cache

Two things we deliberately left uncached, because caching them was making things worse.

User-specific data with low reuse — a user's cart, their last order, their notification state. The cache hit rate was under 4% and Redis memory usage ballooned. We moved these to an in-process LRU cache with a 60-second TTL and dropped Redis involvement entirely.

Anything with a write-read latency requirement under 500ms. If a user writes, then immediately reads, they expect to see their write. Cache-aside patterns will serve them stale data from the cache for a few hundred milliseconds until the invalidation propagates. For those endpoints, we skip the cache entirely. The DB is fast enough, and correctness matters more.

One concrete thing to do today

Go look at your slowest endpoint. Pull its last week of logs, find the keys (query params, path params, user ID) that actually determine the response. Now ask: if I cached this, what's my invalidation story? If the answer is "5-minute TTL, users will deal with it," that's fine for read-mostly data. If the answer is "I don't know," you have work to do — and that work starts with the key design, not the Redis client library.

Caching is like plumbing. Nobody notices it when it works, everyone's angry when it leaks, and the leaks usually trace back to decisions you made on day one. Spend the extra hour on the key design. You'll thank yourself the first time a customer asks why they're seeing last week's price.

If you're building this into a larger system and the architecture choices start to feel heavier — where to put the cache, how to handle multi-region, when to go from Redis to a proper CDN — that's the point where we usually get called in. We've helped a few clients through exactly this scaling inflection point as part of our custom software development work, and the answer is almost never "just add more Redis." It's usually "simplify the thing you're trying to cache."