Kalio Princewill

Posted on Apr 22

Cache-Aside and the Null Sentinel Pattern

#redis #distributedsystems #backend

There is a class of bugs that are worse than crashes. Crashes are loud, they page someone and they get fixed.

This one is quiet. Your cache is running, your hit rate looks fine. But your database is being hit on every single request for data that has not changed at all. I hit this while implementing cache-aside in a side project.

A Quick Recap of How Cache-Aside Works

Before getting into the bug, let me explain the pattern I was using. Cache-aside is the most common caching strategy:

A request comes in. Check the cache first.
If the value is there, return it. Done.
If it is not there, fetch from the database, store the result in the cache, then return it.

The next request for the same data skips the database entirely.

The way you know if a key exists in the cache is: cache.get(key) returns None when the key is not there, and returns the actual value when it is. So your logic ends up looking like this:

raw = cache.get(key)

if raw is not None:   # None means miss
    return raw        # hit, return it

# Miss. Go to the database.
result = fetch_from_db()
cache.set(key, result)
return result

This works great until your database returns None.

When None Has Two Jobs

Here is the scenario. My application tracks LLM traces. Each trace can have evaluation results attached to it. When a trace is brand new, it has no evaluations yet. The database legitimately returns None or an empty list.

Walk through what happens with the code above.

First request:

cache.get("eval_results:trace-abc") → None   # key doesn't exist, miss

fetch_from_db() → None   # DB says "no evals yet", correct answer

cache.set(key, None)     # store it

Second request, same trace, 200ms later:

cache.get("eval_results:trace-abc") → None

And here is the problem. Is that None a cache miss, or is it the stored None from the previous request?

You cannot tell. The function sees None and concludes "miss" every single time. It calls the database again, gets None again, stores None again. The entry is never actually cached, so the miss repeats indefinitely.

Why This Becomes a Database Problem

On its own, one trace with no evaluations is harmless. The danger is the scale. In my project, after ingesting a batch of traces, none of them have evaluation results yet. That is the normal state right after ingestion. The UI requests eval status for every trace on the page.

With this bug, every page load fires a database query for every trace, every time. The cache offers zero protection for this case, and it is the most common case right after ingestion, exactly when you need it most. Put five people refreshing the dashboard at the same time and you go from 50 queries to 250 queries in seconds. All for data that has not changed. That is the cascade.

The failure is invisible. Just a database working much harder than it should, and an application that will fall over under load that a correctly working cache would have absorbed completely.

The Fix: Give the Stored Nothing Its Own Identity

The problem is that None is doing two jobs at once. It means "key not found in cache" and it means "the actual value is empty." These two things need to be distinguishable.

The fix is a sentinel. Instead of storing literal None, you store a specific string that your real data will never produce.

_CACHED_NONE = "__myapp:cached_none__"

Now the write path becomes:

stored = _CACHED_NONE if result is None else result
cache.set(key, stored)

And the read path unwraps it:

raw = cache.get(key)

if raw is not None:                            # None still means miss
    return None if raw == _CACHED_NONE else raw  # unwrap the sentinel

Let's walk through the same scenario:

First request:

cache.get(key) → None   # miss, key doesn't exist
fetch_from_db() → None
cache.set(key, "__myapp:cached_none__")   # stored as sentinel

Second request:

cache.get(key) → "__myapp:cached_none__"   # not None, so it's a HIT
unwrap → return None to caller
Database never touched.

The caller still receives None. The API behaviour is identical, but now the empty answer is cached and the database is protected. The write path that saves evaluation results invalidates the cache key, so the sentinel is only ever served while the data is genuinely absent.

The Test That Proves It

The thing I appreciated most when writing tests for this was how clearly a single test exposes the bug.

def test_none_result_is_cached_not_refetched():
    cache = InMemoryCache()
    call_count = 0

    def compute():
        nonlocal call_count
        call_count += 1
        return None  # DB says "no results"

    cache_aside(cache, "k", compute, ttl_s=60)
    cache_aside(cache, "k", compute, ttl_s=60)

    assert call_count == 1  # compute only ran once — None was cached

Remove the sentinel and run this against the original code. call_count is 2. That is the bug, made visible. No ambiguity.

The Broader Pattern

This is not specific to caching. None being overloaded is a common source of silent bugs anywhere in a system where it carries two meanings.

Python's own standard library handles this with dict.get(key, default). You pass a custom default so you can distinguish "key not found" from "key found, value is None." Same idea.

The rule I now follow: whenever None means two different things in the same flow, one of them needs a name. A sentinel string, a dedicated object, a wrapper type. Anything that gives absence of value its own distinct identity so the code can tell the difference.

The bug I described will not show up in development. Your test database is small, your traffic is low, and a few extra queries are invisible. It surfaces in production, under load, in the form of a database that is inexplicably struggling with read traffic on a day when nothing obvious changed.

Write the test. Make it fail on the old code. Then fix it.

Top comments (2)

Chukwuduzie Blaise • Apr 22

Clean thought process, the challenge now is to identify scenarios where "None" or a state has two or more meanings in the same system and optimize using this sentinel strategy. It's either one catches them proactively before the system fails under load or after after it has failed.
But now instead of being confused on potential causes, your aticle has provided insight to it, thanks man.

Kalio Princewill • Apr 25 • Edited

Your'e welcome