DEV Community

Daniel Romitelli
Daniel Romitelli

Posted on • Originally published at craftedbydaniel.com

I Bought the Cheapest Redis and Dared It to Fail: The Circuit Breaker That Made Cache Optional (Series Part 11)

I noticed it as a silence first.

A request that normally felt instant took long enough that I caught myself staring at the logs like they owed me money. Nothing was “down.” No exception. No crash. Just… slow. And the slow wasn’t mysterious: Redis had stopped answering.

This is Part 11 of my series “How to Architect an Enterprise AI System (And Why the Engineer Still Matters)”. Part 10 was about privacy as a runtime toggle; this one is about a different kind of toggle: making Redis optional—not by hand-waving, but by design.

The uncomfortable truth is that a cache is only “an optimization” if your system behaves that way when it disappears. Most systems don’t. They say cache is optional, then they couple correctness to it with a thousand tiny assumptions until the first hiccup turns into an outage.

My core decision was simple to state and annoying to execute:

I chose the cheapest Redis tier, and I wrote a circuit breaker so the app survives without Redis.

That single decision forced me to be honest about what caching is supposed to do in my stack: save time and cost when it’s there, and quietly get out of the way when it’s not.

The key insight: “Optional” isn’t a slogan—it’s a return type

A naive cache integration has a hidden contract:

  • If Redis is slow, the app is slow.
  • If Redis errors, the app errors.
  • If Redis is misconfigured, the app behaves unpredictably.

That’s not a cache. That’s a dependency.

The trick that made everything else click was this: every cache operation must be allowed to fail and return None. Not “raise and handle somewhere.” Not “retry until it works.” Just: no cache value, proceed normally.

Once I committed to that, the rest of the architecture became almost mechanical:

  1. Wrap every Redis operation in a single safe wrapper.
  2. Track Redis health with a tiny state machine.
  3. Use TTL tiers so the cheapest tier stays useful under eviction pressure.
  4. Put a semantic cache in front of the full workflow so a hit skips the expensive pipeline.

The result is a system where Redis going down doesn’t break anything—it just makes things slower because the pipeline does the work it would have skipped.

And that’s exactly the trade I wanted.

How it works under the hood

At a high level, I treat Redis like a battery you plug into your system: great when it’s charged, harmless when it’s missing.

Here’s the flow I built.

flowchart TD
  request[Request] --> semanticCache[Semantic cache lookup]
  semanticCache -->|hit| cachedArtifact[Return cached artifact]
  semanticCache -->|miss| workflow[Run processing pipeline]

  workflow --> redisOps[Redis get set delete]
  redisOps --> wrapper[_execute_with_fallback]
  wrapper -->|success| redisOk[Cache used]
  wrapper -->|failure| noCache[Return None and continue]

  wrapper --> breaker[Circuit breaker]
  breaker --> health[RedisHealthStatus]```



The important thing about this diagram is what’s *not* there: no arrow from Redis failure to “error response.” Redis failure routes to “continue.”

### The Redis tier choice I made (and why)

I ran Redis on the **Basic SKU (250MB, $16/month)** with **`allkeys-lru` eviction**.

A typical architecture review will push you toward Standard or Premium “because production.” And to be fair: that advice is often correct when the cache is a dependency.

But I didn’t want “production Redis.” I wanted **cheap memory**—and I wanted the system’s reliability story to come from my code, not from my bill.

The moment you make cache optional, the reliability gap stops being existential. Redis going down doesn’t become an incident; it becomes a performance regression.

That’s why I was comfortable choosing Basic and spending engineering effort on a circuit breaker instead.

### The circuit breaker: a tiny state machine that protects the rest of the app

I model Redis health explicitly as a state machine:

- `HEALTHY`
- `DEGRADED`
- `UNHEALTHY`
- `UNKNOWN`

And I open the breaker after **5 failures**, then attempt recovery after **5 minutes**.

This is one of those places where “simple” is a feature: I didn’t want a sprawling resiliency framework. I wanted something I could read in one sitting and trust at 2am.

Below is a grounded, minimal version of that shape—**state enum + dataclass**—because the state itself is the contract the rest of the system relies on.



```python
from dataclasses import dataclass
from enum import Enum
from typing import Optional


class RedisHealthStatus(Enum):
    HEALTHY = "healthy"
    DEGRADED = "degraded"
    UNHEALTHY = "unhealthy"
    UNKNOWN = "unknown"


@dataclass
class CircuitBreakerState:
    status: RedisHealthStatus = RedisHealthStatus.UNKNOWN
    failure_count: int = 0
    opened_at: Optional[float] = None
Enter fullscreen mode Exit fullscreen mode

What surprised me is how much calmer the rest of the codebase becomes once you can ask one question—“what state are we in?”—instead of scattering Redis exception handling everywhere.

The wrapper that makes Redis safe: _execute_with_fallback()

The wrapper is the whole point. It enforces the “optional cache” contract:

  • If Redis works: return the value.
  • If Redis fails: return None.

And it updates the circuit breaker state as it goes.

The other critical detail: I use exponential backoff with a 0.1s base and a 2.0s cap. That’s not to “make Redis work.” It’s to avoid turning a transient wobble into a thundering herd.

import asyncio
from typing import Any, Awaitable, Callable, Optional


async def _execute_with_fallback(
    operation: Callable[[], Awaitable[Any]],
    *,
    breaker: "RedisCircuitBreaker",
) -> Optional[Any]:
    """Execute a Redis operation safely.

    Contract: on any failure, return None and allow the caller to proceed.
    """
    try:
        return await breaker.execute(operation)
    except Exception:
        return None
Enter fullscreen mode Exit fullscreen mode

This wrapper looks almost insultingly small, but that’s exactly why it works: it’s hard to misuse. I’d rather have one boring choke point than fifty clever call sites.

The breaker behavior: HEALTHY → DEGRADED → UNHEALTHY → UNKNOWN

The breaker itself is where the state transitions live. The transitions I cared about:

  • Failures accumulate until the breaker opens after 5 failures.
  • Once opened, it stays open for 5 minutes.
  • While open, Redis is treated as unavailable (operations short-circuit).
  • UNKNOWN is the honest startup state until you’ve observed success/failure.

And because I wanted this to be predictable under load, the breaker uses exponential backoff with the 0.1s base and 2.0s cap during retries.

Here’s a compact version of the breaker surface that matches the contract my wrapper expects: an execute() method that either returns the operation result or raises.

import time


class RedisCircuitBreaker:
    def __init__(self):
        self.state = CircuitBreakerState()

        # Tunables from my shipped design
        self.max_failures = 5
        self.recovery_seconds = 5 * 60
        self.backoff_base_seconds = 0.1
        self.backoff_cap_seconds = 2.0

    def _is_open(self) -> bool:
        if self.state.opened_at is None:
            return False
        return (time.time() - self.state.opened_at) < self.recovery_seconds

    async def execute(self, operation):
        if self._is_open():
            raise RuntimeError("Redis circuit open")

        try:
            result = await operation()
            self.state.status = RedisHealthStatus.HEALTHY
            self.state.failure_count = 0
            self.state.opened_at = None
            return result
        except Exception:
            self.state.failure_count += 1
            self.state.status = (
                RedisHealthStatus.DEGRADED
                if self.state.failure_count < self.max_failures
                else RedisHealthStatus.UNHEALTHY
            )

            if self.state.failure_count >= self.max_failures:
                self.state.opened_at = time.time()

            # Exponential backoff values are part of the design contract
            delay = min(
                self.backoff_base_seconds * (2 ** (self.state.failure_count - 1)),
                self.backoff_cap_seconds,
            )
            await asyncio.sleep(delay)
            raise
Enter fullscreen mode Exit fullscreen mode

The non-obvious benefit here is psychological: once the breaker exists, I stop treating Redis errors as “bugs” and start treating them as “weather.” The system adapts instead of panicking.

TTL tiers: keeping cheap Redis useful under eviction

On Basic, memory is constrained. That’s not a problem if you accept the constraint and design around it.

I used three TTL tiers per data type:

  • 24 hours default
  • 48 hours batch
  • 90 days patterns

The reason isn’t academic. It’s about what deserves to survive when allkeys-lru starts evicting.

  • Default extractions are valuable for a day because they reduce repeated work and cost.
  • Batch work benefits from a little longer window because it tends to repeat in bursts.
  • Patterns are slow-changing and expensive to relearn; they deserve long memory.

This is where allkeys-lru becomes a quiet ally: the most-accessed keys naturally stick around under pressure.

Here’s what that TTL strategy looks like as explicit constants—because if TTLs live as magic numbers across call sites, you’ll never reason about cache behavior.

from enum import Enum


class CacheDataType(Enum):
    DEFAULT = "default"
    BATCH = "batch"
    PATTERN = "pattern"


TTL_SECONDS_BY_TYPE = {
    CacheDataType.DEFAULT: 24 * 60 * 60,
    CacheDataType.BATCH: 48 * 60 * 60,
    CacheDataType.PATTERN: 90 * 24 * 60 * 60,
}
Enter fullscreen mode Exit fullscreen mode

The thing I didn’t expect: once I made TTL a first-class concept, debugging cache behavior got dramatically easier. You stop asking “why did Redis forget?” and start asking “which bucket was this in?”

The semantic cache: skipping the entire workflow when a hit is “close enough”

A normal cache key is brittle: change one character, miss the cache, pay full price again.

So I also put a semantic cache layer in front of the main processing workflow. The decision rule is simple:

  • Generate a cache key from canonical records.
  • Compare the new input to cached artifacts.
  • If the similarity score exceeds the threshold, skip the entire LLM pipeline and return the cached artifact.

I’m deliberately not going into the scoring internals here. What matters is the architectural move: the cache sits around the workflow graph, not inside it.

That’s the difference between saving a few calls and saving the whole run.

And it pairs beautifully with the circuit breaker: when Redis is healthy, you get fast returns; when Redis is unhealthy, you still get correct results—just slower.

Infrastructure: configuring Basic Redis to behave like a good citizen

I also made a few infrastructure choices that match the “cheap but safe” philosophy:

  • Basic/C0 SKU
  • allkeys-lru eviction
  • public_network_access_enabled = false
  • private endpoint

That last pair matters: if I’m going to treat Redis as a shared performance component, I don’t want it exposed.

Here’s the shape of the Terraform configuration I used for those decisions.

resource "azurerm_redis_cache" "cache" {
  sku_name                      = "Basic"
  capacity                      = 0
  public_network_access_enabled = false

  redis_configuration {
    maxmemory_policy = "allkeys-lru"
  }
}
Enter fullscreen mode Exit fullscreen mode

The part I like most about this snippet is what it implies: I’m not paying for luxury features to cover for sloppy application behavior. I’m paying for a small, predictable box of memory—and I’m doing the reliability work in code.

What went wrong before I fixed it

Before the breaker and wrapper, Redis failures had a nasty habit: they didn’t always throw immediately.

Sometimes the system would just slow down, then slow down more, then start timing out in places that had nothing to do with caching. That’s the classic failure mode of a “soft dependency”: it poisons the whole request path.

The fix wasn’t “more retries.” Retries are how you turn a small outage into a synchronized traffic jam.

The fix was:

  • centralize the Redis call boundary (_execute_with_fallback()),
  • observe health with a state machine,
  • open the circuit after 5 failures,
  • wait 5 minutes before trying again.

That’s not fancy. It’s disciplined.

The tradeoff: cheaper Redis means you must accept performance variance

The limitation of this approach is the one I chose on purpose: when Redis is down, you pay the full cost of cache misses.

I’m fine with that because:

  • The system remains correct.
  • The system remains available.
  • The failure mode is “slower,” not “broken.”

The analogy I use for this (once, because it’s earned here) is a power tool with a removable battery: the battery makes you fast, but the tool still works when you swap it out—you just work harder. Redis is my battery.

The money part (and why the engineer still matters)

Here’s the cost comparison that made the decision feel obvious:

  • Basic: $16/month
  • Standard: $220/month
  • Premium: $440+/month

If your cache is a dependency, you buy the expensive tier because you’re buying reliability.

If your cache is optional—because every operation can safely return None—then reliability is something you implement once, in about 50 lines of code, and keep forever.

That’s the kind of engineering move that doesn’t show up in a demo, but it changes what the business can afford to run.

In Part 12, I’m going to close the series by tying the whole thesis back to the system that keeps my work continuous across sessions—because the real architecture payoff isn’t any single feature, it’s the fact that I can keep building without losing the plot.


🎧 Listen to the Enterprise AI Architecture audiobook
📖 Read the full 13-part series with an AI assistant

Top comments (0)