Four modes, one cockpit: how I designed graceful degradation up front

#devops #agents #monitoring #python

The cockpit I built has to work in four different operating modes
without code changes. Live cloud memory plus live model. Live cloud
memory plus deterministic model. Local fallback memory plus live
model. Local fallback memory plus deterministic model. Every
combination has to render correctly, produce a complete audit
trace, and not lie to the user about what's happening.

This is the design choice I'm proudest of, because I made it on day
one and it kept paying back through every demo, every CI run, and
every connectivity hiccup since.

Why four modes

The two axes are the two external services the system depends on.

Memory — the agent memory layer is
Hindsight, a managed cloud store
for retain/recall/reflect operations. When the cloud is reachable
and the API key is valid, recall returns memory matches. When it
isn't, the system falls back to a local JSON store that ships with
seed memories.

Model — the language model is called through cascadeflow's
Groq adapter. With CASCADEFLOW_LIVE_GROQ=true and a key, calls
are real. With it false or no key, the same code path returns
deterministic, prerecorded RCA output. No exceptions, no degraded
shape, just the same response structure with live_call=False set.

Multiply the two and you get four quadrants:

	Live model	Deterministic model
Cloud memory	Live demo path	Demo without API spend
Local fallback	Disconnected dev	Hermetic CI

Each quadrant has a distinct failure mode if you don't plan for it.
The hermetic CI quadrant is the one most people skip — they assume
the cloud will be up — and that's the one that bites you the night
before a demo.

The connection probe

Both adapters are constructed in Streamlit's @st.cache_resource
scope, which means once-per-session. Each one runs a probe on
construction:

# incident_agent/memory.py
class IncidentMemory:
    def __init__(self) -> None:
        base_url = os.environ.get("HINDSIGHT_BASE_URL", "")
        api_key = os.environ.get("HINDSIGHT_API_KEY", "")
        bank_id = os.environ.get("HINDSIGHT_BANK_ID", "openrecall")

        self.fallback_mode = False
        self.status = "disconnected"
        self._client: Any | None = None

        if not base_url or not api_key:
            self._flip_to_fallback("missing HINDSIGHT_BASE_URL or HINDSIGHT_API_KEY")
            return

        try:
            self._client = _build_hindsight_client(base_url, api_key)
            # Cheap probe: list banks or fetch bank metadata
            self._probe()
            self.status = f"connected to Hindsight Cloud at {base_url} / bank {bank_id}"
        except Exception as e:
            self._flip_to_fallback(f"probe failed: {e!s}")

The probe does one cheap call. If it succeeds, the cloud client
sticks around for the session. If it fails, _flip_to_fallback
fires.

The flip-to-fallback contract

This is a single function with a single job, and it is the most
important function in the memory module:

# incident_agent/memory.py
def _flip_to_fallback(self, reason: str) -> None:
    if self._client is not None:
        try:
            close = getattr(self._client, "close", None)
            if callable(close):
                close()
        except Exception:
            pass
    self._client = None
    self.fallback_mode = True
    self.status = f"local fallback active — {reason}"

Three things happen, in order:

The cloud client gets close() called if it exposes one. This releases the underlying HTTP connection cleanly so the socket isn't sitting in TIME_WAIT.
The client reference is dropped to None. No retry on the same client. If the cloud comes back mid-session, the system stays on local fallback until the next session. This is deliberate — flapping between cloud and local would produce inconsistent memory recall results.
The status string and the fallback_mode flag are updated. The cockpit reads both.

The contract on this is encoded as a requirement: any failed
client.recall or client.retain call must call this helper. No
silent retries. No "let me just try the cloud one more time."
The first failure flips the mode for the whole session.

What every code path has to look like

Once you commit to this contract, every cloud call has the same
shape:

def recall(self, query: str, *, limit: int = 5) -> list[MemoryMatch]:
    if self.fallback_mode or self._client is None:
        return self._recall_local(query, limit=limit)
    try:
        results = self._client.recall(query=query, bank_id=self._bank_id, limit=limit)
        return [self._to_match(r) for r in results]
    except Exception as e:
        self._flip_to_fallback(f"recall failed: {e!s}")
        return self._recall_local(query, limit=limit)

The pattern is identical for retain. Try the cloud; on failure,
flip and fall back; never let the exception escape to the workflow
layer. The workflow layer can stay ignorant of which mode it's in,
and that's the point — the bypass logic, the cost curve, the audit
trace, none of them branch on cloud-vs-local.

The badges in the cockpit

The cockpit renders three runtime badges in the hero card:

Memory state — Hindsight connected (green) or Local fallback active (amber)
Routing — Standard route (gray) or Escalated route (purple) — set by the most recent triage decision
Model output — Live Groq calls enabled (blue) or Deterministic model output (slate)

Each badge reads one piece of session state. The first reads
mem.fallback_mode. The second reads the last RouteTrace step's
route. The third reads the env flag plus the most recent
live_call field. The user can see in one glance which of the
four quadrants they're operating in:

The screenshot above shows the cloud-memory + deterministic-model
combination. If the cloud were unreachable, the first badge would
be amber and say "Local fallback active." If the model env flag
flipped to true, the third badge would say "Live Groq calls
enabled."

The Demo Mode end-to-end test

The most useful test in the suite checks that the hermetic
quadrant — local memory, deterministic model — produces the same
shape of output as the live quadrant:

# tests/property/test_workflow.py
def test_demo_mode_end_to_end() -> None:
    """Feature: openrecall, Demo_Mode end-to-end pipeline.

    Local memory + deterministic model produces a complete
    AnalysisResult with full route_trace and audit_trace, with
    len(audit_trace) == len(route_trace).
    """
    os.environ["CASCADEFLOW_LIVE_GROQ"] = "false"
    os.environ["HINDSIGHT_BASE_URL"] = "http://127.0.0.1:9"  # unreachable

    workflow = build_workflow_for_demo_mode()
    result = workflow.analyze(SAMPLE_CRASHLOOP_ALERT)

    assert result.incident.error_type
    assert len(result.route_trace) >= 3
    assert len(result.audit_trace) == len(result.route_trace)
    assert all(step.live_call is False for step in result.route_trace)

This test runs in CI on every commit. The Hindsight base URL
points at port 9 (the discard port; nothing is listening), so the
probe fails fast and the system flips to local fallback. The Groq
flag is off, so the model adapter returns deterministic output.
Both axes are forced to their fallback positions and the workflow
still has to produce a complete result.

The first time I ran this test, it failed because I had a code
path that returned an empty audit trace when the Groq adapter was
deterministic. The fix was a one-liner — emit the audit entry
unconditionally — but I would have shipped without it if the
hermetic quadrant test hadn't existed.

The thing I almost got wrong

My first instinct was to make the _flip_to_fallback helper
attempt one retry against the cloud before giving up. I deleted
that idea after thinking about what it would mean for the audit
trace.

If the system retries silently, the user has no way to know whether
they're seeing cloud memory or local memory in their results. The
audit trace would show Hindsight Cloud recall steps that
sometimes succeeded and sometimes returned local results, with no
indication of which. That's a worse user experience than a clean
flip.

The clean flip costs you the cloud for the rest of the session if
the cloud has a transient blip. That's a real cost. But it buys
you a session-wide invariant: every recall in this session came
from the same source. The user can read the badge and trust the
output.

I picked the invariant. If a session-wide flip is the wrong choice
for someone else's use case, the helper is a single function in a
single module and they can swap in retry logic. But for a triage
co-pilot where the analyst has to be able to trace decisions, I
wanted the predictability.

The takeaway

When your agent depends on two external services, you have four
modes whether you planned for them or not. Plan for them. Pick a
fallback contract — close, drop, flag — and apply it identically
to every external call. Render the mode in the UI so the user
always knows which quadrant they're in. Test the worst-case
quadrant in CI on every commit.

The
Hindsight memory
client and the
cascadeflow routing
layer both expose enough metadata to do this cleanly. The
Vectorize agent-memory overview
explains why memory has to be auditable in the first place; the
cascadeflow docs cover the live-
vs-deterministic switch on the model side.

Code is at https://github.com/Dawn-Fighter/openrecall
The four-quadrant mode matrix is
documented in docs/ARCHITECTURE.md and the
_flip_to_fallback helper is in incident_agent/memory.py. If
you copy one pattern from this project for your own agent, copy
that one.