The cheapest model call is the one you don't make

#ai #devops #llm #automation

I spent the better part of a week building an alert triage co-pilot,
and the most useful thing it does is refuse to call the language
model.

That sounds like a contradiction, so let me explain what I built and
why the most boring path through the code is the one I'm proudest of.

The setup

I work with on-call engineers and SOC analysts. The shape of their
day is well documented: a queue of alerts that never empties, where
40 to 50 percent are noise — duplicates, known-benign rate spikes, a
cron job that fires twice every Tuesday — and the rest are split
between things that need attention and things that look like things
that need attention.

The standard playbook for "AI in incident response" is to take every
alert and run it through a strong reasoning model that produces a
root cause hypothesis and a runbook. It works. It also costs money,
adds latency, and — this is the part that bothered me — re-derives
the same answer the team already wrote down three weeks ago.

The team learned. The system didn't.

The premise

I wanted the system to learn the same way the team does. When the
fifth identical "checkout-service CrashLoopBackOff after deploy"
shows up, the analyst doesn't open a fresh investigation. They look
at it, recognize it, and either dispose of it or escalate based on
prior context.

That's the behavior I wanted to encode. Not "ask the model to do
better RCA," but "skip the model when the answer is already in
memory."

For the memory layer I picked Hindsight, an
agent memory product from Vectorize
that exposes a clean retain/recall/reflect API and stores
fingerprint-keyed memories that survive across sessions. It's
open source, and the
docs are direct enough that I had
the integration wired up in an afternoon.

For the routing layer I picked
cascadeflow. The pitch
is "runtime intelligence inside the agent loop" — model selection,
budget enforcement, full audit trail per step. I'd been looking for
a clean way to plug in cost tracking without writing it from scratch,
and the cascadeflow Groq adapter handled the inference path while
giving me the trace metadata I needed downstream.

The bypass

Here is the rule I encoded.

A new alert arrives. I extract a structured fingerprint from it
(error class, service role, dependency pattern, signal shape, attack
pattern, environment), then ask the memory layer for the closest
prior incidents keyed on that fingerprint. The memory layer returns
matches with a similarity score in [0, 1] and the analyst's final
triage decision attached to each one.

If — and only if — all four of these are true, I do not call the
strong model:

# incident_agent/triage.py
STRONG_MATCH_THRESHOLD: Final[float] = 0.85
DECISION_CONSISTENCY_THRESHOLD: Final[float] = 0.9
BYPASS_CONFIDENCE_THRESHOLD: Final[float] = 0.85

BYPASS_ELIGIBLE: Final[frozenset[TriageDecision]] = frozenset(
    {"false_positive", "duplicate", "known_benign"}
)


def is_bypass_eligible(result: TriageResult, fp: AlertFingerprint) -> bool:
    if not fp.attack_pattern == "":
        return False
    if result.proposed_decision not in BYPASS_ELIGIBLE:
        return False
    if result.triage_confidence < BYPASS_CONFIDENCE_THRESHOLD:
        return False
    return True

The four clauses, in plain English:

The closest prior incident matches the new alert with score ≥ 0.85.
Among the top-k consistent matches, ≥ 90% picked the same triage decision.
The composite triage confidence is ≥ 0.85.
The fingerprint has no attack pattern AND the dominant decision is one of false_positive, duplicate, or known_benign.

If any clause fails, the strong model gets called. If all four pass,
I emit a synthetic routing step with model="memory-bypass", set the
alert's cost to zero, and move on.

The fourth clause is the one I argued with myself about the most.
Why hard-block bypass on attack patterns? Because false positives in
security have a different cost shape than false positives in
reliability. A misrouted CrashLoopBackOff costs you a wasted
investigation. A misrouted port-scan signature costs you a breach.
The asymmetry is not a knob, so it isn't a knob in the code.

The audit invariant

Every routing decision has to be inspectable. If a junior analyst
ever asks "why did this alert get auto-decided," the answer has to
be a row in a table, not a vibe.

I enforce that with a single property:

For every analysis, len(audit_trace) == len(route_trace).

In code:

# incident_agent/audit.py
def record_step(self, step: RouteTrace, *, decision_basis: str) -> AuditTraceEntry:
    entry = AuditTraceEntry(
        step_index=len(self._entries),
        model=step.model,
        route=step.route,
        cost_usd=step.cost_usd,
        baseline_cost_usd=step.baseline_cost_usd,
        live_call=step.live_call,
        decision_basis=decision_basis,
    )
    self._entries.append(entry)
    return entry

Every RouteTrace step — alert normalization, fingerprint
extraction, memory recall, triage, RCA, and the synthetic
memory-bypass when it fires — gets one and only one
AuditTraceEntry. The cockpit reads them as a table, the property
suite reads them as an assertion.

I cannot overstate how much pain this saved me. The first time I
shipped the bypass, I forgot to emit the synthetic audit entry, and
the ledger had three RouteTrace steps and two AuditTraceEntry rows.
The property test failed in milliseconds with a counterexample.

What it actually does in the cockpit

There are two views. The single-alert tab is the legacy RCA workflow
— paste a free-text alert, get a structured incident brief, an RCA
hypothesis with a confidence score, suggested verification commands,
and a learning loop where the analyst confirms the final root cause
and retains it to memory.

The queue tab is the new one. You upload a JSON array of alerts (or
click "Use packaged seed alerts" for the 100-alert demo dataset),
hit Analyze, and watch the batch summary fill in: alerts processed,
how many were auto-decided by memory, how many were escalated to
the strong model, total cost, baseline cost (what it would have cost
with the strong model on every alert), savings band, percent saved.

The cost curve below it is layered: the actual per-alert cost in
blue, the strong-model-only baseline in red, and a green shaded
savings band between them. That band is the only chart on the
screen and it's deliberate — it's the one number that grows as
memory accumulates.

When the bypass fires, the audit trace expander for that alert ends
in a memory-bypass row with cost_usd = $0.000000. That's the
shape the system was designed to produce.

Numbers from a real run

On the packaged 100-alert dataset, with a freshly seeded memory
bank of 18 prior incidents and no in-session retains:

Total cost: $0.0268
Baseline cost (strong-only): $0.0384
Savings: $0.0116, or 30.2%
Auto-decided by memory: 0 (memory is sparse on first run)
Escalated to strong model: 53

The 30% savings come purely from cascadeflow routing the cheap
extraction steps (alert normalization, fingerprint pull) to a
qwen-class model instead of the strong one. The bypass count being
zero on first run is the correct number — memory is sparse, no
fingerprint cleared all four bypass clauses.

The interesting result happens on the second run. After 20 retains,
the bypass starts firing on repeat fingerprints, the auto-decided
count climbs, and the green savings band widens by the alert. That's
the cost curve compounding. The team does the work once; the system
charges you nothing the second time.

What I'd tell another engineer building this

Three things.

Make the no-call path a first-class route, not an exception. I
spent a day trying to express the bypass as "if condition, return
early." It got messy. The moment I modeled it as a synthetic
RouteTrace step with model="memory-bypass" and cost_usd=0.0,
everything got cleaner — the audit trace stayed parallel, the cost
curve recorded a real point, and the cockpit didn't need a special
case to render it.

Score memory matches client-side. Vector stores will return a
score field. Ignore it. The threshold logic for whether to bypass
the strong model is yours, not your vector store's, and putting it
in the client keeps the bypass rule auditable and reproducible.

Pin the thresholds with Final and never inline a literal.
0.85, 0.9, 0.85 — those three numbers determine when a model gets
skipped. They live in one file as Final[float] constants. If you
inline them into a comparison anywhere else, the next person to
tune them will tune one and miss the other two.

The one-line summary

The cheapest model call is the one you don't make. Memory tells you
when not to make it. The best part of the project is the route that
does nothing — quickly, cheaply, and on the record.

Code lives at https://github.com/Dawn-Fighter/openrecall
The Hindsight docs and
cascadeflow docs are both worth a
read if you're putting memory and runtime intelligence into your
own agent.