<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Sidharth SP</title>
    <description>The latest articles on DEV Community by Sidharth SP (@wnxd).</description>
    <link>https://dev.to/wnxd</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3940827%2F1587348a-88e1-461e-bd1c-83eb1f725d0d.jpg</url>
      <title>DEV Community: Sidharth SP</title>
      <link>https://dev.to/wnxd</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/wnxd"/>
    <language>en</language>
    <item>
      <title>The cheapest model call is the one you don't make</title>
      <dc:creator>Sidharth SP</dc:creator>
      <pubDate>Tue, 19 May 2026 17:32:54 +0000</pubDate>
      <link>https://dev.to/wnxd/the-cheapest-model-call-is-the-one-you-dont-make-36kb</link>
      <guid>https://dev.to/wnxd/the-cheapest-model-call-is-the-one-you-dont-make-36kb</guid>
      <description>&lt;p&gt;I spent the better part of a week building an alert triage co-pilot,&lt;br&gt;
and the most useful thing it does is refuse to call the language&lt;br&gt;
model.&lt;/p&gt;

&lt;p&gt;That sounds like a contradiction, so let me explain what I built and&lt;br&gt;
why the most boring path through the code is the one I'm proudest of.&lt;/p&gt;
&lt;h2&gt;
  
  
  The setup
&lt;/h2&gt;

&lt;p&gt;I work with on-call engineers and SOC analysts. The shape of their&lt;br&gt;
day is well documented: a queue of alerts that never empties, where&lt;br&gt;
40 to 50 percent are noise — duplicates, known-benign rate spikes, a&lt;br&gt;
cron job that fires twice every Tuesday — and the rest are split&lt;br&gt;
between things that need attention and things that look like things&lt;br&gt;
that need attention.&lt;/p&gt;

&lt;p&gt;The standard playbook for "AI in incident response" is to take every&lt;br&gt;
alert and run it through a strong reasoning model that produces a&lt;br&gt;
root cause hypothesis and a runbook. It works. It also costs money,&lt;br&gt;
adds latency, and — this is the part that bothered me — re-derives&lt;br&gt;
the same answer the team already wrote down three weeks ago.&lt;/p&gt;

&lt;p&gt;The team learned. The system didn't.&lt;/p&gt;
&lt;h2&gt;
  
  
  The premise
&lt;/h2&gt;

&lt;p&gt;I wanted the system to learn the same way the team does. When the&lt;br&gt;
fifth identical "checkout-service CrashLoopBackOff after deploy"&lt;br&gt;
shows up, the analyst doesn't open a fresh investigation. They look&lt;br&gt;
at it, recognize it, and either dispose of it or escalate based on&lt;br&gt;
prior context.&lt;/p&gt;

&lt;p&gt;That's the behavior I wanted to encode. Not "ask the model to do&lt;br&gt;
better RCA," but "skip the model when the answer is already in&lt;br&gt;
memory."&lt;/p&gt;

&lt;p&gt;For the memory layer I picked Hindsight, an&lt;br&gt;
&lt;a href="https://vectorize.io/what-is-agent-memory" rel="noopener noreferrer"&gt;agent memory product from Vectorize&lt;/a&gt;&lt;br&gt;
that exposes a clean retain/recall/reflect API and stores&lt;br&gt;
fingerprint-keyed memories that survive across sessions. It's&lt;br&gt;
&lt;a href="https://github.com/vectorize-io/hindsight" rel="noopener noreferrer"&gt;open source&lt;/a&gt;, and the&lt;br&gt;
&lt;a href="https://hindsight.vectorize.io/" rel="noopener noreferrer"&gt;docs&lt;/a&gt; are direct enough that I had&lt;br&gt;
the integration wired up in an afternoon.&lt;/p&gt;

&lt;p&gt;For the routing layer I picked&lt;br&gt;
&lt;a href="https://github.com/lemony-ai/cascadeflow" rel="noopener noreferrer"&gt;cascadeflow&lt;/a&gt;. The pitch&lt;br&gt;
is "runtime intelligence inside the agent loop" — model selection,&lt;br&gt;
budget enforcement, full audit trail per step. I'd been looking for&lt;br&gt;
a clean way to plug in cost tracking without writing it from scratch,&lt;br&gt;
and the cascadeflow Groq adapter handled the inference path while&lt;br&gt;
giving me the trace metadata I needed downstream.&lt;/p&gt;
&lt;h2&gt;
  
  
  The bypass
&lt;/h2&gt;

&lt;p&gt;Here is the rule I encoded.&lt;/p&gt;

&lt;p&gt;A new alert arrives. I extract a structured fingerprint from it&lt;br&gt;
(error class, service role, dependency pattern, signal shape, attack&lt;br&gt;
pattern, environment), then ask the memory layer for the closest&lt;br&gt;
prior incidents keyed on that fingerprint. The memory layer returns&lt;br&gt;
matches with a similarity score in &lt;code&gt;[0, 1]&lt;/code&gt; and the analyst's final&lt;br&gt;
triage decision attached to each one.&lt;/p&gt;

&lt;p&gt;If — and only if — all four of these are true, I do not call the&lt;br&gt;
strong model:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# incident_agent/triage.py
&lt;/span&gt;&lt;span class="n"&gt;STRONG_MATCH_THRESHOLD&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;Final&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;float&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mf"&gt;0.85&lt;/span&gt;
&lt;span class="n"&gt;DECISION_CONSISTENCY_THRESHOLD&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;Final&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;float&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mf"&gt;0.9&lt;/span&gt;
&lt;span class="n"&gt;BYPASS_CONFIDENCE_THRESHOLD&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;Final&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;float&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mf"&gt;0.85&lt;/span&gt;

&lt;span class="n"&gt;BYPASS_ELIGIBLE&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;Final&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;frozenset&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;TriageDecision&lt;/span&gt;&lt;span class="p"&gt;]]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;frozenset&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;false_positive&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;duplicate&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;known_benign&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;


&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;is_bypass_eligible&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;TriageResult&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;fp&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;AlertFingerprint&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;bool&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="n"&gt;fp&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;attack_pattern&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="sh"&gt;""&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="bp"&gt;False&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;proposed_decision&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;BYPASS_ELIGIBLE&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="bp"&gt;False&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;triage_confidence&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&lt;/span&gt; &lt;span class="n"&gt;BYPASS_CONFIDENCE_THRESHOLD&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="bp"&gt;False&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="bp"&gt;True&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The four clauses, in plain English:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;The closest prior incident matches the new alert with score
≥ 0.85.&lt;/li&gt;
&lt;li&gt;Among the top-k consistent matches, ≥ 90% picked the same
triage decision.&lt;/li&gt;
&lt;li&gt;The composite triage confidence is ≥ 0.85.&lt;/li&gt;
&lt;li&gt;The fingerprint has no attack pattern AND the dominant decision
is one of &lt;code&gt;false_positive&lt;/code&gt;, &lt;code&gt;duplicate&lt;/code&gt;, or &lt;code&gt;known_benign&lt;/code&gt;.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;If any clause fails, the strong model gets called. If all four pass,&lt;br&gt;
I emit a synthetic routing step with &lt;code&gt;model="memory-bypass"&lt;/code&gt;, set the&lt;br&gt;
alert's cost to zero, and move on.&lt;/p&gt;

&lt;p&gt;The fourth clause is the one I argued with myself about the most.&lt;br&gt;
Why hard-block bypass on attack patterns? Because false positives in&lt;br&gt;
security have a different cost shape than false positives in&lt;br&gt;
reliability. A misrouted CrashLoopBackOff costs you a wasted&lt;br&gt;
investigation. A misrouted port-scan signature costs you a breach.&lt;br&gt;
The asymmetry is not a knob, so it isn't a knob in the code.&lt;/p&gt;
&lt;h2&gt;
  
  
  The audit invariant
&lt;/h2&gt;

&lt;p&gt;Every routing decision has to be inspectable. If a junior analyst&lt;br&gt;
ever asks "why did this alert get auto-decided," the answer has to&lt;br&gt;
be a row in a table, not a vibe.&lt;/p&gt;

&lt;p&gt;I enforce that with a single property:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;For every analysis, &lt;code&gt;len(audit_trace) == len(route_trace)&lt;/code&gt;.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;In code:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# incident_agent/audit.py
&lt;/span&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;record_step&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;step&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;RouteTrace&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;decision_basis&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;AuditTraceEntry&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;entry&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;AuditTraceEntry&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="n"&gt;step_index&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="nf"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;_entries&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
        &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;step&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;route&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;step&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;route&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;cost_usd&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;step&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;cost_usd&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;baseline_cost_usd&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;step&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;baseline_cost_usd&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;live_call&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;step&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;live_call&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;decision_basis&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;decision_basis&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;_entries&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;append&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;entry&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;entry&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Every &lt;code&gt;RouteTrace&lt;/code&gt; step — alert normalization, fingerprint&lt;br&gt;
extraction, memory recall, triage, RCA, and the synthetic&lt;br&gt;
&lt;code&gt;memory-bypass&lt;/code&gt; when it fires — gets one and only one&lt;br&gt;
&lt;code&gt;AuditTraceEntry&lt;/code&gt;. The cockpit reads them as a table, the property&lt;br&gt;
suite reads them as an assertion.&lt;/p&gt;

&lt;p&gt;I cannot overstate how much pain this saved me. The first time I&lt;br&gt;
shipped the bypass, I forgot to emit the synthetic audit entry, and&lt;br&gt;
the ledger had three RouteTrace steps and two AuditTraceEntry rows.&lt;br&gt;
The property test failed in milliseconds with a counterexample.&lt;/p&gt;

&lt;h2&gt;
  
  
  What it actually does in the cockpit
&lt;/h2&gt;

&lt;p&gt;There are two views. The single-alert tab is the legacy RCA workflow&lt;br&gt;
— paste a free-text alert, get a structured incident brief, an RCA&lt;br&gt;
hypothesis with a confidence score, suggested verification commands,&lt;br&gt;
and a learning loop where the analyst confirms the final root cause&lt;br&gt;
and retains it to memory.&lt;/p&gt;

&lt;p&gt;The queue tab is the new one. You upload a JSON array of alerts (or&lt;br&gt;
click "Use packaged seed alerts" for the 100-alert demo dataset),&lt;br&gt;
hit Analyze, and watch the batch summary fill in: alerts processed,&lt;br&gt;
how many were auto-decided by memory, how many were escalated to&lt;br&gt;
the strong model, total cost, baseline cost (what it would have cost&lt;br&gt;
with the strong model on every alert), savings band, percent saved.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F7vfci9knwt81pj43cws4.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F7vfci9knwt81pj43cws4.png" alt=" " width="799" height="382"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The cost curve below it is layered: the actual per-alert cost in&lt;br&gt;
blue, the strong-model-only baseline in red, and a green shaded&lt;br&gt;
savings band between them. That band is the only chart on the&lt;br&gt;
screen and it's deliberate — it's the one number that grows as&lt;br&gt;
memory accumulates.&lt;/p&gt;

&lt;p&gt;When the bypass fires, the audit trace expander for that alert ends&lt;br&gt;
in a &lt;code&gt;memory-bypass&lt;/code&gt; row with &lt;code&gt;cost_usd = $0.000000&lt;/code&gt;. That's the&lt;br&gt;
shape the system was designed to produce.&lt;/p&gt;

&lt;h2&gt;
  
  
  Numbers from a real run
&lt;/h2&gt;

&lt;p&gt;On the packaged 100-alert dataset, with a freshly seeded memory&lt;br&gt;
bank of 18 prior incidents and no in-session retains:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Total cost: $0.0268&lt;/li&gt;
&lt;li&gt;Baseline cost (strong-only): $0.0384&lt;/li&gt;
&lt;li&gt;Savings: $0.0116, or 30.2%&lt;/li&gt;
&lt;li&gt;Auto-decided by memory: 0 (memory is sparse on first run)&lt;/li&gt;
&lt;li&gt;Escalated to strong model: 53&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The 30% savings come purely from cascadeflow routing the cheap&lt;br&gt;
extraction steps (alert normalization, fingerprint pull) to a&lt;br&gt;
qwen-class model instead of the strong one. The bypass count being&lt;br&gt;
zero on first run is the &lt;em&gt;correct&lt;/em&gt; number — memory is sparse, no&lt;br&gt;
fingerprint cleared all four bypass clauses.&lt;/p&gt;

&lt;p&gt;The interesting result happens on the second run. After 20 retains,&lt;br&gt;
the bypass starts firing on repeat fingerprints, the auto-decided&lt;br&gt;
count climbs, and the green savings band widens by the alert. That's&lt;br&gt;
the cost curve compounding. The team does the work once; the system&lt;br&gt;
charges you nothing the second time.&lt;/p&gt;

&lt;h2&gt;
  
  
  What I'd tell another engineer building this
&lt;/h2&gt;

&lt;p&gt;Three things.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Make the no-call path a first-class route, not an exception.&lt;/strong&gt; I&lt;br&gt;
spent a day trying to express the bypass as "if condition, return&lt;br&gt;
early." It got messy. The moment I modeled it as a synthetic&lt;br&gt;
&lt;code&gt;RouteTrace&lt;/code&gt; step with &lt;code&gt;model="memory-bypass"&lt;/code&gt; and &lt;code&gt;cost_usd=0.0&lt;/code&gt;,&lt;br&gt;
everything got cleaner — the audit trace stayed parallel, the cost&lt;br&gt;
curve recorded a real point, and the cockpit didn't need a special&lt;br&gt;
case to render it.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Score memory matches client-side.&lt;/strong&gt; Vector stores will return a&lt;br&gt;
score field. Ignore it. The threshold logic for whether to bypass&lt;br&gt;
the strong model is yours, not your vector store's, and putting it&lt;br&gt;
in the client keeps the bypass rule auditable and reproducible.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Pin the thresholds with &lt;code&gt;Final&lt;/code&gt; and never inline a literal.&lt;/strong&gt;&lt;br&gt;
0.85, 0.9, 0.85 — those three numbers determine when a model gets&lt;br&gt;
skipped. They live in one file as &lt;code&gt;Final[float]&lt;/code&gt; constants. If you&lt;br&gt;
inline them into a comparison anywhere else, the next person to&lt;br&gt;
tune them will tune one and miss the other two.&lt;/p&gt;

&lt;h2&gt;
  
  
  The one-line summary
&lt;/h2&gt;

&lt;p&gt;The cheapest model call is the one you don't make. Memory tells you&lt;br&gt;
when not to make it. The best part of the project is the route that&lt;br&gt;
does nothing — quickly, cheaply, and on the record.&lt;/p&gt;

&lt;p&gt;Code lives at &lt;a href="https://github.com/Dawn-Fighter/openrecall" rel="noopener noreferrer"&gt;https://github.com/Dawn-Fighter/openrecall&lt;/a&gt;&lt;br&gt;
The &lt;a href="https://hindsight.vectorize.io/" rel="noopener noreferrer"&gt;Hindsight docs&lt;/a&gt; and&lt;br&gt;
&lt;a href="https://docs.cascadeflow.ai/" rel="noopener noreferrer"&gt;cascadeflow docs&lt;/a&gt; are both worth a&lt;br&gt;
read if you're putting memory and runtime intelligence into your&lt;br&gt;
own agent.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>automation</category>
      <category>devops</category>
      <category>llm</category>
    </item>
  </channel>
</rss>
