<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Karthik S</title>
    <description>The latest articles on DEV Community by Karthik S (@karthik_s_599904b6f055c2c).</description>
    <link>https://dev.to/karthik_s_599904b6f055c2c</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3944646%2F9bfc3b7c-06dd-448c-a6a5-e8291213fde1.png</url>
      <title>DEV Community: Karthik S</title>
      <link>https://dev.to/karthik_s_599904b6f055c2c</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/karthik_s_599904b6f055c2c"/>
    <language>en</language>
    <item>
      <title>I built an agent to audit incidents, but it kept forgetting Tuesday by Wednesday.</title>
      <dc:creator>Karthik S</dc:creator>
      <pubDate>Fri, 22 May 2026 17:26:24 +0000</pubDate>
      <link>https://dev.to/karthik_s_599904b6f055c2c/i-built-an-agent-to-audit-incidents-but-it-kept-forgetting-tuesday-by-wednesday-149c</link>
      <guid>https://dev.to/karthik_s_599904b6f055c2c/i-built-an-agent-to-audit-incidents-but-it-kept-forgetting-tuesday-by-wednesday-149c</guid>
      <description>&lt;p&gt;When you build an enterprise AI agent, the honeymoon phase lasts exactly until you push to production. &lt;/p&gt;

&lt;p&gt;I built "SentinelOps", an AI agent designed to audit compliance risks and operational incidents for our engineering teams. Initially, it was a massive success. It could ingest a JSON dump of server logs and perfectly diagnose a memory leak, providing step-by-step remediation plans.&lt;/p&gt;

&lt;p&gt;But a week later, the exact same memory leak occurred. I asked SentinelOps what to do. It hallucinated three entirely different solutions, completely forgetting the successful remediation plan we executed just days prior. &lt;/p&gt;

&lt;p&gt;My brilliant AI had severe amnesia. Here is how I re-architected our agent using &lt;a href="https://github.com/vectorize-io/hindsight" rel="noopener noreferrer"&gt;Hindsight&lt;/a&gt; to give it persistent, semantic memory.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Problem with Stateless Agents
&lt;/h2&gt;

&lt;p&gt;Large Language Models are inherently stateless. Every time you open a new context window, the model is born yesterday. &lt;/p&gt;

&lt;p&gt;Many developers try to solve this by dumping raw conversation logs back into the context window. This is a terrible idea. Not only do you hit token limits incredibly fast, but you also destroy the agent's ability to focus. Feeding an agent 100 pages of raw chat history to help it solve a targeted compliance issue just confuses it.&lt;/p&gt;

&lt;p&gt;What I needed was true &lt;a href="https://vectorize.io/what-is-agent-memory" rel="noopener noreferrer"&gt;agent memory&lt;/a&gt;—a way for the agent to semantically search its past experiences and only retrieve the exact memories relevant to the current crisis.&lt;/p&gt;

&lt;h2&gt;
  
  
  Implementing Hindsight
&lt;/h2&gt;

&lt;p&gt;I turned to the &lt;a href="https://hindsight.vectorize.io/" rel="noopener noreferrer"&gt;Hindsight docs&lt;/a&gt; to build a persistent memory layer. Instead of storing raw chat logs, I modified my backend to only store the &lt;em&gt;outcomes&lt;/em&gt; and &lt;em&gt;decisions&lt;/em&gt;.&lt;/p&gt;

&lt;p&gt;Every time SentinelOps successfully diagnosed an issue, it was forced to generate a structured JSON summary. We then piped that summary directly into Hindsight:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight javascript"&gt;&lt;code&gt;&lt;span class="c1"&gt;// backend/services/memoryService.js&lt;/span&gt;
&lt;span class="k"&gt;import&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="nx"&gt;HindsightClient&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="k"&gt;from&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;@vectorize-io/hindsight-client&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;hindsight&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nc"&gt;HindsightClient&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt; &lt;span class="na"&gt;url&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;process&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;env&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;HINDSIGHT_URL&lt;/span&gt; &lt;span class="p"&gt;});&lt;/span&gt;

&lt;span class="k"&gt;export&lt;/span&gt; &lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="kd"&gt;function&lt;/span&gt; &lt;span class="nf"&gt;rememberDecision&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;interactionId&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;query&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;decision&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="k"&gt;try&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;memoryDocument&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;`
      Incident Query: &lt;/span&gt;&lt;span class="p"&gt;${&lt;/span&gt;&lt;span class="nx"&gt;query&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;
      Risk Level: &lt;/span&gt;&lt;span class="p"&gt;${&lt;/span&gt;&lt;span class="nx"&gt;decision&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;riskLevel&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;
      Remediation: &lt;/span&gt;&lt;span class="p"&gt;${&lt;/span&gt;&lt;span class="nx"&gt;decision&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;recommendedAction&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;
      Governance: &lt;/span&gt;&lt;span class="p"&gt;${&lt;/span&gt;&lt;span class="nx"&gt;decision&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;governanceSeverity&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;
    `&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

    &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nx"&gt;hindsight&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;store&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;
      &lt;span class="na"&gt;id&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;interactionId&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
      &lt;span class="na"&gt;content&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;memoryDocument&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
      &lt;span class="na"&gt;metadata&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="na"&gt;domain&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;decision&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;domain&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="na"&gt;timestamp&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nc"&gt;Date&lt;/span&gt;&lt;span class="p"&gt;().&lt;/span&gt;&lt;span class="nf"&gt;toISOString&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
      &lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="p"&gt;});&lt;/span&gt;
  &lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="k"&gt;catch &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;err&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="nx"&gt;console&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;error&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;Failed to commit to memory:&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;err&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
  &lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Now, when a new incident comes in, the first thing the agent does is query its own history:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight javascript"&gt;&lt;code&gt;&lt;span class="k"&gt;export&lt;/span&gt; &lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="kd"&gt;function&lt;/span&gt; &lt;span class="nf"&gt;recallContext&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;query&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;matches&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nx"&gt;hindsight&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;search&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt; &lt;span class="nx"&gt;query&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="na"&gt;topK&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;3&lt;/span&gt; &lt;span class="p"&gt;});&lt;/span&gt;

  &lt;span class="k"&gt;if &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;matches&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;length&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nx"&gt;matches&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;map&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;m&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="nx"&gt;m&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;content&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;join&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="s1"&gt;---&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
  &lt;span class="p"&gt;}&lt;/span&gt;
  &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="kc"&gt;null&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  The CascadeFlow Optimization
&lt;/h2&gt;

&lt;p&gt;Once the agent had memory, the context windows started getting a bit larger, which increased API costs. To mitigate this, I implemented &lt;a href="https://github.com/lemony-ai/cascadeflow" rel="noopener noreferrer"&gt;CascadeFlow&lt;/a&gt;. &lt;/p&gt;

&lt;p&gt;Using the &lt;a href="https://docs.cascadeflow.ai/" rel="noopener noreferrer"&gt;cascadeflow docs&lt;/a&gt;, I built a routing engine that checks the complexity of the query &lt;em&gt;before&lt;/em&gt; hitting the expensive models. If the query is just a simple policy lookup, it routes to a cheap 8B parameter model. If it's a critical infrastructure failure, it pulls the Hindsight memory and routes to the massive 70B reasoning model.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Result
&lt;/h2&gt;

&lt;p&gt;The transformation was immediate. &lt;/p&gt;

&lt;p&gt;&lt;a href="/absolute/path/to/media__1779389401506.png" class="article-body-image-wrapper"&gt;&lt;img src="/absolute/path/to/media__1779389401506.png" alt="Agent Memory in Action"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;When a recurring Kubernetes crash happened the following week, SentinelOps didn't guess. It responded with: &lt;em&gt;"Based on a similar incident 4 days ago, this is likely a misconfigured readiness probe in the payment-gateway service. Applying previous remediation plan..."&lt;/em&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  What I Learned
&lt;/h2&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Don't store raw logs.&lt;/strong&gt; Storing raw conversation history is garbage-in, garbage-out. Force your agent to summarize its decisions before committing them to memory.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Context is king.&lt;/strong&gt; Providing an LLM with 3 highly relevant historical examples yields significantly better results than zero-shot prompting it.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Memory requires routing.&lt;/strong&gt; If you are doing RAG or memory injection, you need an intelligent router like CascadeFlow to manage the compute costs of those larger context windows.&lt;/li&gt;
&lt;/ol&gt;

</description>
      <category>ai</category>
      <category>opensource</category>
      <category>backend</category>
      <category>javascript</category>
    </item>
    <item>
      <title>Our AI Inference Bill Dropped 65% After We Stopped Treating Every Query the Same</title>
      <dc:creator>Karthik S</dc:creator>
      <pubDate>Thu, 21 May 2026 19:03:56 +0000</pubDate>
      <link>https://dev.to/karthik_s_599904b6f055c2c/our-ai-inference-bill-dropped-65-after-we-stopped-treating-every-query-the-same-l1b</link>
      <guid>https://dev.to/karthik_s_599904b6f055c2c/our-ai-inference-bill-dropped-65-after-we-stopped-treating-every-query-the-same-l1b</guid>
      <description>&lt;ul&gt;
&lt;li&gt;
Every query hitting our AI layer was going straight to the most powerful model we had. A user asking "what does HIPAA Section 164.312 say?" got the same compute budget as one asking "should we shut down the payment processor during this active incident?" That was expensive and stupid, and it took embarrassingly long to fix.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This is the story of how we built a routing layer called CascadeFlow into &lt;a href="https://github.com/lemony-ai/cascadeflow" rel="noopener noreferrer"&gt;SentinelOps AI&lt;/a&gt;, an enterprise decision intelligence platform, and what actually happened when we turned it on.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Problem With "One Model Fits All"
&lt;/h2&gt;

&lt;p&gt;When you're building an AI system for enterprise operations teams—people making real decisions about infrastructure, compliance posture, and incident response—you face a genuine tension. You need the model to be &lt;em&gt;good&lt;/em&gt; when it matters. But "good" on a documentation lookup is a different thing from "good" on "we have a potential SOC2 violation, walk me through the remediation path."&lt;/p&gt;

&lt;p&gt;Before routing, every query went to our primary reasoning model (Llama 3.3 70B via Groq). The latency was fine. The quality was fine. The cost was not fine. At scale, routing simple factual queries through a 70B parameter model is just burning money.&lt;/p&gt;

&lt;p&gt;The naive fix is to have engineers triage queries manually, which doesn't scale. The correct fix is a classifier that does it automatically.&lt;/p&gt;

&lt;h2&gt;
  
  
  CascadeFlow: A Lightweight Routing Engine
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;We integrated &lt;a href="https://github.com/lemony-ai/cascadeflow" rel="noopener noreferrer"&gt;@cascadeflow/core&lt;/a&gt; as our routing middleware. The idea is straightforward: before a query hits the expensive model, a cheap, fast classifier decides which tier it belongs to.&lt;/p&gt;

&lt;p&gt;Our routing logic looks roughly like this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight javascript"&gt;&lt;code&gt;&lt;span class="k"&gt;import&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="nx"&gt;CascadeFlow&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="k"&gt;from&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;@cascadeflow/core&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;cascade&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nc"&gt;CascadeFlow&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;
  &lt;span class="na"&gt;classifier&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="na"&gt;model&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;llama-3.1-8b-instant&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="c1"&gt;// fast, cheap&lt;/span&gt;
    &lt;span class="na"&gt;provider&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;groq&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="p"&gt;},&lt;/span&gt;
  &lt;span class="na"&gt;tiers&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;
    &lt;span class="p"&gt;{&lt;/span&gt;
      &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;simple&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
      &lt;span class="na"&gt;model&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;llama-3.1-8b-instant&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
      &lt;span class="na"&gt;triggers&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;documentation&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;lookup&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;definition&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;what is&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
    &lt;span class="p"&gt;},&lt;/span&gt;
    &lt;span class="p"&gt;{&lt;/span&gt;
      &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;complex&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
      &lt;span class="na"&gt;model&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;llama-3.3-70b-versatile&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
      &lt;span class="na"&gt;triggers&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;incident&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;compliance&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;risk&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;critical&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;breach&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
    &lt;span class="p"&gt;},&lt;/span&gt;
  &lt;span class="p"&gt;],&lt;/span&gt;
&lt;span class="p"&gt;});&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The classifier runs first—it's an 8B model, so it's fast and cheap—and classifies the incoming query into a complexity tier. Simple queries (policy lookups, definition requests, status checks) stay on the 8B model. Complex queries (active incidents, compliance risk assessments, multi-system decisions) escalate to the 70B.&lt;/p&gt;

&lt;p&gt;From our LLM service layer, the routing call is transparent:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight javascript"&gt;&lt;code&gt;&lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="kd"&gt;function&lt;/span&gt; &lt;span class="nf"&gt;routeAndExecute&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;query&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;context&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;tier&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nx"&gt;cascade&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;classify&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;query&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
  &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;model&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;tier&lt;/span&gt; &lt;span class="o"&gt;===&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;complex&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;
    &lt;span class="p"&gt;?&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;llama-3.3-70b-versatile&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;
    &lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;llama-3.1-8b-instant&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

  &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nx"&gt;groq&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;chat&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;completions&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;create&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;
    &lt;span class="nx"&gt;model&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="na"&gt;messages&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nf"&gt;buildMessages&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;query&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;context&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
    &lt;span class="na"&gt;response_format&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="na"&gt;type&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;json_object&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt; &lt;span class="p"&gt;},&lt;/span&gt;
  &lt;span class="p"&gt;});&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;That &lt;code&gt;response_format: json_object&lt;/code&gt; constraint is important—we'll come back to it.&lt;/p&gt;

&lt;h2&gt;
  
  
  What Routing Actually Costs You
&lt;/h2&gt;

&lt;p&gt;There's a hidden cost to routing that nobody talks about: &lt;strong&gt;the classifier itself can be wrong&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;In our early testing, the 8B classifier was misrouting about 12% of complex queries down to the cheap tier. A question like "is our current encryption at rest sufficient for PHI storage?" looks superficially like a documentation query. The classifier saw "encryption" and "PHI" as lookup-adjacent terms and routed it to the cheap model, which gave a technically accurate but shallow answer that lacked the risk-weighted framing an auditor would need.&lt;/p&gt;

&lt;p&gt;We fixed this in two ways:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Conservative misclassification bias.&lt;/strong&gt; When the classifier's confidence is below a threshold, escalate to the expensive tier. False positives (routing simple queries high) cost money. False negatives (routing complex queries low) cost credibility. In an enterprise governance context, credibility is more expensive.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Domain keyword pre-checks.&lt;/strong&gt; Before the classifier even runs, we scan for a hardcoded list of high-stakes terms. If a query contains words like &lt;code&gt;breach&lt;/code&gt;, &lt;code&gt;PHI&lt;/code&gt;, &lt;code&gt;incident&lt;/code&gt;, &lt;code&gt;remediation&lt;/code&gt;, or &lt;code&gt;SOC2&lt;/code&gt;, it goes to the 70B model unconditionally.&lt;br&gt;
&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight javascript"&gt;&lt;code&gt;&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;HIGH_STAKES_KEYWORDS&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;
  &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;breach&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;incident&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;PHI&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;PII&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;SOC2&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;HIPAA&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;remediation&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;critical&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;violation&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;audit&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;penalty&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;
&lt;span class="p"&gt;];&lt;/span&gt;

&lt;span class="kd"&gt;function&lt;/span&gt; &lt;span class="nf"&gt;requiresComplexModel&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;query&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;lower&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;query&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;toLowerCase&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;
  &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nx"&gt;HIGH_STAKES_KEYWORDS&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;some&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;kw&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="nx"&gt;lower&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;includes&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;kw&lt;/span&gt;&lt;span class="p"&gt;));&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This is not elegant, but it's safe. The performance overhead is a single &lt;code&gt;.includes()&lt;/code&gt; check per query.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Numbers
&lt;/h2&gt;

&lt;p&gt;After deploying CascadeFlow routing against a realistic mix of enterprise queries, roughly 68% of queries fell into the "simple" tier. The remaining 32% were genuinely complex—incident-related, compliance-heavy, or multi-system risk assessments that benefited from the more capable model.&lt;/p&gt;

&lt;p&gt;That routing split—combined with the price difference between an 8B and 70B parameter model—accounts for most of the cost reduction. The exact figure depends on your query distribution and your provider's pricing, but 60-65% is a reasonable estimate for an enterprise operational workload where most interactions are informational rather than analytical.&lt;/p&gt;

&lt;h2&gt;
  
  
  Forcing Structure Out of Both Models
&lt;/h2&gt;

&lt;p&gt;One consequence of routing to two different models is that you now have two sources of unstructured text to deal with. We solved this by enforcing a strict JSON response schema at the prompt level, regardless of which model is running.&lt;/p&gt;

&lt;p&gt;Every response from SentinelOps AI conforms to this shape:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"summary"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"One-sentence decision summary"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"risk_level"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"LOW | MEDIUM | HIGH | CRITICAL"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"confidence"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mf"&gt;0.87&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"recommendation"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"Specific, actionable recommendation"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"tradeoffs"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;"Tradeoff A"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"Tradeoff B"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"governance_flags"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[],&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"citations"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[]&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The frontend renders this as a Decision Card—not a chat bubble. Risk level gets a color-coded badge. Confidence is displayed as a progress bar. Tradeoffs are rendered as a checklist. Governance flags trigger a separate UI element that routes to the compliance dashboard.&lt;/p&gt;

&lt;p&gt;When you force both the cheap and expensive model into the same output schema, the quality difference between tiers becomes &lt;em&gt;measurable&lt;/em&gt;. You can compare &lt;code&gt;confidence&lt;/code&gt; scores, count &lt;code&gt;governance_flags&lt;/code&gt;, and track whether the 8B model's recommendations match the 70B model's on borderline queries. This becomes a feedback loop for improving your routing thresholds over time.&lt;/p&gt;

&lt;h2&gt;
  
  
  Lessons
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;1. Start with keyword gating, not just ML classification.&lt;/strong&gt; A simple list of high-stakes terms as a pre-filter saved us from the worst misrouting failures. ML classifiers are probabilistic. Safety-critical routing decisions shouldn't be.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;2. Misrouting in the wrong direction is asymmetric.&lt;/strong&gt; Routing a simple query to a powerful model costs you money. Routing a complex query to a weak model costs you trust. Size your misclassification bias accordingly.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;3. A common output schema across tiers is essential.&lt;/strong&gt; Without it, you're comparing apples and oranges and your frontend needs to handle two different response shapes. Force the schema at the prompt level.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;4. Routing is a product decision, not just an infrastructure one.&lt;/strong&gt; The thresholds you set for escalation reflect your platform's risk tolerance. In a governance context, we erred conservative. A developer tool might err aggressive. Know which direction your users would rather you fail.&lt;/p&gt;

&lt;p&gt;You can read more about how CascadeFlow handles multi-tier routing in the &lt;a href="https://docs.cascadeflow.ai/" rel="noopener noreferrer"&gt;cascadeflow docs&lt;/a&gt;. The cost savings are real, but the more important outcome is that complex queries now get the compute they actually need instead of competing on the same tier as "what does this acronym stand for."&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ftchlk5a6pg6mwh5njrl1.jpg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ftchlk5a6pg6mwh5njrl1.jpg" alt=" " width="800" height="329"&gt;&lt;/a&gt;&lt;br&gt;
&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fw8b3ri4coe9aoo8tqusa.jpg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fw8b3ri4coe9aoo8tqusa.jpg" alt=" " width="771" height="641"&gt;&lt;/a&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>javascript</category>
      <category>llm</category>
      <category>backend</category>
    </item>
    <item>
      <title>[Boost]</title>
      <dc:creator>Karthik S</dc:creator>
      <pubDate>Thu, 21 May 2026 18:43:31 +0000</pubDate>
      <link>https://dev.to/karthik_s_599904b6f055c2c/-3ohk</link>
      <guid>https://dev.to/karthik_s_599904b6f055c2c/-3ohk</guid>
      <description></description>
    </item>
  </channel>
</rss>
