<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Riddhiman </title>
    <description>The latest articles on DEV Community by Riddhiman  (@riddhiman_sasri_2ca955793).</description>
    <link>https://dev.to/riddhiman_sasri_2ca955793</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3946548%2F163c5c8d-ac41-4e08-b94b-d4e45b844c96.png</url>
      <title>DEV Community: Riddhiman </title>
      <link>https://dev.to/riddhiman_sasri_2ca955793</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/riddhiman_sasri_2ca955793"/>
    <language>en</language>
    <item>
      <title>Why One Model Is Never Enough: Routing Incident Analysis With cascadeflow</title>
      <dc:creator>Riddhiman </dc:creator>
      <pubDate>Fri, 22 May 2026 17:52:15 +0000</pubDate>
      <link>https://dev.to/riddhiman_sasri_2ca955793/why-one-model-is-never-enough-routing-incident-analysis-with-cascadeflow-1f8k</link>
      <guid>https://dev.to/riddhiman_sasri_2ca955793/why-one-model-is-never-enough-routing-incident-analysis-with-cascadeflow-1f8k</guid>
      <description>&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fcoq4639x5l6br3z3pwzj.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fcoq4639x5l6br3z3pwzj.png" alt=" " width="800" height="591"&gt;&lt;/a&gt;# Why One Model Is Never Enough: Routing Incident Analysis With cascadeflow&lt;/p&gt;

&lt;p&gt;The first time our incident assistant burned through a premium reasoning model to parse a three-line nginx log, I knew we had a problem. Not with the AI. With the assumption that one model, called blindly every time, is the right way to build anything production-worthy.&lt;/p&gt;

&lt;p&gt;That assumption is expensive. And in the context of real-time incident response—where you're getting paged at 2 AM and your Redis cluster is throwing connection errors—it's also slow in ways that hurt.&lt;/p&gt;

&lt;p&gt;This is the story of how I built IncidentOS, an AI-powered operational memory system for SRE teams, and why &lt;a href="https://github.com/lemony-ai/cascadeflow" rel="noopener noreferrer"&gt;cascadeflow&lt;/a&gt; became the piece that made the runtime actually usable.&lt;/p&gt;




&lt;h2&gt;
  
  
  What IncidentOS Actually Does
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fg0ya9qr476dri6pfaznw.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fg0ya9qr476dri6pfaznw.png" alt=" " width="800" height="365"&gt;&lt;/a&gt;&lt;br&gt;
The core insight behind IncidentOS is blunt: engineering teams solve the same incidents over and over again. Redis timeouts, pod OOMKills, connection pool exhaustion, deployment-triggered latency spikes. The fixes exist. They live in a Slack thread from eight months ago, a Jira ticket that's been closed, or in the head of the one senior engineer who happened to be on-call that night.&lt;/p&gt;

&lt;p&gt;IncidentOS is an attempt to fix the memory problem, not the monitoring problem. Datadog and Grafana are good at telling you what's happening right now. They're not built to tell you &lt;em&gt;we've seen this exact pattern before, here's what caused it, and here's what fixed it&lt;/em&gt;. That's a different problem, and it needs a different tool.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/Screenshot_2026-05-19_220337.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/Screenshot_2026-05-19_220337.png" alt="IncidentOS Dashboard — Incident Simulator and Operational Memory panel" width="800" height="400"&gt;&lt;/a&gt;&lt;br&gt;
&lt;em&gt;The main dashboard: incident scenarios on the left, live operational memory on the right. The Reflection Insights panel surfaces cross-incident patterns automatically — in this case, memory-related and deployment-related issues are already flagged as recurring root causes across 8 stored incidents.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;The system is structured in two layers:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Memory layer&lt;/strong&gt; — powered by &lt;a href="https://hindsight.vectorize.io/" rel="noopener noreferrer"&gt;Hindsight&lt;/a&gt;, every incident that passes through IncidentOS gets stored: symptoms, root cause, affected services, deployment version, remediation steps, whether the fix actually worked. When a new incident comes in, Hindsight does semantic search across that history and surfaces the closest matches. Not keyword search. Semantic similarity — so "Prisma connection timeout" and "database pool exhausted" can correctly resolve to the same underlying pattern.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Runtime intelligence layer&lt;/strong&gt; — powered by &lt;a href="https://docs.cascadeflow.ai/" rel="noopener noreferrer"&gt;cascadeflow&lt;/a&gt;, every AI request gets routed to the right model for that specific task. Simple log parsing goes to a fast, cheap model. Incident summarization goes to a mid-tier model. Complex root-cause reasoning that requires synthesizing multiple historical incidents gets escalated to an advanced reasoning model. The routing logic is explicit, auditable, and configurable.&lt;/p&gt;

&lt;p&gt;The backend is FastAPI (Python 3.11+). The frontend is React 18 with Vite. ChromaDB handles vector storage for the memory layer. The whole thing runs in Docker with a single &lt;code&gt;docker-compose up&lt;/code&gt;. It's designed to be a decision-support tool — it never touches your infrastructure. Engineers stay in control.&lt;/p&gt;


&lt;h2&gt;
  
  
  The Architecture Before We Go Further
&lt;/h2&gt;

&lt;p&gt;Here's the actual project structure — not a cleaned-up diagram, the real thing:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fdmg7zjwlef75smv19wh2.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fdmg7zjwlef75smv19wh2.png" alt=" " width="473" height="586"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/Screenshot_2026-05-19_221906.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/Screenshot_2026-05-19_221906.png" alt="IncidentOS Architecture HUD — full project tree showing frontend React components, FastAPI backend, agent core, ChromaDB, and integrations" width="800" height="400"&gt;&lt;/a&gt;&lt;br&gt;
&lt;em&gt;The full project tree. The key file is &lt;code&gt;agent/routing.py&lt;/code&gt; — that's where the cascadeflow model router lives. &lt;code&gt;agent/memory.py&lt;/code&gt; is the ChromaDB + Hindsight integration. The &lt;code&gt;data/runbooks/&lt;/code&gt; folder contains markdown remediation guides that get injected as context during incidents. &lt;code&gt;agent/tools.py&lt;/code&gt; exposes 8 operational SRE tools the agent calls during investigation.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;The clean separation matters: &lt;code&gt;agent/core.py&lt;/code&gt; runs the agentic state loops, &lt;code&gt;agent/routing.py&lt;/code&gt; handles model selection, &lt;code&gt;agent/memory.py&lt;/code&gt; handles semantic recall. They're independent. You can swap the memory backend or change the routing rules without touching the agent logic.&lt;/p&gt;


&lt;h2&gt;
  
  
  The Routing Problem (And Why It's Harder Than It Looks)
&lt;/h2&gt;

&lt;p&gt;When I started building this, I did what most people do: picked one model and called it for everything. It worked fine in testing. Then I started running it against realistic incident volumes and the costs climbed fast, and more importantly, the &lt;em&gt;latency&lt;/em&gt; became a problem.&lt;/p&gt;

&lt;p&gt;Here's the thing about incident response: when your API latency has spiked and you're trying to understand why, you don't want to wait fifteen seconds for a premium model to finish thinking about a log file you could have parsed in two. Every second of model latency is a second added to your mean time to recovery.&lt;/p&gt;

&lt;p&gt;The fix wasn't complicated in concept. Different tasks have genuinely different complexity requirements:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Task&lt;/th&gt;
&lt;th&gt;Complexity&lt;/th&gt;
&lt;th&gt;Model&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Semantic memory recall from ChromaDB&lt;/td&gt;
&lt;td&gt;Low&lt;/td&gt;
&lt;td&gt;Haiku&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Simple log classification&lt;/td&gt;
&lt;td&gt;Low&lt;/td&gt;
&lt;td&gt;Haiku&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Data retrieval and log search&lt;/td&gt;
&lt;td&gt;Medium&lt;/td&gt;
&lt;td&gt;Haiku&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Complex root-cause reasoning across incidents&lt;/td&gt;
&lt;td&gt;High&lt;/td&gt;
&lt;td&gt;Sonnet&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The hard part is making that routing logic explicit, consistent, and observable. If you're just writing &lt;code&gt;if/else&lt;/code&gt; branches around different API calls, you end up with a mess that's hard to audit and impossible to tune. That's where &lt;a href="https://github.com/lemony-ai/cascadeflow" rel="noopener noreferrer"&gt;cascadeflow&lt;/a&gt; came in.&lt;/p&gt;


&lt;h2&gt;
  
  
  How cascadeflow Handles the Routing
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fwa1ko4bi3142b8fmq5x2.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fwa1ko4bi3142b8fmq5x2.png" alt=" " width="695" height="667"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://docs.cascadeflow.ai/" rel="noopener noreferrer"&gt;cascadeflow&lt;/a&gt; is a runtime intelligence layer for AI agents. The core idea is that you define routing rules declaratively, and the runtime handles model selection, escalation logic, latency tracking, and cost accounting. You get an audit trail for every request.&lt;/p&gt;

&lt;p&gt;Here's the actual routing logic from &lt;code&gt;agent/routing.py&lt;/code&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;enum&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;Enum&lt;/span&gt;

&lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;TaskType&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;Enum&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;MEMORY_RECALL&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;memory_recall&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
    &lt;span class="n"&gt;SIMPLE_CLASSIFICATION&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;simple_classification&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
    &lt;span class="n"&gt;DATA_RETRIEVAL&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;data_retrieval&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
    &lt;span class="n"&gt;COMPLEX_REASONING&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;complex_reasoning&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;

&lt;span class="c1"&gt;# ModelRouter maps task types to models at runtime
&lt;/span&gt;&lt;span class="n"&gt;router&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;ModelRouter&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

&lt;span class="c1"&gt;# Haiku for lightweight tasks — memory recall is semantic search,
# not reasoning. No point paying for Sonnet.
&lt;/span&gt;&lt;span class="n"&gt;router&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;register&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;TaskType&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;MEMORY_RECALL&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;         &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;claude-3-5-haiku-20241022&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;router&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;register&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;TaskType&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;SIMPLE_CLASSIFICATION&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;claude-3-5-haiku-20241022&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;router&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;register&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;TaskType&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;DATA_RETRIEVAL&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;        &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;claude-3-5-haiku-20241022&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# Sonnet only when reasoning depth actually requires it
&lt;/span&gt;&lt;span class="n"&gt;router&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;register&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;TaskType&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;COMPLEX_REASONING&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;     &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;claude-3-5-sonnet-20241022&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The reasoning is explicit, not magic. When the agent selects a model, it logs &lt;em&gt;why&lt;/em&gt; — and that log is streamed live to the UI:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/Screenshot_2026-05-19_220618.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/Screenshot_2026-05-19_220618.png" alt="Agent Reasoning Stream — AGENT_START, REASONING, MODEL_SELECTION showing claude-3-5-haiku for MEMORY_RECALL with reasoning " width="800" height="400"&gt;&lt;/a&gt;&lt;br&gt;
&lt;em&gt;The agent's live reasoning stream. The &lt;code&gt;[MODEL_SELECTION]&lt;/code&gt; event is the cascadeflow routing decision made visible — not just which model, but exactly why. "Memory recall is a lightweight semantic search task — using Haiku for efficiency." That one line of transparency is what makes engineers trust the system.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;And here's what it looks like when the task actually warrants escalation to the advanced model:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/Screenshot_2026-05-19_220705.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/Screenshot_2026-05-19_220705.png" alt="Agent Reasoning Stream — MODEL_SELECTION showing claude-3-5-sonnet for COMPLEX_REASONING with reasoning " width="800" height="400"&gt;&lt;/a&gt;&lt;br&gt;
&lt;em&gt;Escalation in action. Post-mortem generation — correlating log patterns, checking deployment diffs, synthesizing remediation steps — routes to Sonnet. The reasoning is logged. Every model decision is auditable.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;And in &lt;code&gt;main.py&lt;/code&gt;, the clean import structure shows how these layers stay separated:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# main.py — separation of concerns in the imports
&lt;/span&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;agent.core&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;IncidentAgent&lt;/span&gt;      &lt;span class="c1"&gt;# agentic state loops
&lt;/span&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;agent.memory&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;MemorySystem&lt;/span&gt;     &lt;span class="c1"&gt;# ChromaDB + Hindsight
&lt;/span&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;agent.routing&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;ModelRouter&lt;/span&gt;     &lt;span class="c1"&gt;# cascadeflow routing
&lt;/span&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;agent.tools&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;initialize_tools&lt;/span&gt;  &lt;span class="c1"&gt;# 8 SRE tool definitions
&lt;/span&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;integrations.logs&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;LogSearcher&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;integrations.slack&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;SlackNotifier&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/Screenshot_2026-05-19_222024.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/Screenshot_2026-05-19_222024.png" alt="VS Code showing main.py with actual imports including FastAPI, WebSocket, IncidentAgent, MemorySystem, ModelRouter, initialize_tools, LogSearcher, SlackNotifier" width="800" height="400"&gt;&lt;/a&gt;&lt;br&gt;
&lt;em&gt;The real &lt;code&gt;main.py&lt;/code&gt;. &lt;code&gt;IncidentAgent&lt;/code&gt; orchestrates, &lt;code&gt;MemorySystem&lt;/code&gt; handles recall, &lt;code&gt;ModelRouter&lt;/code&gt; handles routing. Three separate modules with one job each.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;The Runtime Intelligence panel on the frontend surfaces all of this cost accounting in real time:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/Screenshot_2026-05-19_220642.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/Screenshot_2026-05-19_220642.png" alt="Runtime Intelligence panel — Total Cost $0.0610, Input Tokens 12,370, Output Tokens 3,740, Cost by Task Type: memory_recall $0.0028 (1 call), simple_classification $0.0028 (2 calls), data_retrieval $0.0061 (3 calls), complex_reasoning $0.0493 (2 calls), Routing Decisions" width="800" height="400"&gt;&lt;/a&gt;&lt;br&gt;
&lt;em&gt;A full P1 investigation cost $0.0610 total. The breakdown tells the real story: &lt;code&gt;complex_reasoning&lt;/code&gt; consumed $0.0493 across 2 Sonnet calls. The 6 Haiku calls combined cost $0.0117. Without routing, all 8 calls would have gone to Sonnet. At scale, that difference is not small.&lt;/em&gt;&lt;/p&gt;


&lt;h2&gt;
  
  
  Hindsight in the Loop: Memory-Grounded Responses
&lt;/h2&gt;

&lt;p&gt;The routing layer handles &lt;em&gt;how&lt;/em&gt; to run the AI. &lt;a href="https://vectorize.io/what-is-agent-memory" rel="noopener noreferrer"&gt;Hindsight's persistent agent memory&lt;/a&gt; handles &lt;em&gt;what context&lt;/em&gt; the AI reasons over.&lt;/p&gt;

&lt;p&gt;When a new incident is triggered, the agent first recalls semantically similar historical incidents from ChromaDB before generating any analysis. Here's the tools layer where that happens (&lt;code&gt;agent/tools.py&lt;/code&gt;):&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# tools.py — initialize_tools wires up memory, logs, and Slack at agent startup
&lt;/span&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;initialize_tools&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;log_searcher_instance&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;slack_notifier_instance&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                     &lt;span class="n"&gt;memory_system_instance&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="k"&gt;global&lt;/span&gt; &lt;span class="n"&gt;log_searcher&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;slack_notifier&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;memory_system&lt;/span&gt;
    &lt;span class="n"&gt;log_searcher&lt;/span&gt;   &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;log_searcher_instance&lt;/span&gt;
    &lt;span class="n"&gt;slack_notifier&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;slack_notifier_instance&lt;/span&gt;
    &lt;span class="n"&gt;memory_system&lt;/span&gt;  &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;memory_system_instance&lt;/span&gt;  &lt;span class="c1"&gt;# ChromaDB + Hindsight
&lt;/span&gt;
&lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;search_logs&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;input_data&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;Dict&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;Any&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="n"&gt;incident&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;Incident&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;query&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;input_data&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;query&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;""&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="c1"&gt;# Searches telemetry providers for matching log patterns
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/Screenshot_2026-05-19_222040.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/Screenshot_2026-05-19_222040.png" alt="VS Code showing tools.py with imports from models.schemas, integrations.logs, integrations.slack, and the initialize_tools and search_logs function definitions" width="800" height="400"&gt;&lt;/a&gt;&lt;br&gt;
&lt;em&gt;The tools layer. &lt;code&gt;recall_similar_incidents&lt;/code&gt; is the first tool the agent calls — it queries ChromaDB for semantically similar past incidents. Only after that retrieval does the agent proceed to log analysis and reasoning, so every response is grounded in real history.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;Here's a real example: a P1 Memory Leak in User Service detected at 4:33 PM:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Feexdhne4i88gacepeld8.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Feexdhne4i88gacepeld8.png" alt=" " width="800" height="277"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/Screenshot_2026-05-19_220354.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/Screenshot_2026-05-19_220354.png" alt="Incident card for Memory Leak in User Service — P1 DETECTED, ID INC-20260519163339, description: User service experiencing gradual memory increase leading to OOM errors and pod restarts, Detected 19/5/2026 4:33:39 pm, Start Agent Investigation button" width="800" height="400"&gt;&lt;/a&gt;&lt;br&gt;
&lt;em&gt;An active P1. "Start Agent Investigation" kicks off the full agentic loop — memory recall, log analysis, root cause reasoning, remediation suggestions. Engineers initiate. The agent investigates. Nothing touches infrastructure.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;After the agent runs — recalling from memory, analyzing logs, correlating with deployment history — this is the analysis output:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/Screenshot_2026-05-19_220550.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/Screenshot_2026-05-19_220550.png" alt="Analysis results — Summary: Memory Leak in User Service, Root Cause: Uncached user sessions not being properly cleaned up after logout in the session manager code, Confidence: 94%, Impact: P1 Single service affected user-service Duration 1 minute, Actions Taken: Recalled memory-leak incident INC-2024-001 from ChromaDB, triaged severity P1, analyzed heap allocation logs, correlated with session manager release. Incident Timeline on right showing AGENT_START through MODEL_SELECTION steps" width="800" height="400"&gt;&lt;/a&gt;&lt;br&gt;
&lt;em&gt;Root cause identified at 94% confidence: uncached user sessions not cleaned up after logout in the session manager. The Actions Taken panel shows the full chain of reasoning — recalled INC-2024-001 from ChromaDB memory, triaged from OOM signals, analyzed heap allocation logs, correlated with the session manager release. Every step cited.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;The complete investigation view — reasoning stream and runtime cost panel side by side:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/Screenshot_2026-05-19_220607.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/Screenshot_2026-05-19_220607.png" alt="Full investigation UI — Agent Reasoning Stream on left with AGENT_START through MODEL_SELECTION events, Runtime Intelligence panel on right with Total Cost $0.0610, cost by task type breakdown, and routing decisions" width="800" height="400"&gt;&lt;/a&gt;&lt;br&gt;
&lt;em&gt;The live investigation interface. Reasoning stream on the left, cost accounting on the right. Engineers can watch the agent's thinking in real time while the runtime panel tracks exactly what's being spent and which model handles each step.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;After each incident resolves, the outcome gets written back to &lt;a href="https://hindsight.vectorize.io/" rel="noopener noreferrer"&gt;Hindsight&lt;/a&gt; — what fix worked, what didn't, how long resolution took. The memory compounds over time.&lt;/p&gt;




&lt;h2&gt;
  
  
  What I Learned Building This
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;1. Task classification is worth getting right early.&lt;/strong&gt; The routing logic is only as good as the task classifier upstream. A mis-classified task — routing a complex root-cause question to Haiku — produces confidently wrong output, which is worse than a slow correct answer. I spent more time here than expected.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;2. Making model selection visible changes how engineers engage.&lt;/strong&gt; The &lt;code&gt;[MODEL_SELECTION]&lt;/code&gt; step in the reasoning stream wasn't a nice-to-have. Engineers who can see &lt;em&gt;why&lt;/em&gt; the system picked Haiku vs Sonnet trust the output more. It reframes the AI as a transparent tool rather than a black box.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;3. Audit trails matter more than dashboards.&lt;/strong&gt; The cascadeflow cost-by-task breakdown turned out to be useful not just for cost tracking, but for debugging the routing logic itself. If &lt;code&gt;complex_reasoning&lt;/code&gt; costs spike unexpectedly, it means the classifier is mis-routing lighter tasks upward.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;4. Memory without recency weighting is dangerous.&lt;/strong&gt; An incident from three years ago might involve infrastructure that no longer exists. I added recency decay to the ChromaDB recall step so older incidents are surfaced with lower confidence scores. This sounds obvious in retrospect; it wasn't when I was first designing the retrieval.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;5. Never touch infrastructure automatically.&lt;/strong&gt; This was always the design, but I'll say it plainly: IncidentOS is a decision-support tool. It surfaces information. Engineers act on it. The moment you start automating production changes based on AI suggestions without human review, you've built a different kind of incident.&lt;/p&gt;




&lt;h2&gt;
  
  
  Where This Goes
&lt;/h2&gt;

&lt;p&gt;The memory layer gets more useful as the incident corpus grows. Eight incidents in, it's catching patterns across root causes. Five hundred incidents in, it starts to feel like having a very experienced colleague who has personally debugged every failure your systems have ever produced.&lt;/p&gt;

&lt;p&gt;The routing layer gets cheaper as model pricing drops and fast models get more capable. The architecture stays the same — you just update the tier assignments in &lt;code&gt;routing.py&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;If you're building anything that involves repeated AI calls over structured workflows, the &lt;a href="https://docs.cascadeflow.ai/" rel="noopener noreferrer"&gt;cascadeflow docs&lt;/a&gt; are worth reading for the routing primitives alone. And if you're working on anything that needs memory across sessions, &lt;a href="https://hindsight.vectorize.io/" rel="noopener noreferrer"&gt;Hindsight&lt;/a&gt; is the most direct path I've found to persistent semantic recall without building retrieval infrastructure from scratch.&lt;/p&gt;

&lt;p&gt;The core insight remains simple: not every problem needs your most expensive model, and your agents shouldn't have to rediscover the same answers every time they run.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>programming</category>
      <category>productivity</category>
      <category>career</category>
    </item>
  </channel>
</rss>
