Riddhiman

Posted on May 22

Why One Model Is Never Enough: Routing Incident Analysis With cascadeflow

#ai #career #programming #productivity

# Why One Model Is Never Enough: Routing Incident Analysis With cascadeflow

The first time our incident assistant burned through a premium reasoning model to parse a three-line nginx log, I knew we had a problem. Not with the AI. With the assumption that one model, called blindly every time, is the right way to build anything production-worthy.

That assumption is expensive. And in the context of real-time incident response—where you're getting paged at 2 AM and your Redis cluster is throwing connection errors—it's also slow in ways that hurt.

This is the story of how I built IncidentOS, an AI-powered operational memory system for SRE teams, and why cascadeflow became the piece that made the runtime actually usable.

What IncidentOS Actually Does

The core insight behind IncidentOS is blunt: engineering teams solve the same incidents over and over again. Redis timeouts, pod OOMKills, connection pool exhaustion, deployment-triggered latency spikes. The fixes exist. They live in a Slack thread from eight months ago, a Jira ticket that's been closed, or in the head of the one senior engineer who happened to be on-call that night.

IncidentOS is an attempt to fix the memory problem, not the monitoring problem. Datadog and Grafana are good at telling you what's happening right now. They're not built to tell you we've seen this exact pattern before, here's what caused it, and here's what fixed it. That's a different problem, and it needs a different tool.

The main dashboard: incident scenarios on the left, live operational memory on the right. The Reflection Insights panel surfaces cross-incident patterns automatically — in this case, memory-related and deployment-related issues are already flagged as recurring root causes across 8 stored incidents.

The system is structured in two layers:

Memory layer — powered by Hindsight, every incident that passes through IncidentOS gets stored: symptoms, root cause, affected services, deployment version, remediation steps, whether the fix actually worked. When a new incident comes in, Hindsight does semantic search across that history and surfaces the closest matches. Not keyword search. Semantic similarity — so "Prisma connection timeout" and "database pool exhausted" can correctly resolve to the same underlying pattern.

Runtime intelligence layer — powered by cascadeflow, every AI request gets routed to the right model for that specific task. Simple log parsing goes to a fast, cheap model. Incident summarization goes to a mid-tier model. Complex root-cause reasoning that requires synthesizing multiple historical incidents gets escalated to an advanced reasoning model. The routing logic is explicit, auditable, and configurable.

The backend is FastAPI (Python 3.11+). The frontend is React 18 with Vite. ChromaDB handles vector storage for the memory layer. The whole thing runs in Docker with a single docker-compose up. It's designed to be a decision-support tool — it never touches your infrastructure. Engineers stay in control.

The Architecture Before We Go Further

Here's the actual project structure — not a cleaned-up diagram, the real thing:

The full project tree. The key file is agent/routing.py — that's where the cascadeflow model router lives. agent/memory.py is the ChromaDB + Hindsight integration. The data/runbooks/ folder contains markdown remediation guides that get injected as context during incidents. agent/tools.py exposes 8 operational SRE tools the agent calls during investigation.

The clean separation matters: agent/core.py runs the agentic state loops, agent/routing.py handles model selection, agent/memory.py handles semantic recall. They're independent. You can swap the memory backend or change the routing rules without touching the agent logic.

The Routing Problem (And Why It's Harder Than It Looks)

When I started building this, I did what most people do: picked one model and called it for everything. It worked fine in testing. Then I started running it against realistic incident volumes and the costs climbed fast, and more importantly, the latency became a problem.

Here's the thing about incident response: when your API latency has spiked and you're trying to understand why, you don't want to wait fifteen seconds for a premium model to finish thinking about a log file you could have parsed in two. Every second of model latency is a second added to your mean time to recovery.

The fix wasn't complicated in concept. Different tasks have genuinely different complexity requirements:

Task	Complexity	Model
Semantic memory recall from ChromaDB	Low	Haiku
Simple log classification	Low	Haiku
Data retrieval and log search	Medium	Haiku
Complex root-cause reasoning across incidents	High	Sonnet

The hard part is making that routing logic explicit, consistent, and observable. If you're just writing if/else branches around different API calls, you end up with a mess that's hard to audit and impossible to tune. That's where cascadeflow came in.

How cascadeflow Handles the Routing

cascadeflow is a runtime intelligence layer for AI agents. The core idea is that you define routing rules declaratively, and the runtime handles model selection, escalation logic, latency tracking, and cost accounting. You get an audit trail for every request.

Here's the actual routing logic from agent/routing.py:

from enum import Enum

class TaskType(Enum):
    MEMORY_RECALL = "memory_recall"
    SIMPLE_CLASSIFICATION = "simple_classification"
    DATA_RETRIEVAL = "data_retrieval"
    COMPLEX_REASONING = "complex_reasoning"

# ModelRouter maps task types to models at runtime
router = ModelRouter()

# Haiku for lightweight tasks — memory recall is semantic search,
# not reasoning. No point paying for Sonnet.
router.register(TaskType.MEMORY_RECALL,         model="claude-3-5-haiku-20241022")
router.register(TaskType.SIMPLE_CLASSIFICATION, model="claude-3-5-haiku-20241022")
router.register(TaskType.DATA_RETRIEVAL,        model="claude-3-5-haiku-20241022")

# Sonnet only when reasoning depth actually requires it
router.register(TaskType.COMPLEX_REASONING,     model="claude-3-5-sonnet-20241022")

The reasoning is explicit, not magic. When the agent selects a model, it logs why — and that log is streamed live to the UI:

The agent's live reasoning stream. The [MODEL_SELECTION] event is the cascadeflow routing decision made visible — not just which model, but exactly why. "Memory recall is a lightweight semantic search task — using Haiku for efficiency." That one line of transparency is what makes engineers trust the system.

And here's what it looks like when the task actually warrants escalation to the advanced model:

Escalation in action. Post-mortem generation — correlating log patterns, checking deployment diffs, synthesizing remediation steps — routes to Sonnet. The reasoning is logged. Every model decision is auditable.

And in main.py, the clean import structure shows how these layers stay separated:

# main.py — separation of concerns in the imports
from agent.core import IncidentAgent      # agentic state loops
from agent.memory import MemorySystem     # ChromaDB + Hindsight
from agent.routing import ModelRouter     # cascadeflow routing
from agent.tools import initialize_tools  # 8 SRE tool definitions
from integrations.logs import LogSearcher
from integrations.slack import SlackNotifier

The real main.py. IncidentAgent orchestrates, MemorySystem handles recall, ModelRouter handles routing. Three separate modules with one job each.

The Runtime Intelligence panel on the frontend surfaces all of this cost accounting in real time:

A full P1 investigation cost $0.0610 total. The breakdown tells the real story: complex_reasoning consumed $0.0493 across 2 Sonnet calls. The 6 Haiku calls combined cost $0.0117. Without routing, all 8 calls would have gone to Sonnet. At scale, that difference is not small.

Hindsight in the Loop: Memory-Grounded Responses

The routing layer handles how to run the AI. Hindsight's persistent agent memory handles what context the AI reasons over.

When a new incident is triggered, the agent first recalls semantically similar historical incidents from ChromaDB before generating any analysis. Here's the tools layer where that happens (agent/tools.py):

# tools.py — initialize_tools wires up memory, logs, and Slack at agent startup
def initialize_tools(log_searcher_instance, slack_notifier_instance,
                     memory_system_instance):
    global log_searcher, slack_notifier, memory_system
    log_searcher   = log_searcher_instance
    slack_notifier = slack_notifier_instance
    memory_system  = memory_system_instance  # ChromaDB + Hindsight

async def search_logs(input_data: Dict[str, Any], incident: Incident):
    query = input_data.get("query", "")
    # Searches telemetry providers for matching log patterns

The tools layer. recall_similar_incidents is the first tool the agent calls — it queries ChromaDB for semantically similar past incidents. Only after that retrieval does the agent proceed to log analysis and reasoning, so every response is grounded in real history.

Here's a real example: a P1 Memory Leak in User Service detected at 4:33 PM:

An active P1. "Start Agent Investigation" kicks off the full agentic loop — memory recall, log analysis, root cause reasoning, remediation suggestions. Engineers initiate. The agent investigates. Nothing touches infrastructure.

After the agent runs — recalling from memory, analyzing logs, correlating with deployment history — this is the analysis output:

Root cause identified at 94% confidence: uncached user sessions not cleaned up after logout in the session manager. The Actions Taken panel shows the full chain of reasoning — recalled INC-2024-001 from ChromaDB memory, triaged from OOM signals, analyzed heap allocation logs, correlated with the session manager release. Every step cited.

The complete investigation view — reasoning stream and runtime cost panel side by side:

The live investigation interface. Reasoning stream on the left, cost accounting on the right. Engineers can watch the agent's thinking in real time while the runtime panel tracks exactly what's being spent and which model handles each step.

After each incident resolves, the outcome gets written back to Hindsight — what fix worked, what didn't, how long resolution took. The memory compounds over time.

What I Learned Building This

1. Task classification is worth getting right early. The routing logic is only as good as the task classifier upstream. A mis-classified task — routing a complex root-cause question to Haiku — produces confidently wrong output, which is worse than a slow correct answer. I spent more time here than expected.

2. Making model selection visible changes how engineers engage. The [MODEL_SELECTION] step in the reasoning stream wasn't a nice-to-have. Engineers who can see why the system picked Haiku vs Sonnet trust the output more. It reframes the AI as a transparent tool rather than a black box.

3. Audit trails matter more than dashboards. The cascadeflow cost-by-task breakdown turned out to be useful not just for cost tracking, but for debugging the routing logic itself. If complex_reasoning costs spike unexpectedly, it means the classifier is mis-routing lighter tasks upward.

4. Memory without recency weighting is dangerous. An incident from three years ago might involve infrastructure that no longer exists. I added recency decay to the ChromaDB recall step so older incidents are surfaced with lower confidence scores. This sounds obvious in retrospect; it wasn't when I was first designing the retrieval.

5. Never touch infrastructure automatically. This was always the design, but I'll say it plainly: IncidentOS is a decision-support tool. It surfaces information. Engineers act on it. The moment you start automating production changes based on AI suggestions without human review, you've built a different kind of incident.

Where This Goes

The memory layer gets more useful as the incident corpus grows. Eight incidents in, it's catching patterns across root causes. Five hundred incidents in, it starts to feel like having a very experienced colleague who has personally debugged every failure your systems have ever produced.

The routing layer gets cheaper as model pricing drops and fast models get more capable. The architecture stays the same — you just update the tier assignments in routing.py.

If you're building anything that involves repeated AI calls over structured workflows, the cascadeflow docs are worth reading for the routing primitives alone. And if you're working on anything that needs memory across sessions, Hindsight is the most direct path I've found to persistent semantic recall without building retrieval infrastructure from scratch.

The core insight remains simple: not every problem needs your most expensive model, and your agents shouldn't have to rediscover the same answers every time they run.

Top comments (1)

Paulina Cortes • May 22

is this open-source? can you share the repo? thanks a lot for your contributions