DEV Community: Karthik S

I built an agent to audit incidents, but it kept forgetting Tuesday by Wednesday.

Karthik S — Fri, 22 May 2026 17:26:24 +0000

When you build an enterprise AI agent, the honeymoon phase lasts exactly until you push to production.

I built "SentinelOps", an AI agent designed to audit compliance risks and operational incidents for our engineering teams. Initially, it was a massive success. It could ingest a JSON dump of server logs and perfectly diagnose a memory leak, providing step-by-step remediation plans.

But a week later, the exact same memory leak occurred. I asked SentinelOps what to do. It hallucinated three entirely different solutions, completely forgetting the successful remediation plan we executed just days prior.

My brilliant AI had severe amnesia. Here is how I re-architected our agent using Hindsight to give it persistent, semantic memory.

The Problem with Stateless Agents

Large Language Models are inherently stateless. Every time you open a new context window, the model is born yesterday.

Many developers try to solve this by dumping raw conversation logs back into the context window. This is a terrible idea. Not only do you hit token limits incredibly fast, but you also destroy the agent's ability to focus. Feeding an agent 100 pages of raw chat history to help it solve a targeted compliance issue just confuses it.

What I needed was true agent memory—a way for the agent to semantically search its past experiences and only retrieve the exact memories relevant to the current crisis.

Implementing Hindsight

I turned to the Hindsight docs to build a persistent memory layer. Instead of storing raw chat logs, I modified my backend to only store the outcomes and decisions.

Every time SentinelOps successfully diagnosed an issue, it was forced to generate a structured JSON summary. We then piped that summary directly into Hindsight:

// backend/services/memoryService.js
import { HindsightClient } from '@vectorize-io/hindsight-client';

const hindsight = new HindsightClient({ url: process.env.HINDSIGHT_URL });

export async function rememberDecision(interactionId, query, decision) {
  try {
    const memoryDocument = `
      Incident Query: ${query}
      Risk Level: ${decision.riskLevel}
      Remediation: ${decision.recommendedAction}
      Governance: ${decision.governanceSeverity}
    `;

    await hindsight.store({
      id: interactionId,
      content: memoryDocument,
      metadata: {
        domain: decision.domain,
        timestamp: new Date().toISOString()
      }
    });
  } catch (err) {
    console.error("Failed to commit to memory:", err);
  }
}

Now, when a new incident comes in, the first thing the agent does is query its own history:

export async function recallContext(query) {
  const matches = await hindsight.search({ query, topK: 3 });

  if (matches.length > 0) {
    return matches.map(m => m.content).join('\n---\n');
  }
  return null;
}

The CascadeFlow Optimization

Once the agent had memory, the context windows started getting a bit larger, which increased API costs. To mitigate this, I implemented CascadeFlow.

Using the cascadeflow docs, I built a routing engine that checks the complexity of the query before hitting the expensive models. If the query is just a simple policy lookup, it routes to a cheap 8B parameter model. If it's a critical infrastructure failure, it pulls the Hindsight memory and routes to the massive 70B reasoning model.

The Result

The transformation was immediate.

When a recurring Kubernetes crash happened the following week, SentinelOps didn't guess. It responded with: "Based on a similar incident 4 days ago, this is likely a misconfigured readiness probe in the payment-gateway service. Applying previous remediation plan..."

What I Learned

Don't store raw logs. Storing raw conversation history is garbage-in, garbage-out. Force your agent to summarize its decisions before committing them to memory.
Context is king. Providing an LLM with 3 highly relevant historical examples yields significantly better results than zero-shot prompting it.
Memory requires routing. If you are doing RAG or memory injection, you need an intelligent router like CascadeFlow to manage the compute costs of those larger context windows.

Our AI Inference Bill Dropped 65% After We Stopped Treating Every Query the Same

Karthik S — Thu, 21 May 2026 19:03:56 +0000

Every query hitting our AI layer was going straight to the most powerful model we had. A user asking "what does HIPAA Section 164.312 say?" got the same compute budget as one asking "should we shut down the payment processor during this active incident?" That was expensive and stupid, and it took embarrassingly long to fix.

This is the story of how we built a routing layer called CascadeFlow into SentinelOps AI, an enterprise decision intelligence platform, and what actually happened when we turned it on.

The Problem With "One Model Fits All"

When you're building an AI system for enterprise operations teams—people making real decisions about infrastructure, compliance posture, and incident response—you face a genuine tension. You need the model to be good when it matters. But "good" on a documentation lookup is a different thing from "good" on "we have a potential SOC2 violation, walk me through the remediation path."

Before routing, every query went to our primary reasoning model (Llama 3.3 70B via Groq). The latency was fine. The quality was fine. The cost was not fine. At scale, routing simple factual queries through a 70B parameter model is just burning money.

The naive fix is to have engineers triage queries manually, which doesn't scale. The correct fix is a classifier that does it automatically.

CascadeFlow: A Lightweight Routing Engine

We integrated @cascadeflow/core as our routing middleware. The idea is straightforward: before a query hits the expensive model, a cheap, fast classifier decides which tier it belongs to.

Our routing logic looks roughly like this:

import { CascadeFlow } from '@cascadeflow/core';

const cascade = new CascadeFlow({
  classifier: {
    model: 'llama-3.1-8b-instant', // fast, cheap
    provider: 'groq',
  },
  tiers: [
    {
      name: 'simple',
      model: 'llama-3.1-8b-instant',
      triggers: ['documentation', 'lookup', 'definition', 'what is'],
    },
    {
      name: 'complex',
      model: 'llama-3.3-70b-versatile',
      triggers: ['incident', 'compliance', 'risk', 'critical', 'breach'],
    },
  ],
});

The classifier runs first—it's an 8B model, so it's fast and cheap—and classifies the incoming query into a complexity tier. Simple queries (policy lookups, definition requests, status checks) stay on the 8B model. Complex queries (active incidents, compliance risk assessments, multi-system decisions) escalate to the 70B.

From our LLM service layer, the routing call is transparent:

async function routeAndExecute(query, context) {
  const tier = await cascade.classify(query);
  const model = tier === 'complex'
    ? 'llama-3.3-70b-versatile'
    : 'llama-3.1-8b-instant';

  return groq.chat.completions.create({
    model,
    messages: buildMessages(query, context),
    response_format: { type: 'json_object' },
  });
}

That response_format: json_object constraint is important—we'll come back to it.

What Routing Actually Costs You

There's a hidden cost to routing that nobody talks about: the classifier itself can be wrong.

In our early testing, the 8B classifier was misrouting about 12% of complex queries down to the cheap tier. A question like "is our current encryption at rest sufficient for PHI storage?" looks superficially like a documentation query. The classifier saw "encryption" and "PHI" as lookup-adjacent terms and routed it to the cheap model, which gave a technically accurate but shallow answer that lacked the risk-weighted framing an auditor would need.

We fixed this in two ways:

Conservative misclassification bias. When the classifier's confidence is below a threshold, escalate to the expensive tier. False positives (routing simple queries high) cost money. False negatives (routing complex queries low) cost credibility. In an enterprise governance context, credibility is more expensive.
Domain keyword pre-checks. Before the classifier even runs, we scan for a hardcoded list of high-stakes terms. If a query contains words like breach, PHI, incident, remediation, or SOC2, it goes to the 70B model unconditionally.

const HIGH_STAKES_KEYWORDS = [
  'breach', 'incident', 'PHI', 'PII', 'SOC2', 'HIPAA',
  'remediation', 'critical', 'violation', 'audit', 'penalty'
];

function requiresComplexModel(query) {
  const lower = query.toLowerCase();
  return HIGH_STAKES_KEYWORDS.some(kw => lower.includes(kw));
}

This is not elegant, but it's safe. The performance overhead is a single .includes() check per query.

The Numbers

After deploying CascadeFlow routing against a realistic mix of enterprise queries, roughly 68% of queries fell into the "simple" tier. The remaining 32% were genuinely complex—incident-related, compliance-heavy, or multi-system risk assessments that benefited from the more capable model.

That routing split—combined with the price difference between an 8B and 70B parameter model—accounts for most of the cost reduction. The exact figure depends on your query distribution and your provider's pricing, but 60-65% is a reasonable estimate for an enterprise operational workload where most interactions are informational rather than analytical.

Forcing Structure Out of Both Models

One consequence of routing to two different models is that you now have two sources of unstructured text to deal with. We solved this by enforcing a strict JSON response schema at the prompt level, regardless of which model is running.

Every response from SentinelOps AI conforms to this shape:

{
  "summary": "One-sentence decision summary",
  "risk_level": "LOW | MEDIUM | HIGH | CRITICAL",
  "confidence": 0.87,
  "recommendation": "Specific, actionable recommendation",
  "tradeoffs": ["Tradeoff A", "Tradeoff B"],
  "governance_flags": [],
  "citations": []
}

The frontend renders this as a Decision Card—not a chat bubble. Risk level gets a color-coded badge. Confidence is displayed as a progress bar. Tradeoffs are rendered as a checklist. Governance flags trigger a separate UI element that routes to the compliance dashboard.

When you force both the cheap and expensive model into the same output schema, the quality difference between tiers becomes measurable. You can compare confidence scores, count governance_flags, and track whether the 8B model's recommendations match the 70B model's on borderline queries. This becomes a feedback loop for improving your routing thresholds over time.

Lessons

1. Start with keyword gating, not just ML classification. A simple list of high-stakes terms as a pre-filter saved us from the worst misrouting failures. ML classifiers are probabilistic. Safety-critical routing decisions shouldn't be.

2. Misrouting in the wrong direction is asymmetric. Routing a simple query to a powerful model costs you money. Routing a complex query to a weak model costs you trust. Size your misclassification bias accordingly.

3. A common output schema across tiers is essential. Without it, you're comparing apples and oranges and your frontend needs to handle two different response shapes. Force the schema at the prompt level.

4. Routing is a product decision, not just an infrastructure one. The thresholds you set for escalation reflect your platform's risk tolerance. In a governance context, we erred conservative. A developer tool might err aggressive. Know which direction your users would rather you fail.

You can read more about how CascadeFlow handles multi-tier routing in the cascadeflow docs. The cost savings are real, but the more important outcome is that complex queries now get the compute they actually need instead of competing on the same tier as "what does this acronym stand for."

[Boost]

Karthik S — Thu, 21 May 2026 18:43:31 +0000