Sachin S

Posted on Apr 13

Why My Agent Repeated the Same Mistakes Across Sessions

#ai #programming #microsoft #devops

Why My Agent Repeated the Same Mistakes Across Sessions

I watched my incident-response agent suggest "restart the database" for the third Razorpay outage in a row. The database was fine every time. The agent just didn't remember that.

That's the thing nobody warns you about when you build LLM-powered tools: they're stateless by default. Every session is a blank slate. Your agent can be brilliant in the moment and completely amnesic five minutes later. I spent weeks building an incident copilot called Kairo, and the hardest problem wasn't prompt engineering or retrieval—it was getting the thing to stop repeating itself.

What Kairo Does

Kairo is an incident copilot for engineering teams. When a production alert fires—say, your payment gateway starts throwing 504s at 3 AM—Kairo recalls what happened the last time similar symptoms showed up, tells you what the root cause was, what steps actually fixed it, and critically, what checks your team wasted 30 minutes on before finding the real problem.

The stack is straightforward: a Next.js monolith with TypeScript API routes, no separate backend service. Five endpoints do the work:

/api/seed — bulk-load historical incidents into memory
/api/alert — ingest a live or simulated alert
/api/recall — query past incidents by symptoms
/api/chat — conversational triage grounded in memory
/api/retain — store a resolved incident for future recall

The idea was simple. The execution was not.

The Stateless Problem

My first version of Kairo used vanilla RAG. I embedded our runbooks and internal docs, chunked them, stuffed them into a vector store, and retrieved relevant paragraphs when an alert came in. It worked—technically. The retrieval was accurate. The LLM synthesized reasonable-sounding responses.

But it kept making the same bad suggestions.

Here's the pattern I kept seeing: Razorpay UPI goes down. Status page says green. Our internal health checks pass. The correct response is to failover payment traffic to Cashfree, our backup provider. But my agent would suggest checking the connection pool, inspecting Redis, and reviewing recent deploys—every single time. Those checks had never once been the answer for a vendor-side outage. They were in the runbook, so the agent dutifully retrieved and recommended them.

The problem wasn't retrieval quality. The problem was that runbooks don't encode experience. They don't know that "check the connection pool" wasted 20 minutes last time this exact scenario played out. They don't remember outcomes. They're procedural docs, not operational memory.

I needed something that could retain episodes—what happened, what we tried, what worked, what didn't—and recall them in context the next time similar symptoms appeared.

Plugging Into Hindsight

I'd been going back and forth on how to implement memory when I came across Hindsight agent memory and decided to give it a shot. The pitch was simple: retain and recall structured memories with vector-based semantic search, without having to build and maintain my own embedding pipeline.

Honestly, getting Hindsight cloud connected to Kairo was one of those things that sounds trivial and then eats half your afternoon. Here's what I ran into.

First, the config. You need three environment variables—HINDSIGHT_API_KEY, HINDSIGHT_ORG_ID, and HINDSIGHT_BANK_ID—to talk to the Hindsight cloud API. I wrote a helper to gate all memory calls behind a config check:

function hasHindsightConfig(): boolean {
  return !!(
    process.env.HINDSIGHT_API_KEY &&
    process.env.HINDSIGHT_ORG_ID &&
    process.env.HINDSIGHT_BANK_ID
  )
}

This seems obvious in retrospect, but I initially hard-coded the keys during dev and broke the local fallback path. The dual-mode approach—real Hindsight when keys are present, local text-overlap scoring when they're not—was something I landed on after getting tired of deploying just to test memory. It meant I could iterate on the recall logic locally with zero API calls, then flip to the real thing on deploy.

The second challenge was structuring the data for retain. Hindsight's retain call takes an embedding string, context, metadata, and tags. I had to figure out what metadata schema would make recall actually useful downstream. My first attempt was lazy—I just shoved the incident title and a blob of text in. The recall results came back, but they were too noisy to build structured briefs from.

What worked was being deliberate about the metadata fields:

export async function retainIncident(incident: IncidentMemory) {
  return hindsight.retain(getBankId(), incident.embedding_text, {
    context: incident.title,
    timestamp: incident.timestamp_start,
    metadata: {
      incident_id: incident.incident_id,
      vendor: incident.vendor,
      region: incident.region,
      classification: incident.classification,
      actual_root_cause: incident.actual_root_cause,
      successful_fix: incident.successful_fix,
      failed_checks: incident.failed_checks,
      time_to_resolution_minutes: incident.time_to_resolution_minutes,
    },
    tags: ["kairo", incident.vendor || "internal", incident.classification],
  })
}

Every incident gets stored with its classification (vendor-side, internal, or mixed), the successful_fix, and—this is the important part—failed_checks. That list of dead ends is what turns recall from "here's something similar" into "here's what not to do."

The third thing that tripped me up: the embedding text matters more than you'd think. I hand-crafted it to capture operational semantics rather than prose. Instead of "There was a timeout on the Razorpay UPI integration," the embedding text reads more like "Razorpay UPI timeout, status page green, internal systems healthy, payment capture failing." Dense with symptoms, light on filler. It made a real difference in recall precision.

Where Self-Correction Actually Happens

The part I'm most pleased with is how Kairo classifies incidents using memory instead of guessing.

When an alert comes in, the system builds a query from the vendor, region, and symptoms, then asks Hindsight for similar past incidents. Here's the key bit:

const query = `${alert.vendor ?? "internal"} ${alert.region} ${alert.symptoms.join(" ")}`
const recalled = await recallIncidents(query)

const topClassification = recalled.matches[0]?.metadata?.classification ?? "unknown"

const newIncident = {
  incident_id: `inc_sim_${Date.now()}`,
  title: alert.title,
  vendor: alert.vendor,
  symptoms: alert.symptoms,
  classification: topClassification,
}

The current incident's fault boundary—vendor-side, internal, or mixed—comes from the top recalled episode's classification. Not from the LLM. Not from a heuristic. From what actually happened last time.

This is self-correction across sessions. If three past Razorpay 504 incidents were all resolved by vendor failover, the system classifies the current one as vendor_side before any LLM is involved. It skips the "is this us or them?" debate entirely. And it's auditable: you can trace exactly which past episodes informed the classification.

But here's where it gets interesting. Say that next month, a Razorpay 504 turns out to be caused by our own misconfigured proxy. That incident gets retained with classification: "internal". Now memory contains both vendor-side and internal Razorpay 504 episodes. The next time it happens, the classification will reflect that ambiguity. The system adapts because the memory adapts.

Grounding the LLM—Or Mostly Not Using It

I learned early on that letting the LLM generate resolution steps was a mistake. It would invent plausible-sounding but wrong advice. "Clear the application cache and restart" for a vendor outage. "Scale up the pod replicas" when the problem was a third-party API key rotation.

So I built a brief generator that doesn't use the LLM at all for the initial response:

export function buildKairoBrief(message: string, matches: MemoryMatch[]) {
  if (!matches.length) {
    return [
      "INSUFFICIENT_MEMORY: no Hindsight recall hits.",
      "Do not infer root cause. Collect: vendor status, p95 latency, deploy timeline.",
    ].join("\n")
  }

  const top = getMeta(matches[0])
  const allFailed = Array.from(
    new Set(matches.slice(0, 3).flatMap((m) => getMeta(m).failed_checks || []))
  )
  const topSteps = fixToSteps(top.successful_fix)

  return [
    `BOUNDARY: ${boundaryLine(top.classification)}`,
    `ROOT_CAUSE_MEMORY: ${top.actual_root_cause}`,
    "",
    "RESOLUTION_STEPS (verbatim from memory):",
    ...topSteps,
    "",
    "SKIP (what wasted time before):",
    ...allFailed.slice(0, 5),
  ].join("\n")
}

When memory exists, the response is entirely deterministic. No generation, no temperature, no hallucination risk. The LLM only enters the picture in follow-up conversation turns, and even then it's constrained by the memory-backed brief as context. If memory is empty, the system explicitly says so and asks the engineer to collect telemetry. It never guesses.

This is the design choice I'd push hardest on anyone building a similar system: ground first, generate second. The LLM is a conversation layer, not the decision layer.

What It Looks Like in Practice

A Razorpay UPI alert fires. Kairo recalls three past incidents—all vendor-side, all resolved by traffic failover. The response:

BOUNDARY: third-party vendor side
ROOT_CAUSE_MEMORY: UPI acquiring-bank route saturation

RESOLUTION_STEPS:
1. Fail over UPI and card payments to Cashfree route
2. Monitor UPI payload on Razorpay dashboard
3. Restore traffic after Razorpay confirms recovery

SKIP (wasted time in similar incidents):
- DB connection pool inspection
- Redis failover
- Recent deploy investigation

Compare that to what the agent used to say before I added agent memory: "Consider checking your database connection pool and reviewing recent deployments. If the issue persists, contact Razorpay support." Technically not wrong. Operationally useless. And it said some version of that every time.

The SKIP section is my favorite part. Surfacing what didn't work is arguably more valuable than surfacing what did. It saves the on-call engineer from repeating the same wild goose chase that wasted 20 minutes three months ago.

What I'd Do Differently

If I were starting over, a few things:

I'd invest more time in the synonym expansion earlier. I built a query augmentation layer that expands "504" to timeout, latency, slow, delay and "webhook" to callback, capture before sending the query to Hindsight. This dramatically improved recall quality for incidents described with different vocabulary. I added it late and wished I'd done it from the start.

I'd build a feedback UI sooner. Right now, closing the loop—marking an incident resolved, tagging what worked—is done through the API. A simple "Was this helpful? What was the actual fix?" button after each triage would make memory accumulation feel natural instead of manual.

I'd worry less about the LLM and more about the metadata. I spent days tuning prompts before I realized the prompts didn't matter much. What mattered was the structure of the memories going in. Get the metadata schema right—classification, failed_checks, successful_fix—and the recall outputs basically write themselves.

Lessons for Other Agent Builders

1. Stateless agents repeat mistakes. This sounds obvious but it's easy to miss when your agent gives good answers in a single session. The failure mode is across sessions, across weeks, across team members. If your agent can't remember what failed last time, it will suggest it again.

2. Episodic memory > document retrieval. Runbooks encode process. Episodes encode experience. The difference matters when the process says "check the logs" but experience says "checking the logs wasted 30 minutes last time—skip it."

3. Structure what you store. Unstructured text blobs are fine for search. They're bad for decision-making. Explicit fields for what worked, what failed, and how the incident was classified let you build deterministic responses instead of relying on the LLM to extract structure at query time.

4. Grounding beats generation. Build the response from memory first. Use the LLM to elaborate in follow-up turns, not to invent the initial answer. Your on-call engineer at 3 AM needs the fix that worked last time, not a creative interpretation of your documentation.

5. Build the fallback path first. My local text-overlap fallback let me iterate without API keys, demo without cloud dependencies, and ship faster. When I plugged in real Hindsight recall, it was a one-line swap. Fallbacks aren't tech debt—they're velocity.

Kairo is built with Hindsight for agent memory.
The code is at github.com/MORPHEUS-536/Kairo.