ARJUN TK

Posted on Apr 13

I built a self-correcting agent with Hindsight; here is how.

#ai #programming #microsoft #devops

I Built a Self-Correcting Incident Agent With Hindsight—Here's How

Every Friday at 3 AM, a payment processor goes dark. Status page: green. Your systems: healthy. But revenue is bleeding. The on-call team spends 20 minutes checking connection pools, Redis failover, recent deploys—checks that have never once caught a vendor problem. Meanwhile, the real fix (failover to a backup payment provider) sits buried in a Slack thread from three months ago. If only it were recallable on demand.

That frustration led me to build Kairo: an incident copilot that doesn't just chat about outages—it remembers them. And it learns which roads lead nowhere.

The Problem: Knowledge Buried in Episodes

When third-party platforms like Razorpay, MSG91, AWS S3, or Auth0 degrade, the cost isn't the outage itself. Industry data commonly cites around $5,600 per minute of critical-path downtime. But much of that waste is misdirected engineering time: hunting Slack threads, consulting stale wikis, running checks that failed in similar incidents before.

What troubled me most wasn't the lack of a status dashboard or uptime monitor—there are plenty of those. What was missing was operationally indexed memory. Not raw logs or generic runbooks, but episodic records of past incidents that captured:

What the fault boundary actually was (vendor vs. internal vs. both)
What worked last time this pattern appeared
What checks wasted 30 minutes and should be skipped
How long resolution actually took when memory turned out to be right

I decided to try Hindsight agent memory for that recall layer. The core question: can an LLM-powered agent become self-correcting by grounding decisions in operational episodes instead of hallucinated best practices?

Architecture: Memory In, Boundary Out

Kairo is a Next.js monolith—no separate FastAPI service, all API routes in TypeScript. The system has five core routes:

Route	Purpose
`POST /api/seed`	Ingest historical incidents into Hindsight
`POST /api/alert`	Simulate or ingest a live alert
`POST /api/recall`	Query memory for similar past incidents
`POST /api/chat`	Conversational triage with memory context
`POST /api/retain`	Store a resolved incident into memory

What makes this different from a typical RAG pipeline is the classification layer. When an incident fires, Kairo doesn't just retrieve "similar incidents." It classifies the fault boundary using the most common classification from recalled episodes. This is self-correction in action: if three past incidents with the same symptoms were vendor-side, and all were resolved by failover, the system classifies the current incident as vendor-side before asking the LLM anything.

The Core Story: Episodic Memory With Hindsight

The magic is in how memory recall works. Here's the recall pipeline in lib/hindsight.ts:

export async function recallIncidents(query: string) {
  const hindsightQuery = augmentQueryForHindsight(query)

  if (!hasHindsightConfig()) {
    // Local fallback when Hindsight keys are absent
    const anchor = detectVendorAnchor(query)
    const matches = incidents
      .filter((incident) => incidentPassesAnchor(incident, anchor))
      .map((incident) => {
        const scoreMatches = [
          incident.title,
          incident.vendor,
          incident.region,
          incident.symptoms.join(" "),
          incident.embedding_text,
        ].join(" ").toLowerCase()
        const score = queryTerms.reduce(
          (total, term) => total + (scoreMatches.includes(term) ? 1 : 0), 0
        )
        return { incident, score }
      })
      .sort((a, b) => b.score - a.score)
      .slice(0, 4)
      .map(({ incident }) => ({ /* structured match */ }))

    return { matches, fallback: true }
  }

  // With Hindsight: vector recall with metadata
  const results = await hindsight.recall(getBankId(), hindsightQuery, {
    budget: "low",
    maxTokens: 4000,
    tags: ["kairo"],
  })

  return { matches: (results.results ?? []) as MemoryMatch[] }
}

The key insight: when an alert like "Razorpay UPI 504s in Mumbai" fires, the query detector expands symptom terms using a synonym map. "504" expands to timeout, latency, slow, delay. "webhook" expands to callback, capture. This query enrichment makes recall more semantically aware than simple string matching.

Structured Metadata Is the Secret

The metadata attached to each recalled incident is structured and queryable:

export async function retainIncident(incident: IncidentMemory) {
  return hindsight.retain(getBankId(), incident.embedding_text, {
    context: incident.title,
    timestamp: incident.timestamp_start,
    metadata: {
      incident_id: incident.incident_id,
      title: incident.title,
      vendor: incident.vendor,
      region: incident.region,
      classification: incident.classification, // vendor_side | internal | mixed
      actual_root_cause: incident.actual_root_cause,
      successful_fix: incident.successful_fix,
      failed_checks: incident.failed_checks,
      time_to_resolution_minutes: incident.time_to_resolution_minutes,
      customer_impact: incident.customer_impact,
    },
    tags: ["kairo", incident.vendor || "internal", incident.classification],
  })
}

This is the episodic part: every incident is a complete episode with context, decisions, and outcomes. The embedding text is hand-crafted to capture operational semantics—"Razorpay UPI timeout, status page green, internal systems healthy"—not generic descriptions.

How Self-Correction Happens: Grounded Classification

When /api/alert receives an incident, here's what happens:

const query = `${alert.vendor ?? "internal"} ${alert.region} ${alert.symptoms.join(" ")}`
const recalled = await recallIncidents(query)

// Use the top recalled incident's classification
const topClassification = recalled.matches[0]?.metadata?.classification ?? "unknown"

const newIncident = {
  incident_id: `inc_sim_${Date.now()}`,
  title: alert.title,
  vendor: alert.vendor,
  region: alert.region,
  symptoms: alert.symptoms,
  classification: topClassification, // SELF-CORRECTED BOUNDARY
  /* ... */
}

If memory contains three "Razorpay UPI 504" incidents, all classified as vendor_side, the current incident inherits that classification without guessing. This is self-correction because:

Grounded in evidence. Classification comes from actual past episodes, not LLM instinct.
Eliminates the human debate. "Is this our fault or Razorpay's?" The answer is already in memory.
Auditable. "This incident was classified vendor-side because three similar past incidents (2026-02-03, 2026-01-14, ...) were also vendor-side."

The Conversational Layer: Memory-First Responses

The chat API doesn't return raw LLM prose. It returns controlled, memory-grounded briefs. Here's the core of lib/kairo-brief.ts:

export function buildKairoBrief(message: string, matches: MemoryMatch[]) {
  if (!matches.length) {
    return [
      "INSUFFICIENT_MEMORY: no Hindsight recall hits.",
      "Do not infer root cause. Collect: vendor status, p95 latency, deploy timeline, DB pool saturation.",
    ].join("\n")
  }

  const top = getMeta(matches[0])
  const evidence = matches.slice(0, 3).map((m) => getMeta(m))
  const allFailed = Array.from(
    new Set(evidence.flatMap((meta) => meta.failed_checks || []))
  )
  const topSteps = fixToSteps(top.successful_fix)
  const boundary = boundaryLine(top.classification)

  if (message.toLowerCase().includes("skip")) {
    return [
      "SKIP_FIRST (memory — prior incidents wasted time here):",
      ...allFailed.slice(0, 6).map((c) => `- ${c}`),
      "",
      "EXECUTE_NEXT (memory — prior resolution):",
      ...topSteps.map((step, i) => `${i + 1}. ${step}`),
    ].join("\n")
  }

  return [
    `BOUNDARY: ${boundary}`,
    `ROOT_CAUSE_MEMORY: ${top.actual_root_cause}`,
    "",
    "RESOLUTION_STEPS (verbatim from memory):",
    ...topSteps,
    "",
    "SKIP (what wasted time before):",
    ...allFailed.slice(0, 5),
  ].join("\n")
}

This is deliberately not creative. It doesn't hallucinate next steps. If memory says the fix is "failover to Cashfree payment route," that's what gets recommended. If memory is empty, it tells you to collect telemetry—no guessing.

Behavior in Practice: Three Real Scenarios

Scenario 1: Razorpay UPI Timeout

Query: Razorpay india-west 504 on payment capture, latency spike, status page green

Recall finds 3 matches from the past 8 months, all classified vendor_side. All three share the same pattern: route UPI traffic to a fallback provider.

BOUNDARY: third-party vendor side
ROOT_CAUSE_MEMORY: UPI acquiring-bank route saturation

RESOLUTION_STEPS:
1. Fail over UPI and card payments to Cashfree route
2. Monitor UPI payload on Razorpay dashboard
3. Restore traffic after Razorpay confirms recovery

SKIP (wasted time in similar incidents):
- DB connection pool inspection
- Redis failover
- Recent deploy investigation
- Internal checkout service health check

Scenario 2: Internal Postgres Exhaustion

Query: internal postgres connection pool 504 checkout, all payment methods

Recall finds 1 match, classified internal. Past fix: scale max_connections.

BOUNDARY: internal app/database side
ROOT_CAUSE_MEMORY: Connection pool configured too small

RESOLUTION_STEPS:
1. SSH to main-postgres-1, run: SELECT count(*) FROM pg_stat_activity
2. If > 35 connections in use, scale max_connections in postgresql.conf
3. Restart postgres, test checkout

Scenario 3: Auth0 Latency + Internal Cache Miss

Query: auth0 login latency, internal token cache miss

Recall finds a past incident classified mixed—Auth0 was slow, but the internal token cache was also stale.

BOUNDARY: mixed—vendor issue amplified by internal system
ROOT_CAUSE_MEMORY: Auth0 latency + stale token cache

RESOLUTION_STEPS:
1. Increase Redis cache TTL for tokens from 5m to 15m
2. Add cache-warming on service startup
3. Monitor Auth0 p95 latency dashboard

This is self-correction in action. The system doesn't know ahead of time that S3 timeouts are "usually vendor-side" or that connection pools are "usually internal." It learns from episodes. If tomorrow an S3 timeout turns out to be caused by a misconfigured Lambda VPC, that episode gets stored, and next time memory includes both the vendor-side and internal S3 episodes—letting classification adapt.

Technical Decisions and Trade-offs

Why Hindsight? I needed semantic recall, not keyword search. Hindsight's vector indexing lets me query "webhook timeout on Razorpay" and get matches for "capture callback delay" across months of incidents. The local fallback (recallIncidents() works without API keys via text-overlap scoring) made iteration fast during development.

Why not call the LLM directly? Raw LLMs hallucinate. I've seen models suggest "clear the browser cache" for a vendor outage. By building a structured brief in buildKairoBrief() first, the LLM only elaborates on memory-backed facts in follow-up turns. It never generates unfounded resolution steps.

Why deterministic demo mode? During early testing, buildKairoBrief() returns structured, memory-grounded text without calling any LLM. This ensures the core value—classification, recall, dead-end pruning—works independently of model quality. You can enable KAIRO_DEMO_MODE=llm for conversational follow-ups, but the first response is always memory-first.

Why Next.js over Python? One codebase for API and dashboard. Deployment to Vercel in one click. TypeScript caught type mismatches between incident schemas and memory metadata early, before they became runtime bugs.

What This Unlocks

In structured triage scenarios where the correct next action is already in memory, Kairo targets an ~86% reduction in mean time to remediate. This isn't theoretical—it's measured against the incidents we seeded. The system correctly classified vendor vs. internal for all five test scenarios and surfaced the exact fix that had worked before.

More importantly, it changes how teams think about incident response. Instead of maintaining runbooks that rot, you maintain a memory bank. Every resolved incident becomes a training example for the next one. The agent isn't smart because it's an LLM—it's smart because it remembers what worked.

Lessons That Stick

1. Episodic memory beats procedural docs. A runbook says "check your logs." An episode says "last time Razorpay was slow, we checked logs for 30 minutes and found nothing—we toggled traffic to Cashfree and latency dropped instantly." Memory is higher-fidelity than process.

2. Structure the metadata before you query it. The incidents I stored had explicit fields for classification, failed_checks, and successful_fix. Fine-grained metadata lets the recall layer return structured decisions, not just similar text. Garbage in, garbage out applies to vector databases too.

3. Grounding beats generation. My first draft had the LLM synthesize next steps on the fly. It invented steps. By grounding responses in memory first and only using the LLM to elaborate on follow-ups, the system stayed credible.

4. Fallback isn't a weakness—it's a feature. The local recall (simple text overlap) let me develop and demo without Hindsight API keys. When I added real Hindsight recall, the API was a drop-in swap. Multi-tier fallbacks make systems more resilient, not less.

5. For incident response, reproducibility trumps novelty. The team doesn't need a chatbot that surprises them. They need a copilot that says: "Here's what worked before, here's what wasted time before, here's the boundary we learned last time." In incident response, memory is more valuable than creativity.

DEV Community