DEV Community: Shivangi Gupta

How Hindsight Memory Turned My Chatbot Into an Incident Commander

Shivangi Gupta — Sun, 14 Jun 2026 07:33:57 +0000

It was 11:45 PM on a Thursday when our checkout service started throwing 503s. I was the one on-call. I pulled up the logs, pinged three teammates on Slack, dug through a week-old Notion doc someone had half-written after a similar incident, and spent the next 52 minutes piecing together that the root cause was a Redis cache eviction policy silently dropping session tokens under load — something our team had already diagnosed and fixed four months earlier. Nobody remembered. The fix was sitting in a closed Jira ticket that nobody thought to search.

That night I decided to build something so the next person on-call wouldn't start from zero. That's a memory problem.

The Pattern Nobody Talks About

Every SRE team has the same dirty secret: we resolve the same incidents repeatedly. The symptoms change slightly, the runbooks get stale, and the engineer on-call at midnight reconstructs the same reasoning chain their colleague already built six weeks ago.

The knowledge exists. It just doesn't live anywhere an agent can find it.

When I started building On-Call Copilot, I didn't want another chatbot wrapping an LLM around an incident description. I wanted something that actually learns — getting faster and more accurate every time the team resolves an outage. The key was Hindsight agent memory.

What the System Does

On-Call Copilot is a FastAPI backend paired with a React/TypeScript frontend that gives on-call engineers a single interface for triage. When an incident comes in, the engineer describes what's happening — or selects from common incident presets — and the system does four things in sequence:

Recalls similar historical incidents from organizational memory via Hindsight
Analyzes the current incident description using a Groq LLM
Generates a probable root cause with supporting evidence
Drafts a customer-facing status update

The architecture is deliberately simple:

No RAG pipeline stitched from five libraries. No vector database to host yourself. Hindsight handles the memory layer entirely, so I could focus on reasoning logic rather than infrastructure.

The Core Technical Story: Making Memory Operational

The interesting engineering challenge here wasn't the LLM integration — Groq's API is straightforward, and generating root cause analysis from a well-structured prompt is table stakes at this point. The hard part was making past incident knowledge actually useful at query time.

Here's what the memory integration in backend/memory.py looks like at its core:

from hindsight import HindsightClient

client = HindsightClient(
    api_key=os.environ["HINDSIGHT_API_KEY"],
    bank_id=os.environ["BANK_ID"]
)

def store_incident(incident: dict) -> str:
    """Retain a resolved incident into organizational memory."""
    content = f"""
    Incident: {incident['title']}
    Symptoms: {incident['symptoms']}
    Root Cause: {incident['root_cause']}
    Resolution: {incident['resolution']}
    Duration: {incident['duration_minutes']} minutes
    """
    result = client.retain(content=content, metadata={
        "type": "incident",
        "severity": incident.get("severity", "unknown"),
        "service": incident.get("service", "unknown")
    })
    return result.id

def recall_similar_incidents(description: str, top_k: int = 3) -> list:
    """Recall the most relevant past incidents for a given description."""
    results = client.recall(query=description, top_k=top_k)
    return [r.content for r in results]

Two functions. That's the entire memory layer. retain writes a resolved incident into Hindsight's vector store. recall queries it semantically at incident time. The bank_id scopes the memory to your organization — so you're not pulling from a shared global pool, you're querying your team's specific incident history.

What surprised me was how much signal Hindsight extracts from unstructured incident descriptions. When an engineer types "payments timing out during peak load," the recall doesn't just keyword-match on "payments" or "timeout." It surfaces incidents where database latency caused downstream webhook failures, incidents where connection pool limits were hit under traffic spikes, and incidents where async job queues backed up. The semantic layer does real work here.

The Agent Reasoning Loop

The backend/agent.py file is where historical memory and live LLM reasoning come together. When an incident comes in through the /analyze endpoint, the agent runs this sequence:

async def analyze_incident(description: str) -> IncidentAnalysis:
    # Step 1: Pull relevant past incidents from Hindsight
    historical = recall_similar_incidents(description, top_k=3)

    # Step 2: Build a context-rich prompt
    context = "\n\n".join([
        f"Past Incident {i+1}:\n{h}" 
        for i, h in enumerate(historical)
    ])

    prompt = f"""You are an expert SRE. Analyze this incident using historical context.

Historical incidents from our systems:
{context}

Current incident:
{description}

Provide:
1. Most likely root cause
2. Recommended remediation steps
3. Estimated time to resolve
4. Customer communication draft
"""

    response = groq_client.chat.completions.create(
        model="llama3-8b-8192",
        messages=[{"role": "user", "content": prompt}]
    )

    return parse_analysis(response.choices[0].message.content)

The key design decision here is injecting historical incidents into the prompt before the LLM reasons about the current one. Without Hindsight, the agent would be reasoning from general training data — useful, but generic. With Hindsight recall in the prompt context, the LLM is reasoning from your team's actual resolution history. It knows that the last three times you saw Stripe webhook timeouts, the root cause was database latency, and the fix was bumping the connection pool limit and moving invoice processing to an async worker.

That's a meaningfully different output.

What the Before/After Looks Like

Here's a concrete example. The incident description:

"Stripe webhook processing is timing out during invoice creation. Payments are delayed and subscriptions are not being activated."

Without Hindsight memory (generic LLM response):

Root cause: "Possible network issues, third-party API downtime, or misconfigured webhook endpoint"
Fix: "Check Stripe dashboard, review webhook logs, verify endpoint availability"
Useful? Barely. Any engineer already knows to check the Stripe dashboard.

With Hindsight memory (after 6 months of incident history):

Root cause: "Database latency caused webhook processing to exceed the 30-second timeout limit. Similar incidents on 2024-09-14 and 2024-11-02 had the same signature — high read latency on invoices table during subscription renewal batches."
Fix: "Optimize the invoices query with index on (subscription_id, status). Increase webhook timeout to 60s in Stripe dashboard. Move invoice creation to async Celery task — see PR #847 from the November incident."
Customer update: "We are aware of delays affecting payment processing and subscription activation. Our team has identified the root cause and is applying a fix. Service will be fully restored within 30 minutes."

The second response isn't just more specific — it's referencing your past work, your specific table names, your previous PRs. That's what organizational memory looks like when it's actually wired into the reasoning loop.

Seeding Memory at Scale

The system is only as good as the incidents you've retained. For teams starting fresh, I built backend/seed_data.py to pre-populate Hindsight with representative incident patterns — connection pool exhaustion, pod OOMKill cycles, payment processor timeouts. This gives useful recall from day one while real incident history accumulates and gradually takes over.

What the Frontend Exposes

The React frontend — which I built — maps directly to the four stages of incident response: incident input, memory recall, analysis results, and a live telemetry console. The most important piece was the memory recall view. Engineers needed to see which past incidents were driving the recommendation, not just a black-box output. Transparency in retrieval builds trust — when you can see the root cause is grounded in three real incidents from your own history, you act on it faster.

Lessons I'd Take Into the Next System

1. Memory quality matters more than model size. Llama 3 8B with Hindsight recall outperformed GPT-4o without it on domain-specific incident analysis. Context beats parameters.

2. Retain at resolution time, not creation time. You don't know the root cause when an incident opens. Retain after close, when you have the full picture.

3. Metadata filtering makes recall precise. Scope recalls by service, severity, or date range. A P1 database incident and a P3 CSS bug shouldn't surface each other.

4. Show your work. Transparency in memory retrieval builds trust. Engineers act faster on recommendations when they can see which past incidents are driving them.

5. Seed data is a forcing function. Writing realistic seed incidents forces you to define your memory schema before real data accumulates. Worth doing even if you overwrite it immediately.

Where This Goes

The next step is proactive recall — as anomaly signals come in from observability tooling, the agent checks Hindsight for matching historical patterns before the alert even pages someone. The Hindsight documentation covers webhook-based retain flows that make this straightforward. The Hindsight GitHub repo has everything you need to get started.

Every incident your team resolves is a piece of institutional memory. The question is whether it lives in someone's head, in a runbook nobody reads, or in a system that surfaces it at 2 AM. I built the frontend for this because I wanted on-call to feel calm and clear — the opposite of 11:45 PM staring at a wall of logs. After putting it through a real incident suite, I'd take it over a Notion doc and a prayer any day.

I taught Hindsight to turn chat into database writes

Shivangi Gupta — Sat, 06 Jun 2026 11:42:41 +0000

The hard part was not getting a model to answer a student. The hard part was deciding when a sentence in chat should become a durable change in the system.

That distinction shaped the way I built Student Copilot. I did not want another assistant that could summarize a long thread and then forget the consequence five minutes later. I wanted a workspace where a student could paste an opportunity announcement, forward a team message, write a daily check-in, or ask about overdue work, and the system would keep the operational record in sync.

That sounds simple until you try to ship it. Chat is loose. Databases are not. Users write half-sentences, change their minds, omit dates, paste screenshots, and use names inconsistently. A useful assistant has to handle all of that without turning every ambiguous sentence into a bad write.

The core lesson I took from integrating Hindsight agent memory on GitHub is that memory should not be treated as a prettier chat log. Memory is part of the execution model. It needs boundaries, evidence, retention rules, and a way to distinguish “remember this fact” from “mutate application state now.”

What Student Copilot does

Student Copilot is a personal workspace for tracking student commitments: opportunities, deadlines, teams, tasks, check-ins, weekly reports, memories, and conversation history. The React client is a single-page app served through a small Express backend. The backend owns the API surface, the local persistence layer, and the model calls.

The app has a few main loops:

Save and update opportunities with deadlines and registration links.
Create task backlogs manually or from pasted text.
Maintain team rosters from forms, messages, or images.
Record daily check-ins and generate weekly summaries.
Let the user ask a memory assistant about what they have committed to.
Store explicit memories separately from raw conversation history.

The structure is intentionally plain. server.ts defines the REST API. src/server/db.ts owns persistence and domain operations. src/server/gemini.ts owns model interaction, fallback parsing, structured extraction, and action execution. src/App.tsx carries the UI state and calls the API.

That split matters because the interesting part of this system is not the chat box. It is the path from natural language to a bounded database operation.

The through-line: chat is input, not state

The first version of any assistant-style app tends to make chat history the center of the universe. The user says something, the model responds, and the transcript becomes the only durable artifact.

That breaks down quickly in a productivity system.

If I say, “I submitted my application and Riya is handling the backend tasks,” I do not want that buried in message history. I want at least three possible updates:

The opportunity may need a new status.
The team roster may need a role update.
A memory about Riya’s responsibility may need to be retained.

Those are not the same operation. They have different lifetimes, different query patterns, and different failure modes.

This is where Hindsight documentation for agent memory influenced the design. Hindsight’s retain, recall, and reflect model is a useful mental model because it pushes memory out of “append this transcript” thinking. In this app, I used that idea as an application architecture rule: memory is only useful when it can be recalled in the right context and converted into safer decisions later.

The backend expresses that rule with a small action vocabulary. The model is allowed to propose actions, but the server decides how to execute them.

interface ExtractionResult {
  reply: string;
  extractedActions?: Array<{
    action:
      | 'create_opportunity'
      | 'update_opportunity'
      | 'create_task'
      | 'update_task'
      | 'update_team'
      | 'create_memory'
      | 'daily_checkin';
    data: any;
  }>;
}

This is not elaborate, but it is the important boundary. The model does not get to write arbitrary records. It emits one of a small number of verbs. Each verb maps to code I can inspect, test, and harden.

Building the memory context

When the user sends a message, the assistant does not answer from the transcript alone. It builds a compact workspace context from the current system state: opportunities, tasks, teams, memories, recent logs, and check-ins.

const contextData = {
  currentTime: new Date().toISOString(),
  localDate: '2026-06-05',
  opportunities: opportunities.map(o => ({
    id: o.id,
    title: o.title,
    type: o.type,
    status: o.status,
    deadline: o.deadline,
    notes: o.notes
  })),
  tasks: tasks.map(t => ({
    id: t.id,
    title: t.title,
    dueDate: t.dueDate,
    priority: t.priority,
    status: t.status
  })),
  teams: teams.map(t => ({
    id: t.id,
    name: t.name,
    opportunityId: t.opportunityId,
    members: t.members
  })),
  memories: memories.map(m => ({
    category: m.category,
    title: m.title,
    content: m.content
  }))
};

The important detail is that identifiers are included. If the assistant suggests updating an existing task or opportunity, it can reference a real target instead of guessing from a title. That is a small decision, but it prevents a lot of duplicate records.

This is also where Vectorize’s explanation of agent memory maps well to the app: the memory layer is not just about retrieval. It is about preserving enough state for the agent to reason across time without asking the user to restate everything.

In Student Copilot, a useful memory might be “I prefer backend tasks in the evening,” “Ananya usually owns UI work,” or “this fellowship requires a transcript before submission.” Those facts are not all tasks. They are context that should shape future task creation, team suggestions, and reminders.

Structured output made the system debuggable

The model call asks for JSON with a reply and optional extracted actions. The reply is what the user sees. The actions are what the server may execute.

const response = await generateWithRetry(ai, {
  model: 'gemini-3.5-flash',
  contents: userMessage,
  config: {
    systemInstruction,
    responseMimeType: 'application/json',
    responseSchema: {
      type: Type.OBJECT,
      properties: {
        reply: { type: Type.STRING },
        extractedActions: {
          type: Type.ARRAY,
          items: {
            type: Type.OBJECT,
            properties: {
              action: { type: Type.STRING },
              data: { type: Type.OBJECT }
            },
            required: ['action', 'data']
          }
        }
      },
      required: ['reply']
    }
  }
});

This is the difference between “AI feature” and maintainable software. I can log the action list. I can reject malformed actions. I can add validation per action. I can replay a user message and compare the proposed mutations. I can make the UI optimistic or conservative depending on the operation.

I also learned that schema-constrained output does not remove the need for defensive programming. It narrows the failure space. That is still valuable. A malformed JSON response is easier to handle than a persuasive paragraph that vaguely says it updated something.

The execution boundary

Once the response is parsed, the server saves the conversation and then executes extracted actions. The implementation is intentionally boring.

switch (action) {
  case 'create_task': {
    db.createTask(userId, {
      title: data.title || 'AI Task',
      description: data.description || 'Auto-created via chat message details.',
      dueDate: data.dueDate || '2026-06-15',
      priority: data.priority || 'Medium',
      status: data.status || 'Pending',
      opportunityId: data.opportunityId
    });
    break;
  }
  case 'create_memory': {
    db.createMemory(userId, {
      title: data.title || 'Factual Note',
      content: data.content || '',
      category: data.category || 'goal'
    });
    break;
  }
}

There are no hidden side effects here. A model-proposed create_task becomes exactly one task creation. A model-proposed create_memory becomes exactly one memory. If I want stricter validation, deduplication, approval prompts, or audit trails, this is the seam where those controls belong.

I would not ship this kind of system with the model directly calling a generic database API. The action vocabulary is the contract. It is also the part that makes Hindsight useful: memory can inform the proposed action, but application code still owns authority.

Production systems need boring fallbacks

The least glamorous part of the repo is one of the most important: the assistant still works when the model path is unavailable.

export function getGeminiClient(): GoogleGenAI | null {
  if (!aiClient) {
    const key = process.env.GEMINI_API_KEY;
    if (!key) {
      console.log('GEMINI_API_KEY environment variable is not defined. Using mock AI response fallback.');
      return null;
    }
    aiClient = new GoogleGenAI({ apiKey: key });
  }
  return aiClient;
}

The fallback path answers common questions about deadlines, teams, and pending tasks from the database. It can also create simple records from recognizable phrases. That is not as flexible as the model path, but it keeps the product honest: the workspace is still useful because the core state is first-class.

I prefer this shape to an app where every button is secretly a prompt. Forms, REST endpoints, and deterministic queries remain the foundation. The assistant adds a faster input method and a memory-aware retrieval layer.

A concrete interaction

Here is the kind of flow the system is designed to support.

I paste this into chat:

Registered for the summer research fellowship. Deadline is June 30. Ananya will review the essays, and I need to prepare my transcript this week.

The assistant should not only respond with encouragement. It should propose concrete actions:

Create an opportunity with a June 30 deadline.
Create a task to prepare the transcript.
Save a memory that Ananya is reviewing essays.

The visible answer might say:

I added the fellowship to your opportunity tracker, created a transcript task for this week, and saved Ananya’s review role so I can remember it later.

The next day, if I ask, “What am I blocked on for the fellowship?”, the system should have enough retained state to answer from actual records rather than vibes. It can inspect the opportunity, related tasks, saved memories, and recent check-ins.

That is the product value of Hindsight in this architecture. It helps turn accumulated context into better future behavior without making the transcript the database.

What I learned

The first reusable lesson is that memory should have types. A chat message, a task, a preference, and a durable fact are not interchangeable. Storing them separately makes the assistant more useful and the system easier to debug.

The second lesson is that agents need verbs, not permissions. “You may update the database” is too broad. “You may propose create_task, update_task, and create_memory actions that this server validates” is something I can reason about.

The third lesson is that Hindsight works best when it is part of the application’s control flow. I do not want memory bolted on after response generation. I want recall to shape the context before a decision, retention to happen after meaningful events, and reflection to improve what gets carried forward.

The fourth lesson is that fallback behavior is product behavior. If the model is down or the key is missing, users should still be able to see deadlines, tasks, teams, and memories. The assistant can degrade. The workspace should not disappear.

The fifth lesson is that the UI should expose state, not magic. Student Copilot has tabs for opportunities, tasks, teams, check-ins, memories, reports, and profile data because users need to inspect and correct what the assistant inferred. A memory system without correction paths eventually becomes a liability.

The part I would keep

If I rebuilt this from scratch, I would keep the same core idea: chat is an input surface, not the source of truth.

Hindsight makes that more practical because it gives the agent a way to learn from prior interactions without stuffing the entire past into every prompt. But the application still needs strong boundaries. The database owns records. The server owns mutations. The assistant proposes typed actions. The memory layer improves context over time.

That architecture is less flashy than a fully autonomous assistant, but it is the one I trust. It lets the system get more useful as it learns while keeping the important question visible in code: what exactly is this message allowed to change?