Shivangi Gupta

Posted on Jun 14

How Hindsight Memory Turned My Chatbot Into an Incident Commander

#webdev #javascript #beginners #ai

It was 11:45 PM on a Thursday when our checkout service started throwing 503s. I was the one on-call. I pulled up the logs, pinged three teammates on Slack, dug through a week-old Notion doc someone had half-written after a similar incident, and spent the next 52 minutes piecing together that the root cause was a Redis cache eviction policy silently dropping session tokens under load — something our team had already diagnosed and fixed four months earlier. Nobody remembered. The fix was sitting in a closed Jira ticket that nobody thought to search.

That night I decided to build something so the next person on-call wouldn't start from zero. That's a memory problem.

The Pattern Nobody Talks About

Every SRE team has the same dirty secret: we resolve the same incidents repeatedly. The symptoms change slightly, the runbooks get stale, and the engineer on-call at midnight reconstructs the same reasoning chain their colleague already built six weeks ago.

The knowledge exists. It just doesn't live anywhere an agent can find it.

When I started building On-Call Copilot, I didn't want another chatbot wrapping an LLM around an incident description. I wanted something that actually learns — getting faster and more accurate every time the team resolves an outage. The key was Hindsight agent memory.

What the System Does

On-Call Copilot is a FastAPI backend paired with a React/TypeScript frontend that gives on-call engineers a single interface for triage. When an incident comes in, the engineer describes what's happening — or selects from common incident presets — and the system does four things in sequence:

Recalls similar historical incidents from organizational memory via Hindsight
Analyzes the current incident description using a Groq LLM
Generates a probable root cause with supporting evidence
Drafts a customer-facing status update

The architecture is deliberately simple:

No RAG pipeline stitched from five libraries. No vector database to host yourself. Hindsight handles the memory layer entirely, so I could focus on reasoning logic rather than infrastructure.

The Core Technical Story: Making Memory Operational

The interesting engineering challenge here wasn't the LLM integration — Groq's API is straightforward, and generating root cause analysis from a well-structured prompt is table stakes at this point. The hard part was making past incident knowledge actually useful at query time.

Here's what the memory integration in backend/memory.py looks like at its core:

from hindsight import HindsightClient

client = HindsightClient(
    api_key=os.environ["HINDSIGHT_API_KEY"],
    bank_id=os.environ["BANK_ID"]
)

def store_incident(incident: dict) -> str:
    """Retain a resolved incident into organizational memory."""
    content = f"""
    Incident: {incident['title']}
    Symptoms: {incident['symptoms']}
    Root Cause: {incident['root_cause']}
    Resolution: {incident['resolution']}
    Duration: {incident['duration_minutes']} minutes
    """
    result = client.retain(content=content, metadata={
        "type": "incident",
        "severity": incident.get("severity", "unknown"),
        "service": incident.get("service", "unknown")
    })
    return result.id

def recall_similar_incidents(description: str, top_k: int = 3) -> list:
    """Recall the most relevant past incidents for a given description."""
    results = client.recall(query=description, top_k=top_k)
    return [r.content for r in results]

Two functions. That's the entire memory layer. retain writes a resolved incident into Hindsight's vector store. recall queries it semantically at incident time. The bank_id scopes the memory to your organization — so you're not pulling from a shared global pool, you're querying your team's specific incident history.

What surprised me was how much signal Hindsight extracts from unstructured incident descriptions. When an engineer types "payments timing out during peak load," the recall doesn't just keyword-match on "payments" or "timeout." It surfaces incidents where database latency caused downstream webhook failures, incidents where connection pool limits were hit under traffic spikes, and incidents where async job queues backed up. The semantic layer does real work here.

The Agent Reasoning Loop

The backend/agent.py file is where historical memory and live LLM reasoning come together. When an incident comes in through the /analyze endpoint, the agent runs this sequence:

async def analyze_incident(description: str) -> IncidentAnalysis:
    # Step 1: Pull relevant past incidents from Hindsight
    historical = recall_similar_incidents(description, top_k=3)

    # Step 2: Build a context-rich prompt
    context = "\n\n".join([
        f"Past Incident {i+1}:\n{h}" 
        for i, h in enumerate(historical)
    ])

    prompt = f"""You are an expert SRE. Analyze this incident using historical context.

Historical incidents from our systems:
{context}

Current incident:
{description}

Provide:
1. Most likely root cause
2. Recommended remediation steps
3. Estimated time to resolve
4. Customer communication draft
"""

    response = groq_client.chat.completions.create(
        model="llama3-8b-8192",
        messages=[{"role": "user", "content": prompt}]
    )

    return parse_analysis(response.choices[0].message.content)

The key design decision here is injecting historical incidents into the prompt before the LLM reasons about the current one. Without Hindsight, the agent would be reasoning from general training data — useful, but generic. With Hindsight recall in the prompt context, the LLM is reasoning from your team's actual resolution history. It knows that the last three times you saw Stripe webhook timeouts, the root cause was database latency, and the fix was bumping the connection pool limit and moving invoice processing to an async worker.

That's a meaningfully different output.

What the Before/After Looks Like

Here's a concrete example. The incident description:

"Stripe webhook processing is timing out during invoice creation. Payments are delayed and subscriptions are not being activated."

Without Hindsight memory (generic LLM response):

Root cause: "Possible network issues, third-party API downtime, or misconfigured webhook endpoint"
Fix: "Check Stripe dashboard, review webhook logs, verify endpoint availability"
Useful? Barely. Any engineer already knows to check the Stripe dashboard.

With Hindsight memory (after 6 months of incident history):

Root cause: "Database latency caused webhook processing to exceed the 30-second timeout limit. Similar incidents on 2024-09-14 and 2024-11-02 had the same signature — high read latency on invoices table during subscription renewal batches."
Fix: "Optimize the invoices query with index on (subscription_id, status). Increase webhook timeout to 60s in Stripe dashboard. Move invoice creation to async Celery task — see PR #847 from the November incident."
Customer update: "We are aware of delays affecting payment processing and subscription activation. Our team has identified the root cause and is applying a fix. Service will be fully restored within 30 minutes."

The second response isn't just more specific — it's referencing your past work, your specific table names, your previous PRs. That's what organizational memory looks like when it's actually wired into the reasoning loop.

Seeding Memory at Scale

The system is only as good as the incidents you've retained. For teams starting fresh, I built backend/seed_data.py to pre-populate Hindsight with representative incident patterns — connection pool exhaustion, pod OOMKill cycles, payment processor timeouts. This gives useful recall from day one while real incident history accumulates and gradually takes over.

What the Frontend Exposes

The React frontend — which I built — maps directly to the four stages of incident response: incident input, memory recall, analysis results, and a live telemetry console. The most important piece was the memory recall view. Engineers needed to see which past incidents were driving the recommendation, not just a black-box output. Transparency in retrieval builds trust — when you can see the root cause is grounded in three real incidents from your own history, you act on it faster.

Lessons I'd Take Into the Next System

1. Memory quality matters more than model size. Llama 3 8B with Hindsight recall outperformed GPT-4o without it on domain-specific incident analysis. Context beats parameters.

2. Retain at resolution time, not creation time. You don't know the root cause when an incident opens. Retain after close, when you have the full picture.

3. Metadata filtering makes recall precise. Scope recalls by service, severity, or date range. A P1 database incident and a P3 CSS bug shouldn't surface each other.

4. Show your work. Transparency in memory retrieval builds trust. Engineers act faster on recommendations when they can see which past incidents are driving them.

5. Seed data is a forcing function. Writing realistic seed incidents forces you to define your memory schema before real data accumulates. Worth doing even if you overwrite it immediately.

Where This Goes

The next step is proactive recall — as anomaly signals come in from observability tooling, the agent checks Hindsight for matching historical patterns before the alert even pages someone. The Hindsight documentation covers webhook-based retain flows that make this straightforward. The Hindsight GitHub repo has everything you need to get started.

Every incident your team resolves is a piece of institutional memory. The question is whether it lives in someone's head, in a runbook nobody reads, or in a system that surfaces it at 2 AM. I built the frontend for this because I wanted on-call to feel calm and clear — the opposite of 11:45 PM staring at a wall of logs. After putting it through a real incident suite, I'd take it over a Notion doc and a prayer any day.

DEV Community