Why I Built the Backend That Gives Our AI Incident Agent a Memory

The agent was working. It could take an incident description, reason through it with a Groq LLM, and return a structured root cause analysis in seconds. Impressive for about ten minutes — until we ran the same incident twice and got the same generic output both times. It had no idea we'd already diagnosed and fixed this exact problem before.

That's not an intelligence problem. That's a memory problem.

The Problem Every On-Call Engineer Knows

SRE and DevOps teams face the same recurring incidents repeatedly: database connection pool exhaustion, API latency spikes, Kubernetes pod crashes, payment processing failures. Every time one hits, the on-call engineer digs through Slack history, stale runbooks, and closed tickets — rebuilding the same reasoning chain a colleague already built weeks ago.

The knowledge exists. It just doesn't live anywhere a system can find it at 2 AM.

When I built the backend for On-Call Copilot, I wasn't trying to build another LLM wrapper. I wanted something that actually accumulates institutional knowledge — getting smarter every time the team resolves an outage. That required solving the memory problem properly, which is where Hindsight agent memory came in.

How the System Works

On-Call Copilot is a FastAPI backend paired with a React/TypeScript frontend. When an incident comes in, the backend handles four things in sequence:

Recalls similar historical incidents from organizational memory
Analyzes the current incident using a Groq LLM
Generates a probable root cause with supporting evidence
Drafts a customer-facing status update

The architecture is intentionally minimal:

Hindsight handles memory. Groq handles reasoning. The backend orchestrates the two — cleanly separated, no RAG pipeline stitched from five libraries, no self-hosted vector database.

The Memory Layer: Two Functions, Real Power

The most important file in the backend is memory.py. The entire organizational memory integration comes down to this:

from hindsight import HindsightClient

client = HindsightClient(
    api_key=os.environ["HINDSIGHT_API_KEY"],
    bank_id=os.environ["BANK_ID"]
)

def store_incident(incident: dict) -> str:
    content = f"""
    Incident: {incident['title']}
    Symptoms: {incident['symptoms']}
    Root Cause: {incident['root_cause']}
    Resolution: {incident['resolution']}
    Duration: {incident['duration_minutes']} minutes
    """
    result = client.retain(content=content, metadata={
        "type": "incident",
        "severity": incident.get("severity", "unknown"),
        "service": incident.get("service", "unknown")
    })
    return result.id

def recall_similar_incidents(description: str, top_k: int = 3) -> list:
    results = client.recall(query=description, top_k=top_k)
    return [r.content for r in results]

Two functions. That's the entire memory layer. retain writes a resolved incident into Hindsight's vector store. recall queries it semantically when a new incident arrives. The bank_id scopes memory to your organization — you're querying your team's specific history, not a shared global pool.

What surprised me was how much signal Hindsight extracts from unstructured text. The recall doesn't keyword-match. When an engineer types "payments timing out during peak load," it surfaces incidents where database latency caused downstream processing failures — because that's what the description actually means.

Wiring Memory Into the Reasoning Loop

The agent.py file is where historical memory and live LLM reasoning come together. The key design decision: inject recalled incidents into the prompt before the LLM reasons about the current one:

async def analyze_incident(description: str) -> IncidentAnalysis:
    # Step 1: Pull relevant past incidents from Hindsight
    historical = recall_similar_incidents(description, top_k=3)

    context = "\n\n".join([
        f"Past Incident {i+1}:\n{h}"
        for i, h in enumerate(historical)
    ])

    prompt = f"""You are an expert SRE. Analyze this incident using historical context.

Historical incidents from our systems:
{context}

Current incident:
{description}

Provide:
1. Most likely root cause
2. Recommended remediation steps
3. Estimated time to resolve
4. Customer communication draft
"""
    response = groq_client.chat.completions.create(
        model="llama3-8b-8192",
        messages=[{"role": "user", "content": prompt}]
    )
    return parse_analysis(response.choices[0].message.content)

Without Hindsight, the LLM reasons from general training data — useful but generic. With recall in the context window, it reasons from your team's actual resolution history. That's a meaningfully different output.

A Real Example: Before and After

Here is the actual example incident from the project:

"Stripe webhook processing is timing out during invoice creation. Payments are delayed and subscriptions are not being activated."

Without memory (generic LLM response):

Root cause: "Possible network issues, third-party API downtime, or misconfigured webhook endpoint"
Fix: "Check Stripe dashboard, review webhook logs, verify endpoint availability"

With Hindsight memory (after incident history has accumulated):

Root cause: "Database latency caused webhook processing to exceed timeout limits. Same signature seen in previous incidents — high read latency during subscription renewal batches."
Fix: "Optimize database queries on the invoices table. Increase webhook timeout in Stripe dashboard. Move heavy invoice processing to async workers."
Customer update: "We are aware of delays affecting payment processing and subscription activation. Our team has identified the root cause and is applying a fix. Service will be fully restored shortly."

The second response references your team's actual past patterns and recommends fixes your team has already validated. That only happens because memory is genuinely wired into the reasoning loop — not bolted on as an afterthought.

What I Learned Building This

Retain at resolution, not at creation. When an incident opens, you don't know the root cause. Retain after close, when you have symptoms, root cause, and resolution all in one place. Storing incomplete data early pollutes the memory bank with noise.

Metadata filtering is not optional. Without scoping recalls by service and severity, a P1 database outage and a P3 CSS bug surface each other. The signal-to-noise ratio matters enormously at 2 AM. The metadata field in retain is what makes recall precise.

Seed data is a forcing function. I built seed_data.py to pre-populate Hindsight with representative incident patterns — connection pool exhaustion, pod OOMKills, payment processor timeouts — before real history accumulated. Writing those seeds forced me to define the memory schema early. Worth doing even if you overwrite everything later.

Context beats parameters. A smaller model with relevant organizational context in its prompt outperformed larger models reasoning from general training data alone. Memory quality matters more than model size.

Where This Goes

The live frontend is at https://on-call-copilot.vercel.app and the backend at https://on-call-copilot.onrender.com. The next step is proactive recall — as anomaly signals come in from observability tooling, the agent checks Hindsight for matching historical patterns before the alert even pages someone.

Every incident your team resolves is a piece of institutional knowledge. The question is whether it lives in someone's head, in a runbook nobody reads, or in a system that surfaces it at 2 AM when you actually need it. Building the backend for On-Call Copilot made one thing clear: the memory layer isn't a feature. It's the foundation everything else depends on.