Hemanth-Mardi

Posted on Apr 12

# How I Built an On-Call Agent That Never Forgets a Past Incident

#ai #llm #devops #python

How I Built an On-Call Agent That Never Forgets a Past Incident

It's 2:14am. Your phone screams. The payment API is down. You stumble to your laptop, eyes half-open, and start digging through logs. Somewhere in the back of your mind, you know you've seen this before — a similar Redis error, maybe three months ago. You remember it took two hours to fix. But you can't remember what the fix actually was.

That's the moment I decided to build something different.

The Problem With On-Call Memory

Every engineering team has institutional knowledge that lives in Slack threads, runbooks, and the heads of whoever happened to be on-call that night. When an incident recurs, you're essentially starting from scratch — Googling the same symptoms, retracing the same steps, making the same mistakes.

AI-powered assistants help, but only to a point. A generic LLM will tell you to "check your logs" and "verify your database connections." Technically correct. Completely useless at 2am when you need the exact Redis command that fixed this six weeks ago.

What I wanted was an agent with agent memory — one that actually remembers your specific infrastructure, your specific past failures, and your specific fixes.

What I Built

The Incident Response Agent takes a description of a live incident and returns a specific, memory-backed diagnosis with resolution steps. The core loop is simple:

A new incident comes in
The agent searches its memory for similar past incidents
It feeds those past incidents as context to an LLM
The LLM returns a targeted response — root cause, severity, and exact commands

The memory layer is powered by Hindsight, an open-source agent memory system built by Vectorize. Every resolved incident gets stored via a retain call. Every new incident triggers a recall call that fetches the most semantically similar past events.

The stack: Python FastAPI backend, Groq API running llama3-70b as the LLM, Hindsight Cloud for persistent memory, and a plain HTML frontend. No frameworks, no complexity — just the core loop working cleanly.

Architecture

[Frontend HTML]
      |
      v
[FastAPI Backend]
      |
      +---> [Hindsight recall] ---> past incidents as context
      |
      +---> [Groq LLM] ---> diagnosis + resolution steps
      |
      v
[Response to user]

[Store resolved incident] ---> [Hindsight retain]

The backend has three files that matter:

hindsight.py — wraps the retain and recall API calls
agent.py — builds the prompt with memory context and calls Groq
main.py — FastAPI routes exposing /analyze and /retain

The Core Technical Pattern: Retain and Recall

This is the part that makes the whole thing work. The Hindsight docs describe two primitives — retain (store something) and recall (fetch semantically similar things). Here's how I implemented both:

async def retain_memory(content: str, session_id: str = "incident-ops"):
    async with httpx.AsyncClient() as client:
        response = await client.post(
            f"{HINDSIGHT_PIPELINE_URL}/retain",
            headers=HEADERS,
            json={"content": content, "session_id": session_id},
            timeout=30.0
        )
        response.raise_for_status()
        return response.json()

async def recall_memory(query: str, session_id: str = "incident-ops", limit: int = 3):
    async with httpx.AsyncClient() as client:
        response = await client.post(
            f"{HINDSIGHT_PIPELINE_URL}/recall",
            headers=HEADERS,
            json={"query": query, "session_id": session_id, "limit": limit},
            timeout=30.0
        )
        response.raise_for_status()
        data = response.json()
        return data.get("memories", [])

In agent.py, the recalled memories get formatted into the system prompt before the LLM ever sees the incident description:

async def analyze_incident(description: str, session_id: str):
    past_incidents = await recall_memory(query=description, session_id=session_id)

    if past_incidents:
        memory_text = "\n".join([
            f"Past Incident {i+1}: {m.get('content', '')}"
            for i, m in enumerate(past_incidents)
        ])
        memory_context = f"""You have memory of these similar past incidents:

{memory_text}

Use this memory to give a SPECIFIC, actionable response."""
    else:
        memory_context = "No similar past incidents found. Provide general best-practice guidance."

    system_prompt = f"""You are an expert SRE and Incident Response Agent.
{memory_context}
Identify root cause, give specific resolution steps, estimate severity (P1/P2/P3),
and flag if this matches a known pattern from memory."""

The LLM receives a grounded, context-rich prompt instead of a cold, open-ended question. That's the entire difference between a useful response and a useless one.

Before and After: The Moment It Clicked

This was the demo that convinced me the architecture was right.

Before seeding any memory, I asked the agent:

"Redis timeouts causing 503 errors on the payment API"

Response:

"Check your Redis connection settings. Verify that your Redis instance is running and reachable. Review your application logs for more details. Consider increasing timeout values."

Technically not wrong. Completely useless.

After storing five past incidents — including one where we'd hit Redis connection pool exhaustion — I asked the exact same question:

"This matches a past incident where Redis connection pool was exhausted. Root cause: maxconn set too low under peak load. Resolution: increase Redis maxconn to 500, restart gateway pods. Command: redis-cli CONFIG SET maxclients 500. Estimated severity: P1. This fix resolved the issue in approximately 8 minutes during the last occurrence."

Same question. Completely different answer. The memory layer is what changed.

This before/after moment is the core value of the project. It's not about the LLM being smarter — it's about giving the LLM the right context.

Lessons Learned

1. Memory architecture matters more than model choice.
I initially spent time optimizing which LLM to use. That was the wrong variable. Switching from a generic prompt to a memory-backed prompt improved response quality far more than any model upgrade would have. The retrieval layer is the product.

2. Semantic search handles noisy incident descriptions well.
Engineers don't describe incidents consistently. One person writes "Redis OOM", another writes "cache layer failing", another writes "connection refused on port 6379." Hindsight's vector-based recall handles these variations gracefully — it matches on meaning, not keywords. I didn't need to normalize incident descriptions for retrieval to work.

3. FastAPI's async model is a natural fit for this pattern.
Both the Hindsight recall and the Groq completion are I/O-bound network calls. Using async/await throughout meant the agent could handle multiple concurrent incident analyses without blocking. The code stayed clean because both external APIs used the same httpx.AsyncClient pattern.

4. The demo sells itself.
I showed this to three engineers before the hackathon. Every single one said some version of "wait, can I actually use this?" The 2am pain is universal. You don't have to explain why memory-backed incident response is valuable — you just have to show the before/after.

The Honest Limitation

The agent is only as good as your incident documentation.

If past incidents are stored as vague summaries — "fixed Redis issue" — the recall context is nearly useless. The LLM gets garbage in and produces generic output. The quality of the memory directly determines the quality of the response.

This means the real work isn't building the agent. It's building the habit of documenting incidents thoroughly when they're resolved: what happened, what the root cause was, what commands were run, what the outcome was. The agent amplifies that documentation. It doesn't replace it.

For teams that already maintain good runbooks and post-mortems, this architecture slots in almost immediately. For teams that don't, the agent is the forcing function to start.

What's Next

The current version is single-session. The natural next step is multi-team memory — shared incident pools across services, with routing based on which team owns which system. The Hindsight API already supports session isolation, so the architecture supports it without significant changes.

If you want to build your own version, the full code is on GitHub and the Hindsight memory system is open source. The setup takes less than an afternoon. The interesting part — curating your team's incident memory — is the ongoing work that makes it genuinely useful over time.

The 2am call is coming. The question is whether your agent will tell you to "check the logs" or tell you exactly what fixed this last time.

DEV Community