Shruti Gupta

Posted on Jun 14

Your On-Call Agent Forgot Everything. Ours Doesn't.

#python #programming #webdev #ai

The first time I used something that actually remembered a past production failure, I didn't fully trust it. I submitted the same incident twice just to make sure the result wasn't a coincidence.

It wasn't.

I was building On-Call Copilot — an incident response agent that doesn't just generate advice, it recalls what actually happened the last time something similar broke. The live app is at on-call-copilot.vercel.app. The memory layer is Hindsight. And the thing that surprised me most wasn't how hard it was to integrate — it was how immediately obvious the difference was once it was working.

What the system actually does

On-Call Copilot is an AI Incident Commander with organizational memory. The tagline on the app is "Learn from every outage. Resolve the next one faster." That's not marketing — it's literally the architecture.

When a production alert comes in — a Sentry traceback, a Datadog trigger, raw CLI logs — you paste it into the Incident Ingestion Console. The system runs it through a five-stage pipeline:

Production Alert → FastAPI Router → Hindsight Memory → Groq Reasoning → SRE Playbook

Stage three is the one that matters. Before Groq generates anything, Hindsight's semantic graph runs a recall against the full organizational incident history. It doesn't do keyword search. It finds semantically related past incidents — things that failed for the same underlying reason, even if the error messages look different on the surface.

What comes back isn't just "here's a similar incident." It's structured: historical root cause, successful fix, and critically — failed attempts to avoid. Things someone already tried that made it worse. That last part is what makes this different from any generic LLM response.

The FastAPI layer: how the triage request flows

Every incident starts at a single POST endpoint. The frontend sends the raw alert text; FastAPI handles the orchestration — first pulling memory context from Hindsight, then passing that context alongside the alert into the Groq reasoning chain.

# backend/api.py
@app.post("/analyze")
def analyze(data: IncidentRequest):
    return {
        "analysis": analyze_incident(data.incident)
    }

@app.post("/teach")
def teach(data: IncidentRequest):
    store_incident(data.incident)

    return {
        "status": "saved"
    }

@app.get("/")
def home():
    return {
        "message": "On-Call Copilot API Running 🚀"
    }

This ordering is the key design decision. The recall happens before the LLM sees anything. By the time Groq is reasoning about root cause, it already has the organizational context baked in — not as a separate lookup, but as part of the prompt.

What Hindsight's recall actually returns

I had never used Hindsight before this project. My mental model going in was that it would behave like search — give it keywords, get back matching documents.

What it actually does is closer to semantic reasoning over a knowledge graph. When I submitted "FATAL: database pool choked during active transaction," it recalled two past incidents:

INC-103 — Database connection pool exhaustion under high transactional traffic. 91% match. Successful fix: increment proxy pool limits to 50, implement transaction timeout safeguards. Failed attempt: scaling pool replicas dynamically (triggered DB lock storms).
INC-104 — Redis cluster memory allocation overrun. 87% match. Successful fix: configure maxmemory-policy to volatile-lru. Failed attempt: cold restarts of Redis service (nuked all active sessions).

The match percentages are real confidence scores from Hindsight's agent memory system. The "failed attempt" field is the part that earns its keep at 3 AM — it tells you what not to reach for before you waste 40 minutes on it.

The two memory operations: retain and recall

The entire Hindsight integration in backend/memory.py is built on two calls. Here's both of them side by side:

# backend/memory.py
from contextlib import contextmanager
from hindsight import HindsightClient

BANK_ID = os.getenv("BANK_ID")

@contextmanager
def _get_hindsight_client():
    client = HindsightClient(api_key=os.getenv("HINDSIGHT_API_KEY"))
    try:
        yield client
    finally:
        client.close()

def recall_similar_incidents(incident_description: str) -> list[dict]:
    with _get_hindsight_client() as client:
        results = client.recall(
            bank_id=BANK_ID,
            query=incident_description,
            top_k=5,
        )
    return results

def save_resolution(incident_description: str, resolution_summary: str):
    content = (
        f"INCIDENT: {incident_description}\n"
        f"RESOLUTION: {resolution_summary}"
    )
    with _get_hindsight_client() as client:
        client.retain(
            bank_id=BANK_ID,
            content=content,
            context="incident_postmortem",
        )

Two functions. One call each. The entire organizational memory layer is those ~30 lines. What the Hindsight retain/recall API does behind the scenes — semantic indexing, graph traversal, confidence scoring — you get all of that for free.

The pipeline in practice

The Incident Resolution Timeline in the UI makes the pipeline visible in real time:

Alert Received — raw metrics or trace ingested into buffer
Memory Retrieved — FastAPI semantic correlation against regional index maps
Root Cause Identified — LLM isolates anomalies, computes match confidence
Resolution Suggested — detailed playbook with avoidance warnings generated
Knowledge Stored — post-mortem answers indexed back into organizational memory

That fifth step is the learning loop. Every resolved incident feeds back into Hindsight via retain(). The next similar incident pulls it as recalled context. The system gets more specific over time — not because the model changed, but because the memory bank grew.

Before vs after — what memory actually changes

Without organizational memory:

Generic advice pulled from training data — "check network adapters," "reinstall OS"
No awareness of what's already been tried in your specific environment
Suggestions that have failed twice before in your cluster show up again
Every incident starts from zero

With Hindsight memory (150 incidents in the knowledge base):

Precise matches pulled from actual past outages, not textbook examples
Failed fixes flagged explicitly so engineers don't repeat them
One-step indexing after resolution so the next incident benefits immediately
42% estimated reduction in mean time to resolution shown live on the dashboard

That 42% figure isn't a benchmark I'm claiming — it's what the dashboard shows based on the system's historical recall performance across the loaded incident knowledge base.

What the Teach the System panel does

At the bottom of the app is a section called "Teach the System — Training Mode." Paste a resolution summary or an incident link, submit it, and Hindsight indexes it immediately. The Telemetry Console logs the whole thing in real time — you can watch the retain call go out, see the success response, and know that the next engineer who hits a similar issue will get this resolution in their recalled context.

The log from a real session:

[1:20:26 pm] SUCCESS [OUTPOST] Triage finished successfully in 44.97s. Status 200 OK.
[1:20:27 pm] SUCCESS [OUTPOST] Taught system successfully in 39.57s. Status 200 OK.

Triage and teach. Those two operations are the entire product loop.

What using Hindsight taught me

I went into this thinking about memory as a storage problem. I came out thinking about it as a retrieval design problem. What you store matters less than whether the right things surface at the right time.

Hindsight's retain/recall API is small — two core operations cover almost everything. But the quality of what you get back at recall time depends entirely on how well-structured the retained content is. A postmortem that clearly separates root cause, successful fix, and failed attempts produces recall that's immediately actionable. A vague free-text summary produces noise.

The other thing I'd do differently is seed the knowledge base earlier. The system only becomes convincingly better than a generic LLM once there's enough incident history to surface precise matches. With 150 incidents loaded, the difference is stark. With 5, it's marginal. Data quality and quantity are part of the product.

What's next

The current "Teach the System" input is free text. The obvious next step is a structured form — separate fields for root cause, fix steps, failed attempts, and the customer message that generated the least confusion. Structured inputs produce more consistent memories, which produce more reliable recall.

The architecture also has room to expand beyond a single knowledge bank. Right now there's one organizational memory shared across all incidents. A multi-tenant version with per-team or per-service memory banks would let different engineering teams maintain separate incident histories while still being able to query across them when needed.

The memory layer works. What I keep thinking about is how much better it gets with every incident that runs through it — and how most engineering teams are sitting on years of incident history that a system like this could immediately put to use.

DEV Community