Hindsight Gave My Agent a Memory

#python #ai #agents #webdev

The fourth time our agent suggested "restart the service" for a Postgres connection error it had already seen three times before, I realized the problem wasn't the agent's reasoning — it was the agent's memory. It had none.

That was the moment I started building something different.

What the System Does

The Incident Response Agent is an AI-powered oncall assistant that triages alerts, diagnoses errors, and suggests fixes — not from generic LLM knowledge, but from the actual history of incidents your infrastructure has already survived.

Every time an incident is resolved, the agent stores what happened: the error signature, the root cause, the fix that worked, and how long it took. The next time a similar alert fires, it recalls that history and leads with it. No more starting from zero at 2am.

The stack is straightforward:

Hindsight handles persistent agent memory — retain, recall, and reflect across sessions
cascadeflow handles runtime intelligence — routing urgent alerts to powerful models and routine checks to cheaper, faster ones
Groq provides the underlying LLM inference (fast, free-tier friendly)
A Python backend wires it all together with a simple CLI interface for demo purposes and a webhook endpoint for real alert ingestion

The architecture has three jobs: ingest an alert, search memory for anything similar, and generate a response that is grounded in what actually worked before.

The Core Technical Story: Memory That Actually Changes Behavior

Most agents are stateless by design. Every call to the LLM is a blank slate. This is fine for one-off tasks but catastrophic for incident response, where context is everything.

Without memory, the agent is just an expensive Stack Overflow search. With memory, it becomes something closer to that senior engineer who has seen everything and remembers all of it.

Here is how the retain/recall loop works in practice.

Storing an Incident

When an incident is resolved, we store a structured memory in Hindsight:

# memory.py
from hindsight import HindsightClient

client = HindsightClient(api_key=os.getenv("HINDSIGHT_API_KEY"))

def store_incident(incident: dict):
    content = f"""
    Error: {incident['error']}
    Service: {incident['service']}
    Root Cause: {incident['cause']}
    Fix Applied: {incident['fix']}
    Resolution Time: {incident['resolved_in_minutes']} minutes
    """
    client.retain(
        pipeline_id=os.getenv("HINDSIGHT_PIPELINE_ID"),
        content=content,
        metadata={"incident_id": incident["id"], "service": incident["service"]}
    )

This is not just logging. Hindsight indexes this as a semantic memory. When a future alert comes in, it does not do a keyword search — it does a meaning search. "Database refusing connections on port 5432" and "Postgres not accepting new clients" surface the same past incidents even though the wording is completely different.

Recalling Relevant History

When a new alert fires, before we even touch the LLM, we ask Hindsight what it remembers:

def recall_similar(error_message: str, top_k: int = 3) -> str:
    results = client.recall(
        pipeline_id=os.getenv("HINDSIGHT_PIPELINE_ID"),
        query=error_message,
        top_k=top_k
    )
    if not results:
        return "No similar incidents found in memory."

    memories = []
    for r in results:
        memories.append(r["content"])
    return "\n\n---\n\n".join(memories)

Those recalled memories become part of the prompt context. The agent does not hallucinate a fix — it reads what actually worked last time and reasons from there.

Runtime Intelligence: Not Every Alert Deserves GPT-4

This is the part that surprised me most in production.

We were routing every single alert — a disk usage warning, a minor latency spike, a full database outage — through the same model at the same cost. The math does not work at scale. A disk warning that fires 40 times a day costs the same per call as a P0 database incident.

cascadeflow fixes this with a routing layer that lives inside your agent loop:

# router.py
from cascadeflow import CascadeFlow

cf = CascadeFlow()

def get_model_for_severity(severity: str) -> str:
    routing_map = {
        "P0": "groq/llama3-70b-8192",      # most capable, highest cost
        "P1": "groq/llama3-70b-8192",
        "P2": "groq/llama3-8b-8192",        # faster, cheaper
        "P3": "groq/llama3-8b-8192",
        "INFO": "groq/gemma2-9b-it"         # cheapest, good enough
    }
    return routing_map.get(severity, "groq/llama3-8b-8192")

def route_incident(alert: dict, memory_context: str) -> str:
    model = get_model_for_severity(alert["severity"])

    response = cf.complete(
        model=model,
        messages=[
            {"role": "system", "content": SYSTEM_PROMPT},
            {"role": "user", "content": build_prompt(alert, memory_context)}
        ],
        budget_limit=0.05  # hard cap per call in USD
    )
    return response.content

The budget_limit parameter is the part I wish I had known about earlier. It puts a hard ceiling on what any single call can spend. When your agent wakes up at 3am and fires 20 parallel diagnosis calls, you want that ceiling.

Every call cascadeflow makes is logged with the model used, latency, cost, and the routing decision rationale. That audit trail matters when someone asks why the agent recommended a rollback.

What the Before/After Actually Looks Like

Without memory (first interaction):

Alert: Connection refused on port 5432
Agent: This error typically indicates the PostgreSQL database 
service is not running or is unreachable. Common causes include:
1. The PostgreSQL service has stopped
2. Firewall rules blocking port 5432
3. Incorrect connection string configuration
Recommended action: Check service status and restart if needed.

Generic. Textbook. Not wrong, but not useful if you have seen this exact error six times.

With memory (after five similar incidents are stored):

Alert: Connection refused on port 5432
Agent: I found 3 similar incidents in memory.

Most recent match (INC-007, 4 days ago):
- Root cause: pgbouncer ran out of connection slots after a 
  deployment tripled concurrent app instances
- Fix: Restarted pgbouncer, increased pool_size from 20 to 60 
  in pgbouncer.ini
- Resolved in 8 minutes

Recommendation: Check pgbouncer connection pool utilization 
first before restarting Postgres. This has been the actual 
cause in 3 of the last 4 similar alerts on this service.

That second response is not generic LLM output. It is institutional knowledge — the kind that usually lives only in the head of whoever was oncall last month.

Lessons Learned

1. Memory quality depends on what you store at resolution time.
The agent is only as useful as the incident data it has seen. We spent more time thinking about the resolution workflow — making sure engineers captured root cause and fix clearly — than we did on the AI logic itself. Garbage in, garbage out.

2. Semantic recall beats keyword search for error messages.
Error messages are inconsistent. Engineers describe the same problem ten different ways. Hindsight's vector-based recall handles this naturally. A simple keyword match would miss half the relevant history.

3. Budget caps are not optional in production.
The first time an alert storm fires 80 agent calls in two minutes, you will be glad you set budget_limit. cascadeflow's per-call cap saved us from a surprise bill during load testing.

4. Route by severity, not by default.
Most incident alerts do not need your most expensive model. Routing INFO-level alerts to a lighter model and reserving the heavy model for P0s cut our inference cost significantly while keeping response quality where it mattered.

5. The audit trail is the product.
Especially for incident response, knowing why the agent made a recommendation is as important as the recommendation itself. cascadeflow's decision log gives you that without any extra instrumentation.

Where to Go From Here

The code for this project uses Hindsight for persistent agent memory — the retain/recall/reflect primitives are documented at hindsight.vectorize.io and there is a good conceptual overview of what agent memory actually means in practice on the Vectorize site.

For the runtime intelligence layer, cascadeflow is open source and installs in one line. The cascadeflow docs cover model routing, budget enforcement, and the full audit log format.

The core insight is simple: agents that remember are categorically more useful than agents that do not. The first time your oncall agent says "this happened last Tuesday and here is what fixed it," you will stop thinking of it as a demo and start thinking of it as infrastructure.
You can find the full project code here: [incident-response-agent]
https://github.com/Mahesh1215-babu/AI-Incident-Response-Agent