How I Built an Incident Response Agent That Actually Gets Smarter Every Time Your System Breaks

#agents #automation #devops #showdev

Most developers have a ritual. Something breaks in production. You panic. You Google the error. You dig through six months of Slack messages trying to remember if this exact thing happened before. You find a half-baked fix from a thread three years ago. You apply it. It works. You move on.
Two weeks later, the same error hits. And you do the whole thing again from scratch.
I've been there more times than I'd like to admit. The worst part isn't the debugging itself — it's the feeling that you've solved this before, somewhere, sometime, but you can't quite remember where. Your brain has a vague memory of the fix. Your Slack history doesn't. Your runbooks are outdated. Your team has moved on. So you start from zero. Again.
That cycle is what I decided to break. I built an AI incident response agent that uses Hindsight agent memory to remember every single error your system has ever thrown, learn from every resolution, and get dramatically smarter with every single interaction. Not eventually. Visibly. By interaction five, it's a completely different tool than it was at interaction one.

The Incident Response Agent — error input on the left, live Memory Brain on the right

The Problem With Every Existing Tool
Before I built anything, I looked at what already existed. Datadog. New Relic. PagerDuty. Sentry. These are all excellent tools for what they do — log aggregation, alerting, performance monitoring. But they all share the same fundamental limitation.
They remember that something happened. They don't remember what you did about it.
You can scroll through a year of error logs in Datadog and find the exact timestamp of every database timeout your system ever threw. What you won't find is which fix worked, how long it took, whether it was a permanent solution or a temporary patch, or whether the same root cause is responsible for three different error types across your stack.
That's the gap. Logs are memory of events. What engineers actually need is memory of solutions — institutional knowledge that accumulates over time and gets smarter the longer it runs. That's what I built.

What the Agent Actually Does
The surface level is simple. You paste an error message. The agent gives you a fix.
But what happens underneath that is what makes this different from every other debugging tool. Before the agent touches an LLM, it calls recall() on Hindsight — searching its entire memory of past incidents for anything similar to what you just pasted. If it finds matches, it injects that context directly into the prompt. The fix you get isn't generic advice from a language model trained on Stack Overflow. It's advice grounded in your system's specific failure history.
After every interaction, the agent calls retain() — storing the error pattern, the suggested fix, the detected root cause category, and the exact timestamp. That information becomes part of the memory pool that informs every future interaction.
The stack I used: Python, Groq with qwen/qwen3-32b for LLM inference, Hindsight agent memory for persistence and recall, and Streamlit for the UI. The whole thing runs fast — Groq delivered consistent sub-two-second response times even with full memory context injected, which matters a lot when production is down.

The Moment It Clicked
I was testing the agent for the third time, running the same database connection timeout error I'd used in an earlier test. I expected a generic response — maybe slightly better than the first time, but nothing dramatic.
Instead, recall() pulled up the incident from two interactions ago. Not just that it had seen the error — it remembered the specific fix that worked, the category it had assigned to the root cause, and the timestamp. The response came back targeted and specific, and at the bottom it showed something I hadn't seen in the first two interactions: "This fix worked in 2 out of 2 similar past incidents — 100% confidence."
That was the moment I stopped thinking of this as a side project and started thinking of it as something genuinely useful.

the agent recalls past incidents

Five Features That Don't Exist Anywhere Else
I want to be specific about what makes this different, because "AI with memory" is a vague claim. Here's exactly what the memory enables that nothing else currently does.
Pattern memory across errors. The agent doesn't just log individual incidents in isolation — it actively looks for recurrence. Three database timeouts in two days triggers a flag: "This error has appeared 3 times this week. This is an architectural problem, not a one-time bug. A patch won't hold." No existing APM tool makes that distinction automatically. They surface frequency. They don't interpret what frequency means.
Resolution confidence scoring. Every fix suggestion comes with a confidence percentage calculated from past resolution outcomes. First interaction: no memory, no score, general advice. After five interactions involving similar errors: "Fix A worked in 3 out of 4 similar past incidents — 75% confidence." The score is meaningful because it's based on your system's actual history, not generic training data. It gets more accurate the longer the agent runs.
Failure DNA fingerprinting. Over time the agent builds a categorized map of your system's failure patterns — DB issues, network failures, memory leaks, auth problems, dependency errors. It surfaces this as a live "Your system's weak points" panel in the UI that updates in real time as memory grows. A log viewer shows you what happened. This panel shows you what your system tends to do, learned from accumulated evidence.
Time-aware memory. The agent stores timestamps with every retained incident and uses them to detect temporal patterns. "Your database times out consistently between 2AM and 3AM — this strongly suggests a scheduled job conflict." Engineers have always had the raw data to notice this. Nobody has built a tool that connects the dots automatically across incidents separated by days or weeks. This does.
Runbook evolution tracking. The first time you hit an error, you get a general fix. The third time you hit a variation of the same error, the agent shows you how the approach has evolved — from the initial patch to the root cause discovery to the permanent solution. You can see the learning curve made visible. That's institutional memory in a form that's actually usable.

Agent analyzing error patterns like a real senior developer

The Code That Powers It
Two functions do most of the work.
Here's the core loop:

`past_incidents = hindsight.recall(
query=error_message,
top_k=5
)

Build context from memory

memory_context = format_past_incidents(past_incidents)

Generate fix with memory context injected

response = groq_client.chat.completions.create(
model="qwen/qwen3-32b",
messages=[
{"role": "system", "content": SYSTEM_PROMPT},
{"role": "user", "content": f"Error: {error_message}\n\nPast incidents: {memory_context}"}
]
)

Retain this incident for future recall

hindsight.retain(
content=f"Error: {error_message}\nFix: {response}",
metadata={
"timestamp": datetime.now().isoformat(),
"category": detected_category,
"confidence": calculated_confidence
}
)`
The pattern detection and confidence scoring happen in a separate layer that reads from the retained memory pool, categorizes errors by root cause type, counts resolution outcomes per category, and surfaces anomalies like recurrence spikes or time-based clustering. Simple logic — powerful because it operates on real accumulated data from your specific system rather than generic heuristics.

Analysing real errors , checking all aspects
Making Memory Visible in the UI
The biggest UI decision I made was to never hide the memory layer. Every single response shows exactly how many past incidents were recalled, which category they belong to, and what confidence score the fix carries. First interaction it reads "Memory used: 0 past incidents — providing general guidance." By interaction five it reads "Memory used: 4 similar incidents recalled — confidence 80%."
That progression is the whole story of the product told in a single line of UI text. Anyone watching the demo can see the agent getting smarter in real time without any explanation needed.
The right panel — the Memory Brain — updates live with every interaction. Color coded for immediate readability: red for recurring patterns needing architectural attention, amber for errors the agent has seen before, green for genuinely new incident types.

Groq Was Faster Than I Expected
I used Groq for LLM inference with qwen/qwen3-32b. I expected decent speed. What I got was response times under two seconds consistently, even with full memory context injected into the prompt. For a real-time incident response tool where production is down and every second matters, that's not a nice-to-have — it's the whole point. Slow AI advice during an outage is useless advice.

What I Learned
A few things worth carrying into the next project.
Memory changes the product category entirely. Without Hindsight, this is a chatbot wrapper around an LLM. With it, it's institutional memory for your engineering team. Those are not the same product and they don't compete with the same tools.
The confidence score is more important than the fix itself. In a high-stakes moment like a production outage, what engineers need isn't just an answer — it's a calibrated answer. Knowing that a suggested fix has worked 3 out of 4 times in your specific system is more actionable than the most detailed generic explanation.
Temporal patterns are completely underexplored. Every monitoring tool captures timestamps. None of them use accumulated temporal data to surface behavioral patterns the way a memory-enabled agent can. There is a lot of unexplored territory here.
Tight scope is a feature, not a limitation. I built one workflow and made it excellent. Every feature in this project connects directly to one value proposition: an agent that gets smarter the longer it runs on your system. Nothing in the codebase exists outside that thesis.

What Comes Next
The immediate roadmap is direct log stream integration — Datadog, CloudWatch, Sentry webhooks — so the agent ingests incidents automatically rather than requiring manual paste. After that, cross-service memory: a single agent that holds the failure history of an entire microservices architecture and can connect the dots between incidents happening in different services simultaneously.
The long-term vision hasn't changed from the first line of code I wrote. Every engineering team deserves a senior developer on call around the clock who has perfect memory of every incident the system has ever had and never has to start from scratch.
If you want to explore Hindsight agent memory yourself, the documentation is thorough and getting started takes about fifteen minutes.
Full code on GitHub: github.com/Amitrajeetpaul/incident-response-agent