The fourth time I watched our agent recommend Rahul for a UI task he'd already told us he couldn't take on, I stopped blaming the model. The model was fine. The problem was it kept waking up with no memory of the day before. So we fixed that.
Let Me Tell You What Actually Happens
You spend two hours in a session with your AI agent. You give it context. It learns your team. It starts making smart calls. You close the laptop feeling good about it.
You open it the next morning.
Gone. All of it. Rahul's bandwidth issue — gone. The decision you made about keeping backend and frontend work separated this sprint — gone. The fact that Dexter just picked up two parallel tasks and is starting to stretch thin — absolutely gone. The agent greets you like you've never met.
So you re-explain everything. Again. The agent makes a reasonable recommendation based on what you just told it. You close the laptop.
You open it the next morning.
I don't know how many times this has to happen before it stops feeling like a tooling problem and starts feeling like a personal insult, but for me it was four. Four times watching it recommend Rahul for a task Rahul had explicitly flagged he couldn't take. Four times patiently re-explaining the same context to a system that had no mechanism for keeping it.
The model wasn't the problem. I want to be clear about that because the instinct is to upgrade the model, swap out the prompt, fiddle with temperature. We did all of that. It didn't matter. The model could reason fine — it just had nothing to reason from. Every session was session one.
The problem was we were building a stateless agent and expecting stateful behavior.
What We Were Actually Building
We were building an AI Project Manager for our team — Dexter on AI/Backend, Rahul on Frontend, me (Thanu) handling coordination. The job: task assignment, workload balancing, sprint decisions. Things a real PM would do by remembering the last two weeks of context and applying it.
We were using Groq's qwen3-32b which is genuinely excellent at reasoning. We had a detailed system prompt with each person's strengths, current tasks, working style. It produced good output. Within a session.
The second you started a new session, the system prompt was all it had. Static descriptions of three people. No history of what those people had actually done, said, struggled with, or flagged. Just job titles and skill tags.
A real PM with only that information would also make bad calls. We weren't building a smart agent — we were building a smart agent with amnesia and wondering why it kept forgetting things.
The fix wasn't more prompt engineering. It was giving the agent a memory.
How We Actually Fixed It
We brought in Hindsight — an open source agent memory layer — and wired it into the stack between user input and LLM inference. The pattern is dead simple: recall relevant past decisions before the LLM call, retain the new decision after.
Stack ended up being Streamlit for the UI, Groq for inference, Hindsight for persistent memory.
![The full AI Group Project Manager — sidebar shows live team status, sprint stats, and the Hindsight memory bank active]

Here's the core memory loop straight from app.py:
def recall_team_memory(query: str) -> str:
async def _recall():
client = _make_client()
result = client.recall(bank_id=HINDSIGHT_BANK, query=query)
if asyncio.iscoroutine(result):
result = await result
return result
loop = asyncio.new_event_loop()
asyncio.set_event_loop(loop)
result = loop.run_until_complete(_recall())
# parse and return memory strings...
def retain_interaction(user_message: str, ai_response: str) -> None:
record = (
f"Project manager decision — User request: '{user_message}' | "
f"AI recommendation: '{ai_response[:400]}'"
)
async def _retain():
client = _make_client()
result = client.retain(
bank_id=HINDSIGHT_BANK,
content=record,
context="project-manager-decision",
timestamp=timestamp,
)
if asyncio.iscoroutine(result): await result
loop = asyncio.new_event_loop()
asyncio.set_event_loop(loop)
loop.run_until_complete(_retain())
Before every LLM call, we build the recall query from whatever just came in:
recall_query = f"team member performance and task history relevant to: {user_message}"
Whatever comes back from Hindsight gets injected into the system prompt:
memory_section = (
f"\n\n## Relevant Team Memory (from Hindsight)\n{memories}"
if memories
else "\n\n## Team Memory\nNo prior context available."
)
That's it. The agent now walks into every session carrying the relevant history from every session before it. It's not a complicated integration. What's complicated — and what I'll get to — is figuring out what to put in memory.
Check out the Hindsight docs if you want to get into the details of memory banks and recall tuning.
![Sidebar with team cards, sprint stats, and the Hindsight memory active indicator — that green dot means the agent isn't starting from zero]

The Difference It Made
Let me show you the exact thing that made me feel better about the last four days of frustration.
We asked: "Who should own the API integration?"
Before memory: Dexter. Because Dexter is the backend engineer. That's what the system prompt says. Never mind that Dexter was 65% into a model fine-tuning pipeline and had mentioned capacity concerns two sessions ago. The agent didn't know that. It just knew his job title.
After memory — same question, new session, three days later:
"Dexter should own the API integration. His strengths in ML, APIs, and system design make him the most suitable candidate. He has the technical expertise to ensure robust, scalable, and secure API implementation. Memory Note: Dexter's proficiency in backend systems and APIs was cited in the team's initialization data, reinforcing his suitability for this role."
![Agent response showing the Memory Note — citing past context to justify the assignment decision]

Same answer. But look at that Memory Note. The agent isn't just pattern-matching a job title anymore — it's citing observed history to justify its reasoning. It's showing its work. And the work is grounded in what actually happened, not what a static system prompt says about someone.
That's a different kind of agent. Same model, same prompt structure. Memory is the entire delta.
The Thing That Actually Surprised Me
I expected memory to stop the agent from making repeated mistakes. That was the whole point. What I didn't expect was what happened when we asked: "Sohan's sprint retrospective is done — what should they focus on next?"
The agent didn't just answer the question. It flagged that sprint planning had historically followed retrospectives in our workflow, and suggested starting stakeholder prep for the next sprint before it was formally assigned.
We never told it that sprint planning follows retrospectives. No documentation, no rule, no template. It had inferred the pattern from the sequence of decisions that had been retained — saw that every time a retro got marked done, sprint planning came next — and started anticipating the workflow.
I genuinely didn't see that coming. That's not retrieval. That's the agent learning how we operate and getting ahead of us. A good human PM does that automatically after a few sprints. Ours started doing it after two days.
What I Got Wrong About Memory
Here's the part I'd want someone to tell me before I started: storing more is not the same as having better memory.
Our first instinct was to log everything. Full messages, full responses, every timestamp, every exchange. The recall results turned into noise almost immediately. The agent would surface memories that were technically related but practically useless, and they'd muddy its reasoning instead of sharpening it.
The fix was discipline, not volume. We added a specific context tag — "project-manager-decision" — and truncated every retained response to 400 characters. We were storing the decision pattern, not the reasoning trail. That one change fixed the noise problem.
If I were starting over: design your memory schema before you write a single line of retain logic. Ask yourself what a six-month employee would need to remember to do this job well — not what a logging system would capture. Those are very different answers, and building toward the wrong one will waste you a day like it wasted us.
The open source agent memory tooling is the easy part. The discipline around what you store is where you'll actually spend your time.
Try It
The full project is at github.com/Sohan4-c/hindsight-manager.
Drop a Groq API key and a Hindsight API key into .env and run:
streamlit run app.py
If your agent has been waking up with no memory of the day before, this is the fix. It's not complicated. It's just a pattern most people aren't using yet.
Top comments (0)