DEV Community

Rakesh Goundi
Rakesh Goundi

Posted on

I Gave Our Team an AI Manager. It Started Playing Favorites Based on Past Performance.

I Gave Our Team an AI Manager. It Started Playing Favorites Based on Past Performance.

"It's routing this away from Dexter again." Our AI PM had quietly started avoiding assigning certain task types to certain people — not because we told it to, but because it remembered what happened last time.

That was the sentence that stopped our hackathon sprint cold. We looked at each other, then back at the logs. Nobody had written a rule that said "don't give Dexter the API tasks this week." The agent had just... noticed. And it was right.


The Problem: An Agent With No Memory Is Just an Expensive Coin Flip

We were building an AI-powered project manager for our three-person team: Dexter (AI/Backend), Rahul (Frontend), and Rakesh Goundi(Project Coordinator). The pitch was simple — ask it natural language questions like "who should own the login UI?" and get back a justified, context-aware recommendation. No more guessing. No more "whoever has capacity" being the only answer.

We started with Groq's qwen/qwen3-32b model. Fast, capable, surprisingly good at reasoning about team dynamics when you give it good context. But every session started from zero.

The agent knew the team's roles. It knew their names. What it didn't know was that Dexter had spent the last two sessions running behind on a model fine-tuning pipeline — and that the reason was scope creep, not skill. It didn't know Rahul's "Dashboard redesign sprint" had been stuck at 20% for a week because of a CSS system conflict. It didn't know Sohan had just finished a sprint retrospective and actually had bandwidth.

Ask it "who should own the API integration?" on Monday: one answer. Ask the same question Thursday: a completely different answer. Neither grounded in what had actually been happening.

We tried the obvious fix first: stuff the system prompt with a week's worth of meeting notes. It helped, a little. But manually reconstructing context before every conversation isn't memory — it's homework. It doesn't scale, it doesn't catch patterns you didn't think to write down, and it definitely doesn't survive the moment you forget to update the briefing.

"Just add more context" is not a memory system. We needed something that would learn.


What We Built

The app is a Streamlit chat interface that acts as a persistent AI project manager. You talk to it like a colleague: ask it to assign tasks, flag blockers, review workload, reason about who should own what and why. The sidebar shows the team's current tasks, sprint stats, and a live indicator showing whether Hindsight memory is active.

The key design decision: long-term memory wasn't a feature we added at the end. It was the reason the thing was worth building at all. A PM that can't recall past performance isn't a PM — it's a Magic 8-Ball with better grammar. We wanted recommendations that got better over time, that could only come from an agent that had actually been paying attention across sessions.

That's what Hindsight gave us.

![The AI Group Project Manager welcome screen — sidebar shows each team member's current task and status, with quick-action suggestion buttons in the main area](screenshot_

The app on first load. The sidebar shows live task status for each team member — this static data is the baseline the agent starts from. What Hindsight adds is everything that's happened across all previous sessions.


How Hindsight Sits in Our Stack

Hindsight lives between the user's message and the LLM. Before we touch Groq, we query Hindsight for relevant history. That history gets injected directly into the system prompt as a "Relevant Team Memory" section. After the model responds, we store the full interaction back into the project-manager-v1 memory bank.

Here's the core loop — the entire Hindsight integration in one function:

def run_hindsight_loop(user_message: str, groq_client) -> tuple[str, str]:
    # Step 1: Recall — query Hindsight for relevant team history
    recall_query = f"team member performance and task history relevant to: {user_message}"
    memories = recall_team_memory(recall_query)

    # Step 2 & 3: Inject + Decide — embed memories into system prompt, run LLM
    messages = build_agent_prompt(user_message, memories)
    ai_response = run_agent(groq_client, messages)

    # Step 4: Retain — store this interaction back into Hindsight
    retain_interaction(user_message, ai_response)

    return ai_response, snippet
Enter fullscreen mode Exit fullscreen mode

The retain call stores a structured record of every PM decision:

record = (
    f"Project manager decision — User request: '{user_message}' | "
    f"AI recommendation: '{ai_safe[:400]}'"
)
client.retain(
    bank_id=HINDSIGHT_BANK,
    content=record,
    context="project-manager-decision",
    timestamp=timestamp,
)
Enter fullscreen mode Exit fullscreen mode

What we chose to store as "experiences": every completed PM decision — the user's request and the agent's full recommendation, explicitly labeled with context. Not chat logs. Not embeddings of meeting notes we manually pasted. Structured records of what was asked and what was decided, timestamped and indexed.

What we query for: a natural-language description of what the current message is about. "Team member performance and task history relevant to: [user message]." Hindsight's retrieval handles the semantic matching — we don't tag, categorize, or manually curate.

One engineering note worth saving yourself: we create a fresh Hindsight client per call rather than caching one. Streamlit's event loop and asyncio.timeout() don't play well together when you share a client across requests. Fresh client per call costs a few milliseconds. Debugging event loop conflicts costs hours.

Memory is also designed to fail silently. If Hindsight is offline or the API key isn't configured, the app runs normally on its built-in team context. Memory-augmented agents should be better agents, not broken agents.

For more on the retrieval architecture, the Hindsight documentation and Vectorize's agent memory page cover the internals in detail.


Before / After: The Dexter Moment

Early on, without memory, the agent would consistently recommend Dexter for anything involving ML or APIs. Logical. That's his role. And it would explain its reasoning in careful detail every time — citing his strengths, explaining why the fit was good. Very confident. Completely ignorant.

![The team sidebar showing Dexter's In Progress status on the model fine-tuning pipeline, Rahul's Pending dashboard sprint, and Sohan's completed retrospective]

What the agent could always see: static role descriptions and current task status. What it couldn't see — until Hindsight — was why Dexter's task had been at 65% for two weeks.

After several sessions with Hindsight active, we asked: "Who should own the new data pipeline task?"

The response was different. It recalled — unprompted — that Dexter was currently mid-sprint on the model fine-tuning pipeline, that his current task was at 65% and had already been flagged in a previous conversation as a workload risk. It recommended holding the new pipeline assignment until his current task cleared, and suggested scoping a coordination piece to Sohan in the meantime to keep the dependency unblocked.

![The AI PM responding to 'Assign the login UI to the best person' — it recommends Rahul citing his UI/UX strengths, and ends with a Memory Note about the assignment]

Notice the "Memory Note" at the end of every response — the agent explicitly flags what it's storing back into Hindsight from this interaction. It's not just answering; it's updating its own memory for next time.

That's not a recommendation you can get from a role description. It required knowing what had actually happened. The agent wasn't smarter — it just finally had access to history, and it used it.


The Unexpected Moment

About halfway through the hackathon, we asked something we hadn't tried before: "Review Rahul's current workload."

We expected a summary drawn from the sidebar's static team data. Instead, the agent pulled from Hindsight and noted that Rahul had appeared in two recent discussions in the context of a CSS system conflict — and proactively flagged that adding any new frontend tasks this sprint looked risky, based on that pattern.

We hadn't told it to watch for workload risks. We hadn't written any logic that said "if a person appears in multiple conversations with delays, flag them." The agent surfaced a monitoring behavior entirely from accumulated memory.

"It's routing this away from Dexter again."

That moment — realizing the agent had developed preferences based on evidence — was the one that made the whole thing feel real. It wasn't just recalling facts. It was drawing conclusions from patterns we hadn't explicitly labeled.


The Non-Obvious Lesson: Structure What You Store

Here's the thing that surprised us most, and the thing we'd tell anyone about to build their own memory system: the quality of what Hindsight recalls depends almost entirely on what you store.

Our first version stored the raw user message and AI response as a flat blob. Retrieval was mediocre — it would surface vaguely related memories, but not the specific decisions we needed. We were getting the right era of history, not the right moments.

When we switched to storing a structured record — explicitly labeling it as a project-manager-decision with the request and recommendation clearly delimited — retrieval quality improved noticeably and immediately. We hadn't changed the retrieval query. We hadn't changed anything about Hindsight. We just gave it cleaner input.

The retrieval isn't magic. It's semantic similarity over what you gave it. Give it clean, structured, clearly-labeled content, and it finds what you need. Give it blobs and it returns blobs.

The other lesson: we thought what mattered most was how much we stored. It turned out what mattered was how we formatted it. One clear structured record per decision beat three verbose conversational logs every time.

If you're about to bolt Open Source Agent Memory onto your own stack: think hard about your record structure before you write your first retain() call. That structure is the thing you can't easily change later.


The full project is on GitHub: https://github.com/Sohan4-c/hindsight-manager

Top comments (0)