Our AI Project Manager Remembered Who Dropped the Ball Last Sprint — And Didn't Assign It to Them Again

#ai #agentmemory #machinelearning #python

What if your project manager never forgot a single decision it ever made? Last weekend we built one. By Sunday night it was using a three-day-old conversation to override our instincts on task assignment — and it was right.

The Problem Before Hindsight

Every AI assistant we'd tried for project management had the same fatal flaw: goldfish memory.

Ask it to assign a task — reasonable answer. Start a new session the next day — it had completely forgotten that Dexter was already overloaded, that Rahul had struggled with a similar scope last sprint, or that we'd explicitly decided to keep backend and frontend work separated. Each session started from zero.

"Just use a bigger context window" was the obvious suggestion. We tried it. It helped within a session, but the moment a new conversation started, all that institutional knowledge evaporated. We then tried RAG over indexed meeting notes and sprint histories. Recall was shallow — it could surface relevant text, but it couldn't reason about patterns across time. It couldn't say: "Last three times we gave Rahul a design-heavy task mid-sprint, it got deprioritized." It just returned chunks.

The deeper problem: pure retrieval isn't the same as learned experience. A good PM doesn't just remember facts — they remember outcomes, judgment calls, and the subtle signals that certain work tends to go sideways with certain people under time pressure. That kind of memory requires something more structured than a vector search.

What We Built

The concept was simple: an AI Project Manager that gets genuinely smarter the longer it runs — not just within a session, but across every decision it's ever made.

The stack: Streamlit for the UI, Groq's qwen3-32b as the reasoning engine, and Hindsight as the persistent memory layer. The agent manages a three-person team — Dexter (AI/Backend), Rahul (Frontend), and Sohan (Project Coordinator) — handling task assignment, workload balancing, and sprint decisions.

The core bet was that agent memory — real structured memory, not just logged text — would let the agent stop repeating the same assignment mistakes, and eventually start making proactive recommendations we hadn't explicitly programmed.

![The full AI Group Project Manager interface — sidebar shows team status, sprint stats, and Hindsight memory active]

How We Used Hindsight

Hindsight sits between the user query and the LLM call. Every time a task assignment is made, we retain the interaction. Every time a new query comes in, we recall relevant experiences before the LLM ever sees the question.

Here's the actual recall/retain loop from app.py:

def recall_team_memory(query: str) -> str:
    async def _recall():
        client = _make_client()
        result = client.recall(bank_id=HINDSIGHT_BANK, query=query)
        if asyncio.iscoroutine(result):
            result = await result
        return result
    loop = asyncio.new_event_loop()
    asyncio.set_event_loop(loop)
    result = loop.run_until_complete(_recall())
    # parse and return memory strings...

def retain_interaction(user_message: str, ai_response: str) -> None:
    record = (
        f"Project manager decision — User request: '{user_message}' | "
        f"AI recommendation: '{ai_response[:400]}'"
    )
    async def _retain():
        client = _make_client()
        result = client.retain(
            bank_id=HINDSIGHT_BANK,
            content=record,
            context="project-manager-decision",
            timestamp=timestamp,
        )
        if asyncio.iscoroutine(result): await result
    loop = asyncio.new_event_loop()
    asyncio.set_event_loop(loop)
    loop.run_until_complete(_retain())

What we store as "experiences": the full user request paired with a truncated version of the AI's recommendation, tagged with context="project-manager-decision". Intentionally broad — we capture the reasoning pattern, not just the outcome.

The recall query is constructed dynamically:

recall_query = f"team member performance and task history relevant to: {user_message}"

That gets passed to Hindsight's semantic recall, which returns the most relevant past decisions. Those memories are then injected directly into the LLM system prompt before the model reasons about the current question:

memory_section = (
    f"\n\n## Relevant Team Memory (from Hindsight)\n{memories}"
    if memories
    else "\n\n## Team Memory\nNo prior context available."
)

Clean, simple, and remarkably effective. The full Hindsight documentation covers structuring memory banks and tuning recall behavior in more depth.

![Sidebar showing team member cards with task status, sprint stats, and the Hindsight memory bank active indicator]

Before / After: The Moment It Clicked

Without memory: We asked the agent — "Who should own the settings page redesign?" It said Rahul. Makes sense on paper: frontend task, frontend engineer. We'd had a conversation about Rahul's workload two days earlier, but the agent had no way to know that. It just saw "UI task → frontend engineer" and fired.

With Hindsight: Same question, new session. The recalled memory included our earlier discussion about Rahul's load. The agent responded:

"Based on recent interactions, Rahul is currently at high utilization with the authentication UI in progress. Given this, I'd recommend either scoping this task for next sprint, or exploring whether Dexter can take a frontend-adjacent component while Rahul completes current work. Memory Note: Rahul's concurrent UI load is a pattern worth tracking."

Same model. Same prompt structure. The only difference was what Hindsight surfaced.

The agent didn't just give us an answer — it gave us a better answer because it remembered the context that made the obvious answer wrong.

That "Memory Note" line is worth calling out. The prompt instructs the model to end with a memory note, but the content — recognizing Rahul's concurrent load as a recurring pattern — came entirely from what Hindsight recalled. We didn't engineer that insight. It emerged from memory.

![The agent citing a Memory Note after assigning the API integration to Dexter — it pulled from past interactions to justify the decision]

The Unexpected Behavior

About halfway through the hackathon, we asked: "Sohan's sprint retrospective is done — what should they focus on next?"

The agent came back with something we hadn't anticipated. It flagged that based on previous interactions, sprint planning had historically followed retrospectives in our workflow, and proactively suggested Sohan begin stakeholder prep for the next sprint — before it was formally assigned.

We hadn't stored "sprint planning follows retrospectives" anywhere. The agent had inferred a workflow sequence from the order of retained interactions and applied it forward as a recommendation.

We re-read the response three times. It had learned a process from observation — not from instructions, not from documentation, but from the pattern of decisions retained over time. That was the moment the team stopped treating this as a hackathon project and started treating it as something worth finishing properly.

One Non-Obvious Lesson

We expected the hardest part to be the Hindsight integration. It wasn't — the open source agent memory library is clean, the recall/retain pattern is intuitive, and async support is solid.

The actually hard part was deciding what to retain.

Our first instinct was to store everything — every message, every response, full text, every timestamp. That backfired quickly. Recall results got noisy, and the agent started surfacing loosely related memories that confused its reasoning rather than sharpening it.

The fix was two things: specificity in the context tag, and truncating the retained AI response to 400 characters. We wanted the decision pattern, not the full reasoning chain. Once we made that change, recall quality improved dramatically.

The lesson: storage discipline matters more than storage volume. A tightly scoped memory bank with well-tagged, truncated records will outperform a comprehensive log every time. You're not building an audit trail — you're building the institutional knowledge of a teammate who's been here for six months.