How Hindsight's Recall Quality Surprised Me

#ai #webdev #programming #python

I went into this project expecting recall to feel like a fancier keyword search. What I got was something that made me rethink how I'd been building AI agents entirely.

Our team built an AI Group Project Manager using Hindsight agent memory and Groq's LLM. My job was the LLM logic — specifically, making sure the agent gave useful, accurate answers based on what it remembered. That meant I spent more time than anyone else staring at what recall() actually returned, and whether it was good enough to build a response on.

Spoiler: it was better than I expected. But not always in the ways I anticipated.

What I Thought Recall Would Be

My mental model going in was basically: you store some text, you search it later, you get back the closest matches by embedding similarity. Standard RAG. Useful, but also limited in predictable ways — it struggles with names, exact terms, and anything time-related.

So I designed the LLM prompts defensively. I assumed the recalled context would be noisy and incomplete, and I wrote system prompts that told the model to be cautious about what it claimed to know.

Then I actually tested it.

What Recall Actually Returned

The first thing that surprised me was how clean the recalled memories were. When we retained a task assignment like this:

await self._retain(
    content=(
        f"Task assigned to {assigned_to}: '{task}'. "
        f"Deadline: {deadline}. Status: PENDING. "
        f"Assigned on: {datetime.utcnow().strftime('%Y-%m-%d')}."
    ),
    context="task assignment"
)

And then queried with something like "What is Keerthana working on?" — the recalled memory wasn't just the raw string we stored. Hindsight had extracted structured facts from it: who the task was assigned to, what it was, when it was due. The recall result reflected that structure.

This matters because it means the LLM receives clean, factual context rather than a blob of text it has to parse. The quality of the LLM's answer goes up significantly.

The Four Search Strategies

The reason recall felt better than standard vector search is that Hindsight runs four strategies in parallel — semantic, keyword, graph, and temporal — and merges the results. This is called TEMPR.

In practice this meant:

Keyword search caught exact names. When someone typed "Keerthana" in a query, the keyword strategy found memories that contained that exact string, even if the semantic embedding wasn't a strong match.

Graph search caught relationships. After we logged that Keerthana owned the API routes task AND that we'd decided to use a REST architecture, a query about "backend decisions" surfaced both memories together — even though neither contained the word "backend."

Temporal search caught time references. When we asked "what was assigned yesterday?" it actually worked, because Hindsight stored timestamps with every retained memory and could reason about relative time.

I hadn't expected the graph and temporal strategies to matter much for a short hackathon demo. They mattered more than I thought.

Where Recall Struggled

It wasn't perfect. Two things bit us.

First: the first session is thin. With only one or two retained memories, recall doesn't have much to work with. The agent's answers in the first session were noticeably weaker than after we'd built up a few tasks and decisions. This is expected behaviour, but it means your demo needs to start with a seeded bank or run through a setup phase before showing off the recall quality.

Second: generic context labels hurt recall. Early on we were storing everything with context="data". The recall results were flatter — less differentiated between task memories and decision memories. Once we switched to specific labels like "task assignment" and "team decision", the relevant memories started rising to the top more reliably.

The context parameter isn't just metadata. It gets injected into the fact extraction prompt, so it actively shapes what Hindsight extracts and stores. More specific context = better facts = better recall.

What I Changed in the LLM Prompts

Once I understood what recall was actually returning, I rewrote the system prompts to be less defensive. Instead of hedging, I told the model to treat the recalled memories as ground truth:

system = f"""You are an AI Group Project Manager.
Team members: {', '.join(TEAM_MEMBERS)}.

Project memory recalled from Hindsight:
{context}

Use this memory to give accurate, personalised answers.
Reference specific tasks and decisions from memory when relevant.
Be concise and helpful. Today: {datetime.utcnow().strftime('%Y-%m-%d')}."""

The key line is: "Use this memory to give accurate, personalised answers." Without that explicit instruction, the model would sometimes ignore the recalled context and generate plausible-sounding but fabricated answers. With it, the model anchored to what Hindsight had actually returned.

This is a general lesson: giving an LLM memory isn't enough. You have to tell it to use the memory.

The Moment That Changed My Mental Model

Midway through testing, I asked the agent: "What has Keerthana been assigned?"

It returned the API routes task with the correct deadline. Fine. But it also mentioned — unprompted — that the team had decided to use a REST architecture, and that this was relevant to the API work.

We hadn't asked it to connect those two things. Hindsight's observation consolidation had synthesised a relationship between the decision memory and the task memory, and the graph search surfaced them together.

That was the moment I stopped thinking of Hindsight as "storage with search" and started thinking of it as something closer to a knowledge graph that grows as you use it.

Lessons

Recall quality scales with bank size. Seed your bank before demoing.
Context labels are not optional. Specific labels produce meaningfully better recall.
Tell the LLM to use the memory. Explicit instructions in the system prompt matter.
The graph strategy is doing real work. Don't underestimate it just because you're not storing explicit relationships.
Test recall quality directly. Print what arecall() returns before wiring it into the LLM. You'll catch problems earlier.

The full project is on GitHub: github.com/SinchanaNagaraj/ai-group-project-manager

If you're evaluating memory systems for your agents, Hindsight's documentation is worth reading carefully — especially the section on retrieval strategies. The multi-strategy approach is what makes the recall quality meaningfully different from plain vector search.