<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Keerthana Keerthu</title>
    <description>The latest articles on DEV Community by Keerthana Keerthu (@keerthana_nagraj).</description>
    <link>https://dev.to/keerthana_nagraj</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3839655%2F4a07a155-c490-4505-aa6a-f511e1196e5c.png</url>
      <title>DEV Community: Keerthana Keerthu</title>
      <link>https://dev.to/keerthana_nagraj</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/keerthana_nagraj"/>
    <language>en</language>
    <item>
      <title>How Hindsight's Recall Quality Surprised Me</title>
      <dc:creator>Keerthana Keerthu</dc:creator>
      <pubDate>Mon, 23 Mar 2026 09:01:07 +0000</pubDate>
      <link>https://dev.to/keerthana_nagraj/how-hindsights-recall-quality-surprised-me-3k32</link>
      <guid>https://dev.to/keerthana_nagraj/how-hindsights-recall-quality-surprised-me-3k32</guid>
      <description>&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fo4qxovn32a1p2pw8lgu5.jpg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fo4qxovn32a1p2pw8lgu5.jpg" alt=" " width="800" height="376"&gt;&lt;/a&gt;&lt;br&gt;
&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fteptwren1piy6gk1pmr7.jpg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fteptwren1piy6gk1pmr7.jpg" alt=" " width="800" height="376"&gt;&lt;/a&gt;&lt;br&gt;
&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fno0xv2s8s4z4iiwty4qn.jpg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fno0xv2s8s4z4iiwty4qn.jpg" alt=" " width="800" height="376"&gt;&lt;/a&gt;I went into this project expecting recall to feel like a fancier keyword search. What I got was something that made me rethink how I'd been building AI agents entirely.&lt;/p&gt;

&lt;p&gt;Our team built an AI Group Project Manager using &lt;a href="https://vectorize.io/features/agent-memory" rel="noopener noreferrer"&gt;Hindsight agent memory&lt;/a&gt; and Groq's LLM. My job was the LLM logic — specifically, making sure the agent gave useful, accurate answers based on what it remembered. That meant I spent more time than anyone else staring at what &lt;code&gt;recall()&lt;/code&gt; actually returned, and whether it was good enough to build a response on.&lt;/p&gt;

&lt;p&gt;Spoiler: it was better than I expected. But not always in the ways I anticipated.&lt;/p&gt;
&lt;h2&gt;
  
  
  What I Thought Recall Would Be
&lt;/h2&gt;

&lt;p&gt;My mental model going in was basically: you store some text, you search it later, you get back the closest matches by embedding similarity. Standard RAG. Useful, but also limited in predictable ways — it struggles with names, exact terms, and anything time-related.&lt;/p&gt;

&lt;p&gt;So I designed the LLM prompts defensively. I assumed the recalled context would be noisy and incomplete, and I wrote system prompts that told the model to be cautious about what it claimed to know.&lt;/p&gt;

&lt;p&gt;Then I actually tested it.&lt;/p&gt;
&lt;h2&gt;
  
  
  What Recall Actually Returned
&lt;/h2&gt;

&lt;p&gt;The first thing that surprised me was how clean the recalled memories were. When we retained a task assignment like this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;_retain&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;content&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Task assigned to &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;assigned_to&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;: &lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;task&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;. &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
        &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Deadline: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;deadline&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;. Status: PENDING. &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
        &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Assigned on: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;datetime&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;utcnow&lt;/span&gt;&lt;span class="p"&gt;().&lt;/span&gt;&lt;span class="nf"&gt;strftime&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;%Y-%m-%d&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
    &lt;span class="p"&gt;),&lt;/span&gt;
    &lt;span class="n"&gt;context&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;task assignment&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;And then queried with something like &lt;code&gt;"What is Keerthana working on?"&lt;/code&gt; — the recalled memory wasn't just the raw string we stored. &lt;a href="https://hindsight.vectorize.io/" rel="noopener noreferrer"&gt;Hindsight&lt;/a&gt; had extracted structured facts from it: who the task was assigned to, what it was, when it was due. The recall result reflected that structure.&lt;/p&gt;

&lt;p&gt;This matters because it means the LLM receives clean, factual context rather than a blob of text it has to parse. The quality of the LLM's answer goes up significantly.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Four Search Strategies
&lt;/h2&gt;

&lt;p&gt;The reason recall felt better than standard vector search is that Hindsight runs four strategies in parallel — semantic, keyword, graph, and temporal — and merges the results. This is called TEMPR.&lt;/p&gt;

&lt;p&gt;In practice this meant:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Keyword search&lt;/strong&gt; caught exact names. When someone typed "Keerthana" in a query, the keyword strategy found memories that contained that exact string, even if the semantic embedding wasn't a strong match.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Graph search&lt;/strong&gt; caught relationships. After we logged that Keerthana owned the API routes task AND that we'd decided to use a REST architecture, a query about "backend decisions" surfaced both memories together — even though neither contained the word "backend."&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Temporal search&lt;/strong&gt; caught time references. When we asked "what was assigned yesterday?" it actually worked, because Hindsight stored timestamps with every retained memory and could reason about relative time.&lt;/p&gt;

&lt;p&gt;I hadn't expected the graph and temporal strategies to matter much for a short hackathon demo. They mattered more than I thought.&lt;/p&gt;

&lt;h2&gt;
  
  
  Where Recall Struggled
&lt;/h2&gt;

&lt;p&gt;It wasn't perfect. Two things bit us.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;First: the first session is thin.&lt;/strong&gt; With only one or two retained memories, recall doesn't have much to work with. The agent's answers in the first session were noticeably weaker than after we'd built up a few tasks and decisions. This is expected behaviour, but it means your demo needs to start with a seeded bank or run through a setup phase before showing off the recall quality.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Second: generic context labels hurt recall.&lt;/strong&gt; Early on we were storing everything with &lt;code&gt;context="data"&lt;/code&gt;. The recall results were flatter — less differentiated between task memories and decision memories. Once we switched to specific labels like &lt;code&gt;"task assignment"&lt;/code&gt; and &lt;code&gt;"team decision"&lt;/code&gt;, the relevant memories started rising to the top more reliably.&lt;/p&gt;

&lt;p&gt;The &lt;code&gt;context&lt;/code&gt; parameter isn't just metadata. It gets injected into the fact extraction prompt, so it actively shapes what Hindsight extracts and stores. More specific context = better facts = better recall.&lt;/p&gt;

&lt;h2&gt;
  
  
  What I Changed in the LLM Prompts
&lt;/h2&gt;

&lt;p&gt;Once I understood what recall was actually returning, I rewrote the system prompts to be less defensive. Instead of hedging, I told the model to treat the recalled memories as ground truth:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;system&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;You are an AI Group Project Manager.
Team members: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;, &lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;join&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;TEAM_MEMBERS&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;.

Project memory recalled from Hindsight:
&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;context&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;

Use this memory to give accurate, personalised answers.
Reference specific tasks and decisions from memory when relevant.
Be concise and helpful. Today: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;datetime&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;utcnow&lt;/span&gt;&lt;span class="p"&gt;().&lt;/span&gt;&lt;span class="nf"&gt;strftime&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;%Y-%m-%d&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;.&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The key line is: &lt;em&gt;"Use this memory to give accurate, personalised answers."&lt;/em&gt; Without that explicit instruction, the model would sometimes ignore the recalled context and generate plausible-sounding but fabricated answers. With it, the model anchored to what Hindsight had actually returned.&lt;/p&gt;

&lt;p&gt;This is a general lesson: giving an LLM memory isn't enough. You have to tell it to use the memory.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Moment That Changed My Mental Model
&lt;/h2&gt;

&lt;p&gt;Midway through testing, I asked the agent: &lt;em&gt;"What has Keerthana been assigned?"&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;It returned the API routes task with the correct deadline. Fine. But it also mentioned — unprompted — that the team had decided to use a REST architecture, and that this was relevant to the API work.&lt;/p&gt;

&lt;p&gt;We hadn't asked it to connect those two things. Hindsight's observation consolidation had synthesised a relationship between the decision memory and the task memory, and the graph search surfaced them together.&lt;/p&gt;

&lt;p&gt;That was the moment I stopped thinking of Hindsight as "storage with search" and started thinking of it as something closer to a knowledge graph that grows as you use it.&lt;/p&gt;

&lt;h2&gt;
  
  
  Lessons
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Recall quality scales with bank size.&lt;/strong&gt; Seed your bank before demoing.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Context labels are not optional.&lt;/strong&gt; Specific labels produce meaningfully better recall.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Tell the LLM to use the memory.&lt;/strong&gt; Explicit instructions in the system prompt matter.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;The graph strategy is doing real work.&lt;/strong&gt; Don't underestimate it just because you're not storing explicit relationships.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Test recall quality directly.&lt;/strong&gt; Print what &lt;code&gt;arecall()&lt;/code&gt; returns before wiring it into the LLM. You'll catch problems earlier.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The full project is on GitHub: &lt;a href="https://github.com/SinchanaNagaraj/ai-group-project-manager" rel="noopener noreferrer"&gt;github.com/SinchanaNagaraj/ai-group-project-manager&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;If you're evaluating memory systems for your agents, &lt;a href="https://hindsight.vectorize.io/" rel="noopener noreferrer"&gt;Hindsight's documentation&lt;/a&gt; is worth reading carefully — especially the section on retrieval strategies. The multi-strategy approach is what makes the recall quality meaningfully different from plain vector search.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>programming</category>
      <category>webdev</category>
      <category>python</category>
    </item>
  </channel>
</rss>
