<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: T S</title>
    <description>The latest articles on DEV Community by T S (@taraninadh).</description>
    <link>https://dev.to/taraninadh</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3840548%2F6dea087d-113f-41de-8698-9b02b47637f9.png</url>
      <title>DEV Community: T S</title>
      <link>https://dev.to/taraninadh</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/taraninadh"/>
    <language>en</language>
    <item>
      <title>How Hindsight improved partial solution feedback</title>
      <dc:creator>T S</dc:creator>
      <pubDate>Mon, 23 Mar 2026 17:51:17 +0000</pubDate>
      <link>https://dev.to/taraninadh/how-hindsight-improved-partial-solution-feedback-5bdc</link>
      <guid>https://dev.to/taraninadh/how-hindsight-improved-partial-solution-feedback-5bdc</guid>
      <description>&lt;h1&gt;
  
  
  How Hindsight improved partial solution feedback
&lt;/h1&gt;

&lt;p&gt;“Why did it stop giving the correct answer?” I was staring at a passing solution while our system flagged it as “incomplete”—that’s when Hindsight started surfacing how the user actually got there.&lt;/p&gt;

&lt;h2&gt;
  
  
  What I actually built
&lt;/h2&gt;

&lt;p&gt;Codemind is a coding practice platform where users don’t just submit solutions—they iterate. Every keystroke, failed run, partial idea, and retry is part of the signal. The system executes code in a sandbox, evaluates it, and generates feedback in real time.&lt;/p&gt;

&lt;p&gt;At a high level, it looks like this:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Frontend&lt;/strong&gt;: problem UI, code editor, submission flow&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Backend (Python)&lt;/strong&gt;: handles submissions, evaluation, feedback&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Execution layer&lt;/strong&gt;: runs code safely in isolation&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;AI layer&lt;/strong&gt;: analyzes errors and generates hints&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Memory layer (Hindsight)&lt;/strong&gt;: stores and retrieves user behavior over time&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The interesting part—and the one that broke my assumptions—was the memory layer. I integrated &lt;a href="https://github.com/vectorize-io/hindsight" rel="noopener noreferrer"&gt;&lt;strong&gt;Hindsight’s GitHub repository&lt;/strong&gt;&lt;/a&gt; as the system that tracks how users solve problems, not just whether they solve them.&lt;/p&gt;

&lt;p&gt;I didn’t expect that to fundamentally change how feedback works. It did.&lt;/p&gt;




&lt;h2&gt;
  
  
  The problem: “Correct” answers weren’t enough
&lt;/h2&gt;

&lt;p&gt;Initially, feedback was simple:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Run user code&lt;/li&gt;
&lt;li&gt;Compare output against test cases&lt;/li&gt;
&lt;li&gt;If it fails → show error&lt;/li&gt;
&lt;li&gt;If it passes → mark correct&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;For partial solutions, I tried to be helpful by detecting patterns:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;recursion&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;code&lt;/span&gt; &lt;span class="ow"&gt;and&lt;/span&gt; &lt;span class="n"&gt;fails_test_cases&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Check your base condition.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This worked… until it didn’t.&lt;/p&gt;

&lt;p&gt;Two users could submit nearly identical incorrect solutions for completely different reasons. One misunderstood recursion. The other had an off-by-one bug. Same output, totally different thinking.&lt;/p&gt;

&lt;p&gt;My system treated them the same.&lt;/p&gt;

&lt;p&gt;That’s where Hindsight changed things.&lt;/p&gt;




&lt;h2&gt;
  
  
  What I thought Hindsight would do
&lt;/h2&gt;

&lt;p&gt;Going in, I assumed Hindsight was just “better context storage”—basically a structured way to keep user history and feed it into prompts.&lt;/p&gt;

&lt;p&gt;Something like:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;memory&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;hindsight&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get_user_history&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;user_id&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;prompt&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;
User history:
&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;memory&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;

Current code:
&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;code&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;

Give feedback.
&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This is technically correct—and also mostly useless.&lt;/p&gt;

&lt;p&gt;Dumping more history into a prompt doesn’t magically make feedback better. It just makes it noisier.&lt;/p&gt;




&lt;h2&gt;
  
  
  What I actually built with Hindsight
&lt;/h2&gt;

&lt;p&gt;The shift was subtle but important: I stopped treating memory as context and started treating it as &lt;strong&gt;behavioral signals&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;Instead of asking “what has the user done?”, I started asking:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;“What patterns does this user repeat when they fail?”&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;That led to a different integration pattern.&lt;/p&gt;

&lt;h3&gt;
  
  
  Storing attempts as structured events
&lt;/h3&gt;

&lt;p&gt;Every submission became a structured event:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;event&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;user_id&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;user_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;problem_id&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;problem_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;code&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;code&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;error_type&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nf"&gt;classify_error&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;code&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;output&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;timestamp&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nf"&gt;now&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="n"&gt;hindsight&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;store&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;event&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The key here wasn’t just storing code—it was attaching &lt;strong&gt;interpretation&lt;/strong&gt; (&lt;code&gt;error_type&lt;/code&gt;).&lt;/p&gt;

&lt;p&gt;This made retrieval meaningful.&lt;/p&gt;




&lt;h3&gt;
  
  
  Retrieving patterns, not history
&lt;/h3&gt;

&lt;p&gt;Instead of pulling the last N attempts, I started querying for patterns:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;patterns&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;hindsight&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;query&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;user_id&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;user_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;filters&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;error_type&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;recursion_base_case&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;
    &lt;span class="n"&gt;limit&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;3&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Now I wasn’t feeding raw history into the AI. I was feeding &lt;strong&gt;evidence of repeated mistakes&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;That changed the feedback completely.&lt;/p&gt;




&lt;h3&gt;
  
  
  Feedback generation shifted from static → adaptive
&lt;/h3&gt;

&lt;p&gt;Before:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Check your recursion base condition.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;After:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;patterns&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;
You&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;ve missed the base condition in recursion multiple times.
Look at your stopping case carefully—what should happen when input is minimal?
&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This sounds simple, but it made feedback feel &lt;em&gt;intentional&lt;/em&gt; instead of generic.&lt;/p&gt;




&lt;h2&gt;
  
  
  Where it broke (and why)
&lt;/h2&gt;

&lt;p&gt;The first version of this system overfit immediately.&lt;/p&gt;

&lt;p&gt;If a user made one recursion mistake early on, every future problem triggered recursion-related hints—even when irrelevant.&lt;/p&gt;

&lt;p&gt;The issue wasn’t Hindsight. It was how I used it.&lt;/p&gt;

&lt;p&gt;I was treating all past behavior equally.&lt;/p&gt;




&lt;h2&gt;
  
  
  Fixing it: adding decay and relevance
&lt;/h2&gt;

&lt;p&gt;I had to introduce two constraints:&lt;/p&gt;

&lt;h3&gt;
  
  
  1. Time decay
&lt;/h3&gt;

&lt;p&gt;Older mistakes matter less:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;weight&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;event&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;age&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;now&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="n"&gt;event&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;timestamp&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nf"&gt;max&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mf"&gt;0.1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;age&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;days&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  2. Problem context filtering
&lt;/h3&gt;

&lt;p&gt;Only consider similar problems:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;patterns&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;hindsight&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;query&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;user_id&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;user_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;filters&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;error_type&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;current_error&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;problem_tag&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;current_problem_tag&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This dramatically reduced noise.&lt;/p&gt;




&lt;h2&gt;
  
  
  A concrete before vs after
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Before Hindsight
&lt;/h3&gt;

&lt;p&gt;User submits:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;sum&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;n&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;n&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;n&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="nf"&gt;sum&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;n&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Fails for &lt;code&gt;n = 0&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;Feedback:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;“Check your base condition.”&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Not wrong. Not helpful either.&lt;/p&gt;




&lt;h3&gt;
  
  
  After Hindsight
&lt;/h3&gt;

&lt;p&gt;Hindsight sees:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Same user failed 3 times on missing base cases&lt;/li&gt;
&lt;li&gt;All related to edge inputs (0, empty, null)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Feedback becomes:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;“You tend to handle only the ‘main’ case in recursion and skip edge inputs like 0. What should happen when n = 0 here?”&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;This is a completely different experience.&lt;/p&gt;

&lt;p&gt;It’s not just explaining the problem—it’s explaining &lt;em&gt;the user’s pattern&lt;/em&gt;.&lt;/p&gt;




&lt;h2&gt;
  
  
  What this changed in the system
&lt;/h2&gt;

&lt;p&gt;This one shift—using Hindsight for pattern detection instead of history dumping—rippled through the architecture.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Feedback became stateful without becoming messy&lt;/li&gt;
&lt;li&gt;Hints became shorter but more relevant&lt;/li&gt;
&lt;li&gt;The system stopped over-explaining and started nudging&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;I also stopped relying on increasingly complex prompt engineering.&lt;/p&gt;

&lt;p&gt;Instead, I focused on shaping better inputs.&lt;/p&gt;

&lt;p&gt;If you’re curious how Hindsight structures and retrieves these signals, their &lt;a href="https://hindsight.vectorize.io/" rel="noopener noreferrer"&gt;&lt;strong&gt;official Hindsight documentation&lt;/strong&gt;&lt;/a&gt; explains the retrieval model clearly. It’s closer to querying behavior than storing logs.&lt;/p&gt;




&lt;h2&gt;
  
  
  The architecture, simplified
&lt;/h2&gt;

&lt;p&gt;At this point, the flow looks like:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;User Code → Execution → Error Classification
           ↓
       Hindsight Store
           ↓
   Pattern Retrieval
           ↓
     Feedback Engine
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The important part is that Hindsight sits &lt;strong&gt;between evaluation and feedback&lt;/strong&gt;, not just as a passive store.&lt;/p&gt;

&lt;p&gt;If you want a broader look at how memory systems like this fit into agent design, the &lt;a href="https://vectorize.io/features/agent-memory" rel="noopener noreferrer"&gt;&lt;strong&gt;agent memory overview on Vectorize&lt;/strong&gt;&lt;/a&gt; is a good reference.&lt;/p&gt;




&lt;h2&gt;
  
  
  What still isn’t great
&lt;/h2&gt;

&lt;p&gt;This system is better, but not perfect.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Error classification is still heuristic&lt;/li&gt;
&lt;li&gt;Similarity between problems is coarse (tags, not embeddings)&lt;/li&gt;
&lt;li&gt;Some users get “stuck” in a pattern loop where the system keeps nudging the same issue&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Also, debugging this is painful.&lt;/p&gt;

&lt;p&gt;When feedback feels wrong, it’s not obvious whether the issue is:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;bad retrieval&lt;/li&gt;
&lt;li&gt;bad classification&lt;/li&gt;
&lt;li&gt;or bad prompt shaping&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Everything looks correct in isolation.&lt;/p&gt;




&lt;h2&gt;
  
  
  Lessons I’d carry forward
&lt;/h2&gt;

&lt;h3&gt;
  
  
  1. Memory is useless without structure
&lt;/h3&gt;

&lt;p&gt;Dumping past attempts into prompts doesn’t work. You need interpretation layers (like &lt;code&gt;error_type&lt;/code&gt;) to make memory actionable.&lt;/p&gt;




&lt;h3&gt;
  
  
  2. Patterns &amp;gt; history
&lt;/h3&gt;

&lt;p&gt;Raw logs are noisy. Repeated behavior is signal.&lt;/p&gt;

&lt;p&gt;Design your retrieval around patterns, not timelines.&lt;/p&gt;




&lt;h3&gt;
  
  
  3. Decay is not optional
&lt;/h3&gt;

&lt;p&gt;Without time decay, your system becomes biased toward old mistakes.&lt;/p&gt;

&lt;p&gt;This shows up fast and feels “haunted” from a user perspective.&lt;/p&gt;




&lt;h3&gt;
  
  
  4. Partial solutions are the real gold
&lt;/h3&gt;

&lt;p&gt;Correct answers don’t tell you much.&lt;/p&gt;

&lt;p&gt;Failed attempts + retries tell you exactly how someone thinks.&lt;/p&gt;

&lt;p&gt;That’s where Hindsight actually shines.&lt;/p&gt;




&lt;h3&gt;
  
  
  5. Simpler prompts, better inputs
&lt;/h3&gt;

&lt;p&gt;I spent way too long tweaking prompts.&lt;/p&gt;

&lt;p&gt;The real improvement came from feeding the model better, more structured signals.&lt;/p&gt;




&lt;h2&gt;
  
  
  Closing
&lt;/h2&gt;

&lt;p&gt;I started this project thinking feedback meant explaining why code failed.&lt;/p&gt;

&lt;p&gt;What I ended up building was a system that tries to understand &lt;em&gt;how someone fails repeatedly&lt;/em&gt;—and nudges them out of it.&lt;/p&gt;

&lt;p&gt;Hindsight didn’t make the system smarter by itself. It just forced me to stop treating user behavior as logs and start treating it as data worth modeling.&lt;/p&gt;

&lt;p&gt;That one shift made partial solutions more valuable than correct ones—and that wasn’t something I expected going in.&lt;/p&gt;

</description>
      <category>python</category>
    </item>
  </channel>
</rss>
