How Hindsight improved partial solution feedback
“Why did it stop giving the correct answer?” I was staring at a passing solution while our system flagged it as “incomplete”—that’s when Hindsight started surfacing how the user actually got there.
What I actually built
Codemind is a coding practice platform where users don’t just submit solutions—they iterate. Every keystroke, failed run, partial idea, and retry is part of the signal. The system executes code in a sandbox, evaluates it, and generates feedback in real time.
At a high level, it looks like this:
- Frontend: problem UI, code editor, submission flow
- Backend (Python): handles submissions, evaluation, feedback
- Execution layer: runs code safely in isolation
- AI layer: analyzes errors and generates hints
- Memory layer (Hindsight): stores and retrieves user behavior over time
The interesting part—and the one that broke my assumptions—was the memory layer. I integrated Hindsight’s GitHub repository as the system that tracks how users solve problems, not just whether they solve them.
I didn’t expect that to fundamentally change how feedback works. It did.
The problem: “Correct” answers weren’t enough
Initially, feedback was simple:
- Run user code
- Compare output against test cases
- If it fails → show error
- If it passes → mark correct
For partial solutions, I tried to be helpful by detecting patterns:
if "recursion" in code and fails_test_cases:
return "Check your base condition."
This worked… until it didn’t.
Two users could submit nearly identical incorrect solutions for completely different reasons. One misunderstood recursion. The other had an off-by-one bug. Same output, totally different thinking.
My system treated them the same.
That’s where Hindsight changed things.
What I thought Hindsight would do
Going in, I assumed Hindsight was just “better context storage”—basically a structured way to keep user history and feed it into prompts.
Something like:
memory = hindsight.get_user_history(user_id)
prompt = f"""
User history:
{memory}
Current code:
{code}
Give feedback.
"""
This is technically correct—and also mostly useless.
Dumping more history into a prompt doesn’t magically make feedback better. It just makes it noisier.
What I actually built with Hindsight
The shift was subtle but important: I stopped treating memory as context and started treating it as behavioral signals.
Instead of asking “what has the user done?”, I started asking:
“What patterns does this user repeat when they fail?”
That led to a different integration pattern.
Storing attempts as structured events
Every submission became a structured event:
event = {
"user_id": user_id,
"problem_id": problem_id,
"code": code,
"error_type": classify_error(code, output),
"timestamp": now()
}
hindsight.store(event)
The key here wasn’t just storing code—it was attaching interpretation (error_type).
This made retrieval meaningful.
Retrieving patterns, not history
Instead of pulling the last N attempts, I started querying for patterns:
patterns = hindsight.query(
user_id=user_id,
filters={"error_type": "recursion_base_case"},
limit=3
)
Now I wasn’t feeding raw history into the AI. I was feeding evidence of repeated mistakes.
That changed the feedback completely.
Feedback generation shifted from static → adaptive
Before:
return "Check your recursion base condition."
After:
if patterns:
return f"""
You've missed the base condition in recursion multiple times.
Look at your stopping case carefully—what should happen when input is minimal?
"""
This sounds simple, but it made feedback feel intentional instead of generic.
Where it broke (and why)
The first version of this system overfit immediately.
If a user made one recursion mistake early on, every future problem triggered recursion-related hints—even when irrelevant.
The issue wasn’t Hindsight. It was how I used it.
I was treating all past behavior equally.
Fixing it: adding decay and relevance
I had to introduce two constraints:
1. Time decay
Older mistakes matter less:
def weight(event):
age = now() - event["timestamp"]
return max(0.1, 1 / (1 + age.days))
2. Problem context filtering
Only consider similar problems:
patterns = hindsight.query(
user_id=user_id,
filters={
"error_type": current_error,
"problem_tag": current_problem_tag
}
)
This dramatically reduced noise.
A concrete before vs after
Before Hindsight
User submits:
def sum(n):
if n == 1:
return 1
return n + sum(n-1)
Fails for n = 0.
Feedback:
“Check your base condition.”
Not wrong. Not helpful either.
After Hindsight
Hindsight sees:
- Same user failed 3 times on missing base cases
- All related to edge inputs (0, empty, null)
Feedback becomes:
“You tend to handle only the ‘main’ case in recursion and skip edge inputs like 0. What should happen when n = 0 here?”
This is a completely different experience.
It’s not just explaining the problem—it’s explaining the user’s pattern.
What this changed in the system
This one shift—using Hindsight for pattern detection instead of history dumping—rippled through the architecture.
- Feedback became stateful without becoming messy
- Hints became shorter but more relevant
- The system stopped over-explaining and started nudging
I also stopped relying on increasingly complex prompt engineering.
Instead, I focused on shaping better inputs.
If you’re curious how Hindsight structures and retrieves these signals, their official Hindsight documentation explains the retrieval model clearly. It’s closer to querying behavior than storing logs.
The architecture, simplified
At this point, the flow looks like:
User Code → Execution → Error Classification
↓
Hindsight Store
↓
Pattern Retrieval
↓
Feedback Engine
The important part is that Hindsight sits between evaluation and feedback, not just as a passive store.
If you want a broader look at how memory systems like this fit into agent design, the agent memory overview on Vectorize is a good reference.
What still isn’t great
This system is better, but not perfect.
- Error classification is still heuristic
- Similarity between problems is coarse (tags, not embeddings)
- Some users get “stuck” in a pattern loop where the system keeps nudging the same issue
Also, debugging this is painful.
When feedback feels wrong, it’s not obvious whether the issue is:
- bad retrieval
- bad classification
- or bad prompt shaping
Everything looks correct in isolation.
Lessons I’d carry forward
1. Memory is useless without structure
Dumping past attempts into prompts doesn’t work. You need interpretation layers (like error_type) to make memory actionable.
2. Patterns > history
Raw logs are noisy. Repeated behavior is signal.
Design your retrieval around patterns, not timelines.
3. Decay is not optional
Without time decay, your system becomes biased toward old mistakes.
This shows up fast and feels “haunted” from a user perspective.
4. Partial solutions are the real gold
Correct answers don’t tell you much.
Failed attempts + retries tell you exactly how someone thinks.
That’s where Hindsight actually shines.
5. Simpler prompts, better inputs
I spent way too long tweaking prompts.
The real improvement came from feeding the model better, more structured signals.
Closing
I started this project thinking feedback meant explaining why code failed.
What I ended up building was a system that tries to understand how someone fails repeatedly—and nudges them out of it.
Hindsight didn’t make the system smarter by itself. It just forced me to stop treating user behavior as logs and start treating it as data worth modeling.
That one shift made partial solutions more valuable than correct ones—and that wasn’t something I expected going in.
Top comments (0)