T S

Posted on Mar 23

How Hindsight improved partial solution feedback

#python

How Hindsight improved partial solution feedback

“Why did it stop giving the correct answer?” I was staring at a passing solution while our system flagged it as “incomplete”—that’s when Hindsight started surfacing how the user actually got there.

What I actually built

Codemind is a coding practice platform where users don’t just submit solutions—they iterate. Every keystroke, failed run, partial idea, and retry is part of the signal. The system executes code in a sandbox, evaluates it, and generates feedback in real time.

At a high level, it looks like this:

Frontend: problem UI, code editor, submission flow
Backend (Python): handles submissions, evaluation, feedback
Execution layer: runs code safely in isolation
AI layer: analyzes errors and generates hints
Memory layer (Hindsight): stores and retrieves user behavior over time

The interesting part—and the one that broke my assumptions—was the memory layer. I integrated Hindsight’s GitHub repository as the system that tracks how users solve problems, not just whether they solve them.

I didn’t expect that to fundamentally change how feedback works. It did.

The problem: “Correct” answers weren’t enough

Initially, feedback was simple:

Run user code
Compare output against test cases
If it fails → show error
If it passes → mark correct

For partial solutions, I tried to be helpful by detecting patterns:

if "recursion" in code and fails_test_cases:
    return "Check your base condition."

This worked… until it didn’t.

Two users could submit nearly identical incorrect solutions for completely different reasons. One misunderstood recursion. The other had an off-by-one bug. Same output, totally different thinking.

My system treated them the same.

That’s where Hindsight changed things.

What I thought Hindsight would do

Going in, I assumed Hindsight was just “better context storage”—basically a structured way to keep user history and feed it into prompts.

Something like:

memory = hindsight.get_user_history(user_id)
prompt = f"""
User history:
{memory}

Current code:
{code}

Give feedback.
"""

This is technically correct—and also mostly useless.

Dumping more history into a prompt doesn’t magically make feedback better. It just makes it noisier.

What I actually built with Hindsight

The shift was subtle but important: I stopped treating memory as context and started treating it as behavioral signals.

Instead of asking “what has the user done?”, I started asking:

“What patterns does this user repeat when they fail?”

That led to a different integration pattern.

Storing attempts as structured events

Every submission became a structured event:

event = {
    "user_id": user_id,
    "problem_id": problem_id,
    "code": code,
    "error_type": classify_error(code, output),
    "timestamp": now()
}

hindsight.store(event)

The key here wasn’t just storing code—it was attaching interpretation (error_type).

This made retrieval meaningful.

Retrieving patterns, not history

Instead of pulling the last N attempts, I started querying for patterns:

patterns = hindsight.query(
    user_id=user_id,
    filters={"error_type": "recursion_base_case"},
    limit=3
)

Now I wasn’t feeding raw history into the AI. I was feeding evidence of repeated mistakes.

That changed the feedback completely.

Feedback generation shifted from static → adaptive

Before:

return "Check your recursion base condition."

After:

if patterns:
    return f"""
You've missed the base condition in recursion multiple times.
Look at your stopping case carefully—what should happen when input is minimal?
"""

This sounds simple, but it made feedback feel intentional instead of generic.

Where it broke (and why)

The first version of this system overfit immediately.

If a user made one recursion mistake early on, every future problem triggered recursion-related hints—even when irrelevant.

The issue wasn’t Hindsight. It was how I used it.

I was treating all past behavior equally.

Fixing it: adding decay and relevance

I had to introduce two constraints:

1. Time decay

Older mistakes matter less:

def weight(event):
    age = now() - event["timestamp"]
    return max(0.1, 1 / (1 + age.days))

2. Problem context filtering

Only consider similar problems:

patterns = hindsight.query(
    user_id=user_id,
    filters={
        "error_type": current_error,
        "problem_tag": current_problem_tag
    }
)

This dramatically reduced noise.

A concrete before vs after

Before Hindsight

User submits:

def sum(n):
    if n == 1:
        return 1
    return n + sum(n-1)

Fails for n = 0.

Feedback:

“Check your base condition.”

Not wrong. Not helpful either.

After Hindsight

Hindsight sees:

Same user failed 3 times on missing base cases
All related to edge inputs (0, empty, null)

Feedback becomes:

“You tend to handle only the ‘main’ case in recursion and skip edge inputs like 0. What should happen when n = 0 here?”

This is a completely different experience.

It’s not just explaining the problem—it’s explaining the user’s pattern.

What this changed in the system

This one shift—using Hindsight for pattern detection instead of history dumping—rippled through the architecture.

Feedback became stateful without becoming messy
Hints became shorter but more relevant
The system stopped over-explaining and started nudging

I also stopped relying on increasingly complex prompt engineering.

Instead, I focused on shaping better inputs.

If you’re curious how Hindsight structures and retrieves these signals, their official Hindsight documentation explains the retrieval model clearly. It’s closer to querying behavior than storing logs.

The architecture, simplified

At this point, the flow looks like:

User Code → Execution → Error Classification
           ↓
       Hindsight Store
           ↓
   Pattern Retrieval
           ↓
     Feedback Engine

The important part is that Hindsight sits between evaluation and feedback, not just as a passive store.

If you want a broader look at how memory systems like this fit into agent design, the agent memory overview on Vectorize is a good reference.

What still isn’t great

This system is better, but not perfect.

Error classification is still heuristic
Similarity between problems is coarse (tags, not embeddings)
Some users get “stuck” in a pattern loop where the system keeps nudging the same issue

Also, debugging this is painful.

When feedback feels wrong, it’s not obvious whether the issue is:

bad retrieval
bad classification
or bad prompt shaping

Everything looks correct in isolation.

Lessons I’d carry forward

1. Memory is useless without structure

Dumping past attempts into prompts doesn’t work. You need interpretation layers (like error_type) to make memory actionable.

2. Patterns > history

Raw logs are noisy. Repeated behavior is signal.

Design your retrieval around patterns, not timelines.

3. Decay is not optional

Without time decay, your system becomes biased toward old mistakes.

This shows up fast and feels “haunted” from a user perspective.

4. Partial solutions are the real gold

Correct answers don’t tell you much.

Failed attempts + retries tell you exactly how someone thinks.

That’s where Hindsight actually shines.

5. Simpler prompts, better inputs

I spent way too long tweaking prompts.

The real improvement came from feeding the model better, more structured signals.

Closing

I started this project thinking feedback meant explaining why code failed.

What I ended up building was a system that tries to understand how someone fails repeatedly—and nudges them out of it.

Hindsight didn’t make the system smarter by itself. It just forced me to stop treating user behavior as logs and start treating it as data worth modeling.

That one shift made partial solutions more valuable than correct ones—and that wasn’t something I expected going in.

DEV Community

How Hindsight improved partial solution feedback

How Hindsight improved partial solution feedback

What I actually built

The problem: “Correct” answers weren’t enough

What I thought Hindsight would do

What I actually built with Hindsight

Storing attempts as structured events

Retrieving patterns, not history

Feedback generation shifted from static → adaptive

Where it broke (and why)

Fixing it: adding decay and relevance

1. Time decay

2. Problem context filtering

A concrete before vs after

Before Hindsight

After Hindsight

What this changed in the system

The architecture, simplified

What still isn’t great

Lessons I’d carry forward

1. Memory is useless without structure

2. Patterns > history

3. Decay is not optional

4. Partial solutions are the real gold

5. Simpler prompts, better inputs

Closing

Top comments (0)