DEV Community

Cover image for How Hindsight improved partial solution feedback
T S
T S

Posted on

How Hindsight improved partial solution feedback

How Hindsight improved partial solution feedback

“Why did it stop giving the correct answer?” I was staring at a passing solution while our system flagged it as “incomplete”—that’s when Hindsight started surfacing how the user actually got there.

What I actually built

Codemind is a coding practice platform where users don’t just submit solutions—they iterate. Every keystroke, failed run, partial idea, and retry is part of the signal. The system executes code in a sandbox, evaluates it, and generates feedback in real time.

At a high level, it looks like this:

  • Frontend: problem UI, code editor, submission flow
  • Backend (Python): handles submissions, evaluation, feedback
  • Execution layer: runs code safely in isolation
  • AI layer: analyzes errors and generates hints
  • Memory layer (Hindsight): stores and retrieves user behavior over time

The interesting part—and the one that broke my assumptions—was the memory layer. I integrated Hindsight’s GitHub repository as the system that tracks how users solve problems, not just whether they solve them.

I didn’t expect that to fundamentally change how feedback works. It did.


The problem: “Correct” answers weren’t enough

Initially, feedback was simple:

  1. Run user code
  2. Compare output against test cases
  3. If it fails → show error
  4. If it passes → mark correct

For partial solutions, I tried to be helpful by detecting patterns:

if "recursion" in code and fails_test_cases:
    return "Check your base condition."
Enter fullscreen mode Exit fullscreen mode

This worked… until it didn’t.

Two users could submit nearly identical incorrect solutions for completely different reasons. One misunderstood recursion. The other had an off-by-one bug. Same output, totally different thinking.

My system treated them the same.

That’s where Hindsight changed things.


What I thought Hindsight would do

Going in, I assumed Hindsight was just “better context storage”—basically a structured way to keep user history and feed it into prompts.

Something like:

memory = hindsight.get_user_history(user_id)
prompt = f"""
User history:
{memory}

Current code:
{code}

Give feedback.
"""
Enter fullscreen mode Exit fullscreen mode

This is technically correct—and also mostly useless.

Dumping more history into a prompt doesn’t magically make feedback better. It just makes it noisier.


What I actually built with Hindsight

The shift was subtle but important: I stopped treating memory as context and started treating it as behavioral signals.

Instead of asking “what has the user done?”, I started asking:

“What patterns does this user repeat when they fail?”

That led to a different integration pattern.

Storing attempts as structured events

Every submission became a structured event:

event = {
    "user_id": user_id,
    "problem_id": problem_id,
    "code": code,
    "error_type": classify_error(code, output),
    "timestamp": now()
}

hindsight.store(event)
Enter fullscreen mode Exit fullscreen mode

The key here wasn’t just storing code—it was attaching interpretation (error_type).

This made retrieval meaningful.


Retrieving patterns, not history

Instead of pulling the last N attempts, I started querying for patterns:

patterns = hindsight.query(
    user_id=user_id,
    filters={"error_type": "recursion_base_case"},
    limit=3
)
Enter fullscreen mode Exit fullscreen mode

Now I wasn’t feeding raw history into the AI. I was feeding evidence of repeated mistakes.

That changed the feedback completely.


Feedback generation shifted from static → adaptive

Before:

return "Check your recursion base condition."
Enter fullscreen mode Exit fullscreen mode

After:

if patterns:
    return f"""
You've missed the base condition in recursion multiple times.
Look at your stopping case carefully—what should happen when input is minimal?
"""
Enter fullscreen mode Exit fullscreen mode

This sounds simple, but it made feedback feel intentional instead of generic.


Where it broke (and why)

The first version of this system overfit immediately.

If a user made one recursion mistake early on, every future problem triggered recursion-related hints—even when irrelevant.

The issue wasn’t Hindsight. It was how I used it.

I was treating all past behavior equally.


Fixing it: adding decay and relevance

I had to introduce two constraints:

1. Time decay

Older mistakes matter less:

def weight(event):
    age = now() - event["timestamp"]
    return max(0.1, 1 / (1 + age.days))
Enter fullscreen mode Exit fullscreen mode

2. Problem context filtering

Only consider similar problems:

patterns = hindsight.query(
    user_id=user_id,
    filters={
        "error_type": current_error,
        "problem_tag": current_problem_tag
    }
)
Enter fullscreen mode Exit fullscreen mode

This dramatically reduced noise.


A concrete before vs after

Before Hindsight

User submits:

def sum(n):
    if n == 1:
        return 1
    return n + sum(n-1)
Enter fullscreen mode Exit fullscreen mode

Fails for n = 0.

Feedback:

“Check your base condition.”

Not wrong. Not helpful either.


After Hindsight

Hindsight sees:

  • Same user failed 3 times on missing base cases
  • All related to edge inputs (0, empty, null)

Feedback becomes:

“You tend to handle only the ‘main’ case in recursion and skip edge inputs like 0. What should happen when n = 0 here?”

This is a completely different experience.

It’s not just explaining the problem—it’s explaining the user’s pattern.


What this changed in the system

This one shift—using Hindsight for pattern detection instead of history dumping—rippled through the architecture.

  • Feedback became stateful without becoming messy
  • Hints became shorter but more relevant
  • The system stopped over-explaining and started nudging

I also stopped relying on increasingly complex prompt engineering.

Instead, I focused on shaping better inputs.

If you’re curious how Hindsight structures and retrieves these signals, their official Hindsight documentation explains the retrieval model clearly. It’s closer to querying behavior than storing logs.


The architecture, simplified

At this point, the flow looks like:

User Code → Execution → Error Classification
           ↓
       Hindsight Store
           ↓
   Pattern Retrieval
           ↓
     Feedback Engine
Enter fullscreen mode Exit fullscreen mode

The important part is that Hindsight sits between evaluation and feedback, not just as a passive store.

If you want a broader look at how memory systems like this fit into agent design, the agent memory overview on Vectorize is a good reference.


What still isn’t great

This system is better, but not perfect.

  • Error classification is still heuristic
  • Similarity between problems is coarse (tags, not embeddings)
  • Some users get “stuck” in a pattern loop where the system keeps nudging the same issue

Also, debugging this is painful.

When feedback feels wrong, it’s not obvious whether the issue is:

  • bad retrieval
  • bad classification
  • or bad prompt shaping

Everything looks correct in isolation.


Lessons I’d carry forward

1. Memory is useless without structure

Dumping past attempts into prompts doesn’t work. You need interpretation layers (like error_type) to make memory actionable.


2. Patterns > history

Raw logs are noisy. Repeated behavior is signal.

Design your retrieval around patterns, not timelines.


3. Decay is not optional

Without time decay, your system becomes biased toward old mistakes.

This shows up fast and feels “haunted” from a user perspective.


4. Partial solutions are the real gold

Correct answers don’t tell you much.

Failed attempts + retries tell you exactly how someone thinks.

That’s where Hindsight actually shines.


5. Simpler prompts, better inputs

I spent way too long tweaking prompts.

The real improvement came from feeding the model better, more structured signals.


Closing

I started this project thinking feedback meant explaining why code failed.

What I ended up building was a system that tries to understand how someone fails repeatedly—and nudges them out of it.

Hindsight didn’t make the system smarter by itself. It just forced me to stop treating user behavior as logs and start treating it as data worth modeling.

That one shift made partial solutions more valuable than correct ones—and that wasn’t something I expected going in.

Top comments (0)