BALA PRAHARSHA MANNEPALLI

Posted on Mar 23

Why My AI Tutor Improved With Hindsight

#ai #webdev #programming #python

Our AI tutor used to treat every submission like the first one—until Hindsight let it connect attempts, and suddenly it stopped explaining errors and started predicting them.

What I Built (and Why It Wasn’t Working)

I’ve been working on a coding practice platform that behaves less like a judge and more like a tutor. You write code, run it, get feedback—but instead of a binary pass/fail, the system tries to explain what went wrong and nudge you forward.

At a high level, the system is pretty straightforward:

A frontend where users write and submit code
A Python backend that handles submissions and orchestration
An execution layer (sandboxed) that runs code safely
An AI layer that analyzes output and generates feedback
A memory layer where things got interesting

The first four parts worked fine. Code executed, errors were captured, feedback was generated.

But something felt off.

The tutor wasn’t actually teaching. It was just reacting.

The Problem: Stateless Feedback Is Useless

Every time a user submitted code, the system treated it as a completely new event.

It didn’t matter if:

The user had already failed 3 times
The mistake was identical to the previous attempt
The hint we gave last time clearly didn’t help

We still generated feedback like this:

“You might want to check your recursion base case.”

Again. And again. And again.

From the system’s perspective, this was correct. From the user’s perspective, it was useless.

I initially thought I could solve this with a bigger prompt—just shove previous attempts into the context window and let the model “figure it out.”

That worked… until it didn’t:

Context got bloated quickly
Important patterns were lost in noise
Cost and latency increased
Behavior was inconsistent

What I needed wasn’t more context.

I needed memory that meant something.

The Shift: Storing Attempts as First-Class Data

The turning point was when I stopped thinking of submissions as isolated requests and started treating them as a sequence.

Instead of this:

def handle_submission(code, problem_id):
    result = execute(code)
    feedback = generate_feedback(code, result)
    return feedback

I moved to something closer to this:

def handle_submission(user_id, code, problem_id):
    result = execute(code)

    event = {
        "user_id": user_id,
        "problem_id": problem_id,
        "code": code,
        "result": result,
    }

    store_event(event)

    history = get_recent_events(user_id, problem_id)

    feedback = generate_feedback(code, result, history)

    return feedback

This looks obvious in hindsight (no pun intended), but this alone didn’t fix the problem.

I had history, but I didn’t have learning.

The system still wasn’t connecting patterns. It just had more data to ignore.

Enter Hindsight: Turning History Into Behavior

This is where I integrated Hindsight, which is essentially a structured memory layer for agents.

Instead of dumping raw history into prompts, Hindsight lets you:

Store interactions as structured events
Retrieve relevant past patterns
Feed them back into decision-making in a controlled way

I wired it into the submission pipeline so that every attempt becomes a memory entry.

Conceptually, it looked like this:

memory.store(
    user_id=user_id,
    input=code,
    output=result,
    metadata={
        "problem_id": problem_id,
        "error_type": classify_error(result),
    }
)

And on the next submission:

patterns = memory.retrieve(
    user_id=user_id,
    problem_id=problem_id,
    limit=3
)

feedback = generate_feedback(code, result, patterns)

The key difference: I wasn’t passing all history. I was passing relevant history.

If a user repeatedly messed up recursion, the system would surface those specific attempts—not everything they’d ever done.

If you’re curious how this works in detail, the Hindsight documentation explains the retrieval and structuring patterns pretty clearly.

The First Real Change in Behavior

This is where things got interesting.

Before Hindsight:

User writes incorrect recursion →
System says: “Check base case.”

User repeats same mistake →
System says: “Check base case.”

After Hindsight:

User repeats mistake →
System says:

“You’re making the same recursion mistake as your previous attempt—your base case still doesn’t stop when n == 0. Try returning 1 there.”

That’s a completely different experience.

It’s not just explaining the error. It’s connecting the dots.

Why This Worked (and My First Attempt Didn’t)

The difference wasn’t just “adding memory.” It was how memory was used.

My earlier approach:

Dump raw history into prompt
Hope the model extracts patterns

Hindsight approach:

Store structured events
Retrieve relevant ones
Inject them intentionally

This aligns much more closely with how we’d design any other system:

Index data
Query by relevance
Use results predictably

If you look at how agent memory is framed more broadly, this is exactly the pattern described in systems like Vectorize’s agent memory architecture overview.

A Small but Important Design Choice

One thing that made a big difference: explicit error classification.

Instead of storing just raw outputs, I added a lightweight classifier:

def classify_error(result):
    if "RecursionError" in result:
        return "recursion"
    if "IndexError" in result:
        return "index"
    return "other"

This allowed retrieval to be more targeted:

memory.retrieve(
    user_id=user_id,
    filters={"error_type": "recursion"}
)

Without this, the system sometimes pulled irrelevant past attempts, which diluted feedback quality.

This is one of those small, boring engineering decisions that ended up mattering more than expected.

What Still Doesn’t Work Well

It’s not perfect.

A few things are still rough:

Overfitting to past mistakes
Sometimes the system assumes the user is repeating an error when they’re actually trying something new.
Cold start problem
First-time users still get generic feedback.
Memory drift
Old mistakes can become irrelevant, but still show up if not filtered properly.

I haven’t fully solved these yet. I’m experimenting with:

Decay functions for memory relevance
Weighting recent attempts higher
Separating “resolved” vs “active” mistakes

What I Learned Building This

A few things I’d carry forward into any system like this:

1. More context is not better context

Dumping everything into a prompt is lazy and doesn’t scale. Retrieval beats brute force every time.

2. Memory needs structure, not just storage

Raw logs are not memory. Indexing, filtering, and shaping data is what makes it useful.

3. Small metadata decisions matter a lot

That simple error_type field improved retrieval quality more than any prompt tweak I tried.

4. Behavior > accuracy

The biggest improvement wasn’t “better explanations.” It was different behavior—the system started referencing past attempts.

5. Teaching requires continuity

If your system doesn’t connect attempts, it’s not a tutor. It’s just a smarter compiler error.

Where This Leaves the System

Right now, the tutor does something I didn’t expect it to do this early: it adapts.

Not perfectly, not consistently—but enough that you notice.

You can make the same mistake twice, and it won’t respond the same way.

That alone changed how the system feels.

It stopped being reactive and started being contextual.

And that shift didn’t come from a better model.

It came from giving the system a memory it could actually use.

DEV Community