Our AI tutor used to treat every submission like the first one—until Hindsight let it connect attempts, and suddenly it stopped explaining errors and started predicting them.
What I Built (and Why It Wasn’t Working)
I’ve been working on a coding practice platform that behaves less like a judge and more like a tutor. You write code, run it, get feedback—but instead of a binary pass/fail, the system tries to explain what went wrong and nudge you forward.
At a high level, the system is pretty straightforward:
- A frontend where users write and submit code
- A Python backend that handles submissions and orchestration
- An execution layer (sandboxed) that runs code safely
- An AI layer that analyzes output and generates feedback
- A memory layer where things got interesting
The first four parts worked fine. Code executed, errors were captured, feedback was generated.
But something felt off.
The tutor wasn’t actually teaching. It was just reacting.
The Problem: Stateless Feedback Is Useless
Every time a user submitted code, the system treated it as a completely new event.
It didn’t matter if:
- The user had already failed 3 times
- The mistake was identical to the previous attempt
- The hint we gave last time clearly didn’t help
We still generated feedback like this:
“You might want to check your recursion base case.”
Again. And again. And again.
From the system’s perspective, this was correct. From the user’s perspective, it was useless.
I initially thought I could solve this with a bigger prompt—just shove previous attempts into the context window and let the model “figure it out.”
That worked… until it didn’t:
- Context got bloated quickly
- Important patterns were lost in noise
- Cost and latency increased
- Behavior was inconsistent
What I needed wasn’t more context.
I needed memory that meant something.
The Shift: Storing Attempts as First-Class Data
The turning point was when I stopped thinking of submissions as isolated requests and started treating them as a sequence.
Instead of this:
def handle_submission(code, problem_id):
result = execute(code)
feedback = generate_feedback(code, result)
return feedback
I moved to something closer to this:
def handle_submission(user_id, code, problem_id):
result = execute(code)
event = {
"user_id": user_id,
"problem_id": problem_id,
"code": code,
"result": result,
}
store_event(event)
history = get_recent_events(user_id, problem_id)
feedback = generate_feedback(code, result, history)
return feedback
This looks obvious in hindsight (no pun intended), but this alone didn’t fix the problem.
I had history, but I didn’t have learning.
The system still wasn’t connecting patterns. It just had more data to ignore.
Enter Hindsight: Turning History Into Behavior
This is where I integrated Hindsight, which is essentially a structured memory layer for agents.
Instead of dumping raw history into prompts, Hindsight lets you:
- Store interactions as structured events
- Retrieve relevant past patterns
- Feed them back into decision-making in a controlled way
I wired it into the submission pipeline so that every attempt becomes a memory entry.
Conceptually, it looked like this:
memory.store(
user_id=user_id,
input=code,
output=result,
metadata={
"problem_id": problem_id,
"error_type": classify_error(result),
}
)
And on the next submission:
patterns = memory.retrieve(
user_id=user_id,
problem_id=problem_id,
limit=3
)
feedback = generate_feedback(code, result, patterns)
The key difference: I wasn’t passing all history. I was passing relevant history.
If a user repeatedly messed up recursion, the system would surface those specific attempts—not everything they’d ever done.
If you’re curious how this works in detail, the Hindsight documentation explains the retrieval and structuring patterns pretty clearly.
The First Real Change in Behavior
This is where things got interesting.
Before Hindsight:
User writes incorrect recursion →
System says: “Check base case.”
User repeats same mistake →
System says: “Check base case.”
After Hindsight:
User repeats mistake →
System says:
“You’re making the same recursion mistake as your previous attempt—your base case still doesn’t stop when n == 0. Try returning 1 there.”
That’s a completely different experience.
It’s not just explaining the error. It’s connecting the dots.
Why This Worked (and My First Attempt Didn’t)
The difference wasn’t just “adding memory.” It was how memory was used.
My earlier approach:
- Dump raw history into prompt
- Hope the model extracts patterns
Hindsight approach:
- Store structured events
- Retrieve relevant ones
- Inject them intentionally
This aligns much more closely with how we’d design any other system:
- Index data
- Query by relevance
- Use results predictably
If you look at how agent memory is framed more broadly, this is exactly the pattern described in systems like Vectorize’s agent memory architecture overview.
A Small but Important Design Choice
One thing that made a big difference: explicit error classification.
Instead of storing just raw outputs, I added a lightweight classifier:
def classify_error(result):
if "RecursionError" in result:
return "recursion"
if "IndexError" in result:
return "index"
return "other"
This allowed retrieval to be more targeted:
memory.retrieve(
user_id=user_id,
filters={"error_type": "recursion"}
)
Without this, the system sometimes pulled irrelevant past attempts, which diluted feedback quality.
This is one of those small, boring engineering decisions that ended up mattering more than expected.
What Still Doesn’t Work Well
It’s not perfect.
A few things are still rough:
Overfitting to past mistakes
Sometimes the system assumes the user is repeating an error when they’re actually trying something new.Cold start problem
First-time users still get generic feedback.Memory drift
Old mistakes can become irrelevant, but still show up if not filtered properly.
I haven’t fully solved these yet. I’m experimenting with:
- Decay functions for memory relevance
- Weighting recent attempts higher
- Separating “resolved” vs “active” mistakes
What I Learned Building This
A few things I’d carry forward into any system like this:
1. More context is not better context
Dumping everything into a prompt is lazy and doesn’t scale. Retrieval beats brute force every time.
2. Memory needs structure, not just storage
Raw logs are not memory. Indexing, filtering, and shaping data is what makes it useful.
3. Small metadata decisions matter a lot
That simple error_type field improved retrieval quality more than any prompt tweak I tried.
4. Behavior > accuracy
The biggest improvement wasn’t “better explanations.” It was different behavior—the system started referencing past attempts.
5. Teaching requires continuity
If your system doesn’t connect attempts, it’s not a tutor. It’s just a smarter compiler error.
Where This Leaves the System
Right now, the tutor does something I didn’t expect it to do this early: it adapts.
Not perfectly, not consistently—but enough that you notice.
You can make the same mistake twice, and it won’t respond the same way.
That alone changed how the system feels.
It stopped being reactive and started being contextual.
And that shift didn’t come from a better model.
It came from giving the system a memory it could actually use.
Top comments (0)