Our AI Tutor Never Forgets Your Mistakes - Here's How We Built the Memory

#ai #programming #opensource #agentaichallenge

Kinjal Jain | Team Clarion | Code Mentor AI
"Six of us. Six different laptops. Six different weeks. The same off-by-one error on a loop problem and every single time, LeetCode handed us the same generic hint like it had never seen us before. Scrimba taught us beautifully, but the moment we closed the tab, it forgot us completely. That gap between a platform that grades you and one that actually learns you kept coming up every single time our team sat down to talk about what we'd build. Code Mentor AI started as a shared frustration across six people. Hindsight is what turned it into a solution."

The Problem With Every Platform We Loved

LeetCode grades you. Scrimba teaches you. But neither of them remembers you. Every session is a clean slate. The platform has no idea you've failed the same loop boundary problem three times, or that you always forget to handle null before dereferencing. You get the same hint. You make the same mistake. You wonder why you're not improving.
Code Mentor AI is our answer to that. It's a personalized coding tutor built by Team Clarion that remembers every mistake a student makes, classifies it, and uses that history to teach smarter. The long-term memory layer powering it is Hindsight — open-source agent memory by Vectorize. (https://github.com/vectorize-io/hindsight)

My Piece:

1) The Bug Fingerprint Engine
I built the Bug Fingerprint Engine, the feature that is the foundation of everything else. Every time a student makes a mistake, a FastAPI endpoint classifies it using the Groq LLM, embeds it using CodeBERT, and stores both the classification and the vector in PostgreSQL. Over time this builds a per-student weakness fingerprint.
The classification prompt is deliberately narrow:

LLM classification call (Groq)

prompt = """
Given this code and error, classify the mistake.
Choose exactly one: off-by-one | null/undefined |
wrong-loop-condition | logic-error | syntax-error | other.
Code: {code}
Error: {error}
"""
classification = await groq_client.complete(prompt)
Narrow classification matters. We tried open-ended prompts first the LLM would write paragraphs. Forcing a single label made the fingerprint actually queryable and useful for downstream features.

2) The Adaptive Onboarding Test
The second feature I owned was the adaptive onboarding test. The problem with flat quizzes is that they waste everyone's time a strong student sits through beginner questions, a beginner gets overwhelmed by hard ones. We implemented a simplified 3-parameter Item Response Theory (IRT) model instead.
After each answer, the system updates an ability estimate θ using Bayesian updating and picks the next question that maximizes Fisher information at that θ. After 8–10 questions the estimate stabilises and seeds the student's initial weakness profile in Hindsight — so the memory system starts with a real prior, not a blank slate.

3) The Hindsight Student Onboarding Dataset
You can't validate an IRT model without data. So I built it. I created and documented a dataset of 100 simulated student onboarding records as part of the Hindsight intelligent tutoring system project — capturing each student's baseline skill profile, learning gaps, session history, and IRT-derived ability estimates.
This dataset is what we used to test and validate the entire onboarding module before a single real student touched it. Building the data before the product taught me more about the feature than building the feature itself did. When you have to simulate 100 realistic learners, you start to truly understand the edge cases your model needs to handle.

4) Working on the Article & LinkedIn Prompts
Beyond my AI features, I also worked on crafting the content strategy for the team writing the article structure and LinkedIn prompt templates that each member used to tell their own story. Getting six people with six different roles to each publish something authentic was its own engineering problem.

Before / After Hindsight

Before: A student makes an off-by-one error. Code Mentor AI says "Check your loop condition." The student comes back two days later, makes the same mistake, gets the same hint.
After: The Bug Fingerprint has logged three off-by-one errors. The Socratic hint system reads the fingerprint and asks: "You've hit this boundary condition before — what is your loop doing when left and right are equal?" The student pauses. They fix it themselves.

The Non-Obvious Lesson

I assumed storing more mistake history would produce better hints. It didn't. When the LLM received a student's entire mistake log, hints became vague and over-hedged. The fix was limiting recall to the 5 most similar past mistakes. Specificity beats completeness in memory retrieval the agent got sharper when it knew less but knew the right less.
If you're building something similar, the Hindsight documentation (https://hindsight.vectorize.io/) and its agent memory primitives (https://vectorize.io/features/agent-memory) are worth reading before you design your schema.

Team Clarion

Aanchal & Pranati — Backend architecture & database
Lakshay — Full frontend integration with Next.js
Kinjal — Bug Fingerprint Engine, Adaptive Onboarding, Hindsight Student Dataset, Content Strategy
Aman & Priyanshu — Dynamic AI models & AI assistance features **