DEV Community

Cover image for My Coding Agent Remembered My Mistakes I Never Told It To
Krishna Hasare
Krishna Hasare

Posted on

My Coding Agent Remembered My Mistakes I Never Told It To

The user wants clean Markdown for dev.to — not a Word doc. Let me just review the content carefully for punctuation errors and alignment issues, then produce a clean version.

Here's your article cleaned up for dev.to — same content, fixed alignment and a couple of punctuation touches:


I was testing our app for the third time that night. Same username, new session. I expected it to greet me like a stranger.

Instead it said: "Last time you struggled with dynamic programming problems. Want to try something that builds on that?"

I stared at the screen for a second. I hadn't written any personalization logic. No if-statements. No special greeting templates. No "remember this user" flag anywhere in the codebase.

It just... knew.

That was the moment I actually understood what Hindsight was doing under the hood — and why it's different from everything I'd tried before.

Try it live: codemento.netlify.app


What We Built

CodeMentor AI is an AI-powered coding practice mentor built at HackWithIndia 2026. It looks like VS Code — Monaco Editor in the center, 75 LeetCode-style problems on the left, AI mentor chat on the right.

The idea was simple: every coding platform forgets you the moment you close the tab. We wanted to build one that didn't. A mentor that remembers your mistakes, your patterns, your weak spots — and uses that to actually help you improve.

The AI mentor is powered by Groq's llama-3.3-70b. The memory layer is powered by Hindsight. My job was making those two things work together in a way that felt natural — not robotic, not scripted, just genuinely useful.


The Problem I Was Actually Solving

Before Hindsight, our AI mentor was technically correct and completely useless.

It would give good hints. Generic good hints. The kind that are accurate for any developer in any situation, which means they're perfectly calibrated for no developer in any specific situation.

Someone fails a dynamic programming problem for the third time in a row. The mentor says: "Consider breaking this into subproblems." Correct. Useless. It said the exact same thing the first two times.

The mentor had no memory of the first two times. Every session was session one.

I tried fixing this with localStorage — dumping chat history into the browser and reloading it next session. It broke almost immediately. After three sessions you're feeding 40,000 tokens of irrelevant context into every LLM call. Responses got slower, vaguer, worse.

What I needed wasn't more context. I needed smarter context.


How Hindsight Actually Works Here

The Hindsight memory system gives you three operations. Here's exactly how I use each one in CodeMentor AI.

retain() — Store What Happened

After every code submission, I call retain() with a structured summary:

await retain(hindsightKey, bankId,
  `${username} attempted "${problem.title}" (${problem.difficulty}).
   Result: ${pass ? 'SOLVED' : 'FAILED'}. Attempt #${attempts}.
   ${!pass
     ? `Error: ${result.error}. Failed ${failedCount}/${total} test cases.`
     : 'Solved successfully.'
   }`,
  {
    type: pass ? 'success' : 'failure',
    problem: problem.title,
    tags: problem.tags,        // ['array', 'dynamic-programming', etc]
    language: lang,
    timestamp: new Date().toISOString()
  }
);
Enter fullscreen mode Exit fullscreen mode

Key decision: I don't store raw code. I store meaning. "Krishna failed Coin Change attempt 2, dynamic programming, wrong base case" is a useful memory. The actual 30 lines of wrong JavaScript is noise.

I also call retain() when someone asks for a hint — that tells the system they got stuck, and on exactly what concept.

recall() — Fetch What's Relevant

When someone selects a problem, I immediately fetch relevant memories before the editor even loads:

const memories = await recall(
  hindsightKey,
  bankId,
  `${username} past attempts and errors on ${problem.title}`,
  5
);
Enter fullscreen mode Exit fullscreen mode

On app load, I run a broader query:

const patterns = await recall(
  hindsightKey,
  bankId,
  `${username} overall coding patterns and recurring mistakes`,
  8
);
Enter fullscreen mode Exit fullscreen mode

These memories go straight into the AI mentor's system prompt. That's it. That's the whole trick. The LLM gets context, and it uses it naturally — I don't write special logic to handle it.

reflect() — Personalize the Response

Before the mentor responds to a failed submission, I call reflect():

const reflection = await reflect(
  hindsightKey,
  bankId,
  `What are ${username}'s main weaknesses based on their coding history?`
);
Enter fullscreen mode Exit fullscreen mode

The system prompt then looks like this:

const systemPrompt = `
You are CodeMentor AI, an expert coding tutor for ${username}.

What you remember about this user:
${memories.slice(0, 3).map(m => m.content).join('\n')}

Synthesized insight:
${reflection}

Current problem: ${problem.title} (${problem.difficulty})
Their code: ${code.substring(0, 400)}

Give a targeted 2-3 sentence hint. Be specific to their history.
Don't give the full solution.
`;
Enter fullscreen mode Exit fullscreen mode

The response stops being generic the moment memory is in the prompt. The LLM isn't guessing who you are anymore — it knows.


The Moment That Surprised Me

Session three. I opened the app as "krishna", same as before.

The mentor greeted me with a message referencing two specific things I'd struggled with in previous sessions. I had not engineered this. There was no code path that said "if user has memories, personalize greeting." The LLM received the recalled context and used it the way a human tutor would — naturally, without being told to.

That's what agent memory done right actually feels like. You don't notice the system working. You just notice the AI stopped being generic.

The other thing that caught me off guard: Hindsight automatically consolidates related memories into higher-level Observations in the background. After a few sessions, it had synthesized "this user struggles with base cases in recursive problems" — without me writing any consolidation logic. It inferred a pattern from individual facts.

I hadn't built that. It emerged.


What I Got Wrong First

I stored too much. My first version retained full code submissions, full error messages, full conversation transcripts. The recall results were huge and unfocused. Signal drowned in noise. Switching to structured summaries — mistake type, problem tag, outcome — fixed it immediately.

I tested wrong. I kept clearing the memory bank between tests and starting fresh. That's the worst way to test a memory system. The interesting behavior shows up at session three and four, when patterns have accumulated. I almost shipped without ever seeing the actual value of what I'd built.

I waited too long to call recall(). My first version only fetched memories when the user asked a question. Wrong. I should front-load context — call recall() the moment a problem is selected, the moment a session starts. Don't wait. Memory is most useful before the user says anything.


What It Looks Like Now

Open the app. Select a problem. Before you type a single character, the mentor has already pulled your relevant history. The skill radar on the left reflects your actual performance — not placeholder data, but scores derived from recalled Hindsight memories.

Fail a test case. The mentor's hint references your specific error pattern, not a generic category. Ask for a hint explicitly. It retains that too — "Krishna needed a hint on dynamic programming after 1 attempt" — and the next session, it adjusts.

Close the app. Open it again tomorrow. It still knows.

That's the whole thing. That's what makes it a mentor instead of a chatbot.


Lessons Worth Taking

Design your memory schema before writing agent logic. What you retain determines what you can recall. Spend time here. It's the actual hard part.

Store meaning, not data. Summaries outperform transcripts. Structured observations outperform raw logs. Every time.

The personalization emerges — you don't code it. Shape your LLM prompt with recalled memory, and the adaptation follows naturally. You don't need special if-statements for "if user has struggled before, say X."

Test at session 3, not session 1. Single-session demos make memory look optional. The value compounds. Show that.


Try It

codemento.netlify.app

Free Groq key from groq.com. Free Hindsight Cloud account from ui.hindsight.vectorize.io — use code MEMHACK315 for $50 free credits.

If you're building anything where the same user comes back more than once — a tutor, a support agent, a code reviewer — persistent memory isn't a nice-to-have. It's the difference between a tool that helps and a tool that actually improves.


Built at HackWithIndia 2026.

Team: Karan Patil, Sarvesh Gajakosh, Priya Vhatkar, Krishna Hasare, Siddhi Shinde, Aastha Sawade.

Resources:


Top comments (0)