1shrikantsc-spc

Posted on Mar 20 • Edited on Mar 23

My AI Mentor Warned Me About a Bug Before I Even Ran the Code

#ai #programming #showdev #learning

My AI Warned Me About a Bug Before I Even Ran the Code

"Did it seriously just do that?"

I stared at the screen. The agent had just flagged a missing base case in my recursive function — before I clicked Run. Not a linter. Not static analysis. An AI that remembered I had made this exact mistake four times before, and interrupted me before I could make it a fifth.

That moment changed how I think about AI agents entirely.

What We Built

Kernel's Slap is an AI coding mentor with persistent memory. Unlike every coding platform you have used before — LeetCode, HackerRank, Codeforces — it never forgets you. It remembers your mistakes, your patterns, your blind spots, across every session you have ever had with it.

The main moving parts:

Frontend: Next.js 14 + Monaco Editor (same editor as VS Code)
Backend: FastAPI (Python) — handles all agent orchestration
Memory: Hindsight — retain(), recall(), learn()
LLM: Groq (llama-3.3-70b-versatile) — fast enough that hints feel instant
Code Execution: Piston API — free, safe, multi-language sandbox

The core idea: every meaningful interaction — mistake, success, hint given, hint ignored — gets stored in Hindsight's agent memory. Every agent response reads that memory first. The agent never speaks without knowing who it is speaking to.

The Problem We Actually Solved

I want to be specific about this because "coding platforms need memory" is too vague to be useful.

There are three separate failures happening every day on every coding platform:

Failure 1: Repetition without recognition. You fail an off-by-one error on Monday. You fail it again Wednesday on a completely different problem. The platform has zero idea. It gives you the identical generic hint both times. Nobody ever says: "You have made this error eleven times. It is your biggest weakness. Let us fix it."

Failure 2: Generic help. When you get stuck, the hint you receive is the same hint that 2 million other users receive. It has no idea whether analogies work for you, or whether you need code examples, or whether the Socratic method is what actually clicks. It just picks one and hopes.

Failure 3: No real trajectory. You grind 200 problems over three months. Are you actually improving at recursion? You genuinely do not know. The platform shows you problem counts. It does not show you error rates by category over time.

Hindsight fixes all three — but only if you use it correctly. I will show you how we did it and where we got it wrong first.

How Hindsight Works in Our System

The rule we enforced was simple: recall() fires before every single agent response. No exceptions.

This sounds obvious. It is not obvious. Our first version only called recall() when the user explicitly asked for a hint. That was wrong. The agent was blind for the pre-mortem check, blind for the session greeting, blind for the challenge generator. When we changed it to fire before everything, the entire product transformed.

Here is what our hint endpoint looks like:

@app.post("/api/hint")
async def get_hint(payload: dict):
    user_id = payload.get("user_id", "user_1")
    code = payload.get("code", "")
    error = payload.get("error", "")

    # Step 1 — recall FIRST, before anything else
    past = await recall(user_id, f"mistakes with {error}")
    past_text = str(past) if past else "No past mistakes found"

    # Step 2 — now generate a hint that knows this person
    response = groq_client.chat.completions.create(
        model="llama-3.3-70b-versatile",
        messages=[
            {
                "role": "system",
                "content": f"""You are a personal coding mentor.
This student's past mistakes: {past_text}
Give a short personalised hint based on their history.
Reference their past mistakes if relevant."""
            },
            {
                "role": "user",
                "content": f"My code:\n{code}\n\nError:\n{error}\n\nGive me a hint."
            }
        ]
    )

    hint = response.choices[0].message.content

    # Step 3 — retain this mistake with full context
    await retain(user_id, {
        "event": "mistake",
        "code": code,
        "error": error,
        "hint_given": hint,
        "hint_style": "direct"
    })

    return {"hint": hint}

The difference between a generic tutor and a personal mentor is entirely in step 1. Without recall(), the system prompt is: "help this user." With recall(), the system prompt becomes: "this user has failed base case errors four times, last hint style was code example and it did not help, try analogy this time."

The Pre-Mortem Feature (The One That Surprised Us Most)

This is the feature I did not expect to work as well as it did.

The idea: before the user runs their code, the agent scans what they have written, calls recall() to check their personal error history, and warns them if it detects a pattern match.

@app.post("/api/premortem")
async def premortem(payload: dict):
    user_id = payload.get("user_id", "user_1")
    code = payload.get("code", "")

    # Recall this user's most common error patterns
    past = await recall(user_id, "common mistakes errors")
    past_text = str(past) if past else "No history yet"

    response = groq_client.chat.completions.create(
        model="llama-3.3-70b-versatile",
        messages=[
            {
                "role": "system",
                "content": f"""You are a coding mentor.
Student's past mistakes: {past_text}
Look at their code and warn them about likely bugs BEFORE they run it.
Keep it to 1-2 lines only. Be specific."""
            },
            {
                "role": "user",
                "content": f"My code:\n{code}\n\nWhat might go wrong?"
            }
        ]
    )

    warning = response.choices[0].message.content
    return {"warning": warning}

What the student sees:

⚠️ "Based on your last 4 recursive functions, you tend to miss the base case. I can see this code has the same pattern."

The agent warned them before the mistake happened. This is only possible because the system remembers their past mistakes. Without Hindsight, this feature is simply impossible — you cannot warn someone about their personal patterns if you have no memory of their personal patterns.

What learn() Actually Changed

This was the function we underestimated most.

After every solved problem, we call learn():

@app.post("/api/solve")
async def solved(payload: dict):
    user_id = payload.get("user_id", "user_1")
    problem = payload.get("problem", "")
    hint_helped = payload.get("hint_helped", True)

    await learn(user_id, {
        "event": "solved",
        "problem": problem,
        "hint_helped": hint_helped
    })

    await retain(user_id, {
        "event": "success",
        "problem": problem
    })

    return {"status": "Memory updated!"}

Over time, the agent builds a model of what explanation style actually helps each student. If code examples consistently fail but analogies work, learn() accumulates that signal and the agent changes its behavior. This is the difference between a system that logs data and a system that actually learns from it.

Before vs After — A Concrete Example

Without Kernel's Slap:

Student writes def fib(n): return fib(n-1) + fib(n-2). Runs it. Gets RecursionError. Platform says: "Check your base case." Student fixes it. Two weeks later, same error on a different problem. Platform says: "Check your base case." This repeats fifteen times over three months. Nobody notices.

With Kernel's Slap:

Same student. Same code. Before they click Run, the pre-mortem fires: "Based on your history with recursive functions, I think you have missed the base case again." Student fixes it before running. The retain() call stores: caught with pre-mortem assist. Next session: "Welcome back. You have been improving on recursion. You caught the base case yourself last time. Want to try a harder one?"

The error did not disappear. The pattern was caught, named, and addressed. That is what memory enables.

What We Got Wrong

I want to be honest about one dead end.

Our first attempt stored only the error message in retain() — just the raw stderr output. This turned out to be nearly useless for recall(). When the agent asked "what are this user's common mistakes," it got back a list of raw Python tracebacks with no categorization, no context, no hint about what caused them.

We had to refactor to store structured objects: error type, category, hint given, hint style, whether it resolved. The richness of what you store determines the quality of everything you can build on top of it. If you are building with Hindsight, design your retain() schema before you write a single API endpoint.

Lessons Learned

Store rich context, not raw events. A traceback is not a mistake. A structured object with error type, category, hint given, and outcome is a mistake. Design your schema first.

recall() before everything, not just when stuck. The moment we made this rule, the product changed completely. The agent became coherent across all features instead of only smart during hint requests.

learn() is not optional. Most teams treat it as a nice-to-have. It is the function that separates a system that remembers from a system that actually improves. If you are not calling learn(), you are logging, not learning.

The pre-mortem was our most unexpected win. We built it as a secondary feature. It became the one that made people say "wait, how did it know that?" That reaction is what you are building toward.

Memory is not logging. Logging is passive. Memory is active decision input. The difference is whether you call recall() before acting on anything.

Try It Yourself

The full project is on GitHub: github.com/1shrikantsc-spc/kernels-slap

If you want to build something similar, start with the Hindsight documentation — it is genuinely straightforward to integrate. The concepts of retain(), recall(), and learn() map directly to how you would design any memory-driven agent system.

The broader pattern — recall before every response, retain after every outcome, learn after every resolved interaction — applies far beyond coding mentors. Any agent that needs to know who it is talking to before it speaks can use this architecture.

Built with: