Shreya R Chittaragi

Posted on Mar 23

What Happened When My Coding Agent Started Remembering User Mistakes

#ai #webdev #programming #python

By: Shreya R Chittaragi — Memory & Adaptation Module

Hindsight Hackathon — Team 1/0 coders

The first time our mentor called a guessing user "someone who rushes through problems without reading carefully" — using only behavioral signals, no labels — I knew the memory layer was working.

No one told the system this user was a rusher. No dropdown, no profile form, no manual tag. The agent watched how fast they submitted, counted their edits, saw the syntax errors, and concluded it on its own. Then it adapted its hint accordingly.

That's what behavioral memory looks like when it actually works.

What We Built

Our project is an AI Coding Practice Mentor — a system where users submit Python solutions to coding problems, get evaluated, and receive personalized hints. The personalization isn't based on what they tell us about themselves. It's based on how they actually behave while solving problems.

The stack:

FastAPI backend handling code execution and routing
Groq (LLaMA 3.3 70B) for generating hints and feedback
Hindsight for persistent behavioral memory across sessions
React frontend with a live code editor

My role was the memory and adaptation module — everything that sits between "user submitted code" and "here's a hint tailored to how this specific person thinks."

The Problem with Generic Hints

Before memory, every user got the same hint for the same wrong answer.

Submit an empty two_sum function? Here's a generic explanation of hash maps. Doesn't matter if you're someone who overthinks every edge case or someone who submits in 8 seconds without reading the problem. Same hint. Same tone. Same depth.

That's not mentoring. That's a FAQ page.

The insight behind our system is that how someone fails tells you more than what they got wrong. Two users can both fail the same test case for completely different cognitive reasons:

One spent 15 minutes overthinking and missed a simple edge case
One submitted in 5 seconds with a syntax error because they didn't read carefully

They need different responses. The first needs confidence. The second needs to slow down.

Building the Pattern Detection Layer

The first thing I built was cognitive_analyzer.py — a rule-based system that takes raw behavioral signals and converts them into cognitive pattern labels.

The signals come from signal_tracker.py:

def capture_signals(submission: CodeSubmission, result: EvalResult) -> dict:

return \{

    "user\_id": submission\.user\_id,

    "problem\_id": submission\.problem\_id,

    "attempt\_number": submission\.attempt\_number,

    "time\_taken\_sec": submission\.time\_taken,

    "code\_edit\_count": submission\.code\_edit\_count,

    "all\_passed": result\.all\_passed,

    "error\_types": classify\_errors\(result\.error\_types\),

\}

These signals feed into pattern detectors. Here's the rushing detector:

def _check_rushing(signals: dict) -> list:

score = 0\.0

if signals\["time\_taken\_sec"\] < 15:

    score \+= 0\.3

if "syntax\_error" in signals\["error\_types"\]:

    score \+= 0\.5

if signals\["code\_edit\_count"\] <= 2:

    score \+= 0\.2

if score >= 0\.4:

    return \[\{"pattern": "rushing", "confidence": round\(score, 2\)\}\]

return \[\]

Five patterns total: overthinking, guessing, rushing, concept_gap, boundary_weakness. Each has its own confidence score. The dominant pattern drives everything downstream — the hint tone, the next problem difficulty, the encouragement threshold.

The Memory Problem Nobody Warns You About

My first implementation of memory was a Python dict:

_memory_store: dict[str, UserMemoryProfile] = {}

It worked perfectly during testing. Patterns stored, profiles built, adaptive hints generating correctly. Then I restarted the server and every single user profile was gone.

A dict lives in RAM. RAM clears on restart. For a demo this would be catastrophic — judges submit code, close the tab, come back, and the system has no memory of them at all.

I moved to file-based persistence first — serializing the memory store to memory_data.json on every write:

def _save_to_disk():

data = \{uid: profile\.model\_dump\(\) for uid, profile in \_memory\_store\.items\(\)\}

with open\(MEMORY\_FILE, "w"\) as f:

    json\.dump\(data, f, default=str\)

This survived restarts. But it was still local — not scalable, not shareable across instances, and not the real Hindsight integration we needed for the demo.

Integrating Real Hindsight Cloud Memory

Switching to Hindsight meant moving from a flat JSON file to a proper agent memory system with semantic recall and reflection built in.

The integration looked clean at first:

from hindsight_client import Hindsight

client = Hindsight(

base\_url=settings\.HINDSIGHT\_URL,

api\_key=settings\.HINDSIGHT\_API\_KEY

)

def store_session(user_id: str, session_data: dict):

client\.retain\(

    bank\_id="coding\-mentor",

    content=f"User \{user\_id\} showed \{session\_data\['dominant\_pattern'\]\} pattern\.\.\.",

    metadata=\{"user\_id": user\_id\}

\)

Then I hit this error the moment a real user submitted code:

RuntimeError: Timeout context manager should be used inside a task

hindsight_client uses async under the hood. FastAPI was running the route handler in a thread via run_in_threadpool. Calling an async client from a sync thread context without a running event loop causes this exact crash — silently, only on real requests, never in unit tests.

The fix was running the async calls in a fresh event loop:

def _run_in_new_loop(coro):

loop = asyncio\.new\_event\_loop\(\)

try:

    return loop\.run\_until\_complete\(coro\)

finally:

    loop\.close\(\)

def store_session(user_id: str, session_data: dict):

\_run\_in\_new\_loop\(client\.aretain\(

    bank\_id="coding\-mentor",

    content=content,

    context=f"coding session for user \{user\_id\}",

    metadata=\{"user\_id": user\_id\}

\)\)

Four lines. Two hours of debugging. Worth every minute.

The Adaptive Problem Selector

Once memory was working, I built adaptive_selector.py — a module that reads a user's dominant pattern from memory and picks the best next problem for them.

PATTERN_STRATEGY = {

"overthinking": \{"difficulty": "easy",

    "reason": "Simpler problem to build confidence"\},

"guessing":     \{"difficulty": "easy",

    "reason": "Easy problem to force deliberate thinking"\},

"rushing":      \{"difficulty": "medium",

    "reason": "Harder problem that punishes rushing"\},

"concept\_gap":  \{"difficulty": "easy",

    "reason": "Back to basics to fill the knowledge gap"\},

}

A user who rushes gets a medium difficulty problem — something that actually punishes careless reading. A user who's overthinking gets an easy win to rebuild confidence. The logic is simple, but it's only possible because we have a real history of their behavior across sessions.

What It Looks Like Now

After a few submissions, the Insights panel in our UI shows:

Weak Areas: Rushing (56%) Overthinking (40%)

Latest: Rushing

These percentages come directly from pattern confidence scores stored in Hindsight. The mentor hint changes based on this — a rusher gets told to slow down and re-read. An overthinker gets told to trust their instinct and start simple.

In the Hindsight Cloud dashboard, entities are being tracked across sessions: hindsight_test, test_user, overthinking, syntax_struggles — Hindsight isn't just storing logs. It's building a semantic understanding of each user's behavior over time.

What I Learned

_Persistence is not optional. _
An in-memory dict feels fine until the first restart. Design for persistence from day one, even if it's just a JSON file initially.
_Async context matters more than you think. _
The RuntimeError from running async Hindsight calls inside a FastAPI thread cost two hours. Always check whether your client library is async before wiring it into a sync endpoint.
_Behavioral signals beat self-reported data. _
Users don't know they're rushing. Watching what they actually do gives you a more honest picture than anything they'd type into a profile form.
_One source of truth for pattern detection. _
We initially had two pattern detectors with overlapping but inconsistent logic. Consolidating into one was a small change with a big impact on reliability.
_Memory makes the LLM smarter without retraining. _
The same Groq model gives dramatically different, more useful hints when it has behavioral context. You don't need a bigger model — you need better memory.