I Built Adaptive Hints Using Hindsight
“Why is it suddenly explaining recursion?” I hadn’t changed the prompt—Hindsight had quietly picked up the user’s past mistakes and rewired the hint mid-session.
What I actually built
I’ve been working on Codemind, a coding practice platform that tries to behave less like a judge and more like a mentor. Instead of just saying “wrong answer,” it watches how you fail and adjusts guidance over time.
At a high level, the system looks like this:
- Frontend: users solve problems, submit code
- Backend (Python): handles submissions and feedback orchestration
- Execution layer: runs code safely (Firecracker-style sandbox)
- AI layer: analyzes code and generates hints
- Memory layer (Hindsight): stores attempts and extracts patterns
The interesting part isn’t the execution or the model—it’s the memory. I wired in Hindsight GitHub repository to track user attempts over time and use that history to shape hints.
If you’ve never used it, their Hindsight documentation explains the idea well: instead of treating each interaction as isolated, you store and retrieve past events to influence future behavior. It’s basically a structured memory layer for agents.
That sounds obvious. It’s not.
The problem: stateless feedback is useless
Originally, my feedback loop looked like this:
def handle_submission(code, problem_id):
result = run_code(code, problem_id)
if result.passed:
return "Correct"
hint = generate_hint(code, problem_id)
return hint
Every submission was evaluated independently. Same wrong logic? Same hint.
This works fine until you watch a real user:
- Attempt 1 → wrong base case
- Attempt 2 → still wrong base case
- Attempt 3 → copy-paste variation, same mistake
And the system keeps saying:
“Check your recursion logic.”
That’s not helpful. That’s just noise.
What I needed was:
“You’ve made this mistake three times. Let’s change strategy.”
My first attempt at “memory” (and why it failed)
My initial solution was embarrassingly naive: just store past submissions in a database and pass them into the prompt.
history = db.get_user_attempts(user_id, problem_id)
prompt = f"""
User history:
{history}
Current code:
{code}
Give a hint.
"""
This technically worked. But in practice:
- Prompts became huge
- The model ignored most of the history
- No consistent behavior across attempts
It wasn’t structured. It was just dumping text and hoping for magic.
What changed: introducing Hindsight
Instead of shoving raw history into prompts, I switched to using Hindsight as a proper memory layer.
The core idea: store events, not blobs of text.
hindsight.log_event(
user_id=user_id,
event_type="submission",
metadata={
"problem_id": problem_id,
"error_type": classify_error(code),
"passed": result.passed
}
)
Now every submission becomes a structured record:
- Which problem
- What kind of mistake
- Whether it passed
This is where things got interesting.
Instead of asking “what did the user do?”, I could ask:
“What patterns are emerging?”
Turning events into adaptive hints
Once events were stored, I started querying them before generating hints.
patterns = hindsight.query(
user_id=user_id,
filters={"problem_id": problem_id},
aggregation="error_frequency"
)
Then I used those patterns to decide how to hint.
def generate_adaptive_hint(code, patterns):
if patterns["recursion_error"] >= 2:
return explain_recursion_basics(code)
if patterns["off_by_one"] >= 2:
return highlight_edge_cases(code)
return generic_hint(code)
This is simple logic. No fancy ML. But it changed behavior dramatically.
The system stopped reacting to one submission and started reacting to trends.
The unexpected part: sequence mattered more than frequency
At first, I thought counting errors was enough.
It wasn’t.
Two identical mistakes in a row means something very different from:
- mistake → fix → new mistake
So I added sequence awareness.
recent = hindsight.get_recent_events(user_id, limit=3)
if all(e["error_type"] == "recursion_error" for e in recent):
return deep_recursion_hint()
That’s when the system started feeling… intentional.
Not just “you’re wrong,” but:
“You’re stuck in a loop. Let’s break it.”
What the system does now (concretely)
Here’s a real scenario:
Before (stateless)
User submits broken recursion:
- “Check your recursion logic.”
- “Check your recursion logic.”
- “Check your recursion logic.”
Nothing changes.
After (with Hindsight)
Same user:
- “Check your recursion logic.”
- “Your base case might be incorrect.”
- “Let’s walk through a simple recursion example step by step.”
The system escalates guidance based on observed struggle.
No prompt changes. No retraining. Just memory.
Where this shows up in the code
The architecture ended up separating concerns pretty cleanly:
-
submission_handler.py→ orchestrates flow -
execution_engine.py→ runs code safely -
hindsight_client.py→ logs + queries memory -
hint_engine.py→ decides hint strategy
The key boundary is here:
patterns = hindsight_client.get_patterns(user_id, problem_id)
hint = hint_engine.generate(code, patterns)
That separation matters.
Hindsight doesn’t generate hints. It just tells you what’s been happening.
The hint engine decides what to do with that information.
Why I didn’t let the model “figure it out”
It’s tempting to push everything into the LLM:
“Here’s history, figure out what to say.”
I tried that. It’s inconsistent.
Instead, I used Hindsight for state, and kept decision logic deterministic.
This gave me:
- Predictable behavior
- Easier debugging
- Lower token usage
And honestly, it feels more like engineering than prompting.
What I’d do differently
There are still rough edges.
1. Error classification is brittle
Right now, I’m using simple heuristics:
def classify_error(code):
if "recursion" in code:
return "recursion_error"
This is obviously weak. A better approach would use AST analysis or execution traces.
2. Patterns are too rigid
Threshold-based logic (>= 2) works, but it’s crude.
Some users need help faster. Others don’t.
I’d like to make this adaptive per user.
3. No cross-problem learning yet
Hindsight stores everything, but I’m only querying per problem.
Missed opportunity.
If someone struggles with recursion in one problem, I should anticipate it in the next.
Why Hindsight worked here
The key thing Hindsight gave me wasn’t storage—it was structure.
Instead of:
- dumping logs
- embedding conversations
- hoping retrieval works
I had:
- typed events
- queryable patterns
- explicit control over behavior
If you’re curious, the idea is similar to what’s described on the agent memory page on Vectorize.
It’s not about making the model smarter.
It’s about giving your system memory you can reason about.
Lessons I’d keep
Stateless systems plateau fast
If behavior doesn’t change across attempts, users stop learning.Memory should be structured, not appended
Events > raw text history.Sequence beats frequency
Repeated mistakes in a row signal “stuck,” not just “wrong.”Keep decision logic outside the model
Deterministic systems are easier to debug and improve.Start simple, but make it observable
Logging patterns mattered more than clever algorithms.
Closing thought
I didn’t set out to build an “adaptive system.” I just wanted to stop repeating the same useless hint.
Hindsight gave me a way to notice patterns, and once I had that, adapting behavior became straightforward.
The surprising part wasn’t that it worked.
It’s that something this simple made the system feel like it was actually paying attention.
Top comments (0)