PARDHU CHIPINAPI

Posted on Mar 23

I Built Adaptive Hints Using Hindsight

#agentmemory #llmengineering #developertools #edtech

I Built Adaptive Hints Using Hindsight

“Why is it suddenly explaining recursion?” I hadn’t changed the prompt—Hindsight had quietly picked up the user’s past mistakes and rewired the hint mid-session.

What I actually built

I’ve been working on Codemind, a coding practice platform that tries to behave less like a judge and more like a mentor. Instead of just saying “wrong answer,” it watches how you fail and adjusts guidance over time.

At a high level, the system looks like this:

Frontend: users solve problems, submit code
Backend (Python): handles submissions and feedback orchestration
Execution layer: runs code safely (Firecracker-style sandbox)
AI layer: analyzes code and generates hints
Memory layer (Hindsight): stores attempts and extracts patterns

The interesting part isn’t the execution or the model—it’s the memory. I wired in Hindsight GitHub repository to track user attempts over time and use that history to shape hints.

If you’ve never used it, their Hindsight documentation explains the idea well: instead of treating each interaction as isolated, you store and retrieve past events to influence future behavior. It’s basically a structured memory layer for agents.

That sounds obvious. It’s not.

The problem: stateless feedback is useless

Originally, my feedback loop looked like this:

def handle_submission(code, problem_id):
    result = run_code(code, problem_id)

    if result.passed:
        return "Correct"

    hint = generate_hint(code, problem_id)
    return hint

Every submission was evaluated independently. Same wrong logic? Same hint.

This works fine until you watch a real user:

Attempt 1 → wrong base case
Attempt 2 → still wrong base case
Attempt 3 → copy-paste variation, same mistake

And the system keeps saying:

“Check your recursion logic.”

That’s not helpful. That’s just noise.

What I needed was:

“You’ve made this mistake three times. Let’s change strategy.”

My first attempt at “memory” (and why it failed)

My initial solution was embarrassingly naive: just store past submissions in a database and pass them into the prompt.

history = db.get_user_attempts(user_id, problem_id)

prompt = f"""
User history:
{history}

Current code:
{code}

Give a hint.
"""

This technically worked. But in practice:

Prompts became huge
The model ignored most of the history
No consistent behavior across attempts

It wasn’t structured. It was just dumping text and hoping for magic.

What changed: introducing Hindsight

Instead of shoving raw history into prompts, I switched to using Hindsight as a proper memory layer.

The core idea: store events, not blobs of text.

hindsight.log_event(
    user_id=user_id,
    event_type="submission",
    metadata={
        "problem_id": problem_id,
        "error_type": classify_error(code),
        "passed": result.passed
    }
)

Now every submission becomes a structured record:

Which problem
What kind of mistake
Whether it passed

This is where things got interesting.

Instead of asking “what did the user do?”, I could ask:

“What patterns are emerging?”

Turning events into adaptive hints

Once events were stored, I started querying them before generating hints.

patterns = hindsight.query(
    user_id=user_id,
    filters={"problem_id": problem_id},
    aggregation="error_frequency"
)

Then I used those patterns to decide how to hint.

def generate_adaptive_hint(code, patterns):
    if patterns["recursion_error"] >= 2:
        return explain_recursion_basics(code)

    if patterns["off_by_one"] >= 2:
        return highlight_edge_cases(code)

    return generic_hint(code)

This is simple logic. No fancy ML. But it changed behavior dramatically.

The system stopped reacting to one submission and started reacting to trends.

The unexpected part: sequence mattered more than frequency

At first, I thought counting errors was enough.

It wasn’t.

Two identical mistakes in a row means something very different from:

mistake → fix → new mistake

So I added sequence awareness.

recent = hindsight.get_recent_events(user_id, limit=3)

if all(e["error_type"] == "recursion_error" for e in recent):
    return deep_recursion_hint()

That’s when the system started feeling… intentional.

Not just “you’re wrong,” but:

“You’re stuck in a loop. Let’s break it.”

What the system does now (concretely)

Here’s a real scenario:

Before (stateless)

User submits broken recursion:

“Check your recursion logic.”
“Check your recursion logic.”
“Check your recursion logic.”

Nothing changes.

After (with Hindsight)

Same user:

“Check your recursion logic.”
“Your base case might be incorrect.”
“Let’s walk through a simple recursion example step by step.”

The system escalates guidance based on observed struggle.

No prompt changes. No retraining. Just memory.

Where this shows up in the code

The architecture ended up separating concerns pretty cleanly:

submission_handler.py → orchestrates flow
execution_engine.py → runs code safely
hindsight_client.py → logs + queries memory
hint_engine.py → decides hint strategy

The key boundary is here:

patterns = hindsight_client.get_patterns(user_id, problem_id)
hint = hint_engine.generate(code, patterns)

That separation matters.

Hindsight doesn’t generate hints. It just tells you what’s been happening.

The hint engine decides what to do with that information.

Why I didn’t let the model “figure it out”

It’s tempting to push everything into the LLM:

“Here’s history, figure out what to say.”

I tried that. It’s inconsistent.

Instead, I used Hindsight for state, and kept decision logic deterministic.

This gave me:

Predictable behavior
Easier debugging
Lower token usage

And honestly, it feels more like engineering than prompting.

What I’d do differently

There are still rough edges.

1. Error classification is brittle

Right now, I’m using simple heuristics:

def classify_error(code):
    if "recursion" in code:
        return "recursion_error"

This is obviously weak. A better approach would use AST analysis or execution traces.

2. Patterns are too rigid

Threshold-based logic (>= 2) works, but it’s crude.

Some users need help faster. Others don’t.

I’d like to make this adaptive per user.

3. No cross-problem learning yet

Hindsight stores everything, but I’m only querying per problem.

Missed opportunity.

If someone struggles with recursion in one problem, I should anticipate it in the next.

Why Hindsight worked here

The key thing Hindsight gave me wasn’t storage—it was structure.

Instead of:

dumping logs
embedding conversations
hoping retrieval works

I had:

typed events
queryable patterns
explicit control over behavior

If you’re curious, the idea is similar to what’s described on the agent memory page on Vectorize.

It’s not about making the model smarter.

It’s about giving your system memory you can reason about.

Lessons I’d keep

Stateless systems plateau fast
If behavior doesn’t change across attempts, users stop learning.
Memory should be structured, not appended
Events > raw text history.
Sequence beats frequency
Repeated mistakes in a row signal “stuck,” not just “wrong.”
Keep decision logic outside the model
Deterministic systems are easier to debug and improve.
Start simple, but make it observable
Logging patterns mattered more than clever algorithms.

Closing thought

I didn’t set out to build an “adaptive system.” I just wanted to stop repeating the same useless hint.

Hindsight gave me a way to notice patterns, and once I had that, adapting behavior became straightforward.

The surprising part wasn’t that it worked.

It’s that something this simple made the system feel like it was actually paying attention.

DEV Community

I Built Adaptive Hints Using Hindsight

I Built Adaptive Hints Using Hindsight

What I actually built

The problem: stateless feedback is useless

My first attempt at “memory” (and why it failed)

What changed: introducing Hindsight

Turning events into adaptive hints

The unexpected part: sequence mattered more than frequency

What the system does now (concretely)

Before (stateless)

After (with Hindsight)

Where this shows up in the code

Why I didn’t let the model “figure it out”

What I’d do differently

1. Error classification is brittle

2. Patterns are too rigid

3. No cross-problem learning yet

Why Hindsight worked here

Lessons I’d keep

Closing thought

Top comments (0)