DEV Community

Cover image for I Built Adaptive Hints Using Hindsight
PARDHU CHIPINAPI
PARDHU CHIPINAPI

Posted on

I Built Adaptive Hints Using Hindsight

I Built Adaptive Hints Using Hindsight

“Why is it suddenly explaining recursion?” I hadn’t changed the prompt—Hindsight had quietly picked up the user’s past mistakes and rewired the hint mid-session.


What I actually built

I’ve been working on Codemind, a coding practice platform that tries to behave less like a judge and more like a mentor. Instead of just saying “wrong answer,” it watches how you fail and adjusts guidance over time.

At a high level, the system looks like this:

  • Frontend: users solve problems, submit code
  • Backend (Python): handles submissions and feedback orchestration
  • Execution layer: runs code safely (Firecracker-style sandbox)
  • AI layer: analyzes code and generates hints
  • Memory layer (Hindsight): stores attempts and extracts patterns

The interesting part isn’t the execution or the model—it’s the memory. I wired in Hindsight GitHub repository to track user attempts over time and use that history to shape hints.

If you’ve never used it, their Hindsight documentation explains the idea well: instead of treating each interaction as isolated, you store and retrieve past events to influence future behavior. It’s basically a structured memory layer for agents.

That sounds obvious. It’s not.


The problem: stateless feedback is useless

Originally, my feedback loop looked like this:

def handle_submission(code, problem_id):
    result = run_code(code, problem_id)

    if result.passed:
        return "Correct"

    hint = generate_hint(code, problem_id)
    return hint
Enter fullscreen mode Exit fullscreen mode

Every submission was evaluated independently. Same wrong logic? Same hint.

This works fine until you watch a real user:

  • Attempt 1 → wrong base case
  • Attempt 2 → still wrong base case
  • Attempt 3 → copy-paste variation, same mistake

And the system keeps saying:

“Check your recursion logic.”

That’s not helpful. That’s just noise.

What I needed was:

“You’ve made this mistake three times. Let’s change strategy.”


My first attempt at “memory” (and why it failed)

My initial solution was embarrassingly naive: just store past submissions in a database and pass them into the prompt.

history = db.get_user_attempts(user_id, problem_id)

prompt = f"""
User history:
{history}

Current code:
{code}

Give a hint.
"""
Enter fullscreen mode Exit fullscreen mode

This technically worked. But in practice:

  • Prompts became huge
  • The model ignored most of the history
  • No consistent behavior across attempts

It wasn’t structured. It was just dumping text and hoping for magic.


What changed: introducing Hindsight

Instead of shoving raw history into prompts, I switched to using Hindsight as a proper memory layer.

The core idea: store events, not blobs of text.

hindsight.log_event(
    user_id=user_id,
    event_type="submission",
    metadata={
        "problem_id": problem_id,
        "error_type": classify_error(code),
        "passed": result.passed
    }
)
Enter fullscreen mode Exit fullscreen mode

Now every submission becomes a structured record:

  • Which problem
  • What kind of mistake
  • Whether it passed

This is where things got interesting.

Instead of asking “what did the user do?”, I could ask:

“What patterns are emerging?”


Turning events into adaptive hints

Once events were stored, I started querying them before generating hints.

patterns = hindsight.query(
    user_id=user_id,
    filters={"problem_id": problem_id},
    aggregation="error_frequency"
)
Enter fullscreen mode Exit fullscreen mode

Then I used those patterns to decide how to hint.

def generate_adaptive_hint(code, patterns):
    if patterns["recursion_error"] >= 2:
        return explain_recursion_basics(code)

    if patterns["off_by_one"] >= 2:
        return highlight_edge_cases(code)

    return generic_hint(code)
Enter fullscreen mode Exit fullscreen mode

This is simple logic. No fancy ML. But it changed behavior dramatically.

The system stopped reacting to one submission and started reacting to trends.


The unexpected part: sequence mattered more than frequency

At first, I thought counting errors was enough.

It wasn’t.

Two identical mistakes in a row means something very different from:

  • mistake → fix → new mistake

So I added sequence awareness.

recent = hindsight.get_recent_events(user_id, limit=3)

if all(e["error_type"] == "recursion_error" for e in recent):
    return deep_recursion_hint()
Enter fullscreen mode Exit fullscreen mode

That’s when the system started feeling… intentional.

Not just “you’re wrong,” but:

“You’re stuck in a loop. Let’s break it.”


What the system does now (concretely)

Here’s a real scenario:

Before (stateless)

User submits broken recursion:

  1. “Check your recursion logic.”
  2. “Check your recursion logic.”
  3. “Check your recursion logic.”

Nothing changes.


After (with Hindsight)

Same user:

  1. “Check your recursion logic.”
  2. “Your base case might be incorrect.”
  3. “Let’s walk through a simple recursion example step by step.”

The system escalates guidance based on observed struggle.

No prompt changes. No retraining. Just memory.


Where this shows up in the code

The architecture ended up separating concerns pretty cleanly:

  • submission_handler.py → orchestrates flow
  • execution_engine.py → runs code safely
  • hindsight_client.py → logs + queries memory
  • hint_engine.py → decides hint strategy

The key boundary is here:

patterns = hindsight_client.get_patterns(user_id, problem_id)
hint = hint_engine.generate(code, patterns)
Enter fullscreen mode Exit fullscreen mode

That separation matters.

Hindsight doesn’t generate hints. It just tells you what’s been happening.

The hint engine decides what to do with that information.


Why I didn’t let the model “figure it out”

It’s tempting to push everything into the LLM:

“Here’s history, figure out what to say.”

I tried that. It’s inconsistent.

Instead, I used Hindsight for state, and kept decision logic deterministic.

This gave me:

  • Predictable behavior
  • Easier debugging
  • Lower token usage

And honestly, it feels more like engineering than prompting.


What I’d do differently

There are still rough edges.

1. Error classification is brittle

Right now, I’m using simple heuristics:

def classify_error(code):
    if "recursion" in code:
        return "recursion_error"
Enter fullscreen mode Exit fullscreen mode

This is obviously weak. A better approach would use AST analysis or execution traces.


2. Patterns are too rigid

Threshold-based logic (>= 2) works, but it’s crude.

Some users need help faster. Others don’t.

I’d like to make this adaptive per user.


3. No cross-problem learning yet

Hindsight stores everything, but I’m only querying per problem.

Missed opportunity.

If someone struggles with recursion in one problem, I should anticipate it in the next.


Why Hindsight worked here

The key thing Hindsight gave me wasn’t storage—it was structure.

Instead of:

  • dumping logs
  • embedding conversations
  • hoping retrieval works

I had:

  • typed events
  • queryable patterns
  • explicit control over behavior

If you’re curious, the idea is similar to what’s described on the agent memory page on Vectorize.

It’s not about making the model smarter.

It’s about giving your system memory you can reason about.


Lessons I’d keep

  • Stateless systems plateau fast
    If behavior doesn’t change across attempts, users stop learning.

  • Memory should be structured, not appended
    Events > raw text history.

  • Sequence beats frequency
    Repeated mistakes in a row signal “stuck,” not just “wrong.”

  • Keep decision logic outside the model
    Deterministic systems are easier to debug and improve.

  • Start simple, but make it observable
    Logging patterns mattered more than clever algorithms.


Closing thought

I didn’t set out to build an “adaptive system.” I just wanted to stop repeating the same useless hint.

Hindsight gave me a way to notice patterns, and once I had that, adapting behavior became straightforward.

The surprising part wasn’t that it worked.

It’s that something this simple made the system feel like it was actually paying attention.

Top comments (0)