Yasasasswini Idimukkala

Posted on Mar 23

Building a Memory Layer with Hindsight: From Stateless Feedback to Pattern-Aware Systems

#ai #webdev #programming #beginners

I used to think feedback systems were stateless—until I saw Hindsight pull a previous wrong approach and adjust the next hint without being told.

What I Actually Built

This project started as a fairly standard coding practice platform. A user writes code, we execute it in a sandbox, compare outputs, and return pass/fail with some explanation. If you’ve built anything like a mini-LeetCode, you already know the shape:

A frontend where users solve problems
A Python backend handling submissions
An execution layer (sandboxed, Firecracker-style isolation)
An “AI layer” that explains failures

At first, everything was request → response. Clean. Predictable. Also… kind of useless after the third failed attempt.

The interesting part of this repo is the memory layer—what I called “Hindsight”—that sits between code execution and feedback generation. Instead of treating each submission independently, it stores and reuses past attempts to influence future hints.

Concretely, the system looks like this:

Frontend → API → Execution Engine → Analyzer → Hindsight → Feedback वापस to user

The only non-trivial piece there is Hindsight. Everything else is plumbing.

The Problem: Stateless Feedback Breaks Down Fast

Initially, my feedback loop looked like this:

def evaluate_submission(code, test_cases):
    result = run_in_sandbox(code)

    if result.passed:
        return {"status": "pass"}

    explanation = analyze_failure(code, result)
    return {
        "status": "fail",
        "hint": explanation
    }

This works fine for a single attempt. But once a user fails multiple times, it becomes obvious how broken this model is.

The system repeats the same hint
It doesn’t recognize patterns
It doesn’t escalate guidance
It doesn’t adapt

If someone makes the same off-by-one error three times, returning the same generic explanation is borderline insulting.

I didn’t need a better model. I needed memory.

The First Attempt: Just Store Attempts

My first version of Hindsight was embarrassingly simple. I just stored previous submissions and their outcomes.

class AttemptStore:
    def __init__(self):
        self.attempts = {}

    def save(self, user_id, problem_id, attempt):
        key = (user_id, problem_id)
        self.attempts.setdefault(key, []).append(attempt)

    def get_history(self, user_id, problem_id):
        return self.attempts.get((user_id, problem_id), [])

Each attempt looked roughly like:

{
    "code": "...",
    "error": "IndexError",
    "analysis": "off-by-one in loop",
    "timestamp": ...
}

Then I wired it into the feedback path:

history = attempt_store.get_history(user_id, problem_id)

hint = generate_hint(current_attempt, history)

This felt like progress. It wasn’t.

All I had done was move from stateless → slightly stateful. The system had history, but it didn’t use it meaningfully.

The Shift: Memory Needs Structure, Not Just Storage

The real change came when I stopped thinking of Hindsight as a log and started treating it as a pattern extractor.

Instead of passing raw history into the hint generator, I introduced a small layer that compresses attempts into “mistake patterns”.

def extract_patterns(history):
    patterns = {}

    for attempt in history:
        key = attempt["analysis"]
        patterns[key] = patterns.get(key, 0) + 1

    return patterns

Now instead of this:

generate_hint(current, history)

I had this:

patterns = extract_patterns(history)
generate_hint(current, patterns)

This seems trivial, but it changes the behavior significantly.

Instead of:

“There might be an off-by-one error”

You get:

“You’ve had off-by-one issues in previous attempts—check your loop bounds again.”

That one line made the system feel like it was actually paying attention.

Where Hindsight Actually Fits

At a high level, Hindsight sits between analysis and feedback:

def process_submission(user_id, problem_id, code):
    result = run_in_sandbox(code)
    analysis = analyze_failure(code, result)

    attempt = {
        "code": code,
        "analysis": analysis,
        "passed": result.passed
    }

    attempt_store.save(user_id, problem_id, attempt)

    history = attempt_store.get_history(user_id, problem_id)
    patterns = extract_patterns(history)

    hint = generate_hint(analysis, patterns)

    return {
        "passed": result.passed,
        "hint": hint
    }

It’s deliberately dumb infrastructure:

No embeddings
No vector search
No heavy retrieval

Just structured memory + aggregation.

That decision was intentional. I didn’t want infra complexity before I had behavior worth scaling.

Where Hindsight Surprised Me

The biggest surprise wasn’t that it worked—it’s how quickly it changed user behavior.

Before Hindsight:

Users brute-forced solutions
Repeated the same mistake
Ignored hints

After Hindsight:

Users slowed down
They started reading hints
They corrected patterns instead of guessing

The system didn’t become “smarter”. It became contextual.

A Concrete Example

Problem: reverse a string

User attempts:

Uses incorrect indexing → fails
Fixes something else, same indexing bug → fails
Tries slicing incorrectly → fails

Without memory:

“Check your indexing logic.”

With Hindsight:

“You’ve repeated indexing issues across attempts. Focus on how you're accessing elements rather than changing approach.”

That’s a completely different interaction.

Why I Didn’t Go Full Vector DB

At some point, I looked into integrating a proper memory system using embeddings and semantic retrieval. That’s where I came across Hindsight GitHub repository and their docs.

Also worth checking:

They take a much more sophisticated approach:

Store interactions as embeddings
Retrieve relevant past context
Feed it into agent reasoning

That’s the “correct” long-term direction.

But for this project, it would have been premature.

My constraints:

Students using low-resource environments
Need for predictable behavior
No tolerance for latency spikes

A simple in-memory + structured pattern system beat a complex retrieval pipeline.

Tradeoffs I Accepted (and Felt)

1. Memory is shallow

I’m not capturing semantic similarity. If two mistakes look different but are conceptually the same, I miss it.

2. Patterns are naive

Counting occurrences works, but it doesn’t capture progression. A user improving still looks like “repeating errors”.

3. No cross-problem learning

Everything is scoped to (user_id, problem_id). If someone struggles with recursion everywhere, the system doesn’t connect that yet.

That’s the next obvious step.

What I’d Do Differently Next Time

If I were to rebuild this with more time:

1. Introduce embeddings only for pattern grouping

Not full retrieval—just clustering similar mistakes.

2. Track transitions, not just counts

Instead of:

off-by-one: 3

Track:

off-by-one → fixed → new error

That’s actual learning progression.

3. Separate “mistake type” from “surface error”

Right now, I rely too much on analyzer output strings. I’d formalize a schema:

{
    "type": "loop_boundary",
    "severity": "basic",
    "context": "array iteration"
}

This makes memory reusable and consistent.

Lessons Learned

Memory beats better explanations

You don’t need a smarter model if you remember what already happened.

Logs are not memory

Storing data isn’t enough. You need structure that influences decisions.

Start dumb, then layer complexity

A dictionary + counting got me surprisingly far.

Personalization changes behavior

The moment users feel “seen”, they stop brute forcing.

Scope aggressively

Per-user, per-problem memory sounds limited—but it’s what kept the system predictable.

DEV Community