I used to think feedback systems were stateless—until I saw Hindsight pull a previous wrong approach and adjust the next hint without being told.
What I Actually Built
This project started as a fairly standard coding practice platform. A user writes code, we execute it in a sandbox, compare outputs, and return pass/fail with some explanation. If you’ve built anything like a mini-LeetCode, you already know the shape:
- A frontend where users solve problems
- A Python backend handling submissions
- An execution layer (sandboxed, Firecracker-style isolation)
- An “AI layer” that explains failures
At first, everything was request → response. Clean. Predictable. Also… kind of useless after the third failed attempt.
The interesting part of this repo is the memory layer—what I called “Hindsight”—that sits between code execution and feedback generation. Instead of treating each submission independently, it stores and reuses past attempts to influence future hints.
Concretely, the system looks like this:
Frontend → API → Execution Engine → Analyzer → Hindsight → Feedback वापस to user
The only non-trivial piece there is Hindsight. Everything else is plumbing.
The Problem: Stateless Feedback Breaks Down Fast
Initially, my feedback loop looked like this:
def evaluate_submission(code, test_cases):
result = run_in_sandbox(code)
if result.passed:
return {"status": "pass"}
explanation = analyze_failure(code, result)
return {
"status": "fail",
"hint": explanation
}
This works fine for a single attempt. But once a user fails multiple times, it becomes obvious how broken this model is.
- The system repeats the same hint
- It doesn’t recognize patterns
- It doesn’t escalate guidance
- It doesn’t adapt
If someone makes the same off-by-one error three times, returning the same generic explanation is borderline insulting.
I didn’t need a better model. I needed memory.
The First Attempt: Just Store Attempts
My first version of Hindsight was embarrassingly simple. I just stored previous submissions and their outcomes.
class AttemptStore:
def __init__(self):
self.attempts = {}
def save(self, user_id, problem_id, attempt):
key = (user_id, problem_id)
self.attempts.setdefault(key, []).append(attempt)
def get_history(self, user_id, problem_id):
return self.attempts.get((user_id, problem_id), [])
Each attempt looked roughly like:
{
"code": "...",
"error": "IndexError",
"analysis": "off-by-one in loop",
"timestamp": ...
}
Then I wired it into the feedback path:
history = attempt_store.get_history(user_id, problem_id)
hint = generate_hint(current_attempt, history)
This felt like progress. It wasn’t.
All I had done was move from stateless → slightly stateful. The system had history, but it didn’t use it meaningfully.
The Shift: Memory Needs Structure, Not Just Storage
The real change came when I stopped thinking of Hindsight as a log and started treating it as a pattern extractor.
Instead of passing raw history into the hint generator, I introduced a small layer that compresses attempts into “mistake patterns”.
def extract_patterns(history):
patterns = {}
for attempt in history:
key = attempt["analysis"]
patterns[key] = patterns.get(key, 0) + 1
return patterns
Now instead of this:
generate_hint(current, history)
I had this:
patterns = extract_patterns(history)
generate_hint(current, patterns)
This seems trivial, but it changes the behavior significantly.
Instead of:
“There might be an off-by-one error”
You get:
“You’ve had off-by-one issues in previous attempts—check your loop bounds again.”
That one line made the system feel like it was actually paying attention.
Where Hindsight Actually Fits
At a high level, Hindsight sits between analysis and feedback:
def process_submission(user_id, problem_id, code):
result = run_in_sandbox(code)
analysis = analyze_failure(code, result)
attempt = {
"code": code,
"analysis": analysis,
"passed": result.passed
}
attempt_store.save(user_id, problem_id, attempt)
history = attempt_store.get_history(user_id, problem_id)
patterns = extract_patterns(history)
hint = generate_hint(analysis, patterns)
return {
"passed": result.passed,
"hint": hint
}
It’s deliberately dumb infrastructure:
- No embeddings
- No vector search
- No heavy retrieval
Just structured memory + aggregation.
That decision was intentional. I didn’t want infra complexity before I had behavior worth scaling.
Where Hindsight Surprised Me
The biggest surprise wasn’t that it worked—it’s how quickly it changed user behavior.
Before Hindsight:
- Users brute-forced solutions
- Repeated the same mistake
- Ignored hints
After Hindsight:
- Users slowed down
- They started reading hints
- They corrected patterns instead of guessing
The system didn’t become “smarter”. It became contextual.
A Concrete Example
Problem: reverse a string
User attempts:
- Uses incorrect indexing → fails
- Fixes something else, same indexing bug → fails
- Tries slicing incorrectly → fails
Without memory:
“Check your indexing logic.”
With Hindsight:
“You’ve repeated indexing issues across attempts. Focus on how you're accessing elements rather than changing approach.”
That’s a completely different interaction.
Why I Didn’t Go Full Vector DB
At some point, I looked into integrating a proper memory system using embeddings and semantic retrieval. That’s where I came across Hindsight GitHub repository and their docs.
Also worth checking:
They take a much more sophisticated approach:
- Store interactions as embeddings
- Retrieve relevant past context
- Feed it into agent reasoning
That’s the “correct” long-term direction.
But for this project, it would have been premature.
My constraints:
- Students using low-resource environments
- Need for predictable behavior
- No tolerance for latency spikes
A simple in-memory + structured pattern system beat a complex retrieval pipeline.
Tradeoffs I Accepted (and Felt)
1. Memory is shallow
I’m not capturing semantic similarity. If two mistakes look different but are conceptually the same, I miss it.
2. Patterns are naive
Counting occurrences works, but it doesn’t capture progression. A user improving still looks like “repeating errors”.
3. No cross-problem learning
Everything is scoped to (user_id, problem_id). If someone struggles with recursion everywhere, the system doesn’t connect that yet.
That’s the next obvious step.
What I’d Do Differently Next Time
If I were to rebuild this with more time:
1. Introduce embeddings only for pattern grouping
Not full retrieval—just clustering similar mistakes.
2. Track transitions, not just counts
Instead of:
off-by-one: 3
Track:
off-by-one → fixed → new error
That’s actual learning progression.
3. Separate “mistake type” from “surface error”
Right now, I rely too much on analyzer output strings. I’d formalize a schema:
{
"type": "loop_boundary",
"severity": "basic",
"context": "array iteration"
}
This makes memory reusable and consistent.
Lessons Learned
Memory beats better explanations
You don’t need a smarter model if you remember what already happened.
Logs are not memory
Storing data isn’t enough. You need structure that influences decisions.
Start dumb, then layer complexity
A dictionary + counting got me surprisingly far.
Personalization changes behavior
The moment users feel “seen”, they stop brute forcing.
Scope aggressively
Per-user, per-problem memory sounds limited—but it’s what kept the system predictable.
Top comments (0)