Hindsight caught repeated AST traversal bugs
“Why is it flagging recursion again?” I checked the logs, and the agent wasn’t looping—it was using Hindsight to call out the same AST traversal mistake I’d made three commits ago.
Last night I assumed my AST walker was fixed; by morning, Hindsight had surfaced the exact same traversal bug across three different submissions I thought were unrelated.
What I actually built
I’ve been working on Codemind, a coding practice platform that doesn’t just run user code—it tries to understand how someone is solving a problem and guide them while they’re doing it.
At a high level, the system looks like this:
- A frontend where users write and submit code
- A Python backend that parses and analyzes submissions
- A sandboxed execution layer (Firecracker-style isolation) to safely run code
- An analysis pipeline that walks ASTs and applies rules + taint tracking
- A memory layer powered by Hindsight that stores patterns across attempts
The interesting part isn’t execution—it’s the analysis loop:
- Parse code into an AST
- Traverse it to detect patterns (recursion misuse, unsafe flows, etc.)
- Generate feedback
- Store what happened
- Use that history to influence future feedback
That last step is where things got weird—in a good way.
The problem I thought I had
Originally, I thought my biggest challenge would be writing good static analysis rules.
Things like:
- Detect recursion without base case
- Identify unsafe variable flows (taint analysis)
- Catch incorrect loop termination
So I built a fairly standard AST walker. Something like:
def visit(node):
if isinstance(node, FunctionDef):
check_recursion(node)
for child in ast.iter_child_nodes(node):
visit(child)
Pretty normal. Walk the tree, inspect nodes, apply rules.
And it mostly worked.
But then I started noticing something annoying:
I kept fixing the same class of bugs over and over again.
Not in the codebase—in the submissions.
Users (including me, while testing) would:
- Write a recursive function
- Forget the base condition
- Or place it incorrectly
- Or structure traversal in a subtly broken way
And every time, the analyzer would treat it as a fresh problem.
Stateless. No memory.
That’s when I added Hindsight.
What I expected Hindsight to do
I expected Hindsight to act like a simple log:
- Store past attempts
- Maybe surface similar ones
- Help generate slightly better hints
Basically, better context.
Instead, it started behaving like pattern recognition across time.
I integrated it using the official repo:
The idea was simple: every analysis run produces an “event”.
event = {
"user_id": user_id,
"code": submission,
"issues": detected_issues,
"ast_patterns": extracted_patterns,
}
store_event(event)
Then, before generating feedback:
history = hindsight.query(user_id=user_id)
similar = find_similar_patterns(history, current_patterns)
if similar:
adjust_feedback(similar)
That’s it. No magic.
But the behavior changed dramatically.
The bug that wouldn’t stay fixed
The first time I noticed something different was with a recursion check.
My rule looked something like this:
def check_recursion(func_node):
calls = find_recursive_calls(func_node)
if calls and not has_base_case(func_node):
report("Missing base case in recursion")
Clean. Straightforward.
But users kept triggering false positives or weird edge cases:
- Base case present but unreachable
- Base case after recursive call
- Nested conditions that break termination
Individually, each case looked different.
Stateless analysis treated them as separate.
Hindsight didn’t.
Hindsight started grouping my mistakes
Once I started storing AST-derived patterns, I began extracting simple features:
patterns = {
"has_recursion": True,
"base_case_position": "after_recursive_call",
"branching_depth": compute_depth(node),
}
Nothing fancy. Just structural hints.
But when I queried history, I started seeing clusters:
- Same incorrect ordering
- Same missing condition structure
- Same flawed traversal shape
Even across different users.
That’s when the feedback changed from:
“Missing base case”
to:
“Your base case is placed after the recursive call, which matches a previous failing pattern.”
That’s a completely different experience.
The moment it clicked
“This node ordering looks familiar.”
That line came from a debug print I had added while comparing ASTs.
I realized Hindsight wasn’t just storing failures—it was letting me compare structure over time.
I added a crude similarity function:
def is_similar(p1, p2):
return (
p1["has_recursion"] == p2["has_recursion"] and
p1["base_case_position"] == p2["base_case_position"]
)
And suddenly:
- Different submissions mapped to the same pattern
- Feedback became consistent
- I stopped chasing individual bugs
I was debugging classes of mistakes, not instances.
Where my design broke
The first version of this system had a big flaw:
I stored too much raw data.
Entire ASTs, full code blobs, verbose logs.
It made retrieval slow and noisy.
Worse, similarity comparisons became meaningless.
I refactored to store only distilled features:
event = {
"user_id": user_id,
"patterns": extract_patterns(ast_tree),
"issues": issues,
"timestamp": now(),
}
This made two things easier:
- Fast comparison
- Clear grouping of mistakes
Tradeoff: I lost some context.
But for feedback generation, structure mattered more than raw code.
How feedback actually changed
Before Hindsight:
- Each submission analyzed in isolation
- Generic messages
- No sense of progression
After Hindsight:
- Feedback references past mistakes
- Patterns influence hint priority
- Repeated issues get stricter guidance
Example:
Before:
“Check your recursion logic.”
After:
“You’ve repeated a pattern where the base case is evaluated after recursion. Try moving it before the recursive call.”
That’s not just better—it’s targeted.
One real scenario
A user writes:
def factorial(n):
return n * factorial(n-1) if n > 1 else 1
Looks fine.
Then they modify it:
def factorial(n):
return factorial(n-1) * n
No base case.
First submission:
→ flagged as missing base case
Second submission (after hint):
→ same mistake, slightly different structure
Without memory: same feedback again.
With Hindsight:
- Recognizes repeated pattern
- Escalates feedback
- Suggests exact structural fix
That escalation turned out to be key.
What surprised me most
I didn’t expect Hindsight to help me debug my own analyzer.
But it did.
By clustering mistakes, I could see:
- Where my rules were too broad
- Where they were missing edge cases
- Which patterns kept slipping through
In a way, user mistakes became test cases.
Lessons learned
1. Stateless analysis hides patterns
If you only look at one submission at a time, you miss the bigger picture. Most bugs repeat in slightly different forms.
2. Store structure, not raw data
AST features worked better than full trees. Smaller, comparable representations made everything easier.
3. Similarity doesn’t need to be fancy
Simple comparisons (position, presence, ordering) were enough to group meaningful patterns.
4. Feedback quality depends on memory
Better rules didn’t improve feedback nearly as much as remembering past mistakes did.
5. Your system will learn things you didn’t design for
I added Hindsight to improve hints. It ended up exposing flaws in my analyzer and changing how I debug.
What I’d do differently
If I rebuilt this:
- I’d design pattern extraction first, not last
- I’d define similarity metrics upfront
- I’d treat memory as a core system, not an add-on
Right now, Hindsight sits alongside the analyzer.
It probably belongs inside it.
Closing
I went in thinking I was building a better static analyzer.
What I ended up building was a system that remembers how people fail—and uses that to guide them forward.
And somewhere along the way, it started catching my own repeated AST traversal bugs before I even noticed them.
Top comments (0)