DEV Community

Cover image for Hindsight caught repeated AST traversal bugs
Alekhya Bonamukkala
Alekhya Bonamukkala

Posted on

Hindsight caught repeated AST traversal bugs

Hindsight caught repeated AST traversal bugs

“Why is it flagging recursion again?” I checked the logs, and the agent wasn’t looping—it was using Hindsight to call out the same AST traversal mistake I’d made three commits ago.

Last night I assumed my AST walker was fixed; by morning, Hindsight had surfaced the exact same traversal bug across three different submissions I thought were unrelated.


What I actually built

I’ve been working on Codemind, a coding practice platform that doesn’t just run user code—it tries to understand how someone is solving a problem and guide them while they’re doing it.

At a high level, the system looks like this:

  • A frontend where users write and submit code
  • A Python backend that parses and analyzes submissions
  • A sandboxed execution layer (Firecracker-style isolation) to safely run code
  • An analysis pipeline that walks ASTs and applies rules + taint tracking
  • A memory layer powered by Hindsight that stores patterns across attempts

The interesting part isn’t execution—it’s the analysis loop:

  1. Parse code into an AST
  2. Traverse it to detect patterns (recursion misuse, unsafe flows, etc.)
  3. Generate feedback
  4. Store what happened
  5. Use that history to influence future feedback

That last step is where things got weird—in a good way.


The problem I thought I had

Originally, I thought my biggest challenge would be writing good static analysis rules.

Things like:

  • Detect recursion without base case
  • Identify unsafe variable flows (taint analysis)
  • Catch incorrect loop termination

So I built a fairly standard AST walker. Something like:

def visit(node):
    if isinstance(node, FunctionDef):
        check_recursion(node)

    for child in ast.iter_child_nodes(node):
        visit(child)
Enter fullscreen mode Exit fullscreen mode

Pretty normal. Walk the tree, inspect nodes, apply rules.

And it mostly worked.

But then I started noticing something annoying:
I kept fixing the same class of bugs over and over again.

Not in the codebase—in the submissions.

Users (including me, while testing) would:

  • Write a recursive function
  • Forget the base condition
  • Or place it incorrectly
  • Or structure traversal in a subtly broken way

And every time, the analyzer would treat it as a fresh problem.

Stateless. No memory.

That’s when I added Hindsight.


What I expected Hindsight to do

I expected Hindsight to act like a simple log:

  • Store past attempts
  • Maybe surface similar ones
  • Help generate slightly better hints

Basically, better context.

Instead, it started behaving like pattern recognition across time.

I integrated it using the official repo:

The idea was simple: every analysis run produces an “event”.

event = {
    "user_id": user_id,
    "code": submission,
    "issues": detected_issues,
    "ast_patterns": extracted_patterns,
}
store_event(event)
Enter fullscreen mode Exit fullscreen mode

Then, before generating feedback:

history = hindsight.query(user_id=user_id)

similar = find_similar_patterns(history, current_patterns)

if similar:
    adjust_feedback(similar)
Enter fullscreen mode Exit fullscreen mode

That’s it. No magic.

But the behavior changed dramatically.


The bug that wouldn’t stay fixed

The first time I noticed something different was with a recursion check.

My rule looked something like this:

def check_recursion(func_node):
    calls = find_recursive_calls(func_node)

    if calls and not has_base_case(func_node):
        report("Missing base case in recursion")
Enter fullscreen mode Exit fullscreen mode

Clean. Straightforward.

But users kept triggering false positives or weird edge cases:

  • Base case present but unreachable
  • Base case after recursive call
  • Nested conditions that break termination

Individually, each case looked different.

Stateless analysis treated them as separate.

Hindsight didn’t.


Hindsight started grouping my mistakes

Once I started storing AST-derived patterns, I began extracting simple features:

patterns = {
    "has_recursion": True,
    "base_case_position": "after_recursive_call",
    "branching_depth": compute_depth(node),
}
Enter fullscreen mode Exit fullscreen mode

Nothing fancy. Just structural hints.

But when I queried history, I started seeing clusters:

  • Same incorrect ordering
  • Same missing condition structure
  • Same flawed traversal shape

Even across different users.

That’s when the feedback changed from:

“Missing base case”

to:

“Your base case is placed after the recursive call, which matches a previous failing pattern.”

That’s a completely different experience.


The moment it clicked

“This node ordering looks familiar.”

That line came from a debug print I had added while comparing ASTs.

I realized Hindsight wasn’t just storing failures—it was letting me compare structure over time.

I added a crude similarity function:

def is_similar(p1, p2):
    return (
        p1["has_recursion"] == p2["has_recursion"] and
        p1["base_case_position"] == p2["base_case_position"]
    )
Enter fullscreen mode Exit fullscreen mode

And suddenly:

  • Different submissions mapped to the same pattern
  • Feedback became consistent
  • I stopped chasing individual bugs

I was debugging classes of mistakes, not instances.


Where my design broke

The first version of this system had a big flaw:

I stored too much raw data.

Entire ASTs, full code blobs, verbose logs.

It made retrieval slow and noisy.

Worse, similarity comparisons became meaningless.

I refactored to store only distilled features:

event = {
    "user_id": user_id,
    "patterns": extract_patterns(ast_tree),
    "issues": issues,
    "timestamp": now(),
}
Enter fullscreen mode Exit fullscreen mode

This made two things easier:

  1. Fast comparison
  2. Clear grouping of mistakes

Tradeoff: I lost some context.

But for feedback generation, structure mattered more than raw code.


How feedback actually changed

Before Hindsight:

  • Each submission analyzed in isolation
  • Generic messages
  • No sense of progression

After Hindsight:

  • Feedback references past mistakes
  • Patterns influence hint priority
  • Repeated issues get stricter guidance

Example:

Before:

“Check your recursion logic.”

After:

“You’ve repeated a pattern where the base case is evaluated after recursion. Try moving it before the recursive call.”

That’s not just better—it’s targeted.


One real scenario

A user writes:

def factorial(n):
    return n * factorial(n-1) if n > 1 else 1
Enter fullscreen mode Exit fullscreen mode

Looks fine.

Then they modify it:

def factorial(n):
    return factorial(n-1) * n
Enter fullscreen mode Exit fullscreen mode

No base case.

First submission:
→ flagged as missing base case

Second submission (after hint):
→ same mistake, slightly different structure

Without memory: same feedback again.

With Hindsight:

  • Recognizes repeated pattern
  • Escalates feedback
  • Suggests exact structural fix

That escalation turned out to be key.


What surprised me most

I didn’t expect Hindsight to help me debug my own analyzer.

But it did.

By clustering mistakes, I could see:

  • Where my rules were too broad
  • Where they were missing edge cases
  • Which patterns kept slipping through

In a way, user mistakes became test cases.


Lessons learned

1. Stateless analysis hides patterns
If you only look at one submission at a time, you miss the bigger picture. Most bugs repeat in slightly different forms.

2. Store structure, not raw data
AST features worked better than full trees. Smaller, comparable representations made everything easier.

3. Similarity doesn’t need to be fancy
Simple comparisons (position, presence, ordering) were enough to group meaningful patterns.

4. Feedback quality depends on memory
Better rules didn’t improve feedback nearly as much as remembering past mistakes did.

5. Your system will learn things you didn’t design for
I added Hindsight to improve hints. It ended up exposing flaws in my analyzer and changing how I debug.


What I’d do differently

If I rebuilt this:

  • I’d design pattern extraction first, not last
  • I’d define similarity metrics upfront
  • I’d treat memory as a core system, not an add-on

Right now, Hindsight sits alongside the analyzer.

It probably belongs inside it.


Closing

I went in thinking I was building a better static analyzer.

What I ended up building was a system that remembers how people fail—and uses that to guide them forward.

And somewhere along the way, it started catching my own repeated AST traversal bugs before I even noticed them.

Top comments (0)