Devika_ND

Posted on Mar 23 • Edited on Apr 20

Behavioral Signal Tracking in a Production AI Mentor System

#python #microsoft #ai #fastapi

By: Devika N D - Code Execution & Behavioral Signal Module

Hindsight Hackathon - Team 1/0 coders

*“The moment I submitted an infinite loop and watched the server hang forever — *

I realized the execution engine is the most dangerous part of the whole system.”

Nobody told me that when you let users run arbitrary Python code on your server, you are one while True: pass away from a dead process. I found out the hard way.

My job in this project was building the code execution engine, problem store, behavioral signal tracker, and cognitive pattern analyzer.

\This is the story of what I built, what broke, and what I'd do differently.

What We Built

The AI Coding Mentor is a system where students submit Python solutions to coding problems, get evaluated against real test cases, and receive personalized hints based on how they actually behave — not what they tell us about themselves.

The stack:

__FastAPI __backend for all routing and execution
__Groq (LLaMA 3.3 70B) __for AI-generated feedback and problems
__Hindsight __for persistent behavioral memory across sessions
__React __frontend with live code editor

My modules owned the entire middle of this pipeline — from the moment a user hits Submit to the moment a cognitive pattern label gets stored in memory.

The system doesn't ask users how they learn. It watches them. Every keystroke count, every second spent staring at the problem, every failed test case — all of it feeds into a behavioral profile that gets smarter with each session.

The Execution Engine — Harder Than It Looks

The core function is run_user_code() in execution_service.py. It takes user code as a raw string, compiles it, runs it against test cases, and returns a structured result. Simple enough on paper.

def run_user_code(user_code: str, function_name: str, test_cases: list[dict]) -> dict:

namespace = \{\}

try:

    exec\(compile\(user\_code, "<user\_code>", "exec"\), namespace\)

except SyntaxError as e:

    return \_error\_result\(f"Syntax error: \{e\}", total, start\_time\)

except Exception as e:

    return \_error\_result\(f"Runtime error on load: \{e\}", total, start\_time\)



if function\_name not in namespace:

    return \_error\_result\(

        f"Function '\{function\_name\}' not found\.",

        total, start\_time

    \)

The first version worked perfectly for normal code. Then I tested it with an infinite loop. The server froze. No response, no error, no timeout. Just a dead process hanging indefinitely while uvicorn stopped serving everything else.

On Linux you can use signal.SIGALRM to timeout a function. On Windows — which is what I’m running — that doesn’t exist. So I used threading instead:

def _run_with_timeout(fn, kwargs, timeout_sec=5):

result = \{"value": None, "error": None\}



def target\(\):

    try:

        result\["value"\] = fn\(\*\*kwargs\)

    except Exception as e:

        result\["error"\] = str\(e\)



t = threading\.Thread\(target=target\)

t\.start\(\)

t\.join\(timeout=timeout\_sec\)



if t\.is\_alive\(\):

    result\["error"\] = "Time limit exceeded \(5s\)"

return result

Important caveat: on Windows, threading timeout doesn’t actually kill the thread — it just stops waiting. The thread keeps running in the background. This means an infinite loop will still consume CPU even after the timeout fires. For a hackathon demo, this is good enough. For production you’d want a subprocess-based sandbox.

The Problem Store and AI Generator

I started with a hardcoded problems.json — 9 problems covering arrays, strings, loops, and recursion. Each problem has a function_name, test_cases with exact inputs and expected outputs, and starter_code.

Then I realized 9 problems is not enough for an adaptive system. I added an AI problem generator using Groq:

def generate_problem(topic: str, difficulty: str) -> dict:

response = client\.chat\.completions\.create\(

    model="llama\-3\.3\-70b\-versatile",

    messages=\[\{"role": "user", "content": prompt\}\],

    temperature=0\.3

\)

raw = re\.sub\(r"\`\`\`json|\`\`\`", "", raw\)\.strip\(\)

problem = json\.loads\(raw\)

problem\["id"\] = "gen\_" \+ str\(uuid\.uuid4\(\)\)\[:8\]

\_generated\_cache\[problem\["id"\]\] = problem

return problem

The endpoint is GET /problems/generate/{topic}/{difficulty}. Hit it with recursion/medium and you get a fresh problem with 3 test cases, a function signature, and starter code — all ready to run through the execution engine immediately.

The trickiest part was prompt engineering. My first attempt generated Tower of Hanoi — a 4-parameter function — but test cases only passed n. Every test case failed with "missing 3 required positional arguments." The fix: be explicit in the prompt that functions must have 1 or 2 parameters maximum and input keys must match exactly.

Generated problems get saved to problems.json permanently — so the dataset grows every time someone generates one. No manual JSON writing needed.

This started as a workaround for a small dataset and ended up being one of the most useful features in the whole project — infinite problems, fully wired into the same execution pipeline as the static ones.

Capturing Behavioral Signals

Beyond just running code, I needed to understand how users were behaving while they solved problems.

Every time a user submits code, signal_tracker.py captures the raw behavioral data:

def capture_signals(submission: CodeSubmission, result: EvalResult) -> dict:

return \{

    "user\_id":        submission\.user\_id,

    "attempt\_number": submission\.attempt\_number,

    "time\_taken\_sec": submission\.time\_taken,

    "code\_edit\_count":submission\.code\_edit\_count,

    "all\_passed":     result\.all\_passed,

    "error\_types":    classify\_errors\(result\.error\_types\),

    "failed\_cases": \[

        ec for ec in result\.edge\_case\_results

        if not ec\.get\("passed"\)

    \]

\}

These signals feed into cognitive_analyzer.py — five patterns: overthinking, guessing, rushing, concept_gap, and boundary_weakness. Each returns a confidence score between 0 and 1.

Here’s the rushing detector — the one that catches users who submit without reading:

def _check_rushing(signals: dict) -> list:

score = 0\.0



if signals\["time\_taken\_sec"\] < 15:

    score \+= 0\.4   \# submitted too fast



if "syntax\_error" in signals\["error\_types"\]:

    score \+= 0\.4   \# didn't even read the code



if signals\["code\_edit\_count"\] <= 2:

    score \+= 0\.2   \# barely touched the editor



if score >= 0\.4:

    return \[\{"pattern": "rushing", "confidence": round\(score, 2\)\}\]

return \[\]

Under 15 seconds, a syntax error, barely any edits — that’s a user who copy-pasted something without reading the problem. Confidence 0.8, stored in memory, hint tone adapts accordingly.

Wiring the Full Pipeline

The submit_code route is where all five modules connect in sequence:

@router.post("/submit_code", response_model=EvalResult)

def submit_code(submission: CodeSubmission):

\# 1\. Load problem

problem = get\_problem\_by\_id\(submission\.problem\_id\)



\# 2\. Run the code

result\_dict = run\_user\_code\(

    user\_code=submission\.code,

    function\_name=problem\["function\_name"\],

    test\_cases=problem\["test\_cases"\]

\)

result = EvalResult\(\*\*result\_dict\)



\# 3\. Capture behavioral signals

signals = capture\_signals\(submission, result\)



\# 4\. Detect cognitive patterns

patterns = analyze\_patterns\(signals\)



\# 5\. Store into Hindsight memory

store\_session\(user\_id=submission\.user\_id, session\_data=\{

    "patterns":           patterns\["patterns"\],

    "dominant\_pattern":   patterns\["dominant\_pattern"\],

    "dominant\_confidence":patterns\["dominant\_confidence"\],

    "time\_taken\_seconds": signals\["time\_taken\_sec"\],

    "solved":             result\.all\_passed,

\}\)



return result

One request, five things happen: code runs, signals captured, patterns detected, memory stored, result returned. The judge can submit code and immediately call GET /memory/recall/{user_id} to see the pattern stored in real time.

The Route Order Bug That Would Have Killed Our Demo

FastAPI registers routes in declaration order. I had this:

# WRONG — dynamic route declared first

@router.get("/get_problem/{problem_id}") # swallows everything

@router.get("/problems/difficulty/{difficulty}") # never reached

# CORRECT — static routes before dynamic

@router.get("/problems/difficulty/{difficulty}") # matched first

@router.get("/problems/topic/{topic}") # matched first

@router.get("/get_problem/{problem_id}") # dynamic last

When a judge hit GET /problems/difficulty/easy, FastAPI tried to find a problem with ID "difficulty" and returned 404. Found this during final integration testing — not during development when I was only testing /get_problem/p001 directly.

What I Learned

__Windows threading timeout doesn’t kill threads. __It stops waiting but the thread lives on. Design for subprocess isolation if you’re running untrusted code in production.
__Prompt engineering for structured output is iteration. __My first AI problem generator produced functions whose test cases didn’t match the signature. Being extremely explicit about parameter constraints in the prompt fixed it.
__Route order in FastAPI is load-bearing. __Dynamic routes swallow everything declared before them. Always put static routes first.
Owning the full pipeline from execution to signals __forced cleaner interfaces. Because I controlled both ends of the data contract, field name mismatches between modules were caught immediately — not during __
__final integration when they're painful to fix.
__Behavioral signals are more honest than user input. __No one types “I tend to rush” into a profile form. But submit in 8 seconds with a syntax error twice in a row and the system knows exactly what’s happening.

Resources & Links

Hindsight GitHub: https://github.com/vectorize-io/hindsight

Hindsight Docs: https://hindsight.vectorize.io/

Agent Memory: https://vectorize.io/features/agent-memory

Top comments (1)

Hollow House Institute • May 12

This is the direction I think more systems will eventually move toward.

The interesting part isn’t just behavioral signal tracking itself.

It’s what happens once those signals start accumulating longitudinally across runtime sessions.

Over time you begin seeing:

Behavioral Drift
normalization of workarounds
escalation patterns
confidence reinforcement loops
boundary weakness under pressure

That’s where execution-time governance starts becoming important.

A lot of systems currently capture signals but don’t maintain active Decision Boundaries, escalation persistence, or Stop Authority once the system is operating continuously.

The infinite loop example is actually a good illustration of that.

The runtime layer is where governance either survives operationally or slowly becomes symbolic.