I Didn’t Need a Smarter Tutor. I Needed One That Could Remember Why You Failed Last Time.
Most coding tutors are stateless in the exact place they shouldn’t be. They can tell you that your current submission is wrong, but they forget that you made the same boundary mistake on the previous two attempts.
That was the problem I ended up solving in this project. The repository I started from is a small FastAPI backend, but the interesting part sits just behind it: a lightweight analysis engine that turns raw code submissions into repeated-pattern detection, mentor-style suggestions, and a tiny hindsight memory layer. The whole thing is much simpler than the average “agent” demo, and that’s why I like it. You can read it in one sitting and understand where it helps, where it cheats, and where it will break.
What the system actually does
At the API layer, this project is straightforward. backend/app/main.py wires up four routes:
-
POST /run-codeexecutes Python, JavaScript, Java, or C++ in a temporary directory -
POST /submit-codesends a judged submission plus user history into the analysis engine -
GET /get-history/{user_id}returns stored submissions -
GET /get-suggestions/{user_id}returns the latest suggestions and learning path
The backend-side flow is almost comically thin:
@router.post("/submit-code")
def submit_code(submission: SubmissionRequest):
history = get_user_history(submission.user_id)
ai_response = run_code_analysis(submission, history)
stored_submission = submission.model_dump()
if isinstance(ai_response, dict):
stored_submission.update(ai_response)
save_submission(submission.user_id, stored_submission)
return ai_response
That history = get_user_history(...) line is the entire reason the system is interesting. Without it, this is just another one-shot evaluator. With it, the engine can stop treating mistakes as isolated events and start treating them as a behavior.
The architecture is basically this:
flowchart LR
A["/run-code"] --> B["temporary execution sandbox"]
C["/submit-code"] --> D["per-user history store"]
D --> E["ai_engine.process_submission"]
E --> F["mistake classifier"]
E --> G["pattern detector"]
E --> H["hindsight memory"]
G --> I["suggestions + learning path"]
H --> J["current insight + past similar insights"]
The code runner in backend/app/routes/run_code.py is practical rather than fancy. It writes the source into a temp folder, shells out to the language runtime or compiler, and kills anything that runs longer than three seconds. It’s exactly the kind of thing I’d build first for an internal product: enough isolation to be useful, nowhere near enough isolation to call “secure.”
The story here is really about memory
The most opinionated design choice in this codebase is that feedback isn’t generated directly from the latest submission. It’s generated from the latest submission plus accumulated history, and then turned into a hindsight-style insight object.
The engine orchestration in ../Lynt/ai_engine/engine.py makes that clear:
def process_submission(submission, history):
normalized_submission = normalize_submission(submission)
normalized_history = _normalize_history(history)
classified_submission = _classify_submission(normalized_submission)
submissions = normalized_history + [classified_submission]
patterns = detect_patterns(submissions)
suggestions = generate_suggestions(patterns)
learning_path = generate_learning_path(patterns)
insight_memory = list(_INSIGHT_MEMORY_CACHE)
new_history_submissions = _get_new_history_submissions(
normalized_history,
insight_memory,
)
insight_memory = _build_insight_memory(insight_memory, new_history_submissions)
current_insight = _resolve_submission_insight(classified_submission)
past_similar_insights = get_relevant_insights(classified_submission, insight_memory)
That’s the whole pipeline:
- Normalize the new event.
- Classify the mistake.
- Fold it into historical pattern detection.
- Resolve a current “insight.”
- Retrieve related past insights from memory.
I like this shape because it keeps the judgment steps separate. mistake_classifier.py handles coarse labels like off_by_one and logic_error. pattern_detector.py promotes repeated (topic, mistake) pairs into weak areas after three occurrences. suggestion_generator.py turns those weak areas into advice. And hindsight_memory.py stores the reusable lesson.
That last part is the piece I kept coming back to. If a learner keeps getting array bounds wrong, I don’t want the system to just say “wrong answer” again. I want it to remember the kind of failure and say something like: you keep missing boundary conditions in arrays, so start there.
That’s exactly what the hindsight module does:
INSIGHT_RULES = {
"off_by_one": {
"message_template": "You often miss boundary conditions in {topic}.",
"fix_suggestion": "Review edge cases and check index boundaries carefully.",
},
"logic_error": {
"message_template": "You tend to make logic mistakes in {topic}.",
"fix_suggestion": "Walk through sample inputs step by step to verify the logic.",
},
}
And then it stores only the newest version of an insight for a given topic and mistake pair:
def store_insight(insight, memory_list):
for index, existing_insight in enumerate(memory_list):
if (
existing_insight.get("mistake_type") == insight.get("mistake_type")
and existing_insight.get("topic") == insight.get("topic")
):
if _get_recency_sort_key(insight) >= _get_recency_sort_key(existing_insight):
memory_list[index] = insight
return memory_list
memory_list.append(insight)
return memory_list
That’s a very specific tradeoff. I’m not building a full event log here. I’m building a compressed memory of “what lesson should still matter.”
If you’ve looked at the open source Hindsight memory framework on GitHub, the Hindsight documentation for agent memory design, or the broader Vectorize agent memory architecture overview, the core idea will feel familiar: don’t just store chat history or raw events, store distilled lessons that can change future behavior. This repo doesn’t use the full external package, but ai_engine/hindsight/hindsight_memory.py is very clearly built in that direction.
What I thought would work, and what actually mattered
The naive version of this system is obvious: take the judge result, map it to a mistake type, and return some canned advice. In fact, part of this repo still looks like that on purpose.
mistake_classifier.py starts with a tiny mapping:
RESULT_TO_MISTAKE_TYPE = {
"runtime_error": "syntax_error",
"timeout": "inefficient_code",
"wrong_answer": "logic_error",
}
def classify_mistake(result, error_message):
normalized_error_message = str(error_message or "").lower()
if "index" in normalized_error_message:
return "off_by_one"
return RESULT_TO_MISTAKE_TYPE.get((result or "").strip().lower(), "unknown")
On paper, that looks almost too dumb to be useful. In practice, it’s a decent first pass because most learning products don’t need perfect classification to be helpful. They need stable categories that line up with actionable advice.
Where this got more interesting is in the language-specific branches. Java and C++ have custom analysis paths in engine.py, because generic runtime_error was too lossy. A Java NullPointerException and a missing main signature are both “runtime-ish” problems, but they teach very different lessons. Same for C++: “segmentation fault” deserves different feedback than “missing include.”
That’s why the engine now branches like this:
- Java gets checks for
NullPointerException,ArrayIndexOutOfBoundsException, missingmain, and syntax heuristics - C++ gets checks for segmentation faults, out-of-range access, missing headers, and syntax heuristics
- Everything else falls back to the generic classifier
That’s a good example of a design that got more opinionated over time instead of more abstract. I didn’t need a grand unified analysis model. I needed a few language-specific escape hatches where the generic labels were clearly not good enough.
Before and after: when memory actually changes the output
The included test script in ../Lynt/ai_engine/test_ai_engine.py is useful because it shows the system behaving differently once history accumulates.
With a sample history containing repeated array mistakes, a new wrong-answer submission with an index-related error produces this:
mistake_type: off_by_one
patterns: [{'topic': 'arrays', 'mistake': 'off_by_one', 'count': 3}]
suggestions: ['You frequently make boundary errors in arrays. Focus on edge cases and double-check your start and end positions.']
learning_path: ['arrays']
insight: {
'mistake_type': 'off_by_one',
'topic': 'arrays',
'insight_message': 'You often miss boundary conditions in arrays.',
'fix_suggestion': 'Review edge cases and check index boundaries carefully.'
}
That’s the behavioral jump I cared about.
Before memory, the system can tell me “this attempt looks like an off-by-one.”
After memory, it can tell me “you keep making boundary mistakes in arrays, and that pattern is now strong enough that I’m going to prioritize arrays in your learning path.”
That’s a much more credible tutoring loop.
I also verified one practical edge case that the repo doesn’t hide very well: the hindsight cache is global process state.
The backend history store is per-user:
user_submissions: dict[str, list] = {}
def get_user_history(user_id):
return user_submissions.get(user_id, [])
def save_submission(user_id, submission):
history = user_submissions.setdefault(user_id, [])
history.append(submission)
But the engine memory is not. engine.py keeps these at module scope:
_INSIGHT_MEMORY_CACHE: List[Dict[str, Any]] = []
_PROCESSED_HISTORY_KEYS: Set[SubmissionKey] = set()
That means if user A creates an off_by_one insight for arrays, user B can see a “past similar insight” even with an empty personal history, as long as they trigger the same topic and mistake shape. I reproduced exactly that.
For a demo, this is fine. For a real tutoring system, it’s a bug.
And honestly, that bug tells the real story of this codebase better than any polished diagram could. Building “memory” is easy if you mean “append some stuff to a list.” Building memory that is scoped correctly, persisted, deduplicated, queryable, and safe across users is where the real engineering starts.
The part I’d change first
There’s even a comment in engine.py pointing in the right direction:
In a production system, this local filtering would be replaced by a database query against a persistent memory store such as MongoDB or a vector database.
That is the right next step. The current backend has a placeholder backend/app/config/db.py for future persistence, but right now both layers are demo-grade:
- user history disappears on process restart
- hindsight memory is process-wide, not per user
- retrieval is simple filtering by topic or mistake type
- there’s no notion of privacy, tenancy, or long-term memory aging
If I were taking this past demo territory, I’d keep the current shape of the engine and replace the storage model underneath it. The distilled insight object is the part worth preserving. The in-memory container is not.
That’s also where a production-grade hindsight system would make sense. The thing I want to persist is not raw submission blobs forever. It’s a curated memory record: topic, mistake type, lesson, timestamp, user scope, and maybe a confidence or decay strategy.
What I learned building it
The main lesson here is that memory is only useful when it changes behavior. Storing history is trivial. Turning history into a reusable lesson is the actual product.
A few takeaways I’d reuse:
- Simple mistake taxonomies are better than clever ones.
off_by_one,logic_error,memory_error, andnull_referenceare crude, but they map cleanly to advice. - Language-specific escape hatches are worth it. The Java and C++ branches in
engine.pyare more maintainable than pretending one generic classifier can explain every runtime failure. - Distilled memory beats raw logs for tutoring. The
insightobject is more reusable than a pile of past error strings. - Global in-process memory is a trap. It feels convenient until you realize you’ve built cross-user leakage and restart amnesia into the system.
- Pattern thresholds matter. Requiring three repeated failures before declaring a weak area in
pattern_detector.pyis a small but good guardrail against overreacting to noise.
What I like about this repo is that it doesn’t pretend to be more sophisticated than it is. It’s a thin backend, a practical code runner, and an analysis engine trying to answer a very specific question: how do I make feedback feel cumulative instead of stateless?
For this kind of system, that’s the right question. Everything else is implementation detail.

Top comments (0)