What We Built
NumPath is an AI math tutor for children with dyscalculia. At its core is an adaptive engine that picks the next problem for each student based on their Bayesian Knowledge Tracing (BKT) mastery estimate. In this post I'll walk through a problem we had — and solved — in the rule-based phase: classified mistakes were being logged but completely ignored by the selection engine.
The fix was a 60-line change across two files. The research implication is significant.
The Problem: a diagnostic signal going nowhere
Our MistakeClassifier already tagged every wrong answer with a structured code — BORROW_SKIP when a student adds instead of subtracts with borrowing, DIGIT_REVERSAL when they write 51 for 15, MAGNITUDE_MISJUDGE when they pick the smaller number as larger. These MistakeEvent records were hitting the database on every incorrect attempt.
But GetNextProblemUseCase — the code that decides what problem a student gets next — never read them. The engine was selecting problems purely on BKT p_mastery. A student could hit BORROW_SKIP three sessions in a row and still receive problems at the same difficulty, on the same skill, with zero response to the pattern.
This violates what MacLellan et al. call the "Error as Diagnostic Signal" principle: mistakes should trigger targeted remediation, not generic retry.
The Design Decision
The core question was: when should a mistake pattern trigger a response, and what should that response be?
We settled on three rules, each encoded as a named constant:
MISTAKE_WINDOW = 3 # look back this many MistakeEvents
# threshold = ceil(MISTAKE_WINDOW / 2) = 2 — dominant code must appear ≥ 2× in window
MISTAKE_KC_MAP = {
"DIGIT_REVERSAL": "PLACE_VALUE",
"BORROW_SKIP": "SUB_BORROW",
"MAGNITUDE_MISJUDGE": "PLACE_VALUE",
"PLACE_VALUE_CONFUSION": "PLACE_VALUE",
"OPERATION_CONFUSION": "OPERATION_SIGN",
}
When _detect_mistake_signal() fires, two things happen:
-
Skill override — the engine targets the KC linked to that mistake code, even if another KC has lower
p_mastery. -
Difficulty drop — one
DIFFICULTY_STEP(0.2) down, floored atENTRY_DIFFICULTY(0.3) to prevent over-scaffolding students who are already at entry level.
What we explicitly rejected: resetting difficulty to zero (too harsh for students who've been making progress), and weighting by mistake severity (too complex for Phase 1 with no real data to calibrate against).
The reason field on every NextProblemResponse now names the triggering pattern:
"Remediation: BORROW_SKIP detected 2× on SUB_BORROW (p_mastery=0.41)"
This is the explainability requirement. A teacher looking at this in the dashboard can understand exactly why the system chose what it did.
Why It Matters for the Research
The central claim of NumPath's RCT will be that adaptive, mistake-aware tutoring produces better outcomes than static worksheets for dyscalculic learners. Before this change, we had a system that adapted difficulty based on streaks but ignored the type of error a student was making. That's not meaningfully different from a worksheet that repeats problems when you get them wrong.
Closing this loop — mistake code → KC target → difficulty adjustment → reason field — is what makes the system an Intelligent Tutoring System rather than a difficulty slider. Every MistakeEvent record is now a longitudinal data point that shapes the student's next experience, and that chain of causality is fully traceable.
What We Learned
The implementation was straightforward. The harder question was the threshold: why 2 of 3, not 3 of 3? Three-of-three is too strict — a student who makes BORROW_SKIP, then DIGIT_REVERSAL, then BORROW_SKIP again has a clear pattern but the strict threshold misses it. Two-of-three catches the pattern earlier at the cost of occasional false positives. We don't yet have real student data to validate this choice — it's a hypothesis. We've logged it as a research note for Phase 4.
The one thing I'd do differently: add the MistakeEvent index to the model on day one. It was missing and only caught during the performance review pass. A composite index on (student_id, created_at) is obvious in hindsight for any table you're going to query with ORDER BY created_at DESC LIMIT N.
What's Next
Next up: wiring the KC states into the teacher dashboard so educators can see p_mastery per student, not just 7-day accuracy — the final piece of the MacLellan "Teacher-in-the-Loop" principle.
Key Takeaways
-
Logging errors is not the same as learning from them — a diagnostic signal only matters if it changes what happens next; wiring
MistakeEventintoselect_next_problem()is a 60-line change with a meaningful research impact -
The
reasonfield is not a nice-to-have — every adaptive decision must be explainable to a teacher; string-formatted rationale on eachNextProblemResponseis the minimum viable explainability -
Named constants beat magic numbers —
MISTAKE_WINDOW,FRUSTRATION_WINDOW,MASTERY_WINDOWsit side by side; when we have real data to calibrate thresholds, we change one line each
Top comments (0)