DEV Community

Cover image for Closing the feedback loop: how mistake classification drives adaptive problem selection in NumPath
Oscar Rieken
Oscar Rieken

Posted on

Closing the feedback loop: how mistake classification drives adaptive problem selection in NumPath

What We Built

NumPath is an AI math tutor for children with dyscalculia. At its core is an adaptive engine that picks the next problem for each student based on their Bayesian Knowledge Tracing (BKT) mastery estimate. In this post I'll walk through a problem we had — and solved — in the rule-based phase: classified mistakes were being logged but completely ignored by the selection engine.

The fix was a 60-line change across two files. The research implication is significant.

The Problem: a diagnostic signal going nowhere

Our MistakeClassifier already tagged every wrong answer with a structured code — BORROW_SKIP when a student adds instead of subtracts with borrowing, DIGIT_REVERSAL when they write 51 for 15, MAGNITUDE_MISJUDGE when they pick the smaller number as larger. These MistakeEvent records were hitting the database on every incorrect attempt.

But GetNextProblemUseCase — the code that decides what problem a student gets next — never read them. The engine was selecting problems purely on BKT p_mastery. A student could hit BORROW_SKIP three sessions in a row and still receive problems at the same difficulty, on the same skill, with zero response to the pattern.

This violates what MacLellan et al. call the "Error as Diagnostic Signal" principle: mistakes should trigger targeted remediation, not generic retry.

The Design Decision

The core question was: when should a mistake pattern trigger a response, and what should that response be?

We settled on three rules, each encoded as a named constant:

MISTAKE_WINDOW = 3        # look back this many MistakeEvents
# threshold = ceil(MISTAKE_WINDOW / 2) = 2 — dominant code must appear ≥ 2× in window

MISTAKE_KC_MAP = {
    "DIGIT_REVERSAL":        "PLACE_VALUE",
    "BORROW_SKIP":           "SUB_BORROW",
    "MAGNITUDE_MISJUDGE":    "PLACE_VALUE",
    "PLACE_VALUE_CONFUSION": "PLACE_VALUE",
    "OPERATION_CONFUSION":   "OPERATION_SIGN",
}
Enter fullscreen mode Exit fullscreen mode

When _detect_mistake_signal() fires, two things happen:

  1. Skill override — the engine targets the KC linked to that mistake code, even if another KC has lower p_mastery.
  2. Difficulty drop — one DIFFICULTY_STEP (0.2) down, floored at ENTRY_DIFFICULTY (0.3) to prevent over-scaffolding students who are already at entry level.

What we explicitly rejected: resetting difficulty to zero (too harsh for students who've been making progress), and weighting by mistake severity (too complex for Phase 1 with no real data to calibrate against).

The reason field on every NextProblemResponse now names the triggering pattern:

"Remediation: BORROW_SKIP detected 2× on SUB_BORROW (p_mastery=0.41)"
Enter fullscreen mode Exit fullscreen mode

This is the explainability requirement. A teacher looking at this in the dashboard can understand exactly why the system chose what it did.

Why It Matters for the Research

The central claim of NumPath's RCT will be that adaptive, mistake-aware tutoring produces better outcomes than static worksheets for dyscalculic learners. Before this change, we had a system that adapted difficulty based on streaks but ignored the type of error a student was making. That's not meaningfully different from a worksheet that repeats problems when you get them wrong.

Closing this loop — mistake code → KC target → difficulty adjustment → reason field — is what makes the system an Intelligent Tutoring System rather than a difficulty slider. Every MistakeEvent record is now a longitudinal data point that shapes the student's next experience, and that chain of causality is fully traceable.

What We Learned

The implementation was straightforward. The harder question was the threshold: why 2 of 3, not 3 of 3? Three-of-three is too strict — a student who makes BORROW_SKIP, then DIGIT_REVERSAL, then BORROW_SKIP again has a clear pattern but the strict threshold misses it. Two-of-three catches the pattern earlier at the cost of occasional false positives. We don't yet have real student data to validate this choice — it's a hypothesis. We've logged it as a research note for Phase 4.

The one thing I'd do differently: add the MistakeEvent index to the model on day one. It was missing and only caught during the performance review pass. A composite index on (student_id, created_at) is obvious in hindsight for any table you're going to query with ORDER BY created_at DESC LIMIT N.

What's Next

Next up: wiring the KC states into the teacher dashboard so educators can see p_mastery per student, not just 7-day accuracy — the final piece of the MacLellan "Teacher-in-the-Loop" principle.

Key Takeaways

  • Logging errors is not the same as learning from them — a diagnostic signal only matters if it changes what happens next; wiring MistakeEvent into select_next_problem() is a 60-line change with a meaningful research impact
  • The reason field is not a nice-to-have — every adaptive decision must be explainable to a teacher; string-formatted rationale on each NextProblemResponse is the minimum viable explainability
  • Named constants beat magic numbersMISTAKE_WINDOW, FRUSTRATION_WINDOW, MASTERY_WINDOW sit side by side; when we have real data to calibrate thresholds, we change one line each

Top comments (0)