"Wrong" isn't a diagnosis.
When a student answers 32 − 9 = 37, they didn't randomly guess. They subtracted in the wrong direction in the ones column — a specific, named error called a borrow-skip. A tutor that just marks it incorrect and moves on has wasted the most informative signal in the attempt: why the student got it wrong.
NumPath's Phase 2 mistake classifier turns wrong answers into structured MistakeEvent records. Here's how we built it, what we got wrong the first time, and why rule-based classifiers beat a neural network for this job at this stage.
What We Built
Eight rule-based classifiers covering all three of NumPath's Phase 1 skill areas:
| Code | Skill | Pattern |
|---|---|---|
DIGIT_REVERSAL |
SUB_BORROW / NUMBER_LINE | 2-digit answer with digits transposed |
WRONG_OPERATION |
SUB_BORROW | Student added instead of subtracted |
BORROW_SKIP |
SUB_BORROW | Ones subtracted in reverse — no borrow taken |
OFF_BY_TEN |
SUB_BORROW | Result ±10 from correct (borrow applied to wrong column) |
PLACE_VALUE_CONFUSION |
PLACE_VALUE | Compared units digits only, ignored tens |
MAGNITUDE_MISJUDGE |
PLACE_VALUE | Chose the smaller number as larger |
NUMBER_LINE_DIRECTION |
NUMBER_LINE | Said "left" when answer is "right" |
OFF_BY_ONE |
NUMBER_LINE | Numeric answer ±1 from correct (miscounted steps) |
Each classifier is a pure Python predicate — no external dependencies, no DB imports, testable in isolation. The main function runs them in priority order and returns the first match.
The Design Decision
The first question was: classify with rules or train a model?
The case for rules: we don't have labelled training data yet. Phase 1 just shipped. We have zero MistakeEvent records. Training a classifier on nothing produces nothing.
The case for ML: rules are brittle. A student might make a novel error we didn't anticipate, and rule-based code silently returns None.
We went with rules for Phase 2 because the error patterns for dyscalculia are well-documented in the ITS literature — specifically in the work of VanLehn (1982) on subtraction bugs and the later SIERRA system. "Borrow-skip" and "digit reversal" aren't our taxonomy; they're 40-year-old findings from cognitive science. A rule that detects them is more reliable than a model trained on 150 attempts.
The ML path opens in Phase 3 once the mistake_events table has enough volume. The rule-based classifier generates the labelled training data that Phase 3 will learn from.
The BORROW_SKIP bug
The Phase 1 classifier had a BORROW_SKIP function. It was wrong.
# Phase 1 — incorrect
def _is_borrow_skip(problem_content: dict, given: str) -> bool:
a, b = operands
no_borrow_result = str(a + b) # ← adds a + b, not the borrow-skip result
return given == no_borrow_result
This detected addition (32 − 9 → 41) and called it BORROW_SKIP. But addition is a completely different error — confusing +/− signs, not misapplying the borrowing algorithm. The mistake was labelled wrong in every event record.
The real borrow-skip pattern: when ones(a) < ones(b), the student skips borrowing and instead subtracts in the wrong direction in the ones column.
For 32 − 9:
- Correct: borrow a ten → 12 − 9 = 3 ones, 2 tens → 23
- Borrow-skip: ones = 9 − 2 = 7, tens = 3 (unchanged) → 37
# Phase 2 — correct
def _is_borrow_skip(a: int, b: int, given: str) -> bool:
ones_a, ones_b = a % 10, b % 10
if ones_a >= ones_b:
return False # no borrow needed — pattern doesn't apply
borrow_skip_result = (a // 10 - b // 10) * 10 + (ones_b - ones_a)
return given == str(borrow_skip_result)
Verified: 32 − 9 → 37 ✓, 43 − 18 → 35 ✓, 31 − 14 → 23 ✓
The old code was shipping the wrong signal for every borrow-skip attempt. This is exactly why MistakeEvent records are useless until the classifier is correct — the adaptive engine was routing "borrow-skip" students to the wrong remediation path.
The priority ordering problem
Multiple patterns can fire for the same wrong answer. For 43 − 16 = 27, the student wrote 72. That's a DIGIT_REVERSAL (27 reversed). But priority ordering becomes meaningful when patterns genuinely overlap.
The classifier runs a hierarchy:
subtraction problems:
1. DIGIT_REVERSAL ← most specific free-form error
2. WRONG_OPERATION ← added instead of subtracted
3. BORROW_SKIP ← skipped borrowing algorithm
4. OFF_BY_TEN ← borrow applied to wrong column
place_value problems (multiple-choice — no free-form digit writing):
1. PLACE_VALUE_CONFUSION ← compared units digits only (more specific)
2. MAGNITUDE_MISJUDGE ← picked the smaller number (less specific)
number_line problems:
1. NUMBER_LINE_DIRECTION ← wrong direction word
2. DIGIT_REVERSAL ← transposed digits in numeric answer
3. OFF_BY_ONE ← miscounted steps
Place value problems are multiple-choice, so DIGIT_REVERSAL doesn't apply there — the student picks from a given set, they don't write digits freely. Scoping by problem type prevents false positives.
Why It Matters for the Research
Every MistakeEvent record becomes a training signal twice:
Now: the adaptive engine reads the last MISTAKE_WINDOW (3) events. Two BORROW_SKIP codes in a row triggers remediation mode — the engine drops difficulty and targets SUB_BORROW problems specifically. Correct classification = correct remediation.
Later: Phase 3 will train a logistic regression (and eventually a transformer) on the mistake events table. The rule-based classifier generates the initial labelled dataset. If the rules are wrong — as BORROW_SKIP was — the ML model learns the wrong pattern from poisoned labels.
For a dyscalculia intervention study, this matters more than it would in a general tutoring system. Dyscalculia-specific errors like borrow-skip and digit reversal appear in the ITS literature as distinct cognitive profiles. Getting them right means the model can eventually distinguish students who have a procedural gap (BORROW_SKIP) from students who have a representational gap (PLACE_VALUE_CONFUSION) — a distinction that should affect the instructional intervention.
What We Learned
Rule-based classifiers need domain literature, not just intuition. The original BORROW_SKIP implementation was plausible — "student added instead of subtracting" — but wrong. VanLehn's subtraction bug taxonomy makes the actual pattern explicit. Reading the paper would have saved months of mislabelled data.
Priority ordering is a design document. The order in which classifiers run encodes assumptions about what matters more. We chose "most specific fires first" — but that could be wrong. Maybe WRONG_OPERATION (a conceptual error) should always beat DIGIT_REVERSAL (a transcription error) regardless of specificity, because they imply different interventions. We don't have the data to answer that yet.
50 tests is the right investment for a classifier that labels training data. A wrong label propagates forward through every model that trains on it. Testing every predicate in isolation, including priority ordering and edge cases, is not over-engineering — it's protecting the integrity of the entire data pipeline.
What's Next
The mistake_events table is now correctly populated with each session. Once the pilot delivers ≥150 records, Phase 3 can fit a logistic regression on the labelled events — using the rule-based codes as ground truth — and eventually replace the rules with a model that generalises to error patterns we haven't seen yet.
Key Takeaways
- Rule-based mistake classifiers are the right first step when training data doesn't exist yet — they generate the labelled dataset that trains the eventual ML model
- The real borrow-skip pattern (subtract ones in reverse: 32−9=37) is different from wrong-operation (add instead of subtract: 32+9=41) — getting this wrong poisons every downstream model that trains on the events table
- Classifier priority ordering is a design decision that encodes instructional theory; document it explicitly and treat it as something to validate with data
Top comments (0)