Building a mistake taxonomy for dyscalculia — 8 error patterns, rule-based, no ML required

#numpath #dyscalculia #adaptivelearning #python

"Wrong" isn't a diagnosis.

When a student answers 32 − 9 = 37, they didn't randomly guess. They subtracted in the wrong direction in the ones column — a specific, named error called a borrow-skip. A tutor that just marks it incorrect and moves on has wasted the most informative signal in the attempt: why the student got it wrong.

NumPath's Phase 2 mistake classifier turns wrong answers into structured MistakeEvent records. Here's how we built it, what we got wrong the first time, and why rule-based classifiers beat a neural network for this job at this stage.

What We Built

Eight rule-based classifiers covering all three of NumPath's Phase 1 skill areas:

Code	Skill	Pattern
`DIGIT_REVERSAL`	SUB_BORROW / NUMBER_LINE	2-digit answer with digits transposed
`WRONG_OPERATION`	SUB_BORROW	Student added instead of subtracted
`BORROW_SKIP`	SUB_BORROW	Ones subtracted in reverse — no borrow taken
`OFF_BY_TEN`	SUB_BORROW	Result ±10 from correct (borrow applied to wrong column)
`PLACE_VALUE_CONFUSION`	PLACE_VALUE	Compared units digits only, ignored tens
`MAGNITUDE_MISJUDGE`	PLACE_VALUE	Chose the smaller number as larger
`NUMBER_LINE_DIRECTION`	NUMBER_LINE	Said "left" when answer is "right"
`OFF_BY_ONE`	NUMBER_LINE	Numeric answer ±1 from correct (miscounted steps)

Each classifier is a pure Python predicate — no external dependencies, no DB imports, testable in isolation. The main function runs them in priority order and returns the first match.

The Design Decision

The first question was: classify with rules or train a model?

The case for rules: we don't have labelled training data yet. Phase 1 just shipped. We have zero MistakeEvent records. Training a classifier on nothing produces nothing.

The case for ML: rules are brittle. A student might make a novel error we didn't anticipate, and rule-based code silently returns None.

We went with rules for Phase 2 because the error patterns for dyscalculia are well-documented in the ITS literature — specifically in the work of VanLehn (1982) on subtraction bugs and the later SIERRA system. "Borrow-skip" and "digit reversal" aren't our taxonomy; they're 40-year-old findings from cognitive science. A rule that detects them is more reliable than a model trained on 150 attempts.

The ML path opens in Phase 3 once the mistake_events table has enough volume. The rule-based classifier generates the labelled training data that Phase 3 will learn from.

The BORROW_SKIP bug

The Phase 1 classifier had a BORROW_SKIP function. It was wrong.

# Phase 1 — incorrect
def _is_borrow_skip(problem_content: dict, given: str) -> bool:
    a, b = operands
    no_borrow_result = str(a + b)    # ← adds a + b, not the borrow-skip result
    return given == no_borrow_result

This detected addition (32 − 9 → 41) and called it BORROW_SKIP. But addition is a completely different error — confusing +/− signs, not misapplying the borrowing algorithm. The mistake was labelled wrong in every event record.

The real borrow-skip pattern: when ones(a) < ones(b), the student skips borrowing and instead subtracts in the wrong direction in the ones column.

For 32 − 9:

Correct: borrow a ten → 12 − 9 = 3 ones, 2 tens → 23
Borrow-skip: ones = 9 − 2 = 7, tens = 3 (unchanged) → 37

# Phase 2 — correct
def _is_borrow_skip(a: int, b: int, given: str) -> bool:
    ones_a, ones_b = a % 10, b % 10
    if ones_a >= ones_b:
        return False  # no borrow needed — pattern doesn't apply
    borrow_skip_result = (a // 10 - b // 10) * 10 + (ones_b - ones_a)
    return given == str(borrow_skip_result)

Verified: 32 − 9 → 37 ✓, 43 − 18 → 35 ✓, 31 − 14 → 23 ✓

The old code was shipping the wrong signal for every borrow-skip attempt. This is exactly why MistakeEvent records are useless until the classifier is correct — the adaptive engine was routing "borrow-skip" students to the wrong remediation path.

The priority ordering problem

Multiple patterns can fire for the same wrong answer. For 43 − 16 = 27, the student wrote 72. That's a DIGIT_REVERSAL (27 reversed). But priority ordering becomes meaningful when patterns genuinely overlap.

The classifier runs a hierarchy:

subtraction problems:
  1. DIGIT_REVERSAL    ← most specific free-form error
  2. WRONG_OPERATION   ← added instead of subtracted
  3. BORROW_SKIP       ← skipped borrowing algorithm
  4. OFF_BY_TEN        ← borrow applied to wrong column

place_value problems (multiple-choice — no free-form digit writing):
  1. PLACE_VALUE_CONFUSION  ← compared units digits only (more specific)
  2. MAGNITUDE_MISJUDGE     ← picked the smaller number (less specific)

number_line problems:
  1. NUMBER_LINE_DIRECTION  ← wrong direction word
  2. DIGIT_REVERSAL         ← transposed digits in numeric answer
  3. OFF_BY_ONE             ← miscounted steps

Place value problems are multiple-choice, so DIGIT_REVERSAL doesn't apply there — the student picks from a given set, they don't write digits freely. Scoping by problem type prevents false positives.

Why It Matters for the Research

Every MistakeEvent record becomes a training signal twice:

Now: the adaptive engine reads the last MISTAKE_WINDOW (3) events. Two BORROW_SKIP codes in a row triggers remediation mode — the engine drops difficulty and targets SUB_BORROW problems specifically. Correct classification = correct remediation.

Later: Phase 3 will train a logistic regression (and eventually a transformer) on the mistake events table. The rule-based classifier generates the initial labelled dataset. If the rules are wrong — as BORROW_SKIP was — the ML model learns the wrong pattern from poisoned labels.

For a dyscalculia intervention study, this matters more than it would in a general tutoring system. Dyscalculia-specific errors like borrow-skip and digit reversal appear in the ITS literature as distinct cognitive profiles. Getting them right means the model can eventually distinguish students who have a procedural gap (BORROW_SKIP) from students who have a representational gap (PLACE_VALUE_CONFUSION) — a distinction that should affect the instructional intervention.

What We Learned

Rule-based classifiers need domain literature, not just intuition. The original BORROW_SKIP implementation was plausible — "student added instead of subtracting" — but wrong. VanLehn's subtraction bug taxonomy makes the actual pattern explicit. Reading the paper would have saved months of mislabelled data.

Priority ordering is a design document. The order in which classifiers run encodes assumptions about what matters more. We chose "most specific fires first" — but that could be wrong. Maybe WRONG_OPERATION (a conceptual error) should always beat DIGIT_REVERSAL (a transcription error) regardless of specificity, because they imply different interventions. We don't have the data to answer that yet.

50 tests is the right investment for a classifier that labels training data. A wrong label propagates forward through every model that trains on it. Testing every predicate in isolation, including priority ordering and edge cases, is not over-engineering — it's protecting the integrity of the entire data pipeline.

What's Next

The mistake_events table is now correctly populated with each session. Once the pilot delivers ≥150 records, Phase 3 can fit a logistic regression on the labelled events — using the rule-based codes as ground truth — and eventually replace the rules with a model that generalises to error patterns we haven't seen yet.

Key Takeaways

Rule-based mistake classifiers are the right first step when training data doesn't exist yet — they generate the labelled dataset that trains the eventual ML model
The real borrow-skip pattern (subtract ones in reverse: 32−9=37) is different from wrong-operation (add instead of subtract: 32+9=41) — getting this wrong poisons every downstream model that trains on the events table
Classifier priority ordering is a design decision that encodes instructional theory; document it explicitly and treat it as something to validate with data