60 hand-crafted math problems: what I learned writing seed data for an adaptive tutor

#numpath #dyscalculia #python #education

Why hand-author anything?

The obvious approach for seeding an adaptive math tutor is to generate problems programmatically. Pick two random numbers, subtract them, done. I tried this first and it failed for a specific reason: generated problems don't have meaningful hints.

A hint like "Try subtracting the ones column first" is generic. A hint like "2 ones minus 9 is impossible without borrowing — take a ten from the 3 tens" is diagnostic. It names the exact step where a dyscalculic student is likely to get stuck, and it names the operation they need to perform. That second kind of hint requires a human who understands the problem.

NumPath's Phase 1 seeds 100 problems across 5 Knowledge Components, each with two progressive hints, a calibrated difficulty score, and structured metadata that the MistakeClassifier uses to diagnose errors. Every one is hand-authored.

The content schema

Each problem is a JSONB column in Postgres. The schema is intentionally flat — no nested objects, no polymorphism:

{
    "type": "subtraction",
    "question": "32 − 9 = ?",
    "answer": "23",
    "difficulty": 0.25,
    "operands": [32, 9],
    "hints": [
        "2 ones − 9 is impossible without borrowing. Take a ten from the 3 tens.",
        "Now you have 12 ones. 12 − 9 = 3. You have 2 tens left. Answer: 23.",
    ],
}

Three fields deserve explanation:

operands / choices — these aren't shown to the student. They exist for the MistakeClassifier. When a student answers "41" instead of "23" on a subtraction problem, the classifier checks whether the answer matches subtracting the digits in the wrong direction (3 - 2 = 1, 9 - 0 = 9 → 91... no). It checks whether the answer omits borrowing (32 - 9 without regrouping gives 33... no). It checks for digit reversal (23 → 32... close, but the student wrote 41). Each check operates on the operands, not the question string.

difficulty — a float from 0.1 to 0.9, calibrated by hand. This is the initial difficulty estimate. The adaptive engine uses it to match students to problems at their current level. I'll explain the calibration logic below.

hints — always exactly two, always progressive. The first hint names the obstacle. The second hint walks through the solution. Students reveal hints one at a time, voluntarily. Hints are never forced — forcing hints on students who don't want them creates learned helplessness, which is the opposite of what we're trying to study.

The five skill areas

Each skill has 20 problems covering a difficulty gradient from 0.1 to 0.9:

Skill Code	Domain	Example at 0.1	Example at 0.9
`SUB_BORROW`	Subtraction	11 − 4 = ?	1003 − 567 = ?
`PLACE_VALUE`	Number sense	Which is larger: 3 or 8?	What does the 6 represent in 3,641?
`NUMBER_LINE`	Number sense	What number comes after 3?	What is halfway between 250 and 350?
`NUMBER_SENSE`	Number sense	Which is more: 2 or 5?	Order from smallest: 892, 829, 928, 289
`OPERATION_SIGN`	Arithmetic	2 + 3 = ?	15 − 7 + 3 = ?

The difficulty gradient is not linear. The jump from 0.1 to 0.3 (single-digit to simple two-digit) is smaller than the jump from 0.7 to 0.9 (two-digit with borrowing across zeros to three-digit with cascading borrows). This mirrors what the dyscalculia research literature reports: difficulty is not proportional to number size. It's proportional to the number of cognitive steps, particularly steps that require regrouping or holding intermediate results in working memory.

Hint design: what I got wrong

My first draft of hints was procedural — they described what to do:

"Borrow from the tens column. Subtract. Write the answer."

This is useless for a student with dyscalculia. The difficulty isn't knowing what borrowing is — it's executing the procedure without losing track of which column they're in. The second draft of every hint follows two rules:

Name the specific obstacle. Not "this is tricky" — rather "2 ones minus 9 is impossible without borrowing."
Walk through the state change. Not "borrow and subtract" — rather "Take a ten from the 3 tens. Now you have 12 ones. 12 − 9 = 3. You have 2 tens left."

The second rule matters because dyscalculic students often lose the intermediate state — they borrow correctly but then forget what changed. The hint reconstructs the full number after regrouping so the student can see where they are.

This pattern held across all five skill areas. Place value hints name the specific column ("the tens digit is the second from the right"). Number line hints name the direction and distance ("7 is to the right of 4 — count 3 steps forward"). Operation sign hints name the symbols and their meaning ("the − sign means subtract — take the second number away from the first").

Difficulty calibration

Difficulty scores are not arbitrary. They follow a rubric I developed after the first round of testing:

Score range	Criteria
0.1 – 0.2	Single-digit or simple two-digit; one cognitive step
0.25 – 0.4	Two-digit; requires one borrowing or comparison step
0.45 – 0.6	Two-digit with borrowing across columns, or three-digit without borrowing
0.65 – 0.8	Three-digit with borrowing; or problems requiring intermediate computation
0.85 – 0.9	Three-digit with cascading borrows (e.g., borrowing from hundreds when tens is 0)

The adaptive engine uses a DIFFICULTY_BAND of 0.15 around the target difficulty when selecting problems. So a student at target difficulty 0.5 sees problems between 0.35 and 0.65. This means each difficulty tier overlaps with its neighbors — a student improving from 0.4 to 0.6 transitions gradually rather than hitting a cliff.

The seed script

The seed is idempotent — safe to run on every deployment:

async def seed_problems(session, skill_id_map: dict[str, str]) -> int:
    for skill_code, problems in PROBLEMS.items():
        skill_id = skill_id_map.get(skill_code)
        for p in problems:
            content = {k: v for k, v in p.items() if k != "difficulty"}
            stmt = (
                pg_insert(Problem)
                .values(
                    skill_id=skill_id,
                    content=content,
                    difficulty=p["difficulty"],
                    problem_type=p["type"],
                )
                .on_conflict_do_nothing()
            )
            await session.execute(stmt)

on_conflict_do_nothing() means re-running the seed doesn't duplicate problems. The difficulty field is stored both inside the JSONB content and as a top-level column on the Problem model — the column is indexed for the adaptive engine's range queries, while the JSONB copy preserves the original specification.

The full seed runs inside a single transaction: skills first (because problems have a foreign key to skills), then problems, then test accounts. If any step fails, nothing is committed.

What I'd do differently

Two things:

More problems per skill. Twenty problems with a 0.15 difficulty band means some bands have only 2–3 candidates. When the adaptive engine excludes recently-seen problems, it can run out of fresh options at a specific difficulty level. The fallback chain handles this gracefully (widen the band, then allow repeats), but 30 problems per skill would eliminate most fallback cases.

Machine-assisted hint generation. The hints are the bottleneck — each one took 2–3 minutes to write well. For Phase 2, I plan to generate candidate hints with Claude and then manually review them. The human is still in the loop, but the first draft comes faster.

Key Takeaways

Generated problems are easy; generated hints are not — an adaptive tutor's value is in the scaffolding, not the arithmetic; hand-authoring hints that name the specific obstacle and walk through the state change is what makes the system useful for dyscalculia
Difficulty is not proportional to number size — it's proportional to cognitive steps, particularly regrouping and intermediate state; a three-digit problem with no borrowing (350 − 120) is easier than a two-digit problem with cascading borrows (100 − 67)
Idempotent seeds inside a transaction are non-negotiable — on_conflict_do_nothing() plus a single transaction means the seed runs safely on every deployment, fresh clone, and CI pipeline