Bayesian Knowledge Tracing in 37 lines of Python — how NumPath models what a student knows

#numpath #adaptivelearning #bayesian #python

What We Built

NumPath maintains a KCState for every student × Knowledge Component pair. After every attempt, update_bkt() revises the probability that the student has mastered that KC. That probability — p_mastery — is what the adaptive engine reads to pick the next problem and what the teacher dashboard displays as a progress bar.

The entire model is 37 lines. Here it is unabridged.

from dataclasses import dataclass

MASTERY_THRESHOLD = 0.80

@dataclass(frozen=True)
class KCState:
    p_mastery: float
    p_learn: float
    p_guess: float
    p_slip: float
    opportunity_count: int = 0

    @property
    def is_mastered(self) -> bool:
        return self.p_mastery >= MASTERY_THRESHOLD


def update_bkt(state: KCState, is_correct: bool) -> KCState:
    """Standard Bayesian Knowledge Tracing update (Corbett & Anderson, 1995)."""
    p, L, G, S = state.p_mastery, state.p_learn, state.p_guess, state.p_slip

    if is_correct:
        posterior = (p * (1 - S)) / (p * (1 - S) + (1 - p) * G)
    else:
        posterior = (p * S) / (p * S + (1 - p) * (1 - G))

    p_new = posterior + (1 - posterior) * L

    return KCState(
        p_mastery=min(1.0, max(0.0, p_new)),
        p_learn=L,
        p_guess=G,
        p_slip=S,
        opportunity_count=state.opportunity_count + 1,
    )

The Four Parameters

BKT models each KC with four parameters, all probabilities between 0 and 1:

Parameter	Meaning	NumPath default
`p_mastery`	P(student has learned this KC)	0.10 (prior — low, conservative)
`p_learn`	P(learning occurs on this attempt, given not yet learned)	0.30
`p_guess`	P(correct answer given KC not learned)	0.20
`p_slip`	P(incorrect answer given KC is learned)	0.10

These are Phase 1 seed values — not calibrated against real student data yet. The parameter estimation problem (fitting p_learn, p_guess, p_slip per KC from observed attempts) is a Phase 4 task once the RCT produces enough data. For now they are reasonable priors from the BKT literature.

The Update Equations

After observing an answer, two steps happen.

Step 1 — Bayesian update (prior → posterior):

Correct:   posterior = p(1 - S) / [p(1 - S) + (1 - p)G]
Incorrect: posterior = pS       / [pS       + (1 - p)(1 - G)]

This is straight Bayes. A correct answer raises the posterior unless the student is likely to have guessed. An incorrect answer lowers it unless the student is likely to have slipped. A correct answer from a student with p_mastery=0.95 and p_slip=0.10 barely moves the needle — the model already thinks they know it. A correct answer from a student with p_mastery=0.10 and p_guess=0.20 moves it less than you might expect — the model discounts lucky guesses.

Step 2 — Learning update (posterior → next prior):

p_new = posterior + (1 - posterior) × p_learn

Even if the student answered incorrectly, there's a p_learn probability that learning occurred anyway. The posterior is never the final state — the learning update always nudges p_mastery upward slightly, reflecting that every attempt is an opportunity.

The Design Decision

We evaluated three approaches before choosing standard BKT:

Item Response Theory (IRT) — models item difficulty as well as student ability. More expressive, but requires calibrated item parameters we don't have. Rejected for Phase 1.

Deep Knowledge Tracing (DKT) — replaces the parametric model with an LSTM that learns latent student state from sequences of attempts. Better at capturing cross-KC transfer. Rejected for Phase 1 because it needs training data we haven't collected yet. It's on the Phase 2 roadmap.

Accuracy streak — raise difficulty after 3 correct in a row, lower after 3 wrong. This is what most commercial apps do. Rejected because it gives you no probability estimate, no per-KC granularity, and no way to distinguish a guesser from a learner.

Standard BKT is 30 years old and still the right choice when you're instrument-building before data collection. It gives you a per-KC probability estimate with interpretable parameters, it's fast to compute, and its failure modes are well understood.

One implementation choice worth noting: KCState is a frozen dataclass. update_bkt() returns a new KCState rather than mutating the existing one. This makes the update function a pure function — easy to test, easy to replay, and safe to call in parallel if we ever need to.

Why It Matters for the Research

The RCT compares learning outcomes for students using NumPath against a control group using static worksheets. To measure a difference, you need a measurement instrument. p_mastery is that instrument.

After a session, the teacher dashboard shows each student's p_mastery per KC as a progress bar. The adaptive engine uses it to pick the next problem. The LLM insight generator reads it to produce explanations like "Aiden's p_mastery on SUB_BORROW is 0.18 — the model has seen 11 attempts and is not converging." All three downstream consumers depend on the same number being meaningful.

BKT's key property for research purposes: it's falsifiable. If a student's p_mastery stays low after 20 correct answers, that's a signal worth investigating — either the parameters are wrong, or the student is consistently guessing, or there's a measurement problem. An accuracy percentage doesn't give you that.

What We Learned

The model is simple. Getting the parameters right is not.

p_learn=0.30 means the student has a 30% chance of learning the KC on any given attempt. That sounds reasonable. But it implies that after 10 attempts, a student who has not yet learned the KC has a 97% cumulative chance of learning it — which is almost certainly too optimistic. The seed parameters will need calibration.

The other thing we learned: opportunity_count is load-bearing. The adaptive engine uses it as a tiebreaker and the teacher dashboard shows it alongside p_mastery. It's not computed from the BKT model — it's just a counter that increments on every update_bkt() call. The frozen dataclass pattern makes this safe: the count in the database is always the count from the last update_bkt() return value, never a stale mutation.

What's Next

Phase 2 adds a DKT model alongside BKT — trained on the data collected during the pilot. The two models will run in parallel so we can compare their predictions against observed outcomes before the RCT begins.

Key Takeaways

BKT separates learning from performance — p_guess and p_slip let the model discount lucky correct answers and unlucky wrong ones; a 70% accuracy rate means something different depending on what the model thinks caused it
p_mastery is the measurement instrument for the RCT — every downstream consumer (adaptive engine, teacher dashboard, LLM insights) reads the same number, so getting it right matters more than getting it fast
Frozen dataclass + pure function = safe update chain — update_bkt() returns a new KCState; there's no shared mutable state, the update is replayable, and the test suite can verify every case in isolation