Oscar Rieken

Posted on Jun 1

From Bayesian to deep knowledge tracing — upgrading NumPath's student model with a PyTorch LSTM

#numpath #adaptivelearning #pytorch #python

BKT told us how well a student knows subtraction-with-borrowing. It had no idea that a student who reverses digits on subtraction problems probably also reverses them on place value problems — because BKT treats every Knowledge Component as an island.

Deep Knowledge Tracing (DKT) fixes that. Instead of four independent scalar parameters per KC, it maintains a shared LSTM hidden vector across all KCs and learns the dependencies from data. This is Phase 3 of NumPath: swapping out the Markov model for a neural sequence model.

Here's what we built, the design decision that almost made us reach for a transformer, and the student simulator we had to build first to test it without any real students.

What We Built

Two components that feed each other:

Student simulator — five named personas that generate realistic attempt sequences for testing. Each persona has a per-KC accuracy curve and weighted mistake preferences drawn from the dyscalculia ITS literature:

Persona	SUB_BORROW accuracy	Characteristic errors
`ConfidentLearner`	0.80	Rare, careless (OFF_BY_TEN)
`StrugglingSUB`	0.35	Frequent BORROW_SKIP, slow timing
`PlaceValueGap`	0.60	DIGIT_REVERSAL across skill areas
`FrustrationLoop`	0.30	Fast random guessing
`FastMaster`	0.90	Near-zero mistakes

DKT model — a single-layer LSTM that takes a sequence of (skill, correctness) interactions and predicts P(correct on skill k) at each subsequent step.

model = DKTModel(n_skills=3, hidden_size=64)
state = model.initial_state()

# Student answers SUB_BORROW correctly
state = model.step(state, skill_idx=0, is_correct=True)

# Query mastery on any KC
p_mastery = model.predict(state, skill_idx=0)  # → float in (0, 1)

The Design Decision

Why not stay with BKT?

BKT's four parameters — p_mastery, p_learn, p_guess, p_slip — are per-KC and independent. A student who has DIGIT_REVERSAL on subtraction problems and DIGIT_REVERSAL on place value problems is modelled as having two unrelated problems. BKT cannot learn that these are the same underlying representational gap.

DKT's hidden state is shared. After the student makes a digit-reversal error on subtraction, the LSTM adjusts its hidden vector in a way that also shifts the place value prediction. It learns the cross-KC structure from data.

Why not a transformer?

The sequence lengths we're working with are short — 10 to 30 attempts per session. Transformers need longer sequences to exploit their attention mechanism meaningfully. An LSTM is a better fit: it handles variable-length sequences natively, trains faster on small datasets, and produces interpretable per-step hidden states we can inspect.

More importantly: the Piech et al. (2015) DKT paper established LSTMs as the baseline for knowledge tracing. Improving on the baseline is Phase 4 work; Phase 3 is implementing it correctly.

The encoding

The input encoding follows Piech et al. exactly. At each step t, the input is a one-hot vector of size 2 × n_skills:

x[k]             = 1  if skill k was answered CORRECTLY
x[k + n_skills]  = 1  if skill k was answered INCORRECTLY

For three skills (SUB_BORROW=0, PLACE_VALUE=1, NUMBER_LINE=2), a correct subtraction answer encodes as:

[1, 0, 0,  0, 0, 0]
  ↑ correct half    ↑ incorrect half

An incorrect subtraction answer:

[0, 0, 0,  1, 0, 0]

The LSTM sees this 6-dimensional input and updates its hidden state. The output layer projects the hidden state back to 3 dimensions — one P(correct) per KC.

The training objective

The model learns to predict the NEXT response from the current history. At step t, given the encoded interaction x_t, the LSTM outputs:

ŷ_t[k] = σ(W × h_t + b)[k]  =  P(student answers skill k correctly at t+1)

The loss at each step uses only the skill that was actually asked next:

# At step t, the next question has skill_idx q and correctness r
target = torch.tensor([float(r)])
pred   = logits[0, t, q].unsqueeze(0)
loss_t = BCE(pred, target)

Training is one sequence at a time with Adam and gradient clipping. Small dataset — no need for batching yet.

Why the simulator came first

We can't train DKT on real data until the pilot delivers ≥150 attempt records. But we can validate the architecture right now with the student simulator.

The final integration test runs both pipelines end to end:

Generate 30 sequences from StrugglingSUB (35% accuracy on SUB_BORROW)
Generate 30 sequences from FastMaster (90% accuracy on SUB_BORROW)
Train two separate DKT models on each persona's sequences
Simulate 6 practice steps with each model
Assert FastMaster's model predicts higher mastery than StrugglingSUB's

# From test_dkt.py
fast_mastery = mastery_after_steps(result_fast.model,
    [True, True, True, True, True, False])    # 5/6 correct

struggling_mastery = mastery_after_steps(result_struggling.model,
    [True, False, False, True, False, False]) # 2/6 correct

assert fast_mastery > struggling_mastery      # ✓ passes

This gives us confidence the model learns the right signal before we hand it real children's data.

Why It Matters for the Research

BKT's independence assumption is a known limitation in the ITS literature. It was acceptable for Phase 1 and 2 because we didn't have cross-KC interaction data. Now that the mistake classifier is generating BORROW_SKIP and DIGIT_REVERSAL events consistently, we have a sequence model that can learn from them.

The specific research claim that DKT enables: a student's error pattern on one KC predicts their likely error pattern on a related KC. If DKT learns this and BKT doesn't, that's measurable evidence that the LSTM captures structure that the Markov model misses — and a direct contribution to the Phase 4 RCT analysis.

The upgrade path is explicit:

Pilot delivers ≥150 attempts
train_dkt(sequences_from_db) on the full dataset
Evaluate against BKT's predictions using held-out sessions
Replace update_bkt in SubmitAttemptUseCase when DKT's per-KC accuracy exceeds BKT's

The ADR for this transition is on the backlog.

What We Learned

The student simulator is the missing test fixture for ITS research. Standard software testing assumes you can construct any input you need. In adaptive tutoring, your input is a real child's learning trajectory. The simulator bridges that gap — it's not a replacement for real data, but it lets you test that the model responds in the right direction before you commit to an ethical review and a cohort of participants.

BKT and DKT coexist cleanly at the domain layer. KCState stays unchanged. DKTState is a separate dataclass with a different shape. The backend currently uses KCState; swapping in DKTState is an interface change at SubmitAttemptUseCase and GetNextProblemUseCase — two files, no schema migration.

Gradient clipping mattered more than I expected. Early training runs without clip_grad_norm_ diverged on the frustration-loop persona (all-incorrect sequences). Clipping at max_norm=1.0 stabilised training across all five personas.

What's Next

Backend wiring: load the trained DKT model at startup, store hidden state vectors in Redis per student, and swap the two use cases. That's the integration step that puts DKT into the live adaptive loop.

Key Takeaways

DKT's shared LSTM hidden state captures cross-KC dependencies that BKT's independent scalar parameters cannot — a student with DIGIT_REVERSAL on subtraction is more likely to have it on place value, and DKT learns this from data
Build the student simulator before the model: testing an adaptive learning architecture requires synthetic student trajectories, and the simulator lets you validate directional correctness before any ethics review or pilot recruitment
LSTM beats transformer for short sequences (10–30 steps): attention needs length to work; LSTMs handle variable-length sequences natively and train faster on the small datasets typical of ITS research

DEV Community