Notes on FSRS-6: A Small Experiment in Residual Calibration

#algorithms #softwaredevelopment #software #devops

Introduction

I'm a high school student in Japan who spends a lot of time reading about machine learning and LLMs. Lately, I've been using an app called RemNote for studying, and it uses FSRS-6 as its spaced repetition model. At some point I started wondering whether there was room to improve it — not replace it, just nudge it slightly.

I'm writing this as a rough memo and draft, partly for my own records and partly because I'd like to eventually build something like this into an app of my own. I haven't done a thorough literature review, and this is honestly just me thinking out loud, so I might be getting things wrong. If I am, I'd really appreciate hearing about it.

The Core Idea

FSRS-6 is genuinely strong. I don't think it should be thrown out and replaced with something else — if anything, I think it's impressive that it works as well as it does in practice. My starting point wasn't to criticize it, but to ask whether a thin correction layer on top of it could capture a bit more signal.

The intuition was simple: FSRS's core equations might not fully account for things like how long a user took to answer, whether they just failed the same card, how many times they've seen it today, or how overdue the card is relative to its current stability. These felt like plausible candidates for a small post-hoc correction — not a replacement, just a nudge at the end.

The simplest way to express this is:

logit(p_final) = logit(p_fsrs6) + f(x)

Here, f(x) is a small logistic regression trained on auxiliary features. The idea is to take FSRS-6's recall probability, move it into a scale where addition makes sense (log-odds), add a small correction, and convert back to a probability. The FSRS-6 model itself is untouched — only the final output is adjusted.

In practice, I used just four features:

log_duration_norm — response time, normalized
recent_failure — whether the card was failed recently
same_day_count_norm — number of reviews on the same day, normalized
recency_ratio_norm — the ratio of interval to stability, normalized

Nothing fancy. The whole point was to keep it lightweight.

A Critical Caveat: Synthetic Data

The most important thing to say upfront is that this experiment used synthetic data, not real user data. The setup was 200 simulated users, roughly 80,000 reviews, over 365 days — and the data itself was generated using FSRS-6's own equations. This means the results are essentially a proof-of-concept, and I can't claim this would hold up with real-world data. I want to be clear about that.

Evaluation was done with 5-fold TimeSeriesSplit, looking at Log Loss, RMSE (bins), and AUC.

Results

Three models were compared:

Model	Log Loss	RMSE (bins)	AUC
FSRS-6	0.4609	0.2105	0.2604*
FSRS-6-recency	0.3290	0.1206	0.5671
FSRS-6 + Residual Calibration	0.3158	0.2732	0.7826

*Note: FSRS-6's AUC of 0.2604 looks unusually low. Before drawing strong conclusions from the AUC comparisons, I'd want to double-check the evaluation implementation — it's possible there's an issue with label or score direction somewhere. Treat this value with caution.

In this synthetic setting, the residual calibration model outperformed both baselines on Log Loss and AUC. But the story isn't entirely positive.

RMSE (bins) got worse — 0.2732 compared to FSRS-6-recency's 0.1206. This suggests that while the model got better at ranking predictions (AUC) and at average loss (Log Loss), the calibration of the probabilities themselves deteriorated. The model appears to be somewhat overconfident in the 0.4–0.7 range. So "better" depends heavily on what you're optimizing for.

Looking at the calibration plot, there are visible gaps in certain probability ranges, so I'd be cautious about using this model's outputs directly as trustworthy probabilities. The ranking might be improved, but whether the probabilities themselves are reliable is a separate question.

What's Actually Doing the Work?

The most interesting part for me wasn't that the full model improved things — it's which features seemed to matter.

Looking at the coefficients, recency_ratio_norm had the largest weight by far, followed by log_duration_norm. The same-day count and recent failure features contributed, but weren't the main drivers.

The ablation results made this even clearer: recency_ratio_norm alone achieved a Log Loss of 0.2977 and AUC of 0.7814 — actually better than the full four-feature model on Log Loss. That suggests that, at least in this experiment, most of the improvement comes from a single correction: how overdue a card is relative to its current stability.

This makes intuitive sense to me. FSRS already models forgetting quite carefully. But knowing where a card sits relative to its stability as an auxiliary correction might still carry useful signal — even on top of FSRS's own calculations.

I wouldn't read too much into the specific coefficient values at this stage, though. Before making theoretical claims about what the coefficients mean, I'd want to carefully re-examine how the features were defined and which direction the evaluation is going.

Where This Leaves Me

To summarize my current thinking:

I don't want to replace FSRS-6. Its structure is strong, and the interesting direction is adding a thin correction on top, not starting over.

As a proof-of-concept, this is encouraging. Log Loss improved in synthetic conditions, and the feature importance had a plausible structure. But calibration and evaluation quality need more attention. RMSE (bins) got worse, and the AUC baseline looks suspicious. I'd call this a design sketch, not a validated result.

The most promising path seems to be a minimal recency-based correction — specifically recency_ratio_norm — rather than the full four-feature model. If I were building this into an app, I'd start there.

My interest here is less "publish a new model" and more "figure out what feels right when building my own SRS app." In that sense, this experiment gave me a useful first intuition: a recency-based residual correction on top of FSRS-6 might actually be worth pursuing.

If you think I am missing something, or if you have thoughts on this direction, I would be very glad to hear from you. You can find me on X here:
X