How I picked an SRS algorithm for TubeVocab without becoming an Anki nerd

#srs #esl #sideprojects #indiedev

Most "smart" vocabulary apps shove a generic spaced repetition curve at every word and call it a day. They take SM-2, the algorithm Anki has used since the 90s, plug it into a flashcard table, and assume cat, ostensibly, and photosynthesis should all behave the same way in your brain. They don't.

I needed a scheduler for TubeVocab that respected one obvious fact: a B1 learner forgets a C1 word in about a third of the time it takes them to forget an A1 word. If your scheduler doesn't know that, you waste reviews on words the user already owns and bury them under reviews of words they're not ready for. After three rewrites I landed on a hybrid I'm actually happy with. Here's the path.

Why SM-2 is the wrong default for vocabulary

SM-2 assumes domain-uniform difficulty. Every new card starts with the same interval (1 day) and the same ease factor (2.5). That works for trivia decks where every fact is, on average, equally weird. It does not work for vocabulary where item difficulty has a known prior — the CEFR band.

Concretely: if I show a B1 learner the word house and they get it right, SM-2 schedules the next review in 1 day. That's absurd. They've known house since A1. The "correct" interval is closer to 7 days because the forgetting curve for high-frequency A1 vocabulary is much flatter. Burning a review slot on house tomorrow is a tax on every C1 word in the same deck.

# Naive SM-2 — every new card gets the same initial schedule
def sm2_initial(card):
    return {
        "interval_days": 1,
        "ease": 2.5,
        "repetitions": 0,
    }

When I ran this on my first 200 alpha users, the median number of reviews per "mastered" A1 card was 6.2. For C1 it was 4.8 — backwards. The algorithm was over-drilling easy words because they "graduated" the daily-review hurdle later than they should have.

What I tried: SM-2, FSRS-4, custom Leitner, modulated SM-2

Four candidates, evaluated over two weeks against the same 40 beta users (each got randomly assigned a scheduler, deck content held constant):

Vanilla SM-2 — baseline. Easy to implement. Wrong priors as above.
FSRS-4 — the modern Anki default. Three-parameter model (stability, difficulty, retrievability) trained on review logs. Genuinely better than SM-2 in the long run, but you need ~1000 reviews per user before the per-user fit converges. My users churn before that.
Custom 5-box Leitner — boxes graduate at 1/3/7/14/30 days. Dead simple. No ease. Surprisingly competitive on short timescales but degrades because there's no individual signal — a confident "got it" and a barely-recalled "uh, sure" promote you the same way.
CEFR-modulated SM-2 — vanilla SM-2 with the initial interval and ease seeded by the CEFR band of the word. This is what shipped.

FSRS is the right answer once I have a year of review history. Today I have weeks. The modulation hack is what bridges the gap.

The winner: CEFR-modulated SM-2

Punchline: the CEFR band sets the prior. SM-2 takes over after the first successful recall. The modulation only touches the initial schedule, which is exactly the part SM-2 gets wrong.

Mapping I converged on, after eyeballing my 14-day retention curves split by band:

# Initial interval (days) and ease factor seeded by CEFR band
CEFR_PRIOR = {
    "A1": {"interval_days": 4, "ease": 2.7},
    "A2": {"interval_days": 3, "ease": 2.6},
    "B1": {"interval_days": 2, "ease": 2.5},
    "B2": {"interval_days": 2, "ease": 2.4},
    "C1": {"interval_days": 1, "ease": 2.3},
    "C2": {"interval_days": 1, "ease": 2.2},
}

def schedule_new_card(card, cefr_band: str):
    prior = CEFR_PRIOR.get(cefr_band, CEFR_PRIOR["B2"])
    return {
        "interval_days": prior["interval_days"],
        "ease": prior["ease"],
        "repetitions": 0,
        "due_at": today() + timedelta(days=prior["interval_days"]),
    }

After the first correct review, the standard SM-2 recurrence kicks in — interval becomes previous_interval * ease, ease gets bumped or penalized by the user's quality rating, and CEFR is no longer consulted. The point isn't to keep tuning by band forever; it's to not start in the wrong place.

The full scheduler with the quality-rating branch:

def next_schedule(card, quality: int):
    # quality: 0=blackout, 3=correct-with-effort, 5=easy
    if quality < 3:
        # lapse — reset interval, keep ease minus penalty
        return {
            "interval_days": 1,
            "ease": max(1.3, card.ease - 0.2),
            "repetitions": 0,
        }

    reps = card.repetitions + 1
    if reps == 1:
        interval = card.interval_days  # honor the CEFR prior
    elif reps == 2:
        interval = round(card.interval_days * 2.2)
    else:
        interval = round(card.interval_days * card.ease)

    ease = card.ease + (0.1 - (5 - quality) * (0.08 + (5 - quality) * 0.02))
    return {
        "interval_days": interval,
        "ease": max(1.3, ease),
        "repetitions": reps,
    }

The two things worth pointing out: the prior is honored on the first successful repetition (not overwritten), and the ease floor is 1.3 so a single bad day doesn't sentence a card to permanent daily review.

The number: 22% to 38% on 14-day retention

The metric I tracked was 14-day retention on cards introduced during week 1, measured as "user recalled correctly on first review attempt after the 14-day mark." Across the 40 users:

Vanilla SM-2 cohort: 22% retention
CEFR-modulated SM-2 cohort: 38% retention

Sixteen points of absolute improvement, no change to the cards themselves, no change to the UI. The same words, scheduled with priors that match their actual difficulty.

The other win I didn't expect: review volume per day dropped 19% for the modulated cohort, because A1 words stopped showing up in tomorrow's queue. Users reported the app feeling "less nagging" in week-2 survey replies, which I'd bet is doing some quiet retention work on its own.

The tradeoff: storage per user

Honest cost: every review writes a row. Card state (interval, ease, repetitions, due_at) plus a review log row for FSRS-readiness later. Schema:

CREATE TABLE review_log (
    id          INTEGER PRIMARY KEY,
    user_id     INTEGER NOT NULL,
    card_id     INTEGER NOT NULL,
    reviewed_at TIMESTAMP NOT NULL,
    quality     SMALLINT NOT NULL,
    prev_interval_days INTEGER,
    new_interval_days  INTEGER,
    prev_ease   REAL,
    new_ease    REAL
);

CREATE INDEX idx_review_log_user_time
    ON review_log(user_id, reviewed_at DESC);

An active user reviewing ~80 cards/day generates roughly 29k rows/year, ~3.5 MB raw. Cheap on Postgres, but the index on (user_id, reviewed_at) is what keeps the "today's due cards" query under 20ms at p95. Without it, I was seeing 180ms+ once a few users crossed 50k log rows. Spend the disk, get the latency.

I'll need this log anyway when I have enough volume to fit FSRS per user. Building the firehose now means the migration in 6 months is "run a job," not "go collect data we never stored."

v1 vs v4 on the same 40-user cohort

Metric	v1 (vanilla SM-2)	v4 (CEFR-modulated SM-2)
14-day retention (week-1 cards)	22%	38%
Median reviews to "mastered" (A1 card)	6.2	3.4
Median reviews to "mastered" (C1 card)	4.8	7.1
Reviews/day per active user	84	68
p95 "due today" query latency	180ms	18ms

The C1 number going up is the feature, not the bug — hard words deserve more reps. The point is that the algorithm is now spending reps where they pay off.

This scheduler is the one running in production at TubeVocab today, and the review log is quietly accumulating so I can swap in a real FSRS fit once the data warrants it. Until then, a 12-line CEFR prior table is doing most of the work a fancier model would.

Tags: srs, spaced-repetition, esl, sideproject, indie, anki