Why Your Flashcard App Is Showing You Words You Already Know — And What To Do About It

#ai #developer #softwaredevelopment

If you've used Anki, Duolingo, or any vocabulary app for more than a few months, you've had this experience:
You're 50 cards into a review session. The app shows you "hello" for the tenth time this week. You have known "hello" for five years. You still have to mark it as easy and move on, one of the hundred little friction taxes that add up until you abandon the app entirely.
This isn't a UX problem. It's a scheduling problem. The algorithm deciding which cards to show you is making bad decisions. And once you understand why it's making bad decisions, you can stop blaming yourself for not being "the kind of person who sticks with flashcard apps," because the apps have been lying to you about how learning works.
This post is a decisions-and-tradeoffs walkthrough of what I learned picking a spaced-repetition algorithm for Funlingo — a Chrome extension that saves words users click on while watching Netflix. It's for developers building any app that involves remembering things over time. No code. Just the reasoning behind the choice.

The algorithm you probably picked, and why it's wrong
If you built a flashcard feature in the last 20 years and searched "spaced repetition algorithm," you were handed SM-2. SuperMemo shipped it in 1987. Anki uses it. Most tutorials teach it. It has been the default answer for almost four decades.
SM-2 is also genuinely bad, and the reason it's bad is instructive.
The algorithm works roughly like this: when you review a card and get it right, the gap until the next review multiplies by a factor — usually around 2.5x. Get it wrong, the gap collapses and you start over. Over time, a card you keep getting right gets shown less and less often, while a card you keep failing shows up constantly.
At first glance, this is reasonable. At second glance, it has three problems that compound over months of use.
Problem one: there is no model of forgetting. SM-2 doesn't predict how likely you are to remember a card when it's reviewed. It just multiplies numbers. Real memory follows a forgetting curve — after you learn something, your probability of recalling it drops exponentially, and the rate of that drop depends on how well-established the memory is. SM-2 assumes a review that goes well means the next interval should be 2.5x longer, regardless of whether that corresponds to a 50% chance of remembering or a 99% chance. The user can't control for retention because the algorithm has no concept of retention in the first place.
Problem two: the difficulty signal is broken. SM-2 adjusts a "ease factor" based on your ratings, but the adjustment is so aggressive that a single bad rating can drag a card's ease factor to its floor and keep it there forever. This is why every serious Anki user ends up with "leech" cards — vocabulary that got a bad rating once, six months ago, and is now permanently stuck showing up every two days. The algorithm can't distinguish between "this word was hard on one particular day when I was tired" and "this word is fundamentally harder for me than other words." So it punishes the card permanently.
Problem three: it refuses to learn from data. SM-2 uses the same constants for every user. Your forgetting curve and my forgetting curve are different. The Japanese word for "thank you" is easy for a Korean speaker and hard for an English speaker. SM-2 has no mechanism to adapt to who's using it. It's a 1987 algorithm running in a 2026 world where we have enormous amounts of review data and cheap statistical tools, and it ignores all of it.
The frustrating part about SM-2 isn't that it's old. It's that the entire industry kept using it even after the research community moved on. Most flashcard apps you're using today are still running an algorithm that predates the web browser.

What changed
Between 2022 and 2024, a research group around Jarrett Ye released a family of algorithms called FSRS — Free Spaced Repetition Scheduler. It became Anki's optional algorithm in 2023, then the default recommendation for new implementations, and is now the quiet consensus among people who actually care about this stuff.
The insight is simple: instead of scheduling reviews based on "how well did you do last time, multiplied by an arbitrary factor," FSRS models the forgetting curve explicitly and schedules your next review at the moment your predicted retention drops to a target threshold.
The user sets the threshold. If you want to remember 90% of what you've learned, the algorithm schedules reviews at the point you'd otherwise forget. If you're cramming for an exam and want 95% retention, reviews tighten. If you're doing casual language learning and are fine with 85% retention in exchange for fewer reviews, they loosen.
Two things fall out of this design that SM-2 literally cannot do.
The algorithm adapts to you. Every review you complete is a data point — "I predicted you'd remember this with 73% probability, and you actually got it right." The gap between prediction and reality updates the model. Over a few hundred reviews, the parameters converge on your forgetting curve, not a generic one. Your flashcard app gets smarter the longer you use it.
The user has a meaningful dial. Target retention is a real, understandable parameter. A learner can reason about it: "I'm preparing for a test in three weeks, I want 95% retention" or "I'm doing this for fun, 85% is fine." SM-2 has no equivalent. The user's only knob is "do I click Hard or Good," which is confusing, unstable, and punishes them for being honest about a card being difficult.

The part most algorithm posts skip: what it does to the app
If all I've told you is "use FSRS instead of SM-2," you'd be fine implementing it and shipping. But the interesting product decisions are downstream of the algorithm choice, and most articles don't cover them.
Here's what changes when you switch.
The four review buttons start meaning something
In an SM-2 app, the user sees four buttons — Again, Hard, Good, Easy — and has no idea what they do. They click "Good" by reflex. Sometimes they click "Easy" on a card they found easy and the app punishes them for it (Anki users call this "ease hell," where too many Easy ratings explode the interval so far that the user stops remembering the card entirely).
FSRS exposes predicted next-review dates for each button. A well-designed FSRS app shows the user: "Again (10 min), Hard (2 days), Good (18 days), Easy (2 months)." Now the rating means something. The user can reason about the decision. Mis-clicks on Easy drop dramatically.
This is a small UI change that depends entirely on the algorithm underneath. SM-2 can compute these numbers too, but they're so arbitrary (just multiplications of the current interval) that exposing them makes the mechanism feel fake. FSRS's numbers feel like real predictions because they are.
The "new card" state becomes a separate surface
FSRS distinguishes between four card states: new (never seen), learning (seen once or twice, still unstable), review (stabilized in memory), and relearning (failed a review, rebuilding). These aren't just internal states — they correspond to genuinely different experiences for the user.
Most flashcard apps ignore this and pile all four into a single review queue, which feels chaotic. Your brain has to context-switch between "here's a word I've never seen before, let me study it" and "here's a word I'm recalling from three weeks ago, let me test myself." Different mental operations.
The FSRS-friendly UI separates them: a "new words from this week's shows" section and a "words due for review" section, visually distinct. Users model them differently, which matches how the algorithm treats them internally, which produces a session that feels calm instead of frantic.
Duplicate clicks stop breaking the model
This one is specific to any app where your data source is "user interacted with a word," as opposed to a deliberate review session.
Every day, users click the same word twice within 30 seconds. They forget the meaning, click, look at the popup, click away, then click it again because they forgot what they just read. This is normal human behavior. It also destroys a spaced-repetition model because the algorithm thinks you just reviewed the same card twice in rapid succession and triples its stability score. Next time you see the card is six weeks out. You've absolutely forgotten it by then.
The fix is a debounce rule at the interaction layer: any click within a minute of the last click on the same word shows the definition but doesn't count as a review. This is the kind of rule that looks like a hack but actually exists because the algorithm assumes stationary, spaced conditions and real user behavior doesn't have that. A lot of production ML and stats work looks like this — the math assumes an idealized input stream, and half the engineering is about shaping real data to match the assumption.
You have to store history
SM-2 is stateless per card — you only need the current "ease factor" and "interval" to compute the next review. FSRS needs the full review history to train per-user parameters.
This is a real data model change. If your database currently stores one row per card with the current state, you need to add a review-log table with a row per review. For a small app this is trivial. For a large app with years of existing user data, you'll either need to backfill placeholder logs or run the two algorithms in parallel during a migration period.
I mention this because it's the kind of thing that only shows up when you try to implement FSRS in an app that was originally built on SM-2. Greenfield projects don't hit it. Migrations do.

The decisions that aren't obvious from the outside
A few calls that went into Funlingo' implementation that I don't see discussed publicly and that you'll have to make if you ship this in your own app.
Target retention default. The reference implementation suggests 90%. This is fine for serious learners and punishing for casual ones. A casual Netflix language learner will do 30–50 reviews a day at 90% retention, which is enough to feel like a chore. At 85%, they'll do closer to 15–25 a day and stay engaged. At 80%, retention feels noticeably worse but review volume becomes negligible. I ended up defaulting to 85% for casual users and exposing the dial in settings for power users. The "right" default is the one that matches your user's commitment level, not the algorithm's mathematical optimum.
How honest to be about the forgetting curve. FSRS can predict, for any given card, "you have a 64% chance of remembering this right now." That's genuinely useful information. It's also genuinely demoralizing to expose. If a user sees "your retention on Spanish vocabulary is 71%," they feel like they're failing at Spanish. Better: show them a chart of their retention over time, with a target line, so the number feels like a goal rather than a grade. The algorithm gives you the data; the product question is how much to surface.
When to retrain per-user parameters. The reference answer is "after ~1,000 review logs, the user has enough data to warrant optimized parameters." In practice, this is a lot of reviews — most casual users never hit it. So you're running defaults forever for 80% of your user base, which is fine, because defaults are good. The user for whom per-user parameters matter is the power user who's done thousands of reviews, and they're also the user most likely to notice the improvement, so the upgrade feels real when it lands. I'd recommend shipping FSRS with defaults for everyone, then retraining in the background for users who cross the threshold and quietly swapping their parameters. Don't ask them. Just do it better.
How to handle cards you never want to retire. SM-2 can't express "this card is important enough that I want to review it at least every N days regardless of how well I know it." FSRS can — it's a ceiling on the interval. Useful for, say, grammar rules you want to keep sharp even if you've mastered them. Nobody asks for this feature, but power users discover it and become evangelists.

When FSRS is overkill
I should be honest about this because Dev.to readers will call it out otherwise.
FSRS is not the right choice for every flashcard context.
Short-term use cases don't benefit. If you're building a trivia quiz that gets used for two weeks before an exam, SM-2 or even rule-based scheduling is fine. FSRS's advantages compound over months and years. In the first two weeks, the difference is marginal.
Massed practice defeats every scheduler. If your users are cramming — doing 500 reviews in a single night — no algorithm helps them remember long-term. The whole premise of spaced repetition is that reviews are spread out. If your product encourages cramming (because of gamification, streaks, or social pressure), you have a product design problem, not an algorithm problem, and switching to FSRS won't fix it.
Small user bases don't generate enough data for per-user optimization. If your product has 100 active learners, per-user FSRS parameters won't outperform the defaults. Population-level FSRS with defaults still beats SM-2, so this isn't an argument against switching — just an argument against overinvesting in optimization before you have the users to benefit from it.
The user experience assumes deliberate review. If your product surfaces words opportunistically — "here's a word you saved last week, while you're on the homepage" — you're inventing a third mode that isn't really "scheduled review" or "free practice," and FSRS's scheduling logic doesn't cleanly apply. You'll need to decide whether opportunistic exposures count as reviews or not. (My answer, in Funlingo: they don't. They're free exposure that may or may not strengthen the memory, but I don't want to pollute the model with events the user didn't consciously opt into.)

The part I care about most
The technical question — "which spaced repetition algorithm should I pick?" — is interesting but not the most important question.
The important question is: when a language learner saves a word from a Netflix show, what are the chances that word ever turns into durable long-term memory?
With SM-2, the answer is roughly 60–70% for the words that get reviewed — but most saved words never get reviewed, because the review pile becomes overwhelming and users abandon the app. With FSRS plus thoughtful UX (separating new cards from reviews, exposing predicted intervals, debouncing accidental reviews), both numbers improve. Review pile size stays manageable. Retention per reviewed card goes up. Users come back. Words stick.
I don't have a rigorous study to share — my user base is too small to publish statistically significant numbers — but the qualitative shift is unambiguous. People who bounced off Anki use Funlingo's review system. That was never true of the SM-2 version I shipped first.
Algorithms aren't just about efficiency. They're about whether your app produces the outcome it claims to produce. A flashcard app that schedules badly is, in a real sense, lying to its users — promising them learning while delivering rote repetition that doesn't stick. The industry standard was lying for 30 years, politely, without knowing it. The current standard is better. If you're building anything with a "remember this over time" surface, the algorithm you pick shapes whether your users actually learn anything.
Pick the better one.

Discussion
I'd genuinely like to hear from people who've shipped spaced repetition in production:
Has FSRS become your default, or are you still running SM-2? I'm curious whether the migration cost is what's keeping most apps on the old algorithm, or whether it's just inertia. Or whether there's a case for SM-2 I'm missing.
What's the biggest UX mistake you see flashcard apps making around retention? Every implementation I've seen has different opinions on how much to surface the target-retention parameter. Some hide it entirely, some put it in advanced settings. What's worked?
Has anyone built a good "review budget" feature? Meaning: "I have 20 minutes today, show me the cards that will give me the highest retention gain in that time, not the cards that are technically due." This feels like the obvious next evolution. I haven't seen it shipped anywhere.
Drop your thoughts below. I reply to everyone.

Further reading
Anki's FSRS documentation — the clearest non-academic explanation of what the algorithm does, written for end users but worth reading for builders
The Open Spaced Repetition project — umbrella for community implementations across languages and frameworks; good starting point if you're evaluating options
Jarrett Ye's original FSRS paper — the underlying math, if that's your thing
The SuperMemo research archive — the papers that laid the foundation for all modern spaced repetition, starting with SM-0 in 1985

Funlingo is a free Chrome extension that saves words from Netflix, YouTube, and Amazon Prime and schedules reviews using FSRS. This post is Part 2 of a series on building it. Part 1 covered the subtitle injection technique.

DEV Community

Why Your Flashcard App Is Showing You Words You Already Know — And What To Do About It

Top comments (0)