A lot of "learn English with YouTube" tools just dump every word from the captions into your face and call it a vocabulary list. The result is 80% noise — pronouns, articles, contractions, proper nouns, and the same 200 high-frequency words repeated until your brain melts.
When I was building TubeVocab, the hardest engineering problem wasn't scraping subtitles or shipping the React UI. It was the linguistic plumbing between raw caption text and a card a B1 learner would actually benefit from studying. That plumbing is a 4-stage NLP pipeline I tuned over 14 days and ~3,000 manual quality reviews. Here it is end-to-end, with the spaCy snippets that actually run in prod.
Stage 1: Lemmatization (one card per lemma, not per inflection)
If your subtitle says "running, ran, runs, has run" in one video, learners don't need 4 cards. They need one card for run with all forms surfaced as examples.
spaCy's en_core_web_sm lemmatizer does ~95% of this for free. The catch: I disable everything I don't need so the pipeline runs at ~12k tokens/sec on a single CPU.
import spacy
nlp = spacy.load(
"en_core_web_sm",
disable=["ner", "parser", "textcat"],
)
def extract_lemmas(text: str) -> list[tuple[str, str]]:
doc = nlp(text)
return [
(tok.lemma_.lower(), tok.pos_)
for tok in doc
if tok.is_alpha and not tok.is_stop
]
The is_stop filter alone removes ~40% of tokens (the/a/and/is/etc), which cascades into massive savings downstream.
Stage 2: POS-tag filtering (kill the proper nouns and the junk)
After lemmatization I have things like ("netflix", "PROPN"), ("ok", "INTJ"), ("uh", "INTJ"). None of these belong on a flashcard.
I keep only NOUN, VERB, ADJ, ADV and explicitly drop PROPN, INTJ, NUM, and anything tagged X (unknown).
KEEP_POS = {"NOUN", "VERB", "ADJ", "ADV"}
def filter_by_pos(lemmas: list[tuple[str, str]]) -> list[str]:
return [lem for lem, pos in lemmas if pos in KEEP_POS]
Sounds trivial. The first version of TubeVocab didn't do this and ~18% of generated cards were words like "MrBeast", "TikTok", or "umm". Conversion to paid tanked because the first 5 cards a free user saw made the product look broken.
Stage 3: CEFR difficulty classification (the part that took 14 days)
Every card needs a CEFR band — A1 through C2. I tried 3 approaches:
- A pretrained CEFR classifier from HuggingFace — slow (~120ms/word), 25% disagreement with native-speaker spot checks.
- A custom fine-tuned BERT — 91% agreement but +800MB Docker image and 4s cold start. Not worth it.
- A frequency-band lookup with hand-tuned overrides — this won.
I merged EFLLex (CEFR-aligned) + SUBTLEX-US (film/TV frequency), added ~600 manual overrides:
CEFR_BAND = load_static_cefr() # {"run": "A1", "ostensibly": "C1", ...}
def classify(lemma: str, default: str = "B2") -> str:
return CEFR_BAND.get(lemma, default)
B2 is the unknown-word default because it's the median for educational YouTube. Now ~0.4ms/word, 89% agreement with manual reviews.
Stage 4: Dedupe-by-context (the secret sauce)
A learner doesn't need 12 cards for run even if it appears in 12 videos. They need one card with the best example sentence.
For each lemma I score every context sentence on length (10–20 tokens), CEFR-density, and a tiny TextRank clarity score:
def best_context(lemma: str, candidates: list[Sentence]) -> Sentence:
return max(
candidates,
key=lambda s: (
-abs(len(s.tokens) - 15)
- 2 * count_above_band(s, "B2")
+ textrank_score(s)
),
)
This single change moved 14-day retention from 18% to 31%.
Before vs. after on one real 12-min MrBeast video
| Metric | v1 (lemma + freq) | v4 (full pipeline) |
|---|---|---|
| Tokens after lemmatization | 1,847 | 1,847 |
| Cards after POS filter | 1,847 | 612 |
| Cards after CEFR-band trim (B1–B2 target) | 612 | 184 |
| Cards after context-dedupe | 184 | 71 |
| User-reported "useful" rate (n=40) | 22% | 78% |
If you want to see the output without reading spaCy, TubeVocab is the side project these 4 stages live inside — paste a YouTube URL, get back ~50–100 CEFR-tagged cards with timestamps clickable back to the exact second the word was spoken.
Top comments (0)