The 4 NLP stages between raw YouTube subtitles and a flashcard you'd actually study

#nlp #spacy #esl #sideprojects

A lot of "learn English with YouTube" tools just dump every word from the captions into your face and call it a vocabulary list. The result is 80% noise — pronouns, articles, contractions, proper nouns, and the same 200 high-frequency words repeated until your brain melts.

When I was building TubeVocab, the hardest engineering problem wasn't scraping subtitles or shipping the React UI. It was the linguistic plumbing between raw caption text and a card a B1 learner would actually benefit from studying. That plumbing is a 4-stage NLP pipeline I tuned over 14 days and ~3,000 manual quality reviews. Here it is end-to-end, with the spaCy snippets that actually run in prod.

Stage 1: Lemmatization (one card per lemma, not per inflection)

If your subtitle says "running, ran, runs, has run" in one video, learners don't need 4 cards. They need one card for run with all forms surfaced as examples.

spaCy's en_core_web_sm lemmatizer does ~95% of this for free. The catch: I disable everything I don't need so the pipeline runs at ~12k tokens/sec on a single CPU.

import spacy

nlp = spacy.load(
    "en_core_web_sm",
    disable=["ner", "parser", "textcat"],
)

def extract_lemmas(text: str) -> list[tuple[str, str]]:
    doc = nlp(text)
    return [
        (tok.lemma_.lower(), tok.pos_)
        for tok in doc
        if tok.is_alpha and not tok.is_stop
    ]

The is_stop filter alone removes ~40% of tokens (the/a/and/is/etc), which cascades into massive savings downstream.

Stage 2: POS-tag filtering (kill the proper nouns and the junk)

After lemmatization I have things like ("netflix", "PROPN"), ("ok", "INTJ"), ("uh", "INTJ"). None of these belong on a flashcard.

I keep only NOUN, VERB, ADJ, ADV and explicitly drop PROPN, INTJ, NUM, and anything tagged X (unknown).

KEEP_POS = {"NOUN", "VERB", "ADJ", "ADV"}

def filter_by_pos(lemmas: list[tuple[str, str]]) -> list[str]:
    return [lem for lem, pos in lemmas if pos in KEEP_POS]

Sounds trivial. The first version of TubeVocab didn't do this and ~18% of generated cards were words like "MrBeast", "TikTok", or "umm". Conversion to paid tanked because the first 5 cards a free user saw made the product look broken.

Stage 3: CEFR difficulty classification (the part that took 14 days)

Every card needs a CEFR band — A1 through C2. I tried 3 approaches:

A pretrained CEFR classifier from HuggingFace — slow (~120ms/word), 25% disagreement with native-speaker spot checks.
A custom fine-tuned BERT — 91% agreement but +800MB Docker image and 4s cold start. Not worth it.
A frequency-band lookup with hand-tuned overrides — this won.

I merged EFLLex (CEFR-aligned) + SUBTLEX-US (film/TV frequency), added ~600 manual overrides:

CEFR_BAND = load_static_cefr()  # {"run": "A1", "ostensibly": "C1", ...}

def classify(lemma: str, default: str = "B2") -> str:
    return CEFR_BAND.get(lemma, default)

B2 is the unknown-word default because it's the median for educational YouTube. Now ~0.4ms/word, 89% agreement with manual reviews.

Stage 4: Dedupe-by-context (the secret sauce)

A learner doesn't need 12 cards for run even if it appears in 12 videos. They need one card with the best example sentence.

For each lemma I score every context sentence on length (10–20 tokens), CEFR-density, and a tiny TextRank clarity score:

def best_context(lemma: str, candidates: list[Sentence]) -> Sentence:
    return max(
        candidates,
        key=lambda s: (
            -abs(len(s.tokens) - 15)
            - 2 * count_above_band(s, "B2")
            + textrank_score(s)
        ),
    )

This single change moved 14-day retention from 18% to 31%.

Before vs. after on one real 12-min MrBeast video

Metric	v1 (lemma + freq)	v4 (full pipeline)
Tokens after lemmatization	1,847	1,847
Cards after POS filter	1,847	612
Cards after CEFR-band trim (B1–B2 target)	612	184
Cards after context-dedupe	184	71
User-reported "useful" rate (n=40)	22%	78%

If you want to see the output without reading spaCy, TubeVocab is the side project these 4 stages live inside — paste a YouTube URL, get back ~50–100 CEFR-tagged cards with timestamps clickable back to the exact second the word was spoken.