Why I chose human-edited subtitles over AI auto-captions for vocabulary mining

#ai #learning #sideprojects #ux

When I was prototyping TubeVocab, the obvious shortcut was to use YouTube's auto-generated captions for every video. They are free, they exist on almost every clip, and the API gives them back instantly.

I tried that path for a few weeks. It did not survive contact with real learners.

The vocabulary mining experience depends on subtitle quality much more than on UI polish, and human-edited subtitles consistently beat auto-captions in ways that matter to ESL learners.

Auto-captions get the easy words right

For clean studio audio with a single speaker, auto-captions are good enough. A YouTuber sitting in front of a microphone reading a script will produce captions that match the spoken words at 95 percent accuracy or better.

That accuracy collapses when the audio is messy. Two speakers overlap. Background music kicks in. A guest has a strong accent. A scene cuts to street noise. Suddenly the caption track misses half a sentence, merges words, or hallucinates filler.

Those are exactly the moments where a learner needs the most help. Easy sentences they can already follow. The hard sentences are the ones that need clean subtitles to be saved as a flashcard.

The errors hurt the learning loop more than the UI

A wrong transcription is not just inconvenient. It silently teaches the wrong thing.

If the speaker says "I could have told you," and the caption says "I could of told you," and the learner clicks the phrase to save as vocabulary, they save a piece of folk-grammar that does not exist in formal English. They will be confused later when a teacher marks it wrong.

If the speaker says a specialized term and the caption substitutes a similar-sounding common word, the learner saves the wrong word entirely. Their flashcard deck quietly fills with noise.

That is a worse failure than a missing caption. A missing caption means the learner moves on. A wrong caption means the learner trusts it.

Human-edited subtitles encode rhythm

There is a second, less obvious reason human subtitles are better.

Auto-captions split lines mostly by silence detection. A human editor splits lines by meaning and reading rhythm. They group a phrase like "as a matter of fact" on a single line. They break before a clause boundary, not in the middle of a noun phrase.

That rhythm matters for ESL learning because most of the value comes from reading the line as a chunk, not as isolated words. When I save a phrase from TubeVocab, I want the natural unit a fluent speaker would say in one breath, not whatever the silence detector happened to align with.

Human captions preserve those chunks. Auto-captions chop them up.

The cost of better subtitles

The downside is obvious. Human-edited subtitles do not exist for most YouTube videos.

A lot of educational creators upload only auto-captions, or no captions at all. A learner who only watches the top 1 percent of curated channels gets perfect subtitles. A learner who watches whatever interests them gets a mixed bag.

So in practice the system has to handle three cases:

The video has good human subtitles. Use them directly.
The video has only auto-captions. Use them, but flag the line as machine-generated so the learner knows to double-check before saving a flashcard.
The video has no subtitles. Either skip it or run a higher-quality transcription pass before exposing it as a vocabulary mining source.

The interesting product question is the second case. It is not honest to pretend the captions are clean. It is also not useful to refuse to work at all.

What I show the learner

The compromise I landed on is to render the subtitles as usual, but mark auto-captions visually so the learner knows what they are looking at. When they hover or click to save a phrase, the saved card carries metadata about whether the source was human-edited or auto-generated.

That changes how the system can treat the saved item later. A learner reviewing a phrase saved from a human-captioned video can trust the original text. A phrase saved from an auto-captioned video can be re-checked against the audio before being promoted into spaced repetition.

This is more work than treating every caption as equally valid. But it matches reality, and it stops the flashcard deck from filling with subtle errors over months of use.

The product lesson

The naive version of TubeVocab treated all YouTube subtitles as a uniform data source. The honest version treats them as a quality spectrum.

For an ESL tool built on top of real videos, that distinction shows up everywhere: in the flashcards learners save, in the dictation lines they replay, in the sentences they trust as examples. A vocabulary tool is only as good as the text underneath it.

That is why I now treat subtitle source quality as a first-class signal, not a free input I can use without thinking. It is one of the boring parts of building TubeVocab, and one of the parts that quietly determines whether learners get value or get misled.