TL;DR: Transcription comes with its own jargon — WER, diarization, ASR, verbatim, and dozens more. This glossary breaks down 25+ terms in plain English so you can evaluate tools, read spec sheets, and sound like you know what you're talking about (because you will).
- 25+ — Terms Defined
- $30B — Speech Recognition Market (2026)
- < 4% — WER for Top ASR Models
- 95+ — Languages in Modern ASR
Why a Transcription Glossary Matters
You open a transcription tool's pricing page and it says "speaker diarization included on Pro plans." Or a review mentions "5.2% WER on the LibriSpeech benchmark." Sounds impressive — but what does it actually mean for your workflow?
The transcription industry borrows heavily from speech science, machine learning, and audio engineering. That vocabulary gap trips up everyone from podcast producers to legal assistants shopping for their first AI tool. This glossary closes that gap. Bookmark it, share it with your team, and come back whenever a spec sheet throws jargon at you.
Core Transcription Terms (A–Z)
Acoustic Model
The part of a speech recognition engine that maps raw audio signals to phonetic sounds. Think of it as the "ear" of the system — it hears the waveform and guesses which speech sounds are present. Modern acoustic models use deep neural networks trained on thousands of hours of recorded speech.
ASR (Automatic Speech Recognition)
The umbrella technology that converts spoken words into written text. Also called speech-to-text (STT). Every transcription tool — from Google's live captions to QuillAI — runs an ASR engine under the hood. The global ASR market hit roughly $19 billion in 2025 and is projected to surpass $30 billion by late 2026.
ℹ️ ASR vs. STT vs. Voice Recognition
These terms overlap but aren't identical. ASR and STT both mean turning speech into text. Voice recognition (or speaker recognition) identifies who is speaking rather than what they said. Many modern platforms — QuillAI included — combine both capabilities.
Batch Processing
Transcribing a complete audio file after it's been uploaded, as opposed to processing it in real time. Batch mode often produces higher accuracy because the model can look at the full context of a sentence before making predictions. Most transcription tools offer both real-time and batch options.
Clean Verbatim
A transcription style that captures all meaningful spoken content but removes filler words ("um," "uh," "like"), false starts, and stutters. It's the most common format for meeting notes, blog repurposing, and content creation. Compare with verbatim (see below).
Confidence Score
A number (usually 0 to 1) that an ASR model assigns to each transcribed word, indicating how certain it is about the result. A word with a confidence score of 0.98 is almost certainly correct; one at 0.45 is a guess. Some tools flag low-confidence words so you can review them manually.
Diarization (Speaker Diarization)
The process of figuring out "who said what" in a multi-speaker recording. The system segments the audio, generates a voice fingerprint for each speaker, and labels each sentence accordingly — "Speaker A," "Speaker B," and so on. Without diarization, you get a single wall of text with no way to tell speakers apart.
Diarization accuracy depends on audio quality, the number of overlapping voices, and background noise. Modern deep-learning pipelines achieve strong results even on noisy podcast recordings, but heavily overlapping speech (people talking over each other) remains the hardest edge case.
Edit Distance (Levenshtein Distance)
The minimum number of single-word operations — insertions, deletions, substitutions — needed to turn one text string into another. It's the math behind WER. If a model outputs "the quick brown fox" but the reference is "a quick brown fox," the edit distance is 1 (one substitution).
Filler Words
Non-content sounds people insert while speaking: "um," "uh," "you know," "like," "so." Verbatim transcripts keep them; clean verbatim removes them. Filler detection is a separate post-processing step in most ASR pipelines.
Hallucination
When an ASR model generates words or phrases that were never actually spoken in the audio. This happens more often with silence, very quiet speech, or background music. Reputable transcription platforms add safeguards — silence detection, confidence thresholds — to minimize hallucinations.
Key Points Extraction
An AI-powered feature that reads a transcript and pulls out the main ideas, action items, or decisions. Goes beyond raw transcription into summarization territory. Platforms like QuillAI offer this as a built-in feature alongside transcription, so you get both the full text and a condensed summary.
Language Model
The component that predicts which word is most likely to come next in a sentence. If the acoustic model hears something ambiguous — "I scream" vs. "ice cream" — the language model uses context to pick the right option. Large language models (LLMs) have dramatically improved transcription accuracy since 2023.
NLP (Natural Language Processing)
A branch of AI focused on understanding human language. In transcription, NLP powers features like punctuation restoration, entity recognition (identifying names, dates, places), sentiment analysis, and topic detection. It's what turns raw text into structured, useful output.
Normalization
Post-processing that converts spoken forms into their written equivalents. For example, "twenty twenty-six" becomes "2026," or "doctor smith" becomes "Dr. Smith." Normalization also handles currency, percentages, and phone numbers. Without it, transcripts are hard to skim.
Punctuation Restoration
Adding commas, periods, question marks, and other punctuation to a transcript automatically. Raw ASR output is typically unpunctuated, so a separate model (or an integrated one) inserts punctuation based on pauses, intonation, and syntax. Quality here makes or breaks readability.
Real-Time Transcription (Live Transcription)
Converting speech to text as it happens, with minimal delay (typically under 2 seconds). Used for live captions, accessibility, and real-time meeting notes. The accuracy gap between real-time and batch processing has narrowed significantly — top models now reach near-parity.
SRT / VTT Files
Standard subtitle file formats. SRT (SubRip Text) and VTT (WebVTT) both contain timed text segments used for video captions. Many transcription tools export directly to these formats, saving content creators the hassle of manual subtitle editing.
Timestamps (Time Codes)
Markers in a transcript that indicate when each word, sentence, or segment was spoken in the original audio. Usually formatted as HH:MM:SS. Timestamps let you click directly to a moment in the recording — crucial for long interviews, lectures, and webinar transcription.
Turnaround Time (TAT)
How long it takes to receive a finished transcript after submitting audio. Human transcription services typically quote 12–24 hours. AI-powered tools like QuillAI deliver results in minutes — often faster than the audio's own duration.
VAD (Voice Activity Detection)
An algorithm that identifies which parts of an audio stream contain human speech and which are silence, music, or noise. VAD runs before the main ASR engine to filter out non-speech segments, improving both speed and accuracy.
Verbatim Transcription
A transcription style that captures every single sound: all words, filler expressions, stutters, false starts, laughter, coughs, and pauses. It's the gold standard for legal proceedings, qualitative research, and journalism where exact wording matters. Verbatim takes longer to produce and is harder to read than clean verbatim.
WER (Word Error Rate)
The standard accuracy metric for speech recognition. Calculated as: WER = (Substitutions + Deletions + Insertions) / Total Reference Words. A WER of 5% means 5 out of every 100 words are wrong. Top commercial ASR models in 2026 achieve WER under 4% on clean audio — close to human-level performance (which sits around 4–5% WER).
💡 What's a "Good" WER?
It depends on the audio. Clean studio recordings: under 3% is achievable. Phone calls with background noise: 8–12% is realistic. Crosstalk-heavy meetings: 15–20%. Always test a tool on your actual audio rather than trusting benchmark numbers alone.
Whisper
An open-source ASR model released by OpenAI in 2022, trained on 680,000 hours of multilingual audio. Whisper popularized the idea that a single model could handle 95+ languages with strong accuracy. Many transcription services — including QuillAI — use Whisper-based architectures as part of their processing pipeline.
Quick Reference Table
🎯 WER
Word Error Rate — the % of incorrectly transcribed words. Lower = better.
🗣️ Diarization
Identifies who spoke when in multi-speaker recordings.
⏱️ Timestamps
Time markers linking text to exact moments in audio.
🤖 ASR
Automatic Speech Recognition — the core tech behind all transcription tools.
📝 Verbatim
Full transcription including every um, uh, and stutter.
🔇 VAD
Voice Activity Detection — filters silence and noise before transcription.
🧠 NLP
Natural Language Processing — adds punctuation, entities, summaries.
📊 Confidence Score
How sure the model is about each word (0–1 scale).
How These Terms Affect Your Tool Choice
Knowing the vocabulary helps you cut through marketing fluff. When a tool advertises "industry-leading accuracy," you can ask: what WER, on what benchmark, with what audio conditions? When a plan includes "speaker labels," you know that means diarization. When someone says "we support 95 languages," you can check whether that's via Whisper or a proprietary model.
Here's a practical decision framework:
1. Define your audio type
Single speaker (podcast narration), two speakers (interview), or group (meeting)? This determines whether you need diarization.
2. Pick your transcript style
Clean verbatim works for most business use cases. Full verbatim is needed for legal, research, or journalism.
3. Check accuracy claims
Look for published WER numbers and test on your own audio. A tool with 3% WER on studio audio may hit 15% on your noisy conference room recording.
4. Evaluate post-processing
Timestamps, punctuation, normalization, key points — these features determine how usable the output is straight out of the box.
5. Consider language needs
If you work in multiple languages, look for a platform with broad multilingual support.
Frequently Asked Questions
FAQ
What is a good Word Error Rate (WER) for transcription?
For clean, single-speaker audio, a WER under 5% is considered strong — comparable to human transcribers. For noisy, multi-speaker recordings, 8–15% is realistic with current AI models. Always benchmark against your own audio rather than relying solely on published numbers.
What's the difference between verbatim and clean verbatim?
Verbatim captures everything: filler words, stutters, false starts, laughter. Clean verbatim removes those non-content elements while keeping all meaningful speech intact. Most business users prefer clean verbatim for readability; legal and research contexts require full verbatim.
Why does speaker diarization matter?
Without diarization, a multi-speaker transcript is just an unbroken wall of text. Diarization labels each segment with the speaker's identity, making transcripts searchable, quotable, and useful for meeting minutes, interviews, and podcasts.
What does ASR stand for and how does it work?
ASR stands for Automatic Speech Recognition. It works by passing audio through an acoustic model (which identifies speech sounds), a language model (which predicts likely word sequences), and post-processing steps like punctuation and normalization. Modern ASR uses deep neural networks trained on hundreds of thousands of hours of speech.
Can AI transcription handle multiple languages?
Yes. Models like OpenAI's Whisper support 95+ languages from a single model. Platforms such as QuillAI leverage this capability to transcribe audio in dozens of languages without requiring you to specify the language in advance.
See These Terms in Action — QuillAI handles ASR, diarization, timestamps, and key points extraction — all from your browser. Upload an audio file or paste a YouTube link to get started.
Top comments (0)