DEV Community

Cover image for ASR (Automatic Speech Recognition)
WanjohiChristopher
WanjohiChristopher

Posted on

ASR (Automatic Speech Recognition)

Yesterday I shared the full Voice AI pipeline.
Today we're diving deep into Stage 1: ASR (Automatic Speech Recognition).

You speak → It becomes text.

Simple, right? Here's what actually happens:

ASR
𝟭. 𝗙𝗲𝗮𝘁𝘂𝗿𝗲 𝗘𝘅𝘁𝗿𝗮𝗰𝘁𝗶𝗼𝗻
Raw audio → Digital representation

  • MFCCs (Mel-Frequency Cepstral Coefficients)
  • Spectrograms
  • Filter Banks

𝟮. 𝗔𝗰𝗼𝘂𝘀𝘁𝗶𝗰 𝗠𝗼𝗱𝗲𝗹𝗶𝗻𝗴
Maps audio features to phonemes

  • Traditional: HMM-GMM, DNN-HMM
  • Modern: Transformers, Conformers

𝟯. 𝗗𝗲𝗰𝗼𝗱𝗶𝗻𝗴 & 𝗟𝗮𝗻𝗴𝘂𝗮𝗴𝗲 𝗠𝗼𝗱𝗲𝗹𝗶𝗻𝗴
Phonemes → Words using probabilities

  • Beam Search, CTC, Attention mechanisms

𝟰. 𝗣𝗼𝘀𝘁-𝗣𝗿𝗼𝗰𝗲𝘀𝘀𝗶𝗻𝗴
Clean up the output

  • Spell checking, punctuation, capitalization

The evolution has been wild:

𝗧𝗿𝗮𝗱𝗶𝘁𝗶𝗼𝗻𝗮𝗹 (1980s-2010s):
→ HMM + GMM
→ Required phonetic alignment
→ Separate components stitched together

𝗦𝗧𝗔𝗧𝗘-𝗢𝗙-𝗧𝗛𝗘-𝗔𝗥𝗧 (Now):
→ Whisper: 680K hours of training, 50+ languages
→ Wav2Vec 2.0: Self-supervised, works with limited data

Get ASR wrong and your entire voice pipeline fails. It's the foundation.

I've attached a diagram breaking down the full ASR architecture.

What ASR model are you using? Any surprises with accuracy or latency?

Top comments (0)