Gokul S

Posted on Jan 11 • Edited on Mar 28

Part 2: Spoken Language Models

#ai

Update: Here are some of other interesting blogs:

DeepSeek R1: https://dev.to/gokulsg/deepseek-r1-33n0
Evolution of Language Models: https://dev.to/gokulsg/evolution-of-language-models-163
Evaluation in Language Models: https://dev.to/gokulsg/llm-53ha

Part 1 of Spoken Language Models: https://dev.to/gokulsg/spoken-language-models-3afe

End-to-End ASR Architecture

Modern ASR systems have increasingly adopted end-to-end architectures that directly map audio input to text output without explicit intermediate representations. These approaches include Connectionist Temporal Classification (CTC), attention mechanisms, and transformer-based models, each offering unique advantages for speech recognition tasks.

Connectionist Temporal Classification (CTC) allows neural networks to be trained on sequence data without requiring pre-segmented training data. It introduces a "blank" label to handle alignment between input and output sequences of different lengths. CTC solves the fundamental problem of aligning the audio frames with the corresponding transcript when the exact alignment is unknown during training. The algorithm considers all possible alignments and computes a probability distribution over them, making it possible to train the network with just paired audio-transcript data.
Attention mechanisms allow the model to focus on different parts of the input audio when generating each output token, creating flexible alignments between speech and text.

Unlike CTC, which makes a strong independence assumption between outputs at different time steps, attention mechanisms can capture complex dependencies across the entire sequence. This is particularly valuable for dealing with long utterances and handling phenomena like co-articulation, where the pronunciation of one sound is influenced by adjacent sounds.

Transformer-based models, originally developed for text processing, have been adapted for speech recognition with remarkable success. Their self-attention mechanism captures long-range dependencies efficiently and has become foundational in state-of-the-art ASR systems. Transformers process the entire sequence in parallel rather than sequentially, allowing for more efficient training. The multi-head attention mechanism enables the model to jointly attend to information from different representation subspaces, capturing various aspects of the relationship between input and output sequences.

The shift toward end-to-end architectures has simplified the ASR pipeline by eliminating the need for separate pronunciation dictionaries and language models during training. These unified models can be optimized directly for the final objective of transcription accuracy, often leading to better performance compared to traditional pipeline approaches.

Large-scale ASR Models

Recent years have seen the development of increasingly large and capable ASR models that push the boundaries of recognition accuracy and robustness. These models leverage massive datasets, powerful neural architectures, and novel training techniques to achieve unprecedented performance.

DeepSpeech, Mozilla's open-source ASR engine, implements an end-to-end neural network for speech recognition. It eliminates the need for specialized linguistic knowledge by learning directly from audio-text pairs. DeepSpeech uses a simple architecture based on recurrent neural networks that can be deployed on a variety of platforms, from server-grade hardware to mobile devices. Its open-source nature has made it a popular choice for researchers and developers looking to build speech-enabled applications.

Wav2Vec and its variants represent another important advancement in ASR technology. These self-supervised models learn representations from unlabeled audio data, significantly reducing the amount of labeled data required for high-performance ASR. The approach has been particularly valuable for low-resource languages where obtaining large amounts of transcribed speech is challenging. Wav2Vec 2.0 combines contrastive learning with masked prediction to learn powerful speech representations that can be fine-tuned for ASR with minimal labeled data.

Whisper, developed by OpenAI, demonstrated robust performance across languages and domains by training on a massive and diverse dataset. Its multilingual capabilities and robustness to real-world conditions represent a significant step forward in ASR technology. Whisper uses a sequence-to-sequence architecture with attention mechanisms and was trained on 680,000 hours of multilingual and multitask supervised data collected from the web. This extensive training allows it to generalize well to diverse acoustic environments, accents, and recording conditions.

These large-scale models have dramatically improved the accuracy and robustness of ASR systems, making them practical for a wide range of real-world applications. Their ability to handle noisy environments, diverse accents, and multiple languages has expanded the potential use cases for speech recognition technology.

ASR Evaluation Metrics

The performance of ASR systems is typically measured using standardized metrics that quantify the accuracy of transcriptions compared to ground truth references. These metrics allow researchers and developers to compare different approaches objectively and track progress in the field.

Word Error Rate (WER) is the standard metric for ASR evaluation, calculated as the sum of substitutions, insertions, and deletions divided by the total number of words in the reference text. Lower WER values indicate better performance. The formula can be expressed as:

WER = (S + I + D) / N

Where S, I, and D are the number of substitutions, insertions, and deletions, respectively, and N is the total number of words in the reference. WER is intuitive and widely used, but it has limitations, particularly for languages where word boundaries are not clearly defined.

Character Error Rate (CER) is similar to WER but calculated at the character level, making it useful for languages where word boundaries are ambiguous. CER is often used alongside WER to provide a more complete picture of transcription accuracy, especially for languages with complex morphology or writing systems where word-level evaluation may be misleading.

Various benchmark datasets exist to standardize evaluation, including LibriSpeech (based on audiobook recordings) and Common Voice (Mozilla's multilingual speech corpus), allowing for fair comparison between different approaches. These datasets include carefully curated test sets that represent different speaking styles, acoustic conditions, and linguistic content, providing a comprehensive assessment of ASR system performance.

Modern TTS Systems

Traditional TTS systems often relied on concatenative synthesis, which involves stitching together pre-recorded speech segments to form new utterances. This approach dominated the field for many years before neural approaches became prevalent.

Unit selection synthesis selects optimal units from a large database of recorded speech to create natural-sounding output. The units may be phonemes, diphones (transitions between phonemes), or even longer segments. The selection process typically involves two cost functions: a target cost (how well a candidate unit matches the desired phonetic properties) and a concatenation cost (how smoothly two adjacent units join together). The system searches for the sequence of units that minimizes the total cost, often using algorithms like Viterbi search.

Diphone synthesis specifically focuses on the transitions between phonemes, capturing the critical co-articulation effects that occur in natural speech. A diphone spans from the middle of one phoneme to the middle of the next, thus capturing the transition between adjacent sounds. Using diphones as the basic unit helps preserve the natural transitions between sounds, which are often more perceptually important than the stable parts of phonemes.

Domain-specific applications of concatenative synthesis can produce very natural speech for limited domains with predictable content. For instance, transportation announcements or voice response systems for specific industries can use specialized recorded databases tailored to their particular vocabulary and speaking style. These systems trade flexibility for quality by focusing on a restricted domain.

While concatenative methods can produce high-quality speech for covered phrases, they lack flexibility and require large storage for voice databases. The speech quality can degrade significantly when synthesizing sentences that require unusual combinations of units not well-represented in the database. Additionally, changing the voice or speaking style typically requires recording an entirely new database, making these systems somewhat inflexible.

Parametric Synthesis

Parametric synthesis represents speech with statistical models that generate acoustic parameters, offering greater flexibility than concatenative approaches but traditionally sacrificing some naturalness.

HMM-based synthesis uses statistical models to capture the relationships between linguistic features and acoustic parameters, generating speech that can be adapted to different speakers and styles. These systems typically use context-dependent HMMs to model the mapping from linguistic features (derived from text analysis) to acoustic features (such as spectral parameters, fundamental frequency, and duration). During synthesis, the most likely acoustic parameter sequence is generated given the input text, and then a vocoder converts these parameters into a speech waveform.

Statistical parametric speech synthesis, a broader category that includes HMM-based systems, encompasses approaches that model speech parameters statistically. Compared to concatenative synthesis, parametric methods offer advantages in terms of footprint size (requiring less storage), flexibility (easier adaptation to new speakers or styles), and stability (producing more consistent output). However, they typically suffered from a somewhat muffled or buzzy quality due to statistical averaging and limitations in vocoder technology.

Vocoder technologies, which reconstruct speech waveforms from acoustic parameters, have seen significant advances in recent years. Traditional vocoders like STRAIGHT produced acceptable but clearly synthetic speech. Modern neural vocoders like WaveNet, WaveGlow, and LPCNet have dramatically improved the naturalness of parametric synthesis by more accurately modeling the complex relationships between acoustic parameters and waveforms.

The statistical nature of parametric synthesis makes it more adaptable than concatenative approaches. Voice characteristics can be modified through model adaptation techniques using relatively small amounts of target speaker data. Speaking style, emotion, and emphasis can also be controlled by adjusting model parameters, providing greater expressivity than early concatenative systems.

Neural TTS Models

The latest generation of TTS systems leverages neural networks to achieve unprecedented quality, with approaches that can generate speech nearly indistinguishable from human recordings in many cases.

WaveNet, introduced by DeepMind in 2016, represented a breakthrough in neural audio generation. This autoregressive model generates raw audio waveforms sample by sample, producing remarkably natural speech. WaveNet uses dilated causal convolutions to model the conditional probability distribution of each audio sample given all previous samples. This approach captures the fine-grained structure of speech waveforms, including subtle details that contribute to naturalness. Though computationally intensive in its original form, optimized implementations have made it practical for commercial applications.

Tacotron and sequence-to-sequence models use encoder-decoder architectures with attention to map text directly to spectrograms, which are then converted to waveforms. Tacotron 2, which combines a sequence-to-sequence model with a modified WaveNet vocoder, demonstrated near-human naturalness for English speech synthesis. These models learn the mapping from character or phoneme sequences to acoustic features in an end-to-end fashion, eliminating many of the hand-engineered components of traditional TTS pipelines.

Flow-based models provide efficient generation by using invertible transformations, enabling faster-than-real-time synthesis. Models like FloWaveNet and WaveGlow use normalizing flows to transform a simple distribution (like Gaussian noise) into the complex distribution of speech waveforms. Unlike autoregressive models, flow-based approaches can generate all samples in parallel, making them much faster during inference while maintaining high quality.

Modern neural TTS systems offer unprecedented control over various aspects of speech. Models like FastSpeech 2 and VITS provide explicit control over prosody, speaking rate, and energy, allowing for expressive and varied speech output. Multi-speaker models can generate speech in different voices without requiring separate models for each speaker, and cross-lingual models can synthesize speech in languages not seen during training by leveraging shared representations across languages.

Speech Large Language Models

Recent developments have focused on integrating speech processing with language understanding, creating more holistic systems that can process and generate speech in context-aware ways.

Multimodal architectures process both speech and text inputs, enabling seamless transitions between modalities and more natural human-computer interaction. These systems maintain shared representations across modalities, allowing information to flow between the speech and language components. This integration allows for more context-aware speech processing, where the system's understanding of language can inform its interpretation of speech, and vice versa.

End-to-end speech understanding systems directly extract meaning from audio signals, rather than separating speech recognition from language understanding. Traditional pipelines that first transcribe speech to text and then apply natural language understanding can propagate errors and discard potentially useful acoustic information. In contrast, end-to-end approaches preserve this information and optimize directly for understanding, often leading to better performance in applications like voice assistants and spoken dialogue systems.

Conversational capabilities have advanced significantly with Speech Large Language Models (Speech LLMs), which can maintain context across multiple turns of dialogue, creating more natural conversational experiences. These models can track conversational state, remember earlier utterances, and generate contextually appropriate responses. Some systems can even model pragmatic aspects of conversation, such as turn-taking cues and discourse markers.

A comprehensive survey of Speech Large Language Models highlights how these systems combine language understanding with speech processing for more natural interactions. The integration allows for more human-like interactions that consider both linguistic content and acoustic aspects of communication, such as prosody, emphasis, and timing.

Pre-training and Fine-tuning Approaches

Modern speech models employ sophisticated training techniques to achieve high performance while making efficient use of available data.
Self-supervised learning has emerged as a powerful paradigm for speech models, allowing them to learn from unlabeled audio data. Models like wav2vec 2.0 use contrastive learning to distinguish true future audio frames from randomly sampled ones, forcing the model to capture meaningful representations of speech. This approach has been particularly valuable for low-resource scenarios where labeled data is scarce.

Transfer learning enables knowledge gained from one speech task to be transferred to another, reducing the need for task-specific labeled data. For instance, a model pre-trained on a large corpus of unlabeled speech can be fine-tuned with a much smaller amount of labeled data for specific tasks like ASR, speaker identification, or emotion recognition. This approach has democratized speech technology development, making it feasible to build high-quality systems for languages and applications with limited resources.

Few-shot and zero-shot capabilities have become increasingly important features of advanced speech models. Few-shot learning allows models to perform new tasks with minimal task-specific examples, while zero-shot learning enables inference on completely new tasks without any specific examples. These capabilities are especially valuable for quickly adapting systems to new domains or languages without extensive data collection and annotation.

Recent research has demonstrated that techniques like self-supervised pretraining on large speech and text corpora, followed by task-specific fine-tuning, significantly boost performance of Speech LLMs. This two-stage approach leverages the power of large-scale unsupervised data while still allowing optimization for specific downstream applications.

Here's a Python example that demonstrates how to use the Transformers library for both Automatic Speech Recognition (ASR) and Text-to-Speech (TTS) tasks.

from transformers import pipeline
import torch
import soundfile as sf

# Load the ASR model (Whisper)
asr_pipeline = pipeline("automatic-speech-recognition", model="openai/whisper-small")

# Perform speech recognition
speech_file = "sample_audio.wav"  # Replace with your audio file
transcription = asr_pipeline(speech_file)
print("Transcription:", transcription["text"])

# Load the TTS model (Bark or other TTS models)
tts_pipeline = pipeline("text-to-speech", model="suno/bark-small")

# Convert text to speech
text = transcription["text"]
audio_output = tts_pipeline(text)

# Save the generated speech
audio_path = "generated_speech.wav"
sf.write(audio_path, audio_output["audio"], audio_output["sampling_rate"])

print("TTS output saved to:", audio_path)

Reference

DEV Community

Part 2: Spoken Language Models

Top comments (0)