When building a real-time Speech-to-Speech (S2S) translation service, latency is usually the enemy everyone talks about. But there's a silent killer (quite literally) that can ruin the user experience just as effectively: audio gaps.
In our journey migrating from Flask to FastAPI and implementing Nvidia Riva, we encountered a persistent issue where our synthesized audio had audible stuttering—specifically, 20ms gaps of silence between chunks. Here’s how we diagnosed and fixed it, turning a robotic output into a smooth, natural conversation.
The Problem: "Machine Gun" Audio
Our pipeline looked standard:
- Receive user audio (WebSocket)
- Transcribe (ASR) & Translate (NMT)
- Synthesize speech (TTS)
- Stream audio back to the client
But the output sounded like a machine gun. Words were clear, but the flow was choppy. Opening the raw audio dump in Audacity revealed the culprit: consistent 20-50ms gaps of silence inserted between every audio chunk returned by the TTS service.
What Wasn't The Cause
- Network Latency: The gaps were present even when saving to a local file.
- Frontend Playback: The gaps existed in the source PCM data.
- Sample Rates: 24kHz in, 24kHz out. No mismatches.
Root Cause Analysis
After deep diving into the Riva TTS behavior, we found three contributing factors:
- Padding by Design: The TTS model often pads the beginning and end of synthesized audio with silence.
- Imperfect Silence: This "silence" wasn't always digital zero (
0x00). It often contained low-amplitude noise (0x01,0x02), meaning our simpleif sample != 0checks failed to detect it. - Fragmented Synthesis: We were sending text to the TTS engine too aggressively (sentence by sentence or even phrase by phrase). Each request generated its own padding, compounding the issue.
The Solution: A Three-Pronged Approach
1. Aggressive Silence Trimming
We moved from a simple "zero check" to a threshold-based trim. Since the audio is 16-bit PCM, values under 100 are effectively inaudible but prevent "perfect silence" detection.
# Before: Only removed perfect zeros (ineffective)
non_zero = np.where(audio_data != 0)[0]
# After: Remove near-silence (threshold-based)
SILENCE_THRESHOLD = 100
non_silent = np.where(np.abs(audio_data) > SILENCE_THRESHOLD)[0]
if len(non_silent) > 0:
# Keep only the audible part
trimmed_audio = audio_data[non_silent[0] : non_silent[-1] + 1]
This immediately removed about 10-30ms of "dead air" per chunk.
2. Audio Crossfading (Windowing)
Even after trimming, stitching two audio clips together can cause a "click" if the waveform jumps instantly from one amplitude to another. We implemented a 5ms linear fade-out on every chunk.
def apply_fade_out(audio_bytes: bytes, sample_rate: int = 24000, fade_ms: int = 5) -> bytes:
audio_data = np.frombuffer(audio_bytes, dtype=np.int16)
num_samples = int(sample_rate * (fade_ms / 1000))
# Linear ramp from 1.0 to 0.0
fade_curve = np.linspace(1.0, 0.0, num_samples)
# Apply to the very end of the chunk
audio_data[-num_samples:] = audio_data[-num_samples:] * fade_curve
return audio_data.tobytes()
This acts like a micro-crossfade, ensuring every chunks ends at zero amplitude, eliminating the "click" sound at boundaries.
3. Smart Text Aggregation
Finally, we stopped sending every partial sentence to the TTS engine. We increased our text buffer size to accumulate more context before requesting synthesis.
Before: Flush to TTS every 3 segments.
After: Flush to TTS every 5 segments (or on punctuation).
Fewer requests = fewer boundaries = fewer gaps.
Results
- Audio Gaps: Eliminated.
- Playback Smoothness: Indistinguishable from a single continuous file.
- Latency Cost: Negligible (<2ms processing overhead).
Key Takeaway
In real-time audio, how you handle the bytes is just as important as the model generating them. Models are imperfect; your DSP pipeline needs to clean up the mess. Don't trust "silence" to be zero, and never stitch audio without smoothing the edges.
Top comments (0)