DEV Community

Cover image for Solving Audio Gaps in Real-Time Speech Translation
alfchee
alfchee

Posted on

Solving Audio Gaps in Real-Time Speech Translation

When building a real-time Speech-to-Speech (S2S) translation service, latency is usually the enemy everyone talks about. But there's a silent killer (quite literally) that can ruin the user experience just as effectively: audio gaps.

In our journey migrating from Flask to FastAPI and implementing Nvidia Riva, we encountered a persistent issue where our synthesized audio had audible stuttering—specifically, 20ms gaps of silence between chunks. Here’s how we diagnosed and fixed it, turning a robotic output into a smooth, natural conversation.

The Problem: "Machine Gun" Audio

Our pipeline looked standard:

  1. Receive user audio (WebSocket)
  2. Transcribe (ASR) & Translate (NMT)
  3. Synthesize speech (TTS)
  4. Stream audio back to the client

But the output sounded like a machine gun. Words were clear, but the flow was choppy. Opening the raw audio dump in Audacity revealed the culprit: consistent 20-50ms gaps of silence inserted between every audio chunk returned by the TTS service.

What Wasn't The Cause

  • Network Latency: The gaps were present even when saving to a local file.
  • Frontend Playback: The gaps existed in the source PCM data.
  • Sample Rates: 24kHz in, 24kHz out. No mismatches.

Root Cause Analysis

After deep diving into the Riva TTS behavior, we found three contributing factors:

  1. Padding by Design: The TTS model often pads the beginning and end of synthesized audio with silence.
  2. Imperfect Silence: This "silence" wasn't always digital zero (0x00). It often contained low-amplitude noise (0x01, 0x02), meaning our simple if sample != 0 checks failed to detect it.
  3. Fragmented Synthesis: We were sending text to the TTS engine too aggressively (sentence by sentence or even phrase by phrase). Each request generated its own padding, compounding the issue.

The Solution: A Three-Pronged Approach

1. Aggressive Silence Trimming

We moved from a simple "zero check" to a threshold-based trim. Since the audio is 16-bit PCM, values under 100 are effectively inaudible but prevent "perfect silence" detection.

# Before: Only removed perfect zeros (ineffective)
non_zero = np.where(audio_data != 0)[0]

# After: Remove near-silence (threshold-based)
SILENCE_THRESHOLD = 100
non_silent = np.where(np.abs(audio_data) > SILENCE_THRESHOLD)[0]

if len(non_silent) > 0:
    # Keep only the audible part
    trimmed_audio = audio_data[non_silent[0] : non_silent[-1] + 1]
Enter fullscreen mode Exit fullscreen mode

This immediately removed about 10-30ms of "dead air" per chunk.

2. Audio Crossfading (Windowing)

Even after trimming, stitching two audio clips together can cause a "click" if the waveform jumps instantly from one amplitude to another. We implemented a 5ms linear fade-out on every chunk.

def apply_fade_out(audio_bytes: bytes, sample_rate: int = 24000, fade_ms: int = 5) -> bytes:
    audio_data = np.frombuffer(audio_bytes, dtype=np.int16)
    num_samples = int(sample_rate * (fade_ms / 1000))

    # Linear ramp from 1.0 to 0.0
    fade_curve = np.linspace(1.0, 0.0, num_samples)

    # Apply to the very end of the chunk
    audio_data[-num_samples:] = audio_data[-num_samples:] * fade_curve

    return audio_data.tobytes()
Enter fullscreen mode Exit fullscreen mode

This acts like a micro-crossfade, ensuring every chunks ends at zero amplitude, eliminating the "click" sound at boundaries.

3. Smart Text Aggregation

Finally, we stopped sending every partial sentence to the TTS engine. We increased our text buffer size to accumulate more context before requesting synthesis.

Before: Flush to TTS every 3 segments.
After: Flush to TTS every 5 segments (or on punctuation).

Fewer requests = fewer boundaries = fewer gaps.

Results

  • Audio Gaps: Eliminated.
  • Playback Smoothness: Indistinguishable from a single continuous file.
  • Latency Cost: Negligible (<2ms processing overhead).

Key Takeaway

In real-time audio, how you handle the bytes is just as important as the model generating them. Models are imperfect; your DSP pipeline needs to clean up the mess. Don't trust "silence" to be zero, and never stitch audio without smoothing the edges.

Top comments (0)