Extracting Speech Segments with Silero VAD and ONNX Runtime

#python #onnx #audio #machinelearning

Hello, everyone.

Have you ever wanted to keep only the parts of a recording where someone is speaking?
Finding silence before transcription can reduce downstream work and divide a long recording into more manageable pieces.

Today, I will use the ONNX model from Silero VAD to detect speech in a roughly 14-second conversation and extract each segment as a WAV file.

What I Tested

This lab uses FFmpeg to decode an MP3 conversation between two speakers into a 16 kHz mono waveform. It then feeds the audio to Silero VAD in 32 ms chunks.

I wanted to verify:

Whether ONNX Runtime can detect speech using only its CPU execution provider
How many segments are found in 14.171 seconds of audio
How long detection takes
Whether every detected segment can be saved as a separate WAV file

The complete code and reproducible environment are available in the silero-vad lab in kiarina/labs.

VAD stands for Voice Activity Detection. It determines whether speech is present, but it does not identify which of the two people is speaking. Speaker diarization is outside the scope of this test.

Reproducing the Lab

You will need:

mise
uv
FFmpeg
curl

The following commands fetch only this lab, download the shared test audio, and run it:

git clone --depth 1 --filter=blob:none --sparse \
  https://github.com/kiarina/labs.git
cd labs
git sparse-checkout set .gitignore .mise/tasks Makefile mise.toml \
  2026/07/03/silero-vad
mise -C 2026/07/03/silero-vad run

On the first run, the task downloads silero_vad.onnx from the official Silero VAD repository. uv then prepares the Python dependencies and runs the detector.

Model License

The Silero VAD model used in this test is listed under the MIT License by its distributor. Check the distributor's current license terms before using or redistributing the model.

How Speech Segments Are Detected

The input is this shared test asset:

tests/assets/mp3/conversation_2speaker_14s_16k.mp3

The recording follows this scenario:

Speaker 1: Hello? Are you at the station already?

Speaker 2: Yeah. I just came through the ticket gate. How about you?

Speaker 1: I am still on the train. I think I will be there in about five minutes.

Speaker 2: Got it. I will wait in front of the café, then.

Speaker 1: Thanks. It is pretty cold today.

Speaker 2: Definitely. I am glad I brought my scarf.

After FFmpeg decodes the file, the waveform is divided into chunks of 512 samples, or 32 ms. Silero VAD returns a speech probability for each chunk.

The detector uses these settings:

Setting	Value
sample rate	16 kHz
chunk size	512 samples (32 ms)
speech threshold	0.5
negative threshold	0.35
minimum silence	100 ms
minimum speech	250 ms
speech padding	30 ms

The detector uses a threshold of 0.5 to start speech and a lower threshold of 0.35 to mark a possible end. This hysteresis prevents a probability fluctuating near one boundary from repeatedly opening and closing a segment.

A segment begins when the speech probability reaches 0.5. Once the probability falls below 0.35 for at least 100 ms, that position becomes the end. Segments shorter than 250 ms are discarded, and 30 ms of padding is added at both ends.

The chunks are not processed independently. The implementation carries the ONNX Runtime state and the previous 64 samples of context into the next chunk. This preserves temporal context while processing the input incrementally, as required for streaming.

Finally, FFmpeg extracts each detected interval from the original MP3 and saves it under output/ as a 16-bit PCM, 16 kHz, mono WAV file. Existing output is removed before each run.

Results

On a Mac Studio, the detector found 12 speech segments in the 14.171-second input:

audio duration: 14.171s
VAD elapsed: 0.028s
VAD real-time factor: 0.002x
speech segments: 12
001:   0.162s -   1.726s ( 1.564s)
002:   2.050s -   2.462s ( 0.412s)
003:   2.626s -   3.934s ( 1.308s)
004:   4.130s -   4.638s ( 0.508s)
005:   4.930s -   6.014s ( 1.084s)
006:   6.178s -   7.262s ( 1.084s)
007:   7.458s -   7.998s ( 0.540s)
008:   8.194s -   9.598s ( 1.404s)
009:   9.794s -  10.430s ( 0.636s)
010:  10.530s -  11.806s ( 1.276s)
011:  11.970s -  12.510s ( 0.540s)
012:  12.610s -  14.171s ( 1.561s)

The run created 12 files, from speech_001.wav through speech_012.wav. I also verified that every file is 16-bit PCM, 16 kHz, and mono. Together, the extracted segments contain about 11.917 seconds of audio, or 84.1% of the input.

Listening to the files and comparing them with the Japanese script produced the following mapping:

file	speech
`speech_001.wav`	もしもし、もう駅に着いた？ (Hello? Are you at the station already?)
`speech_002.wav`	うん。 (Yeah.)
`speech_003.wav`	今、改札出たところ。 (I just came through the ticket gate.)
`speech_004.wav`	そっちは？ (How about you?)
`speech_005.wav`	こっちはまだ電車。 (I am still on the train.)
`speech_006.wav`	あと5分くらいかな。 (I think I will be there in about five minutes.)
`speech_007.wav`	了解。 (Got it.)
`speech_008.wav`	じゃあ、カフェの前で待ってるね。 (I will wait in front of the café, then.)
`speech_009.wav`	助かる。 (Thanks.)
`speech_010.wav`	今日は結構寒いね。 (It is pretty cold today.)
`speech_011.wav`	ほんと、 (Definitely.)
`speech_012.wav`	マフラー持ってきて正解だった。 (I am glad I brought my scarf.)

The recording was cleanly divided at natural pauses corresponding to periods and question marks. Only the final ほんと、 became a separate segment because of the short pause that followed it. Short responses such as “yeah,” “got it,” “thanks,” and “definitely” were preserved.

The verification environment was:

machine: Mac Studio (Mac16,9)
chip: Apple M4 Max, 16 cores (12 performance + 4 efficiency)
memory: 128 GB
OS: macOS 26.5.1 (25F80), arm64
Python: 3.12.11
ONNX Runtime: 1.27.0
execution provider: CPUExecutionProvider

VAD elapsed measures only Silero VAD inference and segment detection. It excludes model initialization, FFmpeg decoding, and WAV extraction. This was a single run without a warm-up rather than a rigorous benchmark, but its real-time factor was 0.002x.

Interpreting the Results

Under these conditions, the CPU processed 14.171 seconds of audio in 0.028 seconds. That leaves substantial headroom for both offline processing and applications that consume microphone input incrementally. Processing time will vary by machine, so this number should not be treated as a universal benchmark.

The implementation does not simply treat every 32 ms prediction as an independent segment. Separate start and end thresholds, a 100 ms silence requirement, minimum segment duration, and padding turn the probability sequence into intervals that are more useful downstream. In a real application, these segmentation rules can affect the result as much as model inference itself.

For this input, the detector followed the script's punctuation and natural conversational pauses without dropping any content. It is especially useful that short acknowledgments survived even with segments shorter than 250 ms being discarded.

The recording does not include timestamp-level ground-truth annotations, so I did not calculate precision or recall. The listening comparison against the script was successful, but that does not make these settings optimal for every recording. Data containing noise, music, whispers, long pauses, or overlapping speakers would require threshold tuning and comparison against labeled examples.

My Takeaway

Silero VAD feels like a practical small component to place before speech recognition. The ONNX model is about 2.3 MB, runs comfortably on a CPU, and exposes a straightforward core operation: pass in a chunk and receive a probability.

At the same time, VAD output is raw material for segmentation rather than a finished audio split. Whether an application should preserve short acknowledgments or combine speech into longer sentence-like pieces changes what values such as 100 ms and 250 ms should mean.

Next, I would like to compare how the intervals change across thresholds, recording environments, and artificially added noise. I am also interested in measuring how this lightweight preprocessing affects end-to-end speed and accuracy when combined with transcription and speaker diarization.