Detecting Speaker Changes with Pyannote Segmentation 3.0 and ONNX Runtime

#python #onnx #audio #machinelearning

Hello, everyone.

When listening to a conversation, we naturally keep track of who is speaking. A program has a harder job: beyond finding speech, it must also determine where one speaker gives way to another.

Today, I will use an ONNX version of Pyannote Segmentation 3.0 to detect speaker changes in a two-person conversation and split the recording into one WAV file per utterance.

What I Tested

This lab uses FFmpeg to decode a roughly 14-second conversation into a 16 kHz mono waveform. It then combines the Pyannote segmentation model with simple post-processing to produce contiguous speaker segments.

I wanted to verify:

Whether six alternating utterances can be separated into six segments
Whether the detected speaker indexes remain consistent throughout the recording
Whether ONNX Runtime can process the audio faster than real time using only its CPU execution provider
Whether every segment can be saved as a separate WAV file

The complete code and reproducible environment are available in the pyannote-scd lab in kiarina/labs.

This test performs segmentation using the model's speaker indexes. It does not compare speaker embeddings or run clustering, so it is not a complete speaker diarization pipeline that identifies the same person throughout a long recording.

Reproducing the Lab

You will need:

mise
uv
FFmpeg
curl

The following commands fetch only this lab, download the shared test audio, and run it:

git clone --depth 1 --filter=blob:none --sparse \
  https://github.com/kiarina/labs.git
cd labs
git sparse-checkout set .gitignore .mise/tasks Makefile mise.toml \
  2026/07/04/pyannote-scd
mise -C 2026/07/04/pyannote-scd run

On the first run, the task downloads the full-precision onnx/model.onnx file from onnx-community/pyannote-segmentation-3.0 on Hugging Face. uv then prepares the Python dependencies and runs the detector.

Model License

The onnx-community/pyannote-segmentation-3.0 model used in this test is listed under the MIT License by its distributor. Check the distributor's current license terms before using or redistributing the model.

How Speaker Segments Are Detected

The input is this shared test asset:

tests/assets/mp3/conversation_2speaker_14s_16k.mp3

The recording follows this scenario:

Speaker 1: Hello? Are you at the station already?

Speaker 2: Yeah. I just came through the ticket gate. How about you?

Speaker 1: I am still on the train. I think I will be there in about five minutes.

Speaker 2: Got it. I will wait in front of the café, then.

Speaker 1: Thanks. It is pretty cold today.

Speaker 2: Definitely. I am glad I brought my scarf.

FFmpeg decodes the MP3 into a 16 kHz mono waveform, which is passed to the model in 10-second windows. The final chunk is zero-padded to 10 seconds, and frames beyond the original audio duration are discarded after inference.

The detector uses these settings:

Setting	Value
sample rate	16 kHz
inference window	10 seconds
model speakers	3
maximum simultaneous speakers	2
active speaker threshold	0.5
overlap margin	0.1
minimum speaker change	100 ms
minimum speech segment	100 ms
execution provider	CPUExecutionProvider

For every frame, the model outputs log probabilities for seven classes covering silence, individual speakers, and pairs of speakers. This powerset representation is converted into probabilities for three speaker indexes. A speaker is considered active when its probability reaches 0.5.

When at least two speakers are active and the difference between the top two probabilities is no more than 0.1, the frame is labeled overlap. This rule only indicates that the probabilities are close; it does not prove that the source contains overlapping speech.

Converting these probabilities directly into segments would create fragments around silence and brief fluctuations. The post-processing assigns silent frames to adjacent speakers and merges speaker states shorter than 100 ms into neighboring segments. One model frame is approximately 16.978 ms in this run, so 100 ms corresponds to about six frames.

Finally, each segment is saved under output/ as a 16-bit PCM, 16 kHz, mono WAV file. Existing output is removed before each run.

Results

On a Mac Studio, the detector found six speaker segments in the 14.171-second input:

audio duration: 14.171s
SCD elapsed: 0.019s
SCD real-time factor: 0.001x
frame duration: 16.978ms
segments: 6
001:   0.000s -   1.851s ( 1.851s) speaker_2
002:   1.851s -   4.737s ( 2.886s) speaker_1
003:   4.737s -   7.317s ( 2.581s) speaker_2
004:   7.317s -   9.677s ( 2.360s) speaker_1
005:   9.677s -  11.834s ( 2.156s) speaker_2
006:  11.834s -  14.171s ( 2.337s) speaker_1

The six segments cover the complete 14.171-second input without gaps or overlaps. Together, the generated files contain 226,736 samples, and every file is 16-bit PCM, 16 kHz, and mono.

Listening to the files and comparing them with the script produced the following mapping:

file	model output	speech
`speaker_2_001.wav`	`speaker_2`	もしもし、もう駅に着いた？ (Hello? Are you at the station already?)
`speaker_1_002.wav`	`speaker_1`	うん。今、改札出たところ。そっちは？ (Yeah. I just came through the ticket gate. How about you?)
`speaker_2_003.wav`	`speaker_2`	こっちはまだ電車。あと5分くらいかな。 (I am still on the train. I think I will be there in about five minutes.)
`speaker_1_004.wav`	`speaker_1`	了解。じゃあ、カフェの前で待ってるね。 (Got it. I will wait in front of the café, then.)
`speaker_2_005.wav`	`speaker_2`	助かる。今日は結構寒いね。 (Thanks. It is pretty cold today.)
`speaker_1_006.wav`	`speaker_1`	ほんと、マフラー持ってきて正解だった。 (Definitely. I am glad I brought my scarf.)

Each file corresponds to one utterance in the script. None was split in the middle, and none contained speech from the adjacent speaker. The model's speaker_2 index corresponded to Speaker 1, while speaker_1 corresponded to Speaker 2, with the indexes alternating consistently across all six segments.

The verification environment was:

machine: Mac Studio
chip: Apple M4 Max
memory: 128 GB
OS: macOS 26.5.1 (25F80), arm64
Python: 3.12.11
ONNX Runtime: 1.27.0
execution provider: CPUExecutionProvider

SCD elapsed measures only ONNX inference and speaker segment detection. It excludes model initialization, FFmpeg decoding, and WAV generation. This was a single run without a warm-up rather than a rigorous benchmark, but its real-time factor was 0.001x.

Interpreting the Results

For this recording, the six scripted utterances matched the six detected segments. Because silent frames were distributed between adjacent speakers instead of becoming separate files, the detector preserved the entire input while splitting it at speaker changes. That output should be convenient as preprocessing for transcription because each segment contains only one person's utterance.

Processing 14.171 seconds of audio in 0.019 seconds on a CPU is also encouraging. However, the measurement covers only inference and segment detection, and it is a single reference value. It does not represent end-to-end performance including file I/O, nor does it predict performance on another machine.

The speaker_1 and speaker_2 labels are not persistent identities. This implementation runs inference independently on 10-second windows and does not use speaker embeddings to match identities across them. The indexes happened to remain consistent across the 10-second boundary in this input, but that behavior is not guaranteed for other recordings.

The evaluation is also limited to one short, clean recording of two alternating speakers. It does not cover overlapping speech, conversations with three or more people, noise, or long recordings, and there are no timestamp-level ground-truth labels for a quantitative evaluation. The result establishes that this recording was split correctly under these settings, not that the approach will generalize unchanged to every conversation.

My Takeaway

What I find interesting about the Pyannote segmentation model is that it goes one step beyond VAD. Instead of only answering whether somebody is speaking, it provides enough information to locate speaker changes. In this short conversation, a simple threshold and smoothing stage was enough to produce clean utterance-level files. Running comfortably on a CPU through ONNX Runtime also makes it appealing for local processing.

At the same time, six clean output files can make the system look like a finished diarization pipeline. It is not: cross-window speaker identity matching is still missing. That distinction will matter much more with longer recordings.

Next, I would like to evaluate the boundaries with overlapping speech and three or more speakers, then add speaker embeddings and clustering so that the same person can keep a consistent identity throughout a long recording.