Nariaki Wada

Posted on Jul 5 • Edited on Jul 11

Grouping Utterances by Speaker with ECAPA-TDNN and ONNX Runtime

#python #onnx #audio #machinelearning

Hello, everyone.

Splitting a conversation into utterances is useful, but it still leaves an important question unanswered: which utterances came from the same person? Even without identifying anyone by name, grouping the same voice together makes the structure of a conversation much easier to work with.

Today, I will generate a speaker embedding for each utterance with an ONNX version of ECAPA-TDNN, then group utterances from two speakers using cosine similarity.

What I Tested

In my previous test, I detected speaker changes with Pyannote Segmentation 3.0 and split a roughly 14-second conversation into six utterances. This test passes each utterance through ECAPA-TDNN to produce a 192-dimensional speaker embedding, then assigns the embeddings to groups in chronological order.

I wanted to verify:

Whether every utterance consistently produces a 192-dimensional speaker embedding
Whether a simple sequential threshold algorithm can collect the three utterances from each speaker without knowing the number of speakers in advance
How much separation appears between same-speaker and different-speaker cosine similarities
Whether ONNX Runtime can process the audio faster than real time using only its CPU execution provider

The complete code and reproducible environment are available in the ecapa-tdnn-onnx lab in kiarina/labs.

Reproducing the Lab

You will need:

mise
uv
FFmpeg
curl

The following commands fetch only this lab, download the shared test audio, and run it:

git clone --depth 1 --filter=blob:none --sparse \
  https://github.com/kiarina/labs.git
cd labs
git sparse-checkout set .gitignore .mise/tasks Makefile mise.toml \
  2026/07/05/ecapa-tdnn-onnx
mise -C 2026/07/05/ecapa-tdnn-onnx run

On the first run, the task downloads two models from Hugging Face:

SCD: onnx-community/pyannote-segmentation-3.0
Speaker embedding: pranjal-pravesh/ecapa_tdnn_onnx

The ECAPA-TDNN model is pinned to commit 04c3ffe4fd00b3b7853fd57db44e2e531d4817f2. Both the task and the inference code verify that its SHA-256 matches:

245eb5995cfffd74494862dee33da2b00c1c2579eb0c6703847784e9901ed458

Model Licenses

Both onnx-community/pyannote-segmentation-3.0 and pranjal-pravesh/ecapa_tdnn_onnx, as used in this test, are listed under the MIT License by their respective distributors. Check each distributor's current license terms before using or redistributing the models.

Comparing Utterances with Speaker Embeddings

The input is a shared test recording in which two people alternate for three turns each:

tests/assets/mp3/conversation_2speaker_14s_16k.mp3

Pyannote Segmentation 3.0 and the same post-processing from the previous lab divide it into these six utterances:

segment	time	expected speaker
1	0.000–1.851 s	Speaker 1
2	1.851–4.737 s	Speaker 2
3	4.737–7.317 s	Speaker 1
4	7.317–9.677 s	Speaker 2
5	9.677–11.834 s	Speaker 1
6	11.834–14.171 s	Speaker 2

The expected result is therefore two groups containing segments 1, 3, 5 and 2, 4, 6.

The ECAPA-TDNN model accepts a raw 16 kHz mono waveform. Its graph includes Fbank feature extraction, input normalization, and ECAPA-TDNN inference, producing an output with shape [1, 1, 192]. The implementation flattens this into a 192-element vector and applies L2 normalization.

The main settings are:

Setting	Value
sample rate	16 kHz
embedding dimension	192
embedding normalization	L2
similarity	cosine similarity
speaker similarity threshold	0.45
execution provider	CPUExecutionProvider

The grouping algorithm processes utterances one at a time in chronological order. For a new embedding, it calculates cosine similarity against every embedding already stored in every group and takes the maximum score.

If the maximum similarity is at least 0.45, add the embedding to that group
If it is below 0.45, create a new group

Because the vectors are normalized, their dot product is their cosine similarity. The comparison uses the closest member rather than a group centroid, and the number of speakers is not provided.

This approach is easy to implement, but its result depends on both utterance order and the threshold. If an utterance is assigned to the wrong speaker, its embedding also becomes a candidate for later comparisons and can propagate the mistake.

Results

In one run on a Mac Studio without a warm-up, the six utterances were divided into the expected two groups:

audio duration: 14.171s
SCD elapsed: 0.019s
SCD real-time factor: 0.001x
embedding elapsed: 0.114s
embedding real-time factor: 0.008x
similarity threshold: 0.450
segments: 6
groups: 2
1:   0.000s -   1.851s ( 1.851s) group=0 best_score=  new
2:   1.851s -   4.737s ( 2.886s) group=1 best_score=0.095
3:   4.737s -   7.317s ( 2.581s) group=0 best_score=0.459
4:   7.317s -   9.677s ( 2.360s) group=1 best_score=0.556
5:   9.677s -  11.834s ( 2.156s) group=0 best_score=0.582
6:  11.834s -  14.171s ( 2.337s) group=1 best_score=0.595

Comparing the assignments with the known speakers gives:

group	segments	expected speaker
0	1, 3, 5	Speaker 1
1	2, 4, 6	Speaker 2

Every embedding had 192 dimensions, and every L2 norm was 1.0 within floating-point error after normalization. All six generated utterance files were 16 kHz mono PCM WAV files.

The complete pairwise cosine similarity matrix was:

	1	2	3	4	5	6
1	1.000	0.095	0.459	0.156	0.453	0.130
2	0.095	1.000	0.276	0.556	0.216	0.561
3	0.459	0.276	1.000	0.249	0.582	0.220
4	0.156	0.556	0.249	1.000	0.175	0.595
5	0.453	0.216	0.582	0.175	1.000	0.173
6	0.130	0.561	0.220	0.595	0.173	1.000

Same-speaker pairs ranged from 0.453 to 0.595, while different-speaker pairs ranged from 0.095 to 0.276. The ranges did not overlap in this recording, indicating that the embeddings captured the speaker difference.

The verification environment was:

machine: Mac Studio (Mac16,9)
chip: Apple M4 Max
memory: 128 GB
OS: macOS 26.5.1 (25F80), arm64
Python: 3.12.11
NumPy: 2.5.1
ONNX Runtime: 1.27.0
execution provider: CPUExecutionProvider

SCD elapsed covers SCD inference and segment detection. embedding elapsed covers ECAPA-TDNN inference and L2 normalization for all six segments. Model initialization, FFmpeg decoding, and WAV and JSON generation are excluded. This was a single reference run rather than a rigorous benchmark.

Interpreting the Results

For this input, ECAPA-TDNN embeddings and a simple sequential threshold were enough to group both speakers correctly without specifying the speaker count. The previous SCD-only experiment relied on temporary speaker indexes produced by the segmentation model. This time, the groups were built by comparing the voices themselves, which is a useful step toward linking one person across a longer recording or independently extracted utterances.

Performance was also encouraging: generating embeddings for all six utterances took 0.114 seconds, for a real-time factor of 0.008x relative to the input duration. The measurement is limited in scope, but it is fast enough to consider for local conversation processing on an Apple Silicon CPU.

The 0.45 threshold should not be treated as robust, however. Segment 3 joined group 0 with a similarity of 0.459, leaving a margin of only 0.009. The same-speaker similarity between segments 1 and 5 was also only 0.453. A small change in recording conditions or speech content could cause these pairs to split into separate groups.

Lowering the threshold creates the opposite risk: merging different people. The highest different-speaker score was only 0.276 here, leaving ample separation in this sample, but similar voices or noisier recordings may not preserve that gap. The value 0.45 worked for this one recording; it is not a general recommendation.

The evaluation is also limited to one short, clean recording with two speakers. It does not cover overlapping speech, three or more speakers, very short utterances, noise, reverberation, or microphone changes. Errors in the SCD boundaries would also affect the embeddings and grouping—for example, if one segment contained voices from two people.

My Takeaway

There is something satisfying about turning each voice into 192 numbers, taking dot products, and watching alternating utterances fall back into the correct two groups. The model also includes acoustic feature extraction, so the path from raw waveform to speaker embedding stays entirely inside ONNX Runtime.

Still, the first same-speaker match barely cleared the threshold. This result shows that ECAPA-TDNN is promising for this pipeline, but it does not show that fixing the threshold at 0.45 completes speaker diarization.

Audio embedding models differ according to the characteristics they are designed to capture, such as speaker identity, speech content, environmental sounds, or music. Next, I would like to try embedding models beyond speaker recognition and explore how each one represents sound and can be applied to search and classification.