<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Nariaki Wada</title>
    <description>The latest articles on DEV Community by Nariaki Wada (@kiarina).</description>
    <link>https://dev.to/kiarina</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3328007%2Ff28ae1f1-a65a-4aaa-bc43-8ae5c53d997d.png</url>
      <title>DEV Community: Nariaki Wada</title>
      <link>https://dev.to/kiarina</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/kiarina"/>
    <language>en</language>
    <item>
      <title>Detecting Speaker Changes with Pyannote Segmentation 3.0 and ONNX Runtime</title>
      <dc:creator>Nariaki Wada</dc:creator>
      <pubDate>Sun, 05 Jul 2026 09:26:56 +0000</pubDate>
      <link>https://dev.to/kiarina/detecting-speaker-changes-with-pyannote-segmentation-30-and-onnx-runtime-1hh7</link>
      <guid>https://dev.to/kiarina/detecting-speaker-changes-with-pyannote-segmentation-30-and-onnx-runtime-1hh7</guid>
      <description>&lt;p&gt;Hello, everyone.&lt;/p&gt;

&lt;p&gt;When listening to a conversation, we naturally keep track of who is speaking. A program has a harder job: beyond finding speech, it must also determine where one speaker gives way to another.&lt;/p&gt;

&lt;p&gt;Today, I will use an ONNX version of Pyannote Segmentation 3.0 to detect speaker changes in a two-person conversation and split the recording into one WAV file per utterance.&lt;/p&gt;

&lt;h2&gt;
  
  
  What I Tested
&lt;/h2&gt;

&lt;p&gt;This lab uses FFmpeg to decode a roughly 14-second conversation into a 16 kHz mono waveform. It then combines the Pyannote segmentation model with simple post-processing to produce contiguous speaker segments.&lt;/p&gt;

&lt;p&gt;I wanted to verify:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Whether six alternating utterances can be separated into six segments&lt;/li&gt;
&lt;li&gt;Whether the detected speaker indexes remain consistent throughout the recording&lt;/li&gt;
&lt;li&gt;Whether ONNX Runtime can process the audio faster than real time using only its CPU execution provider&lt;/li&gt;
&lt;li&gt;Whether every segment can be saved as a separate WAV file&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The complete code and reproducible environment are available in the &lt;a href="https://github.com/kiarina/labs/tree/main/2026/07/04/pyannote-scd" rel="noopener noreferrer"&gt;pyannote-scd lab in kiarina/labs&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;This test performs segmentation using the model's speaker indexes. It does not compare speaker embeddings or run clustering, so it is not a complete speaker diarization pipeline that identifies the same person throughout a long recording.&lt;/p&gt;

&lt;h2&gt;
  
  
  Reproducing the Lab
&lt;/h2&gt;

&lt;p&gt;You will need:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://mise.jdx.dev/" rel="noopener noreferrer"&gt;mise&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://docs.astral.sh/uv/" rel="noopener noreferrer"&gt;uv&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://ffmpeg.org/" rel="noopener noreferrer"&gt;FFmpeg&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;curl&lt;/code&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The following commands fetch only this lab, download the shared test audio, and run it:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;git clone &lt;span class="nt"&gt;--depth&lt;/span&gt; 1 &lt;span class="nt"&gt;--filter&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;blob:none &lt;span class="nt"&gt;--sparse&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  https://github.com/kiarina/labs.git
&lt;span class="nb"&gt;cd &lt;/span&gt;labs
git sparse-checkout &lt;span class="nb"&gt;set&lt;/span&gt; .gitignore .mise/tasks Makefile mise.toml &lt;span class="se"&gt;\&lt;/span&gt;
  2026/07/04/pyannote-scd
make download-test-assets
mise &lt;span class="nt"&gt;-C&lt;/span&gt; 2026/07/04/pyannote-scd run
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;On the first run, the task downloads the full-precision &lt;code&gt;onnx/model.onnx&lt;/code&gt; file from &lt;a href="https://huggingface.co/onnx-community/pyannote-segmentation-3.0" rel="noopener noreferrer"&gt;&lt;code&gt;onnx-community/pyannote-segmentation-3.0&lt;/code&gt;&lt;/a&gt; on Hugging Face. &lt;code&gt;uv&lt;/code&gt; then prepares the Python dependencies and runs the detector.&lt;/p&gt;

&lt;h2&gt;
  
  
  How Speaker Segments Are Detected
&lt;/h2&gt;

&lt;p&gt;The input is this shared test asset:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;assets/mp3/conversation_2speaker_14s_16k.mp3
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The recording follows this scenario:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Speaker 1:&lt;/strong&gt; Hello? Are you at the station already?&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Speaker 2:&lt;/strong&gt; Yeah. I just came through the ticket gate. How about you?&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Speaker 1:&lt;/strong&gt; I am still on the train. I think I will be there in about five minutes.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Speaker 2:&lt;/strong&gt; Got it. I will wait in front of the café, then.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Speaker 1:&lt;/strong&gt; Thanks. It is pretty cold today.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Speaker 2:&lt;/strong&gt; Definitely. I am glad I brought my scarf.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;FFmpeg decodes the MP3 into a 16 kHz mono waveform, which is passed to the model in 10-second windows. The final chunk is zero-padded to 10 seconds, and frames beyond the original audio duration are discarded after inference.&lt;/p&gt;

&lt;p&gt;The detector uses these settings:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Setting&lt;/th&gt;
&lt;th&gt;Value&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;sample rate&lt;/td&gt;
&lt;td&gt;16 kHz&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;inference window&lt;/td&gt;
&lt;td&gt;10 seconds&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;model speakers&lt;/td&gt;
&lt;td&gt;3&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;maximum simultaneous speakers&lt;/td&gt;
&lt;td&gt;2&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;active speaker threshold&lt;/td&gt;
&lt;td&gt;0.5&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;overlap margin&lt;/td&gt;
&lt;td&gt;0.1&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;minimum speaker change&lt;/td&gt;
&lt;td&gt;100 ms&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;minimum speech segment&lt;/td&gt;
&lt;td&gt;100 ms&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;execution provider&lt;/td&gt;
&lt;td&gt;CPUExecutionProvider&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;For every frame, the model outputs log probabilities for seven classes covering silence, individual speakers, and pairs of speakers. This powerset representation is converted into probabilities for three speaker indexes. A speaker is considered active when its probability reaches &lt;code&gt;0.5&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;When at least two speakers are active and the difference between the top two probabilities is no more than &lt;code&gt;0.1&lt;/code&gt;, the frame is labeled &lt;code&gt;overlap&lt;/code&gt;. This rule only indicates that the probabilities are close; it does not prove that the source contains overlapping speech.&lt;/p&gt;

&lt;p&gt;Converting these probabilities directly into segments would create fragments around silence and brief fluctuations. The post-processing assigns silent frames to adjacent speakers and merges speaker states shorter than 100 ms into neighboring segments. One model frame is approximately 16.978 ms in this run, so 100 ms corresponds to about six frames.&lt;/p&gt;

&lt;p&gt;Finally, each segment is saved under &lt;code&gt;output/&lt;/code&gt; as a 16-bit PCM, 16 kHz, mono WAV file. Existing output is removed before each run.&lt;/p&gt;

&lt;h2&gt;
  
  
  Results
&lt;/h2&gt;

&lt;p&gt;On a Mac Studio, the detector found six speaker segments in the 14.171-second input:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;audio duration: 14.171s
SCD elapsed: 0.019s
SCD real-time factor: 0.001x
frame duration: 16.978ms
segments: 6
001:   0.000s -   1.851s ( 1.851s) speaker_2
002:   1.851s -   4.737s ( 2.886s) speaker_1
003:   4.737s -   7.317s ( 2.581s) speaker_2
004:   7.317s -   9.677s ( 2.360s) speaker_1
005:   9.677s -  11.834s ( 2.156s) speaker_2
006:  11.834s -  14.171s ( 2.337s) speaker_1
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The six segments cover the complete 14.171-second input without gaps or overlaps. Together, the generated files contain 226,736 samples, and every file is 16-bit PCM, 16 kHz, and mono.&lt;/p&gt;

&lt;p&gt;Listening to the files and comparing them with the script produced the following mapping:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;file&lt;/th&gt;
&lt;th&gt;model output&lt;/th&gt;
&lt;th&gt;speech&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;speaker_2_001.wav&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;speaker_2&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;もしもし、もう駅に着いた？ (Hello? Are you at the station already?)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;speaker_1_002.wav&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;speaker_1&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;うん。今、改札出たところ。そっちは？ (Yeah. I just came through the ticket gate. How about you?)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;speaker_2_003.wav&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;speaker_2&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;こっちはまだ電車。あと5分くらいかな。 (I am still on the train. I think I will be there in about five minutes.)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;speaker_1_004.wav&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;speaker_1&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;了解。じゃあ、カフェの前で待ってるね。 (Got it. I will wait in front of the café, then.)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;speaker_2_005.wav&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;speaker_2&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;助かる。今日は結構寒いね。 (Thanks. It is pretty cold today.)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;speaker_1_006.wav&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;speaker_1&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;ほんと、マフラー持ってきて正解だった。 (Definitely. I am glad I brought my scarf.)&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Each file corresponds to one utterance in the script. None was split in the middle, and none contained speech from the adjacent speaker. The model's &lt;code&gt;speaker_2&lt;/code&gt; index corresponded to Speaker 1, while &lt;code&gt;speaker_1&lt;/code&gt; corresponded to Speaker 2, with the indexes alternating consistently across all six segments.&lt;/p&gt;

&lt;p&gt;The verification environment was:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;machine: Mac Studio&lt;/li&gt;
&lt;li&gt;chip: Apple M4 Max&lt;/li&gt;
&lt;li&gt;memory: 128 GB&lt;/li&gt;
&lt;li&gt;OS: macOS 26.5.1 (25F80), arm64&lt;/li&gt;
&lt;li&gt;Python: 3.12.11&lt;/li&gt;
&lt;li&gt;ONNX Runtime: 1.27.0&lt;/li&gt;
&lt;li&gt;execution provider: CPUExecutionProvider&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;code&gt;SCD elapsed&lt;/code&gt; measures only ONNX inference and speaker segment detection. It excludes model initialization, FFmpeg decoding, and WAV generation. This was a single run without a warm-up rather than a rigorous benchmark, but its real-time factor was &lt;code&gt;0.001x&lt;/code&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  Interpreting the Results
&lt;/h2&gt;

&lt;p&gt;For this recording, the six scripted utterances matched the six detected segments. Because silent frames were distributed between adjacent speakers instead of becoming separate files, the detector preserved the entire input while splitting it at speaker changes. That output should be convenient as preprocessing for transcription because each segment contains only one person's utterance.&lt;/p&gt;

&lt;p&gt;Processing 14.171 seconds of audio in 0.019 seconds on a CPU is also encouraging. However, the measurement covers only inference and segment detection, and it is a single reference value. It does not represent end-to-end performance including file I/O, nor does it predict performance on another machine.&lt;/p&gt;

&lt;p&gt;The &lt;code&gt;speaker_1&lt;/code&gt; and &lt;code&gt;speaker_2&lt;/code&gt; labels are not persistent identities. This implementation runs inference independently on 10-second windows and does not use speaker embeddings to match identities across them. The indexes happened to remain consistent across the 10-second boundary in this input, but that behavior is not guaranteed for other recordings.&lt;/p&gt;

&lt;p&gt;The evaluation is also limited to one short, clean recording of two alternating speakers. It does not cover overlapping speech, conversations with three or more people, noise, or long recordings, and there are no timestamp-level ground-truth labels for a quantitative evaluation. The result establishes that this recording was split correctly under these settings, not that the approach will generalize unchanged to every conversation.&lt;/p&gt;

&lt;h2&gt;
  
  
  My Takeaway
&lt;/h2&gt;

&lt;p&gt;What I find interesting about the Pyannote segmentation model is that it goes one step beyond VAD. Instead of only answering whether somebody is speaking, it provides enough information to locate speaker changes. In this short conversation, a simple threshold and smoothing stage was enough to produce clean utterance-level files. Running comfortably on a CPU through ONNX Runtime also makes it appealing for local processing.&lt;/p&gt;

&lt;p&gt;At the same time, six clean output files can make the system look like a finished diarization pipeline. It is not: cross-window speaker identity matching is still missing. That distinction will matter much more with longer recordings.&lt;/p&gt;

&lt;p&gt;Next, I would like to evaluate the boundaries with overlapping speech and three or more speakers, then add speaker embeddings and clustering so that the same person can keep a consistent identity throughout a long recording.&lt;/p&gt;

</description>
      <category>python</category>
      <category>onnx</category>
      <category>audio</category>
      <category>machinelearning</category>
    </item>
    <item>
      <title>Extracting Speech Segments with Silero VAD and ONNX Runtime</title>
      <dc:creator>Nariaki Wada</dc:creator>
      <pubDate>Sun, 05 Jul 2026 07:45:13 +0000</pubDate>
      <link>https://dev.to/kiarina/extracting-speech-segments-with-silero-vad-and-onnx-runtime-3h8a</link>
      <guid>https://dev.to/kiarina/extracting-speech-segments-with-silero-vad-and-onnx-runtime-3h8a</guid>
      <description>&lt;p&gt;Hello, everyone.&lt;/p&gt;

&lt;p&gt;Have you ever wanted to keep only the parts of a recording where someone is speaking?&lt;br&gt;
Finding silence before transcription can reduce downstream work and divide a long recording into more manageable pieces.&lt;/p&gt;

&lt;p&gt;Today, I will use the ONNX model from &lt;a href="https://github.com/snakers4/silero-vad" rel="noopener noreferrer"&gt;Silero VAD&lt;/a&gt; to detect speech in a roughly 14-second conversation and extract each segment as a WAV file.&lt;/p&gt;
&lt;h2&gt;
  
  
  What I Tested
&lt;/h2&gt;

&lt;p&gt;This lab uses FFmpeg to decode an MP3 conversation between two speakers into a 16 kHz mono waveform. It then feeds the audio to Silero VAD in 32 ms chunks.&lt;/p&gt;

&lt;p&gt;I wanted to verify:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Whether ONNX Runtime can detect speech using only its CPU execution provider&lt;/li&gt;
&lt;li&gt;How many segments are found in 14.171 seconds of audio&lt;/li&gt;
&lt;li&gt;How long detection takes&lt;/li&gt;
&lt;li&gt;Whether every detected segment can be saved as a separate WAV file&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The complete code and reproducible environment are available in the &lt;a href="https://github.com/kiarina/labs/tree/main/2026/07/03/silero-vad" rel="noopener noreferrer"&gt;silero-vad lab in kiarina/labs&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;VAD stands for Voice Activity Detection. It determines whether speech is present, but it does not identify which of the two people is speaking. Speaker diarization is outside the scope of this test.&lt;/p&gt;
&lt;h2&gt;
  
  
  Reproducing the Lab
&lt;/h2&gt;

&lt;p&gt;You will need:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://mise.jdx.dev/" rel="noopener noreferrer"&gt;mise&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://docs.astral.sh/uv/" rel="noopener noreferrer"&gt;uv&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://ffmpeg.org/" rel="noopener noreferrer"&gt;FFmpeg&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;curl&lt;/code&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The following commands fetch only this lab, download the shared test audio, and run it:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;git clone &lt;span class="nt"&gt;--depth&lt;/span&gt; 1 &lt;span class="nt"&gt;--filter&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;blob:none &lt;span class="nt"&gt;--sparse&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  https://github.com/kiarina/labs.git
&lt;span class="nb"&gt;cd &lt;/span&gt;labs
git sparse-checkout &lt;span class="nb"&gt;set&lt;/span&gt; .gitignore .mise/tasks Makefile mise.toml &lt;span class="se"&gt;\&lt;/span&gt;
  2026/07/03/silero-vad
make download-test-assets
mise &lt;span class="nt"&gt;-C&lt;/span&gt; 2026/07/03/silero-vad run
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;On the first run, the task downloads &lt;code&gt;silero_vad.onnx&lt;/code&gt; from the official Silero VAD repository. &lt;code&gt;uv&lt;/code&gt; then prepares the Python dependencies and runs the detector.&lt;/p&gt;

&lt;h2&gt;
  
  
  How Speech Segments Are Detected
&lt;/h2&gt;

&lt;p&gt;The input is this shared test asset:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;assets/mp3/conversation_2speaker_14s_16k.mp3
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The recording follows this scenario:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Speaker 1:&lt;/strong&gt; Hello? Are you at the station already?&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Speaker 2:&lt;/strong&gt; Yeah. I just came through the ticket gate. How about you?&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Speaker 1:&lt;/strong&gt; I am still on the train. I think I will be there in about five minutes.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Speaker 2:&lt;/strong&gt; Got it. I will wait in front of the café, then.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Speaker 1:&lt;/strong&gt; Thanks. It is pretty cold today.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Speaker 2:&lt;/strong&gt; Definitely. I am glad I brought my scarf.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;After FFmpeg decodes the file, the waveform is divided into chunks of 512 samples, or 32 ms. Silero VAD returns a speech probability for each chunk.&lt;/p&gt;

&lt;p&gt;The detector uses these settings:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Setting&lt;/th&gt;
&lt;th&gt;Value&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;sample rate&lt;/td&gt;
&lt;td&gt;16 kHz&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;chunk size&lt;/td&gt;
&lt;td&gt;512 samples (32 ms)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;speech threshold&lt;/td&gt;
&lt;td&gt;0.5&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;negative threshold&lt;/td&gt;
&lt;td&gt;0.35&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;minimum silence&lt;/td&gt;
&lt;td&gt;100 ms&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;minimum speech&lt;/td&gt;
&lt;td&gt;250 ms&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;speech padding&lt;/td&gt;
&lt;td&gt;30 ms&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The detector uses a threshold of &lt;code&gt;0.5&lt;/code&gt; to start speech and a lower threshold of &lt;code&gt;0.35&lt;/code&gt; to mark a possible end. This hysteresis prevents a probability fluctuating near one boundary from repeatedly opening and closing a segment.&lt;/p&gt;

&lt;p&gt;A segment begins when the speech probability reaches &lt;code&gt;0.5&lt;/code&gt;. Once the probability falls below &lt;code&gt;0.35&lt;/code&gt; for at least 100 ms, that position becomes the end. Segments shorter than 250 ms are discarded, and 30 ms of padding is added at both ends.&lt;/p&gt;

&lt;p&gt;The chunks are not processed independently. The implementation carries the ONNX Runtime state and the previous 64 samples of context into the next chunk. This preserves temporal context while processing the input incrementally, as required for streaming.&lt;/p&gt;

&lt;p&gt;Finally, FFmpeg extracts each detected interval from the original MP3 and saves it under &lt;code&gt;output/&lt;/code&gt; as a 16-bit PCM, 16 kHz, mono WAV file. Existing output is removed before each run.&lt;/p&gt;

&lt;h2&gt;
  
  
  Results
&lt;/h2&gt;

&lt;p&gt;On a Mac Studio, the detector found 12 speech segments in the 14.171-second input:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;audio duration: 14.171s
VAD elapsed: 0.028s
VAD real-time factor: 0.002x
speech segments: 12
001:   0.162s -   1.726s ( 1.564s)
002:   2.050s -   2.462s ( 0.412s)
003:   2.626s -   3.934s ( 1.308s)
004:   4.130s -   4.638s ( 0.508s)
005:   4.930s -   6.014s ( 1.084s)
006:   6.178s -   7.262s ( 1.084s)
007:   7.458s -   7.998s ( 0.540s)
008:   8.194s -   9.598s ( 1.404s)
009:   9.794s -  10.430s ( 0.636s)
010:  10.530s -  11.806s ( 1.276s)
011:  11.970s -  12.510s ( 0.540s)
012:  12.610s -  14.171s ( 1.561s)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The run created 12 files, from &lt;code&gt;speech_001.wav&lt;/code&gt; through &lt;code&gt;speech_012.wav&lt;/code&gt;. I also verified that every file is 16-bit PCM, 16 kHz, and mono. Together, the extracted segments contain about 11.917 seconds of audio, or 84.1% of the input.&lt;/p&gt;

&lt;p&gt;Listening to the files and comparing them with the Japanese script produced the following mapping:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;file&lt;/th&gt;
&lt;th&gt;speech&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;speech_001.wav&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;もしもし、もう駅に着いた？ (Hello? Are you at the station already?)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;speech_002.wav&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;うん。 (Yeah.)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;speech_003.wav&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;今、改札出たところ。 (I just came through the ticket gate.)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;speech_004.wav&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;そっちは？ (How about you?)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;speech_005.wav&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;こっちはまだ電車。 (I am still on the train.)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;speech_006.wav&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;あと5分くらいかな。 (I think I will be there in about five minutes.)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;speech_007.wav&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;了解。 (Got it.)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;speech_008.wav&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;じゃあ、カフェの前で待ってるね。 (I will wait in front of the café, then.)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;speech_009.wav&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;助かる。 (Thanks.)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;speech_010.wav&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;今日は結構寒いね。 (It is pretty cold today.)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;speech_011.wav&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;ほんと、 (Definitely.)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;speech_012.wav&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;マフラー持ってきて正解だった。 (I am glad I brought my scarf.)&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The recording was cleanly divided at natural pauses corresponding to periods and question marks. Only the final &lt;code&gt;ほんと、&lt;/code&gt; became a separate segment because of the short pause that followed it. Short responses such as “yeah,” “got it,” “thanks,” and “definitely” were preserved.&lt;/p&gt;

&lt;p&gt;The verification environment was:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;machine: Mac Studio (Mac16,9)&lt;/li&gt;
&lt;li&gt;chip: Apple M4 Max, 16 cores (12 performance + 4 efficiency)&lt;/li&gt;
&lt;li&gt;memory: 128 GB&lt;/li&gt;
&lt;li&gt;OS: macOS 26.5.1 (25F80), arm64&lt;/li&gt;
&lt;li&gt;Python: 3.12.11&lt;/li&gt;
&lt;li&gt;ONNX Runtime: 1.27.0&lt;/li&gt;
&lt;li&gt;execution provider: CPUExecutionProvider&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;code&gt;VAD elapsed&lt;/code&gt; measures only Silero VAD inference and segment detection. It excludes model initialization, FFmpeg decoding, and WAV extraction. This was a single run without a warm-up rather than a rigorous benchmark, but its real-time factor was &lt;code&gt;0.002x&lt;/code&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  Interpreting the Results
&lt;/h2&gt;

&lt;p&gt;Under these conditions, the CPU processed 14.171 seconds of audio in 0.028 seconds. That leaves substantial headroom for both offline processing and applications that consume microphone input incrementally. Processing time will vary by machine, so this number should not be treated as a universal benchmark.&lt;/p&gt;

&lt;p&gt;The implementation does not simply treat every 32 ms prediction as an independent segment. Separate start and end thresholds, a 100 ms silence requirement, minimum segment duration, and padding turn the probability sequence into intervals that are more useful downstream. In a real application, these segmentation rules can affect the result as much as model inference itself.&lt;/p&gt;

&lt;p&gt;For this input, the detector followed the script's punctuation and natural conversational pauses without dropping any content. It is especially useful that short acknowledgments survived even with segments shorter than 250 ms being discarded.&lt;/p&gt;

&lt;p&gt;The recording does not include timestamp-level ground-truth annotations, so I did not calculate precision or recall. The listening comparison against the script was successful, but that does not make these settings optimal for every recording. Data containing noise, music, whispers, long pauses, or overlapping speakers would require threshold tuning and comparison against labeled examples.&lt;/p&gt;

&lt;h2&gt;
  
  
  My Takeaway
&lt;/h2&gt;

&lt;p&gt;Silero VAD feels like a practical small component to place before speech recognition. The ONNX model is about 2.3 MB, runs comfortably on a CPU, and exposes a straightforward core operation: pass in a chunk and receive a probability.&lt;/p&gt;

&lt;p&gt;At the same time, VAD output is raw material for segmentation rather than a finished audio split. Whether an application should preserve short acknowledgments or combine speech into longer sentence-like pieces changes what values such as 100 ms and 250 ms should mean.&lt;/p&gt;

&lt;p&gt;Next, I would like to compare how the intervals change across thresholds, recording environments, and artificially added noise. I am also interested in measuring how this lightweight preprocessing affects end-to-end speed and accuracy when combined with transcription and speaker diarization.&lt;/p&gt;

</description>
      <category>python</category>
      <category>onnx</category>
      <category>audio</category>
      <category>machinelearning</category>
    </item>
  </channel>
</rss>
