DEV Community

Khushi Nakra
Khushi Nakra

Posted on

Speaker Diarization Frameworks in Python: Tutorial and Code Walkthrough

Speaker diarization identifies and separates different speakers in an audio file. Think of it as automatically labeling "Speaker A spoke from 0:00-0:15, Speaker B spoke from 0:15-0:30" throughout your recording.

Illustration of speaker diarization

It is essential for applications like meeting transcription, podcast editing, call center analytics, and interview processing. Speaker diarization becomes crucial when you need to know "who" said it along with "what" was said.

This tutorial walks through four different Python frameworks for speaker diarization:

  • pyannote.audio
  • NVIDIA NeMo
  • Simple Diarizer
  • Falcon Speaker Diarization

1. pyannote.audio

Getting started with pyannote.audio for speaker diarization is
straightforward. Follow these steps:

  • Install the pyannote.audio package using pip:
pip3 install pyannote.audio
Enter fullscreen mode Exit fullscreen mode
  • Obtain your authentication token to download pretrained models by visiting their Hugging Face pages.

  • Use the following Python code to perform speaker diarization on an audio file:

from pyannote.audio import Pipeline

# Replace "${ACCESS_TOKEN_GOES_HERE}" with your authentication token
pipeline = Pipeline.from_pretrained(
    "pyannote/speaker-diarization",
    use_auth_token="${ACCESS_TOKEN_GOES_HERE}")

# Replace "${AUDIO_FILE_PATH}" with the path to your audio file
diarization = pipeline("${AUDIO_FILE_PATH}")

for segment, _, speaker in diarization.itertracks(yield_label=True):
    print(f'Speaker "{speaker}" - "{segment}"')
Enter fullscreen mode Exit fullscreen mode

This code will perform speaker diarization and print out the identified speakers along with their corresponding segments in the audio file.

2. NVIDIA NeMo

To perform speaker diarization using NVIDIA NeMo, follow these steps:

  • Install dependencies:
apt-get update && apt-get install -y libsndfile1 ffmpeg
pip3 install Cython
Enter fullscreen mode Exit fullscreen mode
  • Install NeMo:
pip install git+https://github.com/NVIDIA/NeMo.git@r1.20.0#egg=nemo_toolkit[all]
Enter fullscreen mode Exit fullscreen mode
  • Download the config file for the inference from the NeMo GitHub repository.

  • Generate and store the manifest file by running the following code:

import json
import os

from nemo.collections.asr.models import ClusteringDiarizer
from omegaconf import OmegaConf

INPUT_FILE = '/PATH/TO/AUDIO_FILE.wav'
MANIFEST_FILE = '/PATH/TO/MANIFEST_FILE.json'

meta = {
    'audio_filepath': input_file,
    'offset': 0,
    'duration': None,
    'label': 'infer',
    'text': '-',
    'num_speakers': None,
    'rttm_filepath': None,
    'uem_filepath': None
}
with open(MANIFEST_FILE, 'w') as fp:
    json.dump(meta, fp)
    fp.write('\n')
Enter fullscreen mode Exit fullscreen mode

Replace /PATH/TO/AUDIO_FILE.wav with the path to your audio file and /PATH/TO/MANIFEST_FILE.json with the desired path for your manifest file.

  • Load the config file and define a ClusteringDiarizer object:
OUTPUT_DIR = '/PATH/TO/OUTPUT_DIR'
MODEL_CONFIG = '/PATH/TO/CONFIG_FILE.yaml'

config = OmegaConf.load(MODEL_CONFIG)
config.diarizer.manifest_filepath = MANIFEST_FILE
config.diarizer.out_dir = OUTPUT_DIR
config.diarizer.oracle_vad = False
config.diarizer.clustering.parameters.oracle_num_speakers = False

sd_model = ClusteringDiarizer(cfg=config)
Enter fullscreen mode Exit fullscreen mode

Replace /PATH/TO/OUTPUT_DIR and /PATH/TO/CONFIG_FILE.yaml with the desired paths for your output directory and config file, respectively.

  • Perform speaker diarization on the audio file:
sd_model.diarize()
Enter fullscreen mode Exit fullscreen mode

The speaker diarization output will be stored in the OUTPUT_DIR directory as a Rich Transcription Time Marked (RTTM) file.

3. Simple Diarizer

Simple Diarizer is a speaker diarization library that utilizes pretrained models from SpeechBrain. To get started with simple_diarizer, follow these steps:

  • Install the package using pip:
pip install simple_diarizer
Enter fullscreen mode Exit fullscreen mode
  • Define a Diarizer object:
from simple_diarizer.diarizer import Diarizer

diarization = Diarizer(embed_model='xvec', cluster_method='sc')
Enter fullscreen mode Exit fullscreen mode
  • Perform speaker diarization on an audio file by either passing the number of speakers:
# Replace "${AUDIO_FILE_PATH}" with the path to your audio file
segments = diarization.diarize("${AUDIO_FILE_PATH}", num_speakers=NUM_SPEAKERS)
Enter fullscreen mode Exit fullscreen mode

Or by passing a threshold value:

segments = diarization.diarize("${AUDIO_FILE_PATH}", threshold=THRESHOLD)
Enter fullscreen mode Exit fullscreen mode

The segment variable stores the speaker information and timing details, including start and end times for each segment.

4. Falcon Speaker Diarization

Falcon Speaker Diarization is an on-device speaker diarization engine powered by deep learning. To get started with Falcon, follow these steps:

  • Install the package using pip:
pip install pvfalcon
Enter fullscreen mode Exit fullscreen mode
  • Sign up for Picovoice Console for free and copy your AccessKey.

  • Create an instance of the engine:

import pvfalcon

# Replace "${ACCESS_KEY}" with your Picovoice Console AccessKey
falcon = pvfalcon.create(access_key="${ACCESS_KEY}")
Enter fullscreen mode Exit fullscreen mode
  • Perform speaker diarization on an audio file:
# Replace "${AUDIO_FILE_PATH}" with the path to your audio file
segments = falcon.process_file("${AUDIO_FILE_PATH}")
for segment in segments:
    print(
        "{speaker_tag=%d start_sec=%.2f end_sec=%.2f}"
        % (segment.speaker_tag, segment.start_sec, segment.end_sec)
    )
Enter fullscreen mode Exit fullscreen mode

Each segment in the segments array includes timing information and speaker identification.

For more information about Falcon Speaker Diarization, check out the Falcon Speaker Diarization product page or refer to the Falcon Speaker Diarization Python SDK quick start guide.

Video Tutorial


This tutorial was originally published on Picovoice

Top comments (0)