5 Best Open Source Libraries and APIs for Speaker Diarization

#speechrecognition #nlp

In Automatic Speech Recognition, or ASR, Speaker Diarization refers to labeling speech segments in an audio or video file transcription with corresponding speaker identities. It's also sometimes referred to as Speaker Labels, and at its most basic form, helps answer the question: who spoke when?

In order to accurately predict a speaker, a Speaker Diarization model must perform two actions:

Determine the number of speakers that can be found in an audio or video file.
Attribute each speaker to their appropriate speech segment.

Why is Speaker Diarization useful? When reading a transcription text with multiple speakers, your mind automatically tries to sort the text by speaker–Speaker Diarization does this automatically to make a transcript much more readable.

In addition, organizations like call centers might use Speaker Diarization to automatically label an “agent” or a “customer” in a transcription text for a help hotline. Medical professionals might use Speaker Diarization to automatically label “doctor” and “patient” in the transcription text for a virtual appointment and attach this transcript to a patient file.

While this can seem like a complicated task, today’s best open source libraries and APIs for Speaker Diarization are trained using the latest Deep Learning and Machine learning research, making the process much simpler than it was in the past.

This article looks at the five best open source libraries and APIs available today to perform Speaker Diarization:

1. Kaldi

Kaldi ASR is a well-known open source Speech Recognition platform. To use its Speaker Diarization library, you’ll need to either download their PLDA backend or pre-trained X-Vectors, or train your own models.

Familiar with Kaldi but need help getting Speaker Diarization set up? This tutorial can help. If you’ve never used Kaldi ASR before, this Kaldi Speech Recognition Tutorial for Beginners is a great jumping off point.

2. AssemblyAI

AssemblyAI’s Speech-to-Text and Audio Intelligence APIs offer accurate Speech Recognition without the need to pre-train a model. To perform Speaker Diarization, you can sign up for a free account for its Core Transcription API, though note that there is a limit to how much you can transcribe per month before having to upgrade your account to a paid option.

The API’s detailed, easy-to-follow documentation library can help you get started as well.

3. PyAnnote

Similar to Kaldi ASR, PyAnnote is another open source Speaker Diarization toolkit, written in Python and built based on the PyTorch Machine Learning framework.

For optimal use, you will need to train PyAnnote’s end-to-end neural building blocks to tailor your Speaker Diarization model, though some pre-trained models are also available.

4. Google Speech-to-Text

Google Speech-to-Text is a popular Speech Recognition API that also offers Speaker Diarization. The API has good accuracy and language support, though using it to transcribe a large volume of files can be quite pricey.

You’ll need to enable Speaker Diarization when you are transcribing an audio or video stream either via a file or the Google Cloud Storage Bucket. This documentation can walk you through the necessary steps.

5. AWS Transcribe

AWS Transcribe also offers Speaker Diarization for either batch or real-time transcription. Though AWS Transcribe can also be expensive, the API does offer one hour free per month, so it can be a good option for low volume transcription.

To enable Speaker Diarization, you’ll need to login to your Amazon Transcribe console and create a transcription job. This documentation page can show you how.

DEV Community