Introduction
Explore the transformative power of speaker recognition and speaker diarization in this tutorial. We'll integrate OpenAI's Whisper for advanced transcription and Picovoice's Falcon to precisely identify speakers, offering unparalleled audio conversation analysis.
Background
Utilizing Whisper's capabilities for terminal-based transcriptions, we'll enhance it with Falcon's diarization to distinctively identify speakers. This integration improves transcription accuracy and enriches the context of who is speaking and when, enabling sophisticated AI-driven audio analysis. Lets combine the power of these cutting-edge technologies together!
Installation & Setup
1. Audio Recording
Install FFmpeg, an all-in-one tool for audio and video, which we'll use for recording audio and creating transcriptions with Whisper AI.
Homebrew
brew install ffmpeg
Chocolatey
choco install ffmpeg
Once installed, use FFmpeg to list all inputs on your machine. Select an input device for later use.
Mac OS
ffmpeg -f avfoundation -list_devices true -i ""
Linux
ffmpeg -f alsa -list_devices true -i ""
Windows
ffmpeg -f dshow -list_devices true -i dummy
Note the exact name of the audio input you'll use.
2. Whisper Speaker Recognition
To use OpenAI's Whisper for speaker recognition, follow these steps:
Install Python 3.8–3.11: Check your Python version with
python3 -V
. If it's not within 3.8–3.11, download the latest 3.11 version from python.org.Install PIP: Ensure Python's package manager is installed with
python3 -m pip --version
. Install or upgrade it withpython3 -m pip install --upgrade pip
.Install Whisper: Install Whisper and its dependencies via
pip install -U openai-whisper
.
3. Falcon Speaker Diarization
To install Picovoice's Falcon for speaker diarization:
Create an account and get your AccessKey from Picovoice's Dashboard.
Install the pvleopard Python package using
pip3 install pvfalcon
.
Python Script for Audio Recording, Transcription, and Diarization
This Python script demonstrates the integration of Whisper for transcription and Falcon for speaker diarization. The script automatically records audio using CLI, transcribes it, performs speaker diarization, and outputs the final transcript with speaker and timestamp labels.
Code Explanation
import os
import subprocess
import datetime
import pvfalcon
import json
def record_audio():
# Records audio using FFmpeg and saves it as a WAV file
today = datetime.datetime.now().strftime('%Y%m%d')
audio_file = f"./{today}.wav"
subprocess.run([
"ffmpeg", "-f", "avfoundation", "-i", ":YOUR_INPUT_SOURCE",
"-ar", "16000", # Set sample rate to 16 kHz
"-ac", "1", # Set audio to mono
"-t", "15", # Record for 15 seconds
audio_file
])
return audio_file
def transcribe_audio(audio_file):
# Transcribes the audio using Whisper
subprocess.run(["whisper", audio_file, "--model", "medium", "--language", "English"], check=True)
json_output = f"{audio_file.rsplit('.', 1)[0]}.json"
# Display macOS notification when transcription is complete
subprocess.run(["osascript", "-e", 'display notification "Whisper Transcription Complete!" with title "Whisper AI"'])
if not os.path.exists(json_output):
raise FileNotFoundError(f"The file {json_output} was not created by Whisper.")
with open(json_output, 'r') as f:
transcription = json.load(f)
return transcription
def perform_diarization(audio_file, access_key):
# Applies Falcon's speaker diarization on the audio file
falcon = pvfalcon.create(access_key=access_key)
segments = falcon.process_file(audio_file)
falcon.delete() # Clean up Falcon instance after processing
# Display macOS notification when diarization is complete
subprocess.run(["osascript", "-e", 'display notification "Falcon Diarization Complete!" with title "Falcon AI"'])
def merge_transcripts(transcription, diarization, overlap_threshold=0.2):
# Merges transcripts from Whisper and diarization data from Falcon
merged_output = []
used_transcript_segments = set()
for seg in diarization:
speaker_tag = f"Speaker {seg.speaker_tag}"
for part in transcription['segments']:
if part['id'] in used_transcript_segments:
continue # Skip segments already used
overlap_start = max(seg.start_sec, part['start'])
overlap_end = min(seg.end_sec, part['end'])
overlap_duration = max(0, overlap_end - overlap_start)
diarization_duration = seg.end_sec - seg.start_sec
transcription_duration = part['end'] - part['start']
min_duration = min(diarization_duration, transcription_duration)
if overlap_duration >= overlap_threshold * min_duration:
merged_output.append({
'speaker': speaker_tag,
'start': overlap_start,
'end': overlap_end,
'text': part['text']
})
used_transcript_segments.add(part['id'])
# Sort and merge close segments
merged_output.sort(key=lambda x: (x['speaker'], x['start']))
final_output = []
for seg in merged_output:
if final_output and seg['speaker'] == final_output[-1]['speaker'] and seg['start'] - final_output[-1]['end'] < 1:
final_output[-1]['end'] = seg['end'] # Extend the previous segment
final_output[-1]['text'] += ' ' + seg['text']
else:
final_output.append(seg)
return final_output
def main():
access_key = "YOUR_FALCON_ACCESS_KEY" # Replace with your actual Falcon access key
audio_file = record_audio()
transcription = transcribe_audio(audio_file)
diarization = perform_diarization(audio_file, access_key)
merged_output = merge_transcripts(transcription, diarization)
# Print the final output with speaker tags and timestamps
for m in merged_output:
print(f"{m['speaker']} [{m['start']:.2f}-{m['end']:.2f}] {m['text'].strip()}")
os.remove(audio_file) # Clean up the audio file
if __name__ == "__main__":
main()
This integration of Whisper and Falcon offers devs a powerful tool 🛠️ for audio analysis. The script not only transcribes but also assigns text to specific speakers with timestamps 🕒.
It's a perfect starting point for further customization. Dive into this project on GitHub, tweak it, and adapt it to your needs 🤝!
Top comments (1)
Hi, Emir. This is intriguing. I'm interested in replacing the openai-whisper package with the whispercpp package, perhaps using the Python bindings described here. github.com/abdeladim-s/pywhispercpp Any advice? Or do you know of anyone that has use Falcon and Whisper.cpp in this way? Thanks!