Unleashing the Power of Whisper and Falcon in Voice AI

Emir Avci — Thu, 11 Jan 2024 00:37:55 +0000

Introduction

Explore the transformative power of speaker recognition and speaker diarization in this tutorial. We'll integrate OpenAI's Whisper for advanced transcription and Picovoice's Falcon to precisely identify speakers, offering unparalleled audio conversation analysis.

Background

Utilizing Whisper's capabilities for terminal-based transcriptions, we'll enhance it with Falcon's diarization to distinctively identify speakers. This integration improves transcription accuracy and enriches the context of who is speaking and when, enabling sophisticated AI-driven audio analysis. Lets combine the power of these cutting-edge technologies together!

Installation & Setup

1. Audio Recording

Install FFmpeg, an all-in-one tool for audio and video, which we'll use for recording audio and creating transcriptions with Whisper AI.

Homebrew

brew install ffmpeg

Chocolatey

choco install ffmpeg

Once installed, use FFmpeg to list all inputs on your machine. Select an input device for later use.

Mac OS
ffmpeg -f avfoundation -list_devices true -i ""

Linux
ffmpeg -f alsa -list_devices true -i ""

Windows
ffmpeg -f dshow -list_devices true -i dummy

Note the exact name of the audio input you'll use.

2. Whisper Speaker Recognition

To use OpenAI's Whisper for speaker recognition, follow these steps:

Install Python 3.8–3.11: Check your Python version with python3 -V. If it's not within 3.8–3.11, download the latest 3.11 version from python.org.
Install PIP: Ensure Python's package manager is installed with python3 -m pip --version. Install or upgrade it with python3 -m pip install --upgrade pip.
Install Whisper: Install Whisper and its dependencies via pip install -U openai-whisper.

3. Falcon Speaker Diarization

To install Picovoice's Falcon for speaker diarization:

Create an account and get your AccessKey from Picovoice's Dashboard.
Install the pvleopard Python package using pip3 install pvfalcon.

Python Script for Audio Recording, Transcription, and Diarization

This Python script demonstrates the integration of Whisper for transcription and Falcon for speaker diarization. The script automatically records audio using CLI, transcribes it, performs speaker diarization, and outputs the final transcript with speaker and timestamp labels.

Code Explanation

import os
import subprocess
import datetime
import pvfalcon
import json

def record_audio():
    # Records audio using FFmpeg and saves it as a WAV file
    today = datetime.datetime.now().strftime('%Y%m%d')
    audio_file = f"./{today}.wav"
    subprocess.run([
        "ffmpeg", "-f", "avfoundation", "-i", ":YOUR_INPUT_SOURCE",
        "-ar", "16000",  # Set sample rate to 16 kHz
        "-ac", "1",      # Set audio to mono
        "-t", "15",      # Record for 15 seconds
        audio_file
    ])
    return audio_file

def transcribe_audio(audio_file):
    # Transcribes the audio using Whisper
    subprocess.run(["whisper", audio_file, "--model", "medium", "--language", "English"], check=True)
    json_output = f"{audio_file.rsplit('.', 1)[0]}.json"
    # Display macOS notification when transcription is complete
    subprocess.run(["osascript", "-e", 'display notification "Whisper Transcription Complete!" with title "Whisper AI"'])
    if not os.path.exists(json_output):
        raise FileNotFoundError(f"The file {json_output} was not created by Whisper.")
    with open(json_output, 'r') as f:
        transcription = json.load(f)
    return transcription

def perform_diarization(audio_file, access_key):
    # Applies Falcon's speaker diarization on the audio file
    falcon = pvfalcon.create(access_key=access_key)
    segments = falcon.process_file(audio_file)
    falcon.delete() # Clean up Falcon instance after processing
    # Display macOS notification when diarization is complete
    subprocess.run(["osascript", "-e", 'display notification "Falcon Diarization Complete!" with title "Falcon AI"']) 

def merge_transcripts(transcription, diarization, overlap_threshold=0.2):
    # Merges transcripts from Whisper and diarization data from Falcon
    merged_output = []
    used_transcript_segments = set()

    for seg in diarization:
        speaker_tag = f"Speaker {seg.speaker_tag}"

        for part in transcription['segments']:
            if part['id'] in used_transcript_segments:
                continue  # Skip segments already used

            overlap_start = max(seg.start_sec, part['start'])
            overlap_end = min(seg.end_sec, part['end'])
            overlap_duration = max(0, overlap_end - overlap_start)

            diarization_duration = seg.end_sec - seg.start_sec
            transcription_duration = part['end'] - part['start']
            min_duration = min(diarization_duration, transcription_duration)

            if overlap_duration >= overlap_threshold * min_duration:
                merged_output.append({
                    'speaker': speaker_tag,
                    'start': overlap_start,
                    'end': overlap_end,
                    'text': part['text']
                })
                used_transcript_segments.add(part['id'])

 # Sort and merge close segments
    merged_output.sort(key=lambda x: (x['speaker'], x['start']))
    final_output = []
    for seg in merged_output:
        if final_output and seg['speaker'] == final_output[-1]['speaker'] and seg['start'] - final_output[-1]['end'] < 1:
            final_output[-1]['end'] = seg['end']  # Extend the previous segment
            final_output[-1]['text'] += ' ' + seg['text']
        else:
            final_output.append(seg)

    return final_output
def main():
    access_key = "YOUR_FALCON_ACCESS_KEY" # Replace with your actual Falcon access key
    audio_file = record_audio()
    transcription = transcribe_audio(audio_file)
    diarization = perform_diarization(audio_file, access_key)
    merged_output = merge_transcripts(transcription, diarization)
    # Print the final output with speaker tags and timestamps
    for m in merged_output:
        print(f"{m['speaker']} [{m['start']:.2f}-{m['end']:.2f}] {m['text'].strip()}")
    os.remove(audio_file)  # Clean up the audio file

if __name__ == "__main__":
    main()

This integration of Whisper and Falcon offers devs a powerful tool 🛠️ for audio analysis. The script not only transcribes but also assigns text to specific speakers with timestamps 🕒.

It's a perfect starting point for further customization. Dive into this project on GitHub, tweak it, and adapt it to your needs 🤝!

DEV Community: Emir Avci