DEV Community

git-leo-here
git-leo-here

Posted on

TDoC '24 Day 5: Speech-to-Text with Python and Whisper AI

Welcome to Day 5 of TDoC 2024! Today, we explored the fascinating world of Speech-to-Text (STT) technology, focusing on Whisper AI, OpenAI’s advanced model for audio transcription. This post covers foundational STT concepts, Whisper AI's strengths, and its practical implementation in Python.


What is Speech-to-Text?

Speech-to-Text (STT) technology transforms spoken language into written text and is widely used in:

  • Virtual assistants (e.g., Siri, Alexa, Google Assistant).
  • Transcription services for meetings, podcasts, and interviews.
  • Accessibility tools like captions for the hearing impaired.

How Does STT Work?

  1. Audio Input: Captures spoken language as digital waveforms.
  2. Preprocessing: Reduces noise and extracts relevant audio features.
  3. Model Inference: A trained model processes features to generate text.
  4. Post-Processing: Enhances the output by refining grammar and punctuation.

Whisper AI: The Game-Changer for STT

Developed by OpenAI, Whisper AI is a versatile open-source STT model designed to handle challenging transcription tasks across various languages and environments.

Why Whisper AI Stands Out

  • High Accuracy: Handles noisy environments and overlapping speech.
  • Multilingual Support: Transcribes in numerous languages.
  • Privacy-Focused: Runs locally, ensuring data privacy.
  • Free and Open Source: No API costs for developers.

Setting Up Whisper AI for Speech-to-Text

Follow these steps to use Whisper AI in your Python projects.

1. Install Required Libraries

Install whisper and ffmpeg for audio processing:

pip install openai-whisper
pip install ffmpeg-python
Enter fullscreen mode Exit fullscreen mode

Ensure ffmpeg is installed on your system. Refer to the FFmpeg documentation for details.

2. Transcribing Audio with Whisper

Here’s a Python script to transcribe audio files:

import whisper

# Load the Whisper model
model = whisper.load_model("base")  # Available models: tiny, base, small, medium, large

# Transcribe the audio file
audio_file = "example.wav"
result = model.transcribe(audio_file)

# Print the transcription
print("Transcription:", result['text'])
Enter fullscreen mode Exit fullscreen mode

How Whisper Works

  • Model Selection: Whisper offers multiple models (tiny to large). Larger models provide higher accuracy but demand more resources.
  • Audio Preprocessing: The transcribe() method preprocesses the audio, extracting features like spectrograms.
  • Inference: The model decodes features into text and outputs a dictionary containing the transcription and metadata.

Implementing Speech-To-Text Functionality for Vocalshift

1. Creating STT CLI Tool

import whisper
import argparse

def transcribe_audio(audio_file, model_size="base"):
    # Step 1: Load the Whisper model
    print(f"Loading Whisper model ({model_size})...")
    model = whisper.load_model(model_size)

    # Step 2: Transcribe the audio file
    print(f"Transcribing file: {audio_file}...")
    result = model.transcribe(audio_file)

    # Step 3: Output the transcription
    print("\nTranscription:")
    print(result["text"])
    return result["text"]

def create_stt_cli():
    parser = argparse.ArgumentParser(description='Speech to Text CLI Tool')
    parser.add_argument('--audio-file', type=str, required=True, help='Path to the input .wav file')
    parser.add_argument('--model-size', type=str, default='base', choices=['tiny', 'base', 'small', 'medium', 'large'], help='Size of the Whisper model to use')
    return parser

def main():
    parser = create_stt_cli()
    args = parser.parse_args()

    transcribe_audio(args.audio_file, args.model_size)

if __name__ == "__main__":
    main()
Enter fullscreen mode Exit fullscreen mode

2. Creating the Vocalshift CLI Tool to transform voice

import argparse
from stt import transcribe_audio
from main import process_tts

def create_vocalshift_cli():
    parser = argparse.ArgumentParser(description='Vocal Shift CLI Tool')
    parser.add_argument('--input-audio', type=str, required=True, help='Path to the input audio file')
    parser.add_argument('--output-audio', type=str, required=True, help='Path to the output audio file')
    parser.add_argument('--stt-model-size', type=str, default='base', choices=['tiny', 'base', 'small', 'medium', 'large'], help='Size of the Whisper model to use for transcription')
    parser.add_argument('--speaker', type=str, help='Path to speaker voice sample')
    parser.add_argument('--effect', type=str, default=None, help='Effect to apply to the audio')
    parser.add_argument('--effect-level', type=float, default=1.0, help='Effect level to apply to the audio')
    return parser

def vocal_shift(input_audio, output_audio, stt_model_size='base', speaker=None, effect=None, effect_level=1.0):
    # Step 1: Transcribe the input audio to text
    print(f"Transcribing audio file: {input_audio}...")
    text = transcribe_audio(input_audio, model_size=stt_model_size)

    # Step 2: Convert the transcribed text back to audio
    print(f"Converting text back to audio...")
    success = process_tts(
        text=text,
        output_path=output_audio,
        speaker_path=speaker,
        effect=effect,
        effect_level=effect_level
    )

    if success:
        print(f"Audio saved to: {output_audio}")
        return True
    else:
        print("Conversion failed")
        return False

def main():
    parser = create_vocalshift_cli()
    args = parser.parse_args()

    vocal_shift(
        input_audio=args.input_audio,
        output_audio=args.output_audio,
        stt_model_size=args.stt_model_size,
        speaker=args.speaker,
        effect=args.effect,
        effect_level=args.effect_level
    )

if __name__ == "__main__":
    main()
Enter fullscreen mode Exit fullscreen mode

Challenges in Speech-to-Text

Even advanced STT systems face challenges:

  • Background Noise: Differentiating speech from ambient sounds.
  • Accents and Dialects: Adapting to diverse speech patterns.
  • Domain-Specific Vocabulary: Handling technical terms and uncommon words.

Strategies to Overcome Challenges

  • Noise Reduction: Apply audio preprocessing techniques.
  • Fine-Tuning: Train Whisper on domain-specific datasets.
  • Post-Processing: Implement text correction algorithms.

What We Achieved Today

By the end of Day 5, participants:

  • Gained a comprehensive understanding of Speech-to-Text systems.
  • Learned how to set up and use Whisper AI for transcriptions.
  • Explored real-world applications and challenges of Whisper AI.

Resources for Further Learning


Your Feedback Matters!

We’d love to hear from you! Share your experiences, questions, or feedback in the comments below. Let’s keep innovating together. 🚀

Top comments (0)