This is a simplified guide to an AI model called Whisperx maintained by Victor-Upmeet. If you like these kinds of analysis, you should join AImodels.fyi or follow us on Twitter.
Model overview
whisperx is a speech transcription model developed by researchers at Upmeet. It builds upon OpenAI's Whisper model, adding features like accelerated transcription, word-level timestamps, and speaker diarization. Unlike the original Whisper, whisperx supports batching for faster processing of long-form audio. It also offers several model variants optimized for different hardware setups, including the victor-upmeet/whisperx-a40-large and victor-upmeet/whisperx-a100-80gb models.
Model inputs and outputs
whisperx takes an audio file as input and generates a transcript with word-level timestamps and optional speaker diarization. It can handle a variety of audio formats and supports language detection and automatic transcription of multiple languages.
Inputs
- Audio File: The audio file to be transcribed
- Language: The ISO code of the language spoken in the audio (optional, can be automatically detected)
- VAD Onset/Offset: Parameters for voice activity detection
- Diarization: Whether to assign speaker ID labels
- Alignment: Whether to align the transcript to get accurate word-level timestamps
- Speaker Limits: Minimum and maximum number of speakers for diarization
Outputs
- Detected Language: The ISO code of the detected language
- Segments: The transcribed text, with word-level timestamps and optional speaker IDs
Capabilities
whisperx provides fast and accurate ...
Top comments (0)