A beginner's guide to the Insanely-Fast-Whisper-With-Video model by Turian on Replicate

#coding #ai #machinelearning #programming

This is a simplified guide to an AI model called Insanely-Fast-Whisper-With-Video maintained by Turian. If you like these kinds of analysis, you should join AImodels.fyi or follow us on Twitter.

Model overview

The insanely-fast-whisper-with-video model, created by turian, is a powerful AI-based audio transcription tool that leverages the impressive capabilities of OpenAI's Whisper Large v3 model. This model boasts incredible speed, allowing users to transcribe up to 150 minutes of audio in less than 98 seconds on a Nvidia A100 - 80GB GPU. The model also supports video transcription, making it a versatile tool for a wide range of applications.

The insanely-fast-whisper-with-video model builds upon the work of chenxwh/insanely-fast-whisper and adidoes/cog-whisperx-video-transcribe, leveraging techniques like fp16 precision, batching, Flash Attention 2, and bettertransformer to achieve these impressive transcription speeds.

Model inputs and outputs

Inputs

File Name: The path or URL to the audio or video file to be transcribed.
Task: The task to be performed, either transcription or translation.
Language: The language of the input audio (optional, Whisper can auto-detect the language).
Batch Size: The number of parallel batches to compute, adjustable to avoid Out-Of-Memory (OOM) issues.
Timestamp: The type of timestamp to generate, either chunked or word-level.
Diarise Audio: Whether to use Pyannote.audio to diarise the audio clips, which requires a Hugging Face token.