DEV Community

Cover image for Best Open Source Subtitle Generator? Canary Qwen 2.5B + Whisper Full Guide
Furkan Gözükara
Furkan Gözükara

Posted on

Best Open Source Subtitle Generator? Canary Qwen 2.5B + Whisper Full Guide

Best Open Source Subtitle Generator? Canary Qwen 2.5B + Whisper Full Guide

Full tutorial link > https://www.youtube.com/watch?v=4lAk6sf1qF8

Info

NVIDIA NeMo Canary-Qwen-2.5B is an English speech recognition model that achieves state-of-the art performance on multiple English speech benchmarks. Canary model is the new king that dethroned famous Whisper.

Full tutorial for the Whisper TTS Premium speech-to-text app by SECourses with new NVIDIA Canary Qwen 2.5B support. In this video, I demo local subtitle generation, compare Canary Qwen 2.5B against Whisper Large V3, show output formats, batch processing, presets, YouTube URL and live microphone options, then install the app from scratch on Windows.

You will also see RunPod and Massed Compute notes, first-run model download, RTX 5000/CUDA 13 driver requirements, subprocess mode for preventing VRAM/RAM leaks, and when to use Whisper instead of Canary.

Links:

Download App and the source post: [ https://www.patreon.com/posts/whisper-webui-to-145395299 ]

Discord: [ https://discord.com/channels/772774097734074388/1079506787734134844 ]

Patreon app index: [ https://github.com/FurkanGozukara/Stable-Diffusion/blob/main/Patreon-Posts-Index.md ]

Related RunPod/Massed Compute setup tutorial: [ https://youtu.be/ZRrzvD4wNys ]

In my tutorial-video tests, Canary Qwen 2.5B achieved 5.91% global WER and reached up to 46x faster than real-time transcription, making it my new recommended default for English speech-to-text. Whisper remains useful when you need broader spoken-language support or word-level timestamps.

Chapters:

  • 0:00 Intro to the local open-source speech-to-text app and new Canary support
  • 0:20 Quick demo setup with NVIDIA Canary Qwen 2.5B
  • 0:33 Maximum-quality defaults and starting subtitle generation
  • 0:48 Live transcription speed and accuracy preview
  • 1:03 Chunk length settings for smaller or larger subtitle segments
  • 1:18 Fast generation, supported exports, and restarting with all formats
  • 1:31 Multiple subtitle file formats explained
  • 1:47 Batch processing folders, output paths, subfolders, and overwrite mode
  • 1:58 YouTube URLs, microphone/live transcription, translation, and BGM separation
  • 2:09 Saving presets and using advanced parameters
  • 2:24 Auto-optimized defaults for Whisper and Canary models
  • 2:39 Canary Qwen 2.5B vs Whisper Large V3 comparison begins
  • 2:54 Real-world WER benchmark and 5.91% Canary result
  • 3:10 Why non-native English speech is harder to transcribe accurately
  • 3:24 Canary speed advantage and 46x real-time transcription explained
  • 3:43 Test averages across long and short tutorial videos
  • 3:59 Cases where Whisper slightly wins and final Canary recommendation
  • 4:14 Opening the output folder after transcription completes
  • 4:27 VTT output, matching filenames, capitalization, and punctuation
  • 4:44 Accuracy examples inside the generated transcript
  • 4:58 TXT, TSV, SRT, LRC exports and word-level timestamp note
  • 5:20 Download page, latest ZIP, and installation overview
  • 5:31 Windows requirements: Python 3.11, Git, CUDA, and C++ notes
  • 5:51 Choosing install location and keeping the app isolated in venv
  • 6:04 Extracting the ZIP and running Windows install/update BAT
  • 6:23 Automatic model downloads on first run
  • 6:34 RunPod, Massed Compute, and Linux installation files
  • 6:50 Where to learn RunPod and Massed Compute setup in the related guide
  • 7:29 UV-powered Windows installation completes quickly
  • 7:41 Starting the app with Windows start app BAT
  • 7:58 Selecting video/audio input and generating subtitles on a fresh install
  • 8:10 First-run Canary model download and 5GB model size
  • 8:35 Easy setup goal and automatic fresh-install workflow
  • 8:53 Discord, Patreon index, and 100+ SECourses applications
  • 9:13 RTX 5000 support and updated NVIDIA driver requirement
  • 9:35 Fresh-install transcription starts successfully
  • 9:47 Automatic downloads for Canary, Whisper, diarization, and extra tools
  • 10:16 Canary becomes the new default model recommendation
  • 10:36 Subprocess mode to prevent VRAM and RAM leaks
  • 10:51 Why running transcription as a subprocess is recommended
  • 11:04 Switching back to Whisper models when needed
  • 11:20 Whisper language coverage vs Canary and audio/video support
  • 11:42 Real recording benchmark: 27 minutes transcribed in about 2 minutes
  • 11:56 Model loading overhead and clean RAM/VRAM release
  • 12:08 Final notes, subscribe reminder, and downloading the full transcript ZIP

Download Installers and App

https://www.patreon.com/posts/145395299

30 April 2026 - Version 10.0

  • This is a quite big upgrade to our application

  • We now fully support NVIDIA NeMo Canary-Qwen-2.5B is an English speech recognition model : https://huggingface.co/nvidia/canary-qwen-2.5b

  • This model is currently State Of The Art (SOTA) Speech to Text model for English language

  • I have done extensive research and testing and it is set to best default parameters

  • Fully supporting all of the features our Whisper app were already supporting

  • Get the zip file, overwrite all previous files and run installer for update / upgrade

  • The model will be auto downloaded when you first time run

image

  • I also have compared with Whisper best configurations are here the comparison results - best results of Whisper taken

image

image

image

  • As you can see NVIDIA NeMo Canary-Qwen-2.5B is not only significantly better but also faster

15 April 2026 - Version 8.0

  • Diarization had some error and this is fixed

  • Mic tab completey remade and now both live transcription from microphone and offline transcription from microphone working

    • Live transcription quality is not that great
    • Both live transcription and offline transcription recordings from microphone will be saved in outputs folder
    • Live transcription will auto run but for offline transcription first record voice with microphone and then click Generate Subtitles button
  • Don't forget to select your working microphone and give permission for app to use your microphone from your browser

  • For update / install get the latest zip file, overwrite older files and run Windows_Install_Update.bat

image

image

14 April 2026 - Version 7.0

  • Now auto downloads Diarization files and thus you don't need to enter Hugging Face token and get permission

  • Now you can copy paste any YouTube link and generate subtitles

    • This was broken and now fixed
    • It will save generated files with same name as the video title
  • Now you can batch generate subtitles for YouTube video channels

  • Paste the video channel, enable batch and it will generate subtitles for every video

    • Set how many videos you want (scans latest ones)
    • You may get rate limited by YouTube
  • For update / install get the latest zip file, overwrite older files and run Windows_Install_Update.bat

image

8 April 2026 - Version 5.0

  • This is a massive update with so many new features

  • New preset save and load system with extremely well tested best_quality and fast pre-made presets

    • Presets are automatically loaded as you change them and also last used preset is remembered when you restart the app
    • Word Timestamps is enabled by default to improve quality but it also generates regular version as well automatically
  • Download transcription button

  • Open outputs folder button (all transcriptions automatically saved)

  • Load video / audio file directly from path (useful for platforms like RunPod where Gradio upload is slow)

image

  • The fast preset uses new custom in house implemented batch size 32 feature and it is literally blazing fast compared to all other existing Whisper apps and repos

  • Fully supporting all kind of video and audio formats upload with full preview

  • Batch folder processing process given folder all files automatically

  • Live transcription Window that shows latest transcription live while processing

  • At batch size 1 with best quality, 11x real time transcription speed (depends on GPU)

  • At batch size 32 fast preset 15x to 30x real time transcription speed (depends on GPU)

  • New feature Repeat Initial Prompt Every Window

image

  • Supports all Whisper models like Large V1, Large V3, Turbo, Distill Large, Tiny, etc

  • Supports following format outputs you can have checked all so all generated at the same time : SRT, WebVTT, txt, LRC,JSON, TSV

    • All outputs will have the same name as your input file name
  • With sub process working system, you can cancel any processing immediately with 0 RAM or VRAM leak

  • Fully supports Windows and Linux (use Massed Compute installer)

  • Based on Python 3.11 VENV and CUDA 13 and Torch 2.9.1 with pre-compiled libraries like Flash Attention

  • If you don't like output, try to enable / disable Condition On Previous Text it makes big difference

image

  • The app supports 100 languages and 32 models

image

image

image

  • Lots of Advanced Parameters and all set to best quality

  • Built in Background Music Remover Filter

  • Built in Voice Detection Filter

  • image

  • Fully detailed CMD output to watch entire progress

  • Extremely optimized VRAM usage as low as 6 GB GPUs

image

  • Some other utility features like YouTube, record from a Mic, T2T Translation, BGM Seperation

image

Full Page Screenshot

screencapture-127-0-0-1-7861-2026-05-02-05_09_06

Top comments (0)