Best Open Source Subtitle Generator? Canary Qwen 2.5B + Whisper Full Guide
Full tutorial link > https://www.youtube.com/watch?v=4lAk6sf1qF8
Info
NVIDIA NeMo Canary-Qwen-2.5B is an English speech recognition model that achieves state-of-the art performance on multiple English speech benchmarks. Canary model is the new king that dethroned famous Whisper.
Full tutorial for the Whisper TTS Premium speech-to-text app by SECourses with new NVIDIA Canary Qwen 2.5B support. In this video, I demo local subtitle generation, compare Canary Qwen 2.5B against Whisper Large V3, show output formats, batch processing, presets, YouTube URL and live microphone options, then install the app from scratch on Windows.
You will also see RunPod and Massed Compute notes, first-run model download, RTX 5000/CUDA 13 driver requirements, subprocess mode for preventing VRAM/RAM leaks, and when to use Whisper instead of Canary.
Links:
Download App and the source post: [ https://www.patreon.com/posts/whisper-webui-to-145395299 ]
Discord: [ https://discord.com/channels/772774097734074388/1079506787734134844 ]
Patreon app index: [ https://github.com/FurkanGozukara/Stable-Diffusion/blob/main/Patreon-Posts-Index.md ]
Related RunPod/Massed Compute setup tutorial: [ https://youtu.be/ZRrzvD4wNys ]
In my tutorial-video tests, Canary Qwen 2.5B achieved 5.91% global WER and reached up to 46x faster than real-time transcription, making it my new recommended default for English speech-to-text. Whisper remains useful when you need broader spoken-language support or word-level timestamps.
Chapters:
- 0:00 Intro to the local open-source speech-to-text app and new Canary support
- 0:20 Quick demo setup with NVIDIA Canary Qwen 2.5B
- 0:33 Maximum-quality defaults and starting subtitle generation
- 0:48 Live transcription speed and accuracy preview
- 1:03 Chunk length settings for smaller or larger subtitle segments
- 1:18 Fast generation, supported exports, and restarting with all formats
- 1:31 Multiple subtitle file formats explained
- 1:47 Batch processing folders, output paths, subfolders, and overwrite mode
- 1:58 YouTube URLs, microphone/live transcription, translation, and BGM separation
- 2:09 Saving presets and using advanced parameters
- 2:24 Auto-optimized defaults for Whisper and Canary models
- 2:39 Canary Qwen 2.5B vs Whisper Large V3 comparison begins
- 2:54 Real-world WER benchmark and 5.91% Canary result
- 3:10 Why non-native English speech is harder to transcribe accurately
- 3:24 Canary speed advantage and 46x real-time transcription explained
- 3:43 Test averages across long and short tutorial videos
- 3:59 Cases where Whisper slightly wins and final Canary recommendation
- 4:14 Opening the output folder after transcription completes
- 4:27 VTT output, matching filenames, capitalization, and punctuation
- 4:44 Accuracy examples inside the generated transcript
- 4:58 TXT, TSV, SRT, LRC exports and word-level timestamp note
- 5:20 Download page, latest ZIP, and installation overview
- 5:31 Windows requirements: Python 3.11, Git, CUDA, and C++ notes
- 5:51 Choosing install location and keeping the app isolated in venv
- 6:04 Extracting the ZIP and running Windows install/update BAT
- 6:23 Automatic model downloads on first run
- 6:34 RunPod, Massed Compute, and Linux installation files
- 6:50 Where to learn RunPod and Massed Compute setup in the related guide
- 7:29 UV-powered Windows installation completes quickly
- 7:41 Starting the app with Windows start app BAT
- 7:58 Selecting video/audio input and generating subtitles on a fresh install
- 8:10 First-run Canary model download and 5GB model size
- 8:35 Easy setup goal and automatic fresh-install workflow
- 8:53 Discord, Patreon index, and 100+ SECourses applications
- 9:13 RTX 5000 support and updated NVIDIA driver requirement
- 9:35 Fresh-install transcription starts successfully
- 9:47 Automatic downloads for Canary, Whisper, diarization, and extra tools
- 10:16 Canary becomes the new default model recommendation
- 10:36 Subprocess mode to prevent VRAM and RAM leaks
- 10:51 Why running transcription as a subprocess is recommended
- 11:04 Switching back to Whisper models when needed
- 11:20 Whisper language coverage vs Canary and audio/video support
- 11:42 Real recording benchmark: 27 minutes transcribed in about 2 minutes
- 11:56 Model loading overhead and clean RAM/VRAM release
- 12:08 Final notes, subscribe reminder, and downloading the full transcript ZIP
Download Installers and App
30 April 2026 - Version 10.0
This is a quite big upgrade to our application
We now fully support NVIDIA NeMo Canary-Qwen-2.5B is an English speech recognition model : https://huggingface.co/nvidia/canary-qwen-2.5b
This model is currently State Of The Art (SOTA) Speech to Text model for English language
I have done extensive research and testing and it is set to best default parameters
Fully supporting all of the features our Whisper app were already supporting
Get the zip file, overwrite all previous files and run installer for update / upgrade
The model will be auto downloaded when you first time run
- I also have compared with Whisper best configurations are here the comparison results - best results of Whisper taken
- As you can see NVIDIA NeMo Canary-Qwen-2.5B is not only significantly better but also faster
15 April 2026 - Version 8.0
Diarization had some error and this is fixed
-
Mic tab completey remade and now both live transcription from microphone and offline transcription from microphone working
- Live transcription quality is not that great
- Both live transcription and offline transcription recordings from microphone will be saved in outputs folder
- Live transcription will auto run but for offline transcription first record voice with microphone and then click Generate Subtitles button
Don't forget to select your working microphone and give permission for app to use your microphone from your browser
For update / install get the latest zip file, overwrite older files and run Windows_Install_Update.bat
14 April 2026 - Version 7.0
Now auto downloads Diarization files and thus you don't need to enter Hugging Face token and get permission
-
Now you can copy paste any YouTube link and generate subtitles
- This was broken and now fixed
- It will save generated files with same name as the video title
Now you can batch generate subtitles for YouTube video channels
-
Paste the video channel, enable batch and it will generate subtitles for every video
- Set how many videos you want (scans latest ones)
- You may get rate limited by YouTube
For update / install get the latest zip file, overwrite older files and run Windows_Install_Update.bat
8 April 2026 - Version 5.0
-
This is a massive update with so many new features
- Get the latest zip file and make a fresh install please > https://www.patreon.com/posts/145395299
- 1-Click to install on Windows, RunPod, SimplePod, Massed Compute, Linux
-
New preset save and load system with extremely well tested best_quality and fast pre-made presets
- Presets are automatically loaded as you change them and also last used preset is remembered when you restart the app
- Word Timestamps is enabled by default to improve quality but it also generates regular version as well automatically
Download transcription button
Open outputs folder button (all transcriptions automatically saved)
Load video / audio file directly from path (useful for platforms like RunPod where Gradio upload is slow)
The fast preset uses new custom in house implemented batch size 32 feature and it is literally blazing fast compared to all other existing Whisper apps and repos
Fully supporting all kind of video and audio formats upload with full preview
Batch folder processing process given folder all files automatically
Live transcription Window that shows latest transcription live while processing
At batch size 1 with best quality, 11x real time transcription speed (depends on GPU)
At batch size 32 fast preset 15x to 30x real time transcription speed (depends on GPU)
New feature Repeat Initial Prompt Every Window
Supports all Whisper models like Large V1, Large V3, Turbo, Distill Large, Tiny, etc
-
Supports following format outputs you can have checked all so all generated at the same time : SRT, WebVTT, txt, LRC,JSON, TSV
- All outputs will have the same name as your input file name
With sub process working system, you can cancel any processing immediately with 0 RAM or VRAM leak
Fully supports Windows and Linux (use Massed Compute installer)
Based on Python 3.11 VENV and CUDA 13 and Torch 2.9.1 with pre-compiled libraries like Flash Attention
If you don't like output, try to enable / disable Condition On Previous Text it makes big difference
- The app supports 100 languages and 32 models
Lots of Advanced Parameters and all set to best quality
Built in Background Music Remover Filter
Built in Voice Detection Filter
Fully detailed CMD output to watch entire progress
Extremely optimized VRAM usage as low as 6 GB GPUs
- Some other utility features like YouTube, record from a Mic, T2T Translation, BGM Seperation
Top comments (0)