DEV Community

Jovan Chan
Jovan Chan

Posted on • Originally published at runaihome.com

Self-Host Whisper Large-v3 as a Transcription Server in 2026: faster-whisper + FastAPI

This article was originally published on runaihome.com

Your meeting ends at 5:00 PM. By 5:02, you want a full transcript, edited and searchable, without sending a single audio byte to Google or OpenAI. That's the promise of a self-hosted Whisper Large-v3 server — and in 2026 it's genuinely achievable on consumer hardware you already own.

Hardware requirements, backend comparison, step-by-step server setup, and the honest assessment of where the model still falls short — all below.

What Whisper Large-v3 Actually Is

OpenAI released Whisper Large-v3 on November 6, 2023. The architecture is identical to Large-v2 with two changes: the input uses 128 Mel frequency bins instead of 80, and a Cantonese language token was added. Those changes, combined with a training dataset of 1 million hours of weakly labeled audio plus 4 million hours of pseudo-labeled audio, produced a 10–20% reduction in word error rate across the supported language set compared to v2.

The model has 1.55 billion parameters and supports 99 languages (100 if you count Cantonese separately). On LibriSpeech test-clean — a clean audiobook benchmark — it hits 2.01% WER. On messier real-world audio (earnings calls, podcasts, meetings), expect 8–15% WER. Still the best open-source option by a wide margin.

The model weights on disk are ~3.0 GB. At inference time, GPU VRAM usage depends heavily on which backend you choose.

Hardware Requirements

VRAM: Lower Than You Think

The stock OpenAI Whisper package in float16 requires around 10 GB of VRAM for the Large-v3 model — and float32 doubles that. Run it naive on a 12 GB RTX 3060 and you'll be tight.

The faster-whisper backend changes this completely:

Precision VRAM (faster-whisper) Notes
float16 (FP16) ~3.1 GB Base load; add ~20% for inference overhead
int8_float16 ~2.9 GB Best accuracy-per-VRAM ratio
int8 ~2.9 GB CPU or GPU; minimal accuracy loss
Batched (batch_size=8, INT8) ~4.5 GB Throughput mode for bulk files

Practical floor: any GPU with 6 GB VRAM runs Large-v3 comfortably via faster-whisper. A 4 GB card (GTX 1650, RX 570) is borderline — use int8 precision and keep batch size at 1.

For GPU selection context, see our GPU buying guide and the best local AI models by VRAM tier breakdown.

GPU Tier Benchmarks

The table below measures seconds of processing time per minute of audio (lower is better) using Whisper Large-v3 and the original OpenAI Whisper backend. Data sourced from 1 QuBit's cross-GPU benchmark (measured on long-form audio files):

GPU VRAM Avg sec/min audio Implied RTF Real-time?
RTX 4090 24 GB ~7 sec ~0.12 Yes, 8× faster
RTX 3090 24 GB ~12–22 sec ~0.20–0.37 Yes, 3–5× faster
RTX 4060 Ti 16GB 16 GB ~18 sec ~0.30 Yes, ~3× faster
RTX 3060 12 GB ~35 sec ~0.58 Borderline (faster-whisper improves this)
CPU (Intel i9) ~150 sec ~2.5 No — 2.5× slower than real-time

RTF (Real-Time Factor) = seconds of processing per second of audio. RTF < 1.0 means faster than real-time. RTF < 0.5 is good for live captioning. RTF < 0.1 is excellent for latency-sensitive pipelines.

An RTX 4090 with faster-whisper and Flash Attention 2 can push 70–100× real-time on short clips, and around 8× on long files. The RTX 3090 lands at 3–5× depending on VRAM pressure and compute type setting. The RTX 3060 is borderline with the stock backend but gets to ~2× real-time with faster-whisper's INT8 quantization enabled.

If you're buying hardware specifically for transcription workloads, a used RTX 3090 hits the sweet spot: 24 GB of VRAM means no memory pressure even at batch_size=8, and it benchmarks at 3–5× real-time for pennies on the dollar versus a new card. See Amazon for current pricing.

Not ready to buy? You can run Whisper Large-v3 on a cloud GPU for a few cents per hour while you validate the setup. RunPod offers NVIDIA A100 and H100 instances where you can benchmark the full pipeline before committing to local hardware.

CPU Fallback

Running Large-v3 on CPU is possible with faster-whisper's INT8 path. On an Intel i9-12900K, expect RTF around 2.5 — meaning 1 second of audio takes 2.5 seconds to transcribe, and a 1-hour meeting takes 2.5 hours. That's fine for overnight batch jobs on voice memos, but useless for any live or near-real-time use case. Downsize to the medium or small model if CPU-only is your reality.

The Backend Decision: Three Options

Option 1: faster-whisper (Recommended)

faster-whisper reimplements Whisper using CTranslate2, a C++ inference engine optimized for transformer models. It's up to 4× faster than the stock OpenAI package at the same accuracy, uses 50–70% less VRAM, and supports INT8 quantization on both GPU and CPU. This is the backend to use for a server deployment.

Pros: Best speed-per-VRAM ratio, active maintenance, OpenAI API-compatible server wrappers available, word-level timestamps via VAD filter.

Cons: Requires CUDA 12, cuBLAS, and cuDNN 9 — the dependency chain trips up first-time installs on older CUDA setups.

Option 2: whisper.cpp

whisper.cpp (by Georgi Gerganov, the author of llama.cpp) is a pure C/C++ implementation that runs on CPU, CUDA, Metal, and OpenCL. It uses quantized GGML weights and is the most portable option — runs on a Raspberry Pi 5, a Mac Mini, or a Windows machine without Python.

Pros: No Python, no CUDA required, smallest memory footprint of the three, excellent for embedded or edge deployment.

Cons: Slower than faster-whisper on NVIDIA GPUs; hallucination rate 20% higher than faster-whisper in controlled tests; no official streaming API out of the box.

Option 3: Original OpenAI Whisper

The original package is the reference implementation, runs on PyTorch, and is the easiest to install. It's also the slowest and most memory-hungry. If you have 12–16 GB VRAM and are doing casual single-file transcription, it works. For a server that stays running and handles concurrent requests, use faster-whisper instead.

Verdict: Use faster-whisper for any server deployment. Use whisper.cpp for resource-constrained or non-NVIDIA hardware. Use original Whisper only for quick one-off experiments.

Tutorial: Installing faster-whisper and Running a Transcription Server

The setup below uses faster-whisper + FastAPI for a lightweight HTTP endpoint that accepts audio file uploads and returns transcribed text. This stack is sufficient for personal use, meeting transcription, and family/team servers.

For a multi-user team server setup, the patterns in our Open WebUI family setup guide apply directly — replace the model backend with this transcription API and put Caddy in front.

Prerequisites

  • Python 3.9+
  • NVIDIA GPU with CUDA 12 installed (or CPU-only with device="cpu")
  • cuBLAS and cuDNN 9 (included with recent CUDA Toolkit distributions)
  • ffmpeg (for audio preprocessing; sudo apt install ffmpeg on Linux, winget install ffmpeg on Windows)

Step 1: Create a Virtual Environment

python -m venv whisper-env
# Linux/macOS
source whisper-env/bin/activate
# Windows
whisper-env\Scripts\activate
Enter fullscreen mode Exit fullscreen mode

Step 2: Install Dependencies

pip install faster-whisper fastapi uvicorn python-multipart
Enter fullscreen mode Exit fullscreen mode

Verify GPU access after installation:


python
from faster_whisper import WhisperModel
model = WhisperModel("large-v3", d
Enter fullscreen mode Exit fullscreen mode

Top comments (0)