This article was originally published on aifoss.dev
---
title: 'faster-whisper vs Whisper.cpp vs WhisperX: 2026 Shootout'
description: 'Three ways to run Whisper locally in 2026: faster-whisper for Python pipelines, Whisper.cpp for zero-dependency CPU/GPU use, WhisperX for word-level timestamps.'
pubDate: 'May 20 2026'
tags: ["whisper", "ai", "speechtotext", "opensource", "python"]
OpenAI's Whisper changed what was possible with local speech-to-text. The reference implementation is also slow enough to make it impractical for most production use. Three open-source projects fixed that problem in completely different ways, and choosing between them incorrectly costs you either portability, speed, or the features you actually need.
Versions covered: faster-whisper v1.2.1 (October 31, 2025), Whisper.cpp v1.8.4 (March 19, 2025), WhisperX v3.8.5 (April 1, 2025).
The quick answer
| Situation | Best choice |
|---|---|
| Python transcription pipeline on NVIDIA GPU | faster-whisper |
| macOS, iOS, Android, or Windows without Python | Whisper.cpp |
| Word-level timestamps for subtitles or search | WhisperX |
| Speaker diarization (who said what) | WhisperX |
| Raspberry Pi, mobile, or browser via WebAssembly | Whisper.cpp |
| Apple Silicon laptop, Metal or Core ML acceleration | Whisper.cpp |
| Batched high-throughput audio processing | faster-whisper |
| Embedding transcription in a C++ application | Whisper.cpp |
| Production audio pipeline, Python data stack | faster-whisper or WhisperX |
| Transcribe a file on macOS right now | Whisper.cpp |
WhisperX wraps faster-whisper, so it inherits most of its performance characteristics. The real decision is: (a) faster-whisper alone for raw throughput, (b) WhisperX when you need timestamps or speaker labels, or (c) Whisper.cpp when Python isn't available or you need a platform the others don't support.
What each tool actually is
faster-whisper (SYSTRAN/faster-whisper, MIT license) reimplements OpenAI's Whisper using CTranslate2 — a C++ inference engine for transformer models that runs computations in INT8 or FP16 instead of full FP32. The result is up to 4× faster inference with equivalent accuracy and meaningfully lower VRAM usage. It's a Python library, installs via pip, and requires NVIDIA CUDA 12 for GPU acceleration. v1.2.1 added Silero-VAD V6 for improved voice activity detection and fixed a batched-inference bug where <|nocaptions|> tokens were incorrectly generated, causing hallucinated text on borderline audio segments.
Whisper.cpp (ggml-org/whisper.cpp, MIT license) is a C/C++ port built on the ggml tensor library — the same runtime behind llama.cpp. It compiles to a standalone binary with no Python runtime required. The supported hardware list is the widest of any Whisper implementation: NVIDIA CUDA, Apple Metal and Core ML (including the Neural Engine), AMD Vulkan, Intel OpenVINO, WebAssembly, Raspberry Pi, iOS, and Android. It allocates zero memory at runtime after model load. v1.8.4 is a maintenance release incorporating ggml performance improvements across all supported backends.
WhisperX (m-bain/whisperX, BSD-2-Clause license) is a Python layer on top of faster-whisper that adds three capabilities the base implementation lacks: voice activity detection preprocessing (via Silero-VAD, to avoid transcribing silence), word-level forced alignment using wav2vec2 models (reducing timestamp drift from ~1 second to under 100ms), and speaker diarization using pyannote.audio. The project claims 70× realtime transcription speed using batched inference on large-v2 with GPU. The practical result is that WhisperX is slower than bare faster-whisper per audio minute — the alignment pass costs time — but it produces output that actually tells you when each word was spoken and who said it.
The dependency chain matters: WhisperX calls faster-whisper under the hood. Whisper.cpp is a separate codebase with no shared code.
Hardware and system requirements
| faster-whisper v1.2.1 | Whisper.cpp v1.8.4 | WhisperX v3.8.5 | |
|---|---|---|---|
| Language/runtime | Python 3.9+ | C/C++ binary | Python 3.9+ |
| License | MIT | MIT | BSD-2-Clause |
| GPU required? | No (CPU fallback) | No (CPU fallback) | No (CPU fallback) |
| NVIDIA CUDA | CUDA 12 (cuBLAS, cuDNN 9) | Yes | CUDA 12.8 |
| Apple Silicon (Metal) | No | Yes | No |
| Apple Neural Engine (Core ML) | No | Yes | No |
| AMD GPU | No | Vulkan | No |
| Windows | Yes | Yes | Yes |
| iOS / Android | No | Yes | No |
| Raspberry Pi | No | Yes | No |
| WebAssembly | No | Yes | No |
| Word-level timestamps | No | No | Yes |
| Speaker diarization | No | No | Yes |
VRAM usage for a 13-minute audio clip benchmarked by SYSTRAN on an RTX 3070 Ti (8 GB):
| Configuration | VRAM | Transcription time |
|---|---|---|
| large-v3 FP16 (standard) | ~4.5 GB | ~1m03s |
| large-v3 INT8 (quantized) | ~2.9 GB | ~59s |
| large-v3 FP16 batched (batch=8) | ~4.5 GB | ~17s |
| large model CPU INT8 (i7-12700K) | n/a | ~1m42s (small model) |
Whisper.cpp on-disk model sizes (RAM footprint roughly matches):
| Model | Memory |
|---|---|
| tiny | ~273 MB |
| base | ~388 MB |
| small | ~852 MB |
| medium | ~2.1 GB |
| large-v2/v3/v3-turbo | ~3.9 GB |
On an M2 Pro with Whisper.cpp and Metal acceleration, a 60-second clip processes in roughly 6 seconds using large-v3-turbo — approximately 10× realtime. Enable Core ML to run the encoder on the Apple Neural Engine and you gain an additional ~3× speedup over Metal-only. For Apple Silicon users, Whisper.cpp is the fastest local transcription option available in 2026.
WhisperX requires under 8 GB VRAM for large-v2 with beam_size=5, consistent with the faster-whisper numbers since it uses the same backend. The additional pyannote diarization model adds modest overhead on top.
For testing GPU-heavy transcription workloads before committing to hardware, RunPod rents A100 and H100 instances by the hour. For guidance on selecting a GPU for local AI work, runaihome.com covers hardware tradeoffs in depth.
Installation
faster-whisper
pip install faster-whisper
CUDA 12 with cuBLAS and cuDNN 9 is required for GPU acceleration. If you're on CUDA 11, downgrade ctranslate2 to version 3.24.0.
Basic usage:
from faster_whisper import WhisperModel
model = WhisperModel("large-v3", device="cuda", compute_type="int8")
segments, info = model.transcribe("audio.mp3", beam_size=5)
for segment in segments:
print(f"[{segment.start:.2f}s -> {segment.end:.2f}s] {segment.text}")
For batched inference — significantly faster on long files or when processing many files:
from faster_whisper import WhisperModel, BatchedInferencePipeline
model = WhisperModel("large-v3", device="cuda", compute_type="float16")
pipeline = BatchedInferencePipeline(model=model)
segments, info = pipeline.transcribe("audio.mp3", batch_size=16)
Whisper.cpp
Build from source — the only setup path:
git clone https://github.com/ggml-org/whisper.cpp
cd whisper.cpp
cmake -B build
cmake --build build -j --config Release
# Download the model
bash ./models/download-ggml-model.sh large-v3-turbo
# Transcribe a file
./build/bin/whisper-cli -m models/ggml-large-v3-turbo.bin -f audio.wav
On macOS with Metal acceleration:
cmake -B build -DWHISPER_METAL=1
cmake --build build -j --config Release
For Core ML (Apple Neural Engine — runs the encoder ~3× faster than Metal alone on M-series):
cmake -B build -DWHISPER_COREML=1
cmake --build build -j --config Release
The binary accepts wav input directly. For mp3/m4a/other formats, ffmpeg handles conversion: ffmpeg -i input.mp3 -ar 16000 -ac 1 output.wav.
WhisperX
pip install whisperx
CUDA 12.8 is required.
Top comments (0)