Jovan Chan

Posted on Jun 2 • Originally published at aifoss.dev

faster-whisper-vs-whispercpp-vs-whisperx-2026

#opensource #ai #selfhosted #linux

This article was originally published on aifoss.dev

---
title: 'faster-whisper vs Whisper.cpp vs WhisperX: 2026 Shootout'
description: 'Three ways to run Whisper locally in 2026: faster-whisper for Python pipelines, Whisper.cpp for zero-dependency CPU/GPU use, WhisperX for word-level timestamps.'
pubDate: 'May 20 2026'

tags: ["whisper", "ai", "speechtotext", "opensource", "python"]

OpenAI's Whisper changed what was possible with local speech-to-text. The reference implementation is also slow enough to make it impractical for most production use. Three open-source projects fixed that problem in completely different ways, and choosing between them incorrectly costs you either portability, speed, or the features you actually need.

Versions covered: faster-whisper v1.2.1 (October 31, 2025), Whisper.cpp v1.8.4 (March 19, 2025), WhisperX v3.8.5 (April 1, 2025).

The quick answer

Situation	Best choice
Python transcription pipeline on NVIDIA GPU	faster-whisper
macOS, iOS, Android, or Windows without Python	Whisper.cpp
Word-level timestamps for subtitles or search	WhisperX
Speaker diarization (who said what)	WhisperX
Raspberry Pi, mobile, or browser via WebAssembly	Whisper.cpp
Apple Silicon laptop, Metal or Core ML acceleration	Whisper.cpp
Batched high-throughput audio processing	faster-whisper
Embedding transcription in a C++ application	Whisper.cpp
Production audio pipeline, Python data stack	faster-whisper or WhisperX
Transcribe a file on macOS right now	Whisper.cpp

WhisperX wraps faster-whisper, so it inherits most of its performance characteristics. The real decision is: (a) faster-whisper alone for raw throughput, (b) WhisperX when you need timestamps or speaker labels, or (c) Whisper.cpp when Python isn't available or you need a platform the others don't support.

What each tool actually is

faster-whisper (SYSTRAN/faster-whisper, MIT license) reimplements OpenAI's Whisper using CTranslate2 — a C++ inference engine for transformer models that runs computations in INT8 or FP16 instead of full FP32. The result is up to 4× faster inference with equivalent accuracy and meaningfully lower VRAM usage. It's a Python library, installs via pip, and requires NVIDIA CUDA 12 for GPU acceleration. v1.2.1 added Silero-VAD V6 for improved voice activity detection and fixed a batched-inference bug where <|nocaptions|> tokens were incorrectly generated, causing hallucinated text on borderline audio segments.

Whisper.cpp (ggml-org/whisper.cpp, MIT license) is a C/C++ port built on the ggml tensor library — the same runtime behind llama.cpp. It compiles to a standalone binary with no Python runtime required. The supported hardware list is the widest of any Whisper implementation: NVIDIA CUDA, Apple Metal and Core ML (including the Neural Engine), AMD Vulkan, Intel OpenVINO, WebAssembly, Raspberry Pi, iOS, and Android. It allocates zero memory at runtime after model load. v1.8.4 is a maintenance release incorporating ggml performance improvements across all supported backends.

WhisperX (m-bain/whisperX, BSD-2-Clause license) is a Python layer on top of faster-whisper that adds three capabilities the base implementation lacks: voice activity detection preprocessing (via Silero-VAD, to avoid transcribing silence), word-level forced alignment using wav2vec2 models (reducing timestamp drift from ~1 second to under 100ms), and speaker diarization using pyannote.audio. The project claims 70× realtime transcription speed using batched inference on large-v2 with GPU. The practical result is that WhisperX is slower than bare faster-whisper per audio minute — the alignment pass costs time — but it produces output that actually tells you when each word was spoken and who said it.

The dependency chain matters: WhisperX calls faster-whisper under the hood. Whisper.cpp is a separate codebase with no shared code.

Hardware and system requirements

	faster-whisper v1.2.1	Whisper.cpp v1.8.4	WhisperX v3.8.5
Language/runtime	Python 3.9+	C/C++ binary	Python 3.9+
License	MIT	MIT	BSD-2-Clause
GPU required?	No (CPU fallback)	No (CPU fallback)	No (CPU fallback)
NVIDIA CUDA	CUDA 12 (cuBLAS, cuDNN 9)	Yes	CUDA 12.8
Apple Silicon (Metal)	No	Yes	No
Apple Neural Engine (Core ML)	No	Yes	No
AMD GPU	No	Vulkan	No
Windows	Yes	Yes	Yes
iOS / Android	No	Yes	No
Raspberry Pi	No	Yes	No
WebAssembly	No	Yes	No
Word-level timestamps	No	No	Yes
Speaker diarization	No	No	Yes

VRAM usage for a 13-minute audio clip benchmarked by SYSTRAN on an RTX 3070 Ti (8 GB):

Configuration	VRAM	Transcription time
large-v3 FP16 (standard)	~4.5 GB	~1m03s
large-v3 INT8 (quantized)	~2.9 GB	~59s
large-v3 FP16 batched (batch=8)	~4.5 GB	~17s
large model CPU INT8 (i7-12700K)	n/a	~1m42s (small model)

Whisper.cpp on-disk model sizes (RAM footprint roughly matches):

Model	Memory
tiny	~273 MB
base	~388 MB
small	~852 MB
medium	~2.1 GB
large-v2/v3/v3-turbo	~3.9 GB

On an M2 Pro with Whisper.cpp and Metal acceleration, a 60-second clip processes in roughly 6 seconds using large-v3-turbo — approximately 10× realtime. Enable Core ML to run the encoder on the Apple Neural Engine and you gain an additional ~3× speedup over Metal-only. For Apple Silicon users, Whisper.cpp is the fastest local transcription option available in 2026.

WhisperX requires under 8 GB VRAM for large-v2 with beam_size=5, consistent with the faster-whisper numbers since it uses the same backend. The additional pyannote diarization model adds modest overhead on top.

For testing GPU-heavy transcription workloads before committing to hardware, RunPod rents A100 and H100 instances by the hour. For guidance on selecting a GPU for local AI work, runaihome.com covers hardware tradeoffs in depth.

Installation

faster-whisper

pip install faster-whisper

CUDA 12 with cuBLAS and cuDNN 9 is required for GPU acceleration. If you're on CUDA 11, downgrade ctranslate2 to version 3.24.0.

Basic usage:

from faster_whisper import WhisperModel

model = WhisperModel("large-v3", device="cuda", compute_type="int8")
segments, info = model.transcribe("audio.mp3", beam_size=5)

for segment in segments:
    print(f"[{segment.start:.2f}s -> {segment.end:.2f}s] {segment.text}")

For batched inference — significantly faster on long files or when processing many files:

from faster_whisper import WhisperModel, BatchedInferencePipeline

model = WhisperModel("large-v3", device="cuda", compute_type="float16")
pipeline = BatchedInferencePipeline(model=model)
segments, info = pipeline.transcribe("audio.mp3", batch_size=16)

Whisper.cpp

Build from source — the only setup path:

git clone https://github.com/ggml-org/whisper.cpp
cd whisper.cpp
cmake -B build
cmake --build build -j --config Release

# Download the model
bash ./models/download-ggml-model.sh large-v3-turbo

# Transcribe a file
./build/bin/whisper-cli -m models/ggml-large-v3-turbo.bin -f audio.wav

On macOS with Metal acceleration:

cmake -B build -DWHISPER_METAL=1
cmake --build build -j --config Release

For Core ML (Apple Neural Engine — runs the encoder ~3× faster than Metal alone on M-series):

cmake -B build -DWHISPER_COREML=1
cmake --build build -j --config Release

The binary accepts wav input directly. For mp3/m4a/other formats, ffmpeg handles conversion: ffmpeg -i input.mp3 -ar 16000 -ac 1 output.wav.

WhisperX

pip install whisperx

CUDA 12.8 is required.

DEV Community