DEV Community

Jovan Chan
Jovan Chan

Posted on • Originally published at aifoss.dev

faster-whisper-vs-whispercpp-vs-whisperx-2026

This article was originally published on aifoss.dev

---
title: 'faster-whisper vs Whisper.cpp vs WhisperX: 2026 Shootout'
description: 'Three ways to run Whisper locally in 2026: faster-whisper for Python pipelines, Whisper.cpp for zero-dependency CPU/GPU use, WhisperX for word-level timestamps.'
pubDate: 'May 20 2026'

tags: ["whisper", "ai", "speechtotext", "opensource", "python"]

OpenAI's Whisper changed what was possible with local speech-to-text. The reference implementation is also slow enough to make it impractical for most production use. Three open-source projects fixed that problem in completely different ways, and choosing between them incorrectly costs you either portability, speed, or the features you actually need.

Versions covered: faster-whisper v1.2.1 (October 31, 2025), Whisper.cpp v1.8.4 (March 19, 2025), WhisperX v3.8.5 (April 1, 2025).


The quick answer

Situation Best choice
Python transcription pipeline on NVIDIA GPU faster-whisper
macOS, iOS, Android, or Windows without Python Whisper.cpp
Word-level timestamps for subtitles or search WhisperX
Speaker diarization (who said what) WhisperX
Raspberry Pi, mobile, or browser via WebAssembly Whisper.cpp
Apple Silicon laptop, Metal or Core ML acceleration Whisper.cpp
Batched high-throughput audio processing faster-whisper
Embedding transcription in a C++ application Whisper.cpp
Production audio pipeline, Python data stack faster-whisper or WhisperX
Transcribe a file on macOS right now Whisper.cpp

WhisperX wraps faster-whisper, so it inherits most of its performance characteristics. The real decision is: (a) faster-whisper alone for raw throughput, (b) WhisperX when you need timestamps or speaker labels, or (c) Whisper.cpp when Python isn't available or you need a platform the others don't support.


What each tool actually is

faster-whisper (SYSTRAN/faster-whisper, MIT license) reimplements OpenAI's Whisper using CTranslate2 — a C++ inference engine for transformer models that runs computations in INT8 or FP16 instead of full FP32. The result is up to 4× faster inference with equivalent accuracy and meaningfully lower VRAM usage. It's a Python library, installs via pip, and requires NVIDIA CUDA 12 for GPU acceleration. v1.2.1 added Silero-VAD V6 for improved voice activity detection and fixed a batched-inference bug where <|nocaptions|> tokens were incorrectly generated, causing hallucinated text on borderline audio segments.

Whisper.cpp (ggml-org/whisper.cpp, MIT license) is a C/C++ port built on the ggml tensor library — the same runtime behind llama.cpp. It compiles to a standalone binary with no Python runtime required. The supported hardware list is the widest of any Whisper implementation: NVIDIA CUDA, Apple Metal and Core ML (including the Neural Engine), AMD Vulkan, Intel OpenVINO, WebAssembly, Raspberry Pi, iOS, and Android. It allocates zero memory at runtime after model load. v1.8.4 is a maintenance release incorporating ggml performance improvements across all supported backends.

WhisperX (m-bain/whisperX, BSD-2-Clause license) is a Python layer on top of faster-whisper that adds three capabilities the base implementation lacks: voice activity detection preprocessing (via Silero-VAD, to avoid transcribing silence), word-level forced alignment using wav2vec2 models (reducing timestamp drift from ~1 second to under 100ms), and speaker diarization using pyannote.audio. The project claims 70× realtime transcription speed using batched inference on large-v2 with GPU. The practical result is that WhisperX is slower than bare faster-whisper per audio minute — the alignment pass costs time — but it produces output that actually tells you when each word was spoken and who said it.

The dependency chain matters: WhisperX calls faster-whisper under the hood. Whisper.cpp is a separate codebase with no shared code.


Hardware and system requirements

faster-whisper v1.2.1 Whisper.cpp v1.8.4 WhisperX v3.8.5
Language/runtime Python 3.9+ C/C++ binary Python 3.9+
License MIT MIT BSD-2-Clause
GPU required? No (CPU fallback) No (CPU fallback) No (CPU fallback)
NVIDIA CUDA CUDA 12 (cuBLAS, cuDNN 9) Yes CUDA 12.8
Apple Silicon (Metal) No Yes No
Apple Neural Engine (Core ML) No Yes No
AMD GPU No Vulkan No
Windows Yes Yes Yes
iOS / Android No Yes No
Raspberry Pi No Yes No
WebAssembly No Yes No
Word-level timestamps No No Yes
Speaker diarization No No Yes

VRAM usage for a 13-minute audio clip benchmarked by SYSTRAN on an RTX 3070 Ti (8 GB):

Configuration VRAM Transcription time
large-v3 FP16 (standard) ~4.5 GB ~1m03s
large-v3 INT8 (quantized) ~2.9 GB ~59s
large-v3 FP16 batched (batch=8) ~4.5 GB ~17s
large model CPU INT8 (i7-12700K) n/a ~1m42s (small model)

Whisper.cpp on-disk model sizes (RAM footprint roughly matches):

Model Memory
tiny ~273 MB
base ~388 MB
small ~852 MB
medium ~2.1 GB
large-v2/v3/v3-turbo ~3.9 GB

On an M2 Pro with Whisper.cpp and Metal acceleration, a 60-second clip processes in roughly 6 seconds using large-v3-turbo — approximately 10× realtime. Enable Core ML to run the encoder on the Apple Neural Engine and you gain an additional ~3× speedup over Metal-only. For Apple Silicon users, Whisper.cpp is the fastest local transcription option available in 2026.

WhisperX requires under 8 GB VRAM for large-v2 with beam_size=5, consistent with the faster-whisper numbers since it uses the same backend. The additional pyannote diarization model adds modest overhead on top.

For testing GPU-heavy transcription workloads before committing to hardware, RunPod rents A100 and H100 instances by the hour. For guidance on selecting a GPU for local AI work, runaihome.com covers hardware tradeoffs in depth.


Installation

faster-whisper

pip install faster-whisper
Enter fullscreen mode Exit fullscreen mode

CUDA 12 with cuBLAS and cuDNN 9 is required for GPU acceleration. If you're on CUDA 11, downgrade ctranslate2 to version 3.24.0.

Basic usage:

from faster_whisper import WhisperModel

model = WhisperModel("large-v3", device="cuda", compute_type="int8")
segments, info = model.transcribe("audio.mp3", beam_size=5)

for segment in segments:
    print(f"[{segment.start:.2f}s -> {segment.end:.2f}s] {segment.text}")
Enter fullscreen mode Exit fullscreen mode

For batched inference — significantly faster on long files or when processing many files:

from faster_whisper import WhisperModel, BatchedInferencePipeline

model = WhisperModel("large-v3", device="cuda", compute_type="float16")
pipeline = BatchedInferencePipeline(model=model)
segments, info = pipeline.transcribe("audio.mp3", batch_size=16)
Enter fullscreen mode Exit fullscreen mode

Whisper.cpp

Build from source — the only setup path:

git clone https://github.com/ggml-org/whisper.cpp
cd whisper.cpp
cmake -B build
cmake --build build -j --config Release

# Download the model
bash ./models/download-ggml-model.sh large-v3-turbo

# Transcribe a file
./build/bin/whisper-cli -m models/ggml-large-v3-turbo.bin -f audio.wav
Enter fullscreen mode Exit fullscreen mode

On macOS with Metal acceleration:

cmake -B build -DWHISPER_METAL=1
cmake --build build -j --config Release
Enter fullscreen mode Exit fullscreen mode

For Core ML (Apple Neural Engine — runs the encoder ~3× faster than Metal alone on M-series):

cmake -B build -DWHISPER_COREML=1
cmake --build build -j --config Release
Enter fullscreen mode Exit fullscreen mode

The binary accepts wav input directly. For mp3/m4a/other formats, ffmpeg handles conversion: ffmpeg -i input.mp3 -ar 16000 -ac 1 output.wav.

WhisperX

pip install whisperx
Enter fullscreen mode Exit fullscreen mode

CUDA 12.8 is required.

Top comments (0)