Whisper.cpp is a C/C++ port of OpenAI's Whisper speech recognition model. It runs entirely on CPU (no GPU needed), supports 99 languages, and includes a built-in HTTP server with an OpenAI-compatible API.
Free, open source, blazing fast on Apple Silicon and modern CPUs.
Why Use Whisper.cpp?
- No GPU needed — optimized for CPU, especially Apple Silicon (M1/M2/M3)
- OpenAI-compatible API — same endpoint format as OpenAI's Whisper API
- 99 languages — automatic language detection
- Real-time — process audio faster than real-time on modern hardware
- Tiny models — from 39MB (tiny) to 1.5GB (large), run on any machine
Quick Setup
1. Build from Source
git clone https://github.com/ggerganov/whisper.cpp
cd whisper.cpp
make
# Download a model
bash models/download-ggml-model.sh base.en
# Available: tiny, base, small, medium, large-v3
2. Transcribe Audio (CLI)
# Transcribe a WAV file
./main -m models/ggml-base.en.bin -f audio.wav
# With timestamps
./main -m models/ggml-base.en.bin -f audio.wav --output-srt
# Auto-detect language
./main -m models/ggml-large-v3.bin -f audio.wav -l auto
3. Start HTTP Server
# Build server
make server
# Run server
./server -m models/ggml-base.en.bin --host 0.0.0.0 --port 8080
4. Transcribe via API
# OpenAI-compatible endpoint
curl -s http://localhost:8080/v1/audio/transcriptions \
-F file=@meeting.wav \
-F model=whisper-1 \
-F response_format=json | jq '.text'
# With language hint
curl -s http://localhost:8080/v1/audio/transcriptions \
-F file=@audio.wav \
-F language=en \
-F response_format=verbose_json | jq '{text: .text, language: .language, segments: [.segments[] | {start: .start, end: .end, text: .text}]}'
# Get timestamps (SRT format)
curl -s http://localhost:8080/v1/audio/transcriptions \
-F file=@audio.wav \
-F response_format=srt
5. Translation
# Translate any language to English
curl -s http://localhost:8080/v1/audio/translations \
-F file=@russian_audio.wav \
-F model=whisper-1 | jq '.text'
Python Example
from openai import OpenAI
# Works with OpenAI SDK!
client = OpenAI(base_url="http://localhost:8080/v1", api_key="not-needed")
# Transcribe
with open("meeting.wav", "rb") as f:
transcript = client.audio.transcriptions.create(
model="whisper-1",
file=f,
response_format="verbose_json"
)
print(f"Text: {transcript.text}")
for segment in transcript.segments:
print(f"[{segment.start:.1f}s - {segment.end:.1f}s] {segment.text}")
# Translate to English
with open("foreign_audio.wav", "rb") as f:
translation = client.audio.translations.create(
model="whisper-1", file=f)
print(f"Translation: {translation.text}")
Model Sizes
| Model | Size | RAM | Speed | Quality |
|---|---|---|---|---|
| tiny | 39MB | ~390MB | Fastest | Basic |
| base | 74MB | ~500MB | Fast | Good |
| small | 244MB | ~1GB | Medium | Better |
| medium | 769MB | ~2.5GB | Slow | Great |
| large-v3 | 1.5GB | ~4GB | Slowest | Best |
Key Endpoints
| Endpoint | Description |
|---|---|
| /v1/audio/transcriptions | Transcribe audio |
| /v1/audio/translations | Translate to English |
| /v1/models | List models |
| /health | Health check |
Performance Tips
-
Apple Silicon: Use Metal acceleration (
make WHISPER_METAL=1) - x86: Enable AVX2/AVX-512 for best performance
- Large files: Split into chunks, process in parallel
-
Real-time: Use
streambinary for live microphone input
Need custom data extraction or scraping solution? I build production-grade scrapers for any website. Email: Spinov001@gmail.com | My Apify Actors
Top comments (0)