Alex Spinov

Posted on Mar 28

Whisper.cpp Has a Free API — Run OpenAI Whisper Speech-to-Text on CPU

#ai #api #cpp #tutorial

Whisper.cpp is a C/C++ port of OpenAI's Whisper speech recognition model. It runs entirely on CPU (no GPU needed), supports 99 languages, and includes a built-in HTTP server with an OpenAI-compatible API.

Free, open source, blazing fast on Apple Silicon and modern CPUs.

Why Use Whisper.cpp?

No GPU needed — optimized for CPU, especially Apple Silicon (M1/M2/M3)
OpenAI-compatible API — same endpoint format as OpenAI's Whisper API
99 languages — automatic language detection
Real-time — process audio faster than real-time on modern hardware
Tiny models — from 39MB (tiny) to 1.5GB (large), run on any machine

Quick Setup

1. Build from Source

git clone https://github.com/ggerganov/whisper.cpp
cd whisper.cpp
make

# Download a model
bash models/download-ggml-model.sh base.en
# Available: tiny, base, small, medium, large-v3

2. Transcribe Audio (CLI)

# Transcribe a WAV file
./main -m models/ggml-base.en.bin -f audio.wav

# With timestamps
./main -m models/ggml-base.en.bin -f audio.wav --output-srt

# Auto-detect language
./main -m models/ggml-large-v3.bin -f audio.wav -l auto

3. Start HTTP Server

# Build server
make server

# Run server
./server -m models/ggml-base.en.bin --host 0.0.0.0 --port 8080

4. Transcribe via API

# OpenAI-compatible endpoint
curl -s http://localhost:8080/v1/audio/transcriptions \
  -F file=@meeting.wav \
  -F model=whisper-1 \
  -F response_format=json | jq '.text'

# With language hint
curl -s http://localhost:8080/v1/audio/transcriptions \
  -F file=@audio.wav \
  -F language=en \
  -F response_format=verbose_json | jq '{text: .text, language: .language, segments: [.segments[] | {start: .start, end: .end, text: .text}]}'

# Get timestamps (SRT format)
curl -s http://localhost:8080/v1/audio/transcriptions \
  -F file=@audio.wav \
  -F response_format=srt

5. Translation

# Translate any language to English
curl -s http://localhost:8080/v1/audio/translations \
  -F file=@russian_audio.wav \
  -F model=whisper-1 | jq '.text'

Python Example

from openai import OpenAI

# Works with OpenAI SDK!
client = OpenAI(base_url="http://localhost:8080/v1", api_key="not-needed")

# Transcribe
with open("meeting.wav", "rb") as f:
    transcript = client.audio.transcriptions.create(
        model="whisper-1",
        file=f,
        response_format="verbose_json"
    )

print(f"Text: {transcript.text}")
for segment in transcript.segments:
    print(f"[{segment.start:.1f}s - {segment.end:.1f}s] {segment.text}")

# Translate to English
with open("foreign_audio.wav", "rb") as f:
    translation = client.audio.translations.create(
        model="whisper-1", file=f)
    print(f"Translation: {translation.text}")

Model Sizes

Model	Size	RAM	Speed	Quality
tiny	39MB	~390MB	Fastest	Basic
base	74MB	~500MB	Fast	Good
small	244MB	~1GB	Medium	Better
medium	769MB	~2.5GB	Slow	Great
large-v3	1.5GB	~4GB	Slowest	Best

Key Endpoints

Endpoint	Description
/v1/audio/transcriptions	Transcribe audio
/v1/audio/translations	Translate to English
/v1/models	List models
/health	Health check

Performance Tips

Apple Silicon: Use Metal acceleration (make WHISPER_METAL=1)
x86: Enable AVX2/AVX-512 for best performance
Large files: Split into chunks, process in parallel
Real-time: Use stream binary for live microphone input

Need custom data extraction or scraping solution? I build production-grade scrapers for any website. Email: Spinov001@gmail.com | My Apify Actors

DEV Community