Cheapest Audio Transcription APIs in 2025: Whisper via API vs AssemblyAI vs Deepgram
Audio transcription has become a commodity — Whisper changed everything. But running Whisper locally requires a GPU (or at least a beefy CPU), and hosting it yourself adds ops overhead. The better path for most developers: use a transcription API.
This guide compares the leading audio transcription APIs by price, accuracy, language support, and developer experience.
What to Consider When Choosing a Transcription API
- Price: Charged per minute of audio, per hour, or per request. Volume discounts matter.
- Accuracy: Varies by language, audio quality, and domain (medical, legal, technical).
- Languages: Whisper supports 99+ languages; some services only optimize for English.
- Speaker diarization: Can it distinguish who's speaking?
- Turnaround time: Real-time streaming vs async batch processing.
- Word-level timestamps: Needed for video subtitles and caption generation.
Comparison Table
| Tool | Price | Languages | Diarization | Timestamps | Base Model |
|---|---|---|---|---|---|
| IteraTools | ~$0.003/min (credits) | 99+ (Whisper) | No | Yes | Whisper |
| AssemblyAI | $0.01/min | 99+ | Yes | Yes | Custom |
| Deepgram | $0.0043/min | 36 | Yes | Yes | Custom |
| OpenAI Whisper API | $0.006/min | 99+ | No | Yes | Whisper |
| Groq Whisper | $0.002/min | 99+ | No | No | Whisper large-v3 |
IteraTools Transcription — How to Use It
Transcribe from a URL:
curl -X POST https://api.iteratools.com/v1/transcribe \
-H "Authorization: Bearer YOUR_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"url": "https://example.com/audio/interview.mp3",
"language": "en"
}'
Upload a local file:
curl -X POST https://api.iteratools.com/v1/transcribe \
-H "Authorization: Bearer YOUR_API_KEY" \
-F "file=@recording.mp3" \
-F "language=pt"
Response:
{
"text": "Hello, today we're going to discuss the quarterly results...",
"language": "en",
"duration_seconds": 142.5,
"words": [
{"word": "Hello", "start": 0.0, "end": 0.4, "confidence": 0.99},
{"word": "today", "start": 0.5, "end": 0.8, "confidence": 0.98}
],
"credits_used": 5
}
Complete Python Example
import requests
from pathlib import Path
API_KEY = "your_api_key_here"
BASE_URL = "https://api.iteratools.com/v1"
def transcribe_file(audio_path: str, language: str = "en") -> dict:
"""Transcribe a local audio file."""
with open(audio_path, "rb") as f:
response = requests.post(
f"{BASE_URL}/transcribe",
headers={"Authorization": f"Bearer {API_KEY}"},
files={"file": (Path(audio_path).name, f)},
data={"language": language}
)
response.raise_for_status()
return response.json()
def transcribe_url(audio_url: str, language: str = "en") -> dict:
"""Transcribe audio from a URL."""
response = requests.post(
f"{BASE_URL}/transcribe",
headers={"Authorization": f"Bearer {API_KEY}"},
json={"url": audio_url, "language": language}
)
response.raise_for_status()
return response.json()
def generate_srt(transcription: dict, output_file: str = "subtitles.srt"):
"""Generate SRT subtitle file from transcription with timestamps."""
words = transcription.get("words", [])
if not words:
print("No word-level timestamps available")
return
# Group words into subtitle chunks (max 10 words per chunk)
chunks = []
chunk_words = []
for word in words:
chunk_words.append(word)
if len(chunk_words) >= 10:
chunks.append(chunk_words)
chunk_words = []
if chunk_words:
chunks.append(chunk_words)
def format_time(seconds: float) -> str:
h = int(seconds // 3600)
m = int((seconds % 3600) // 60)
s = int(seconds % 60)
ms = int((seconds % 1) * 1000)
return f"{h:02d}:{m:02d}:{s:02d},{ms:03d}"
with open(output_file, "w") as f:
for i, chunk in enumerate(chunks, 1):
start = chunk[0]["start"]
end = chunk[-1]["end"]
text = " ".join(w["word"] for w in chunk)
f.write(f"{i}\n")
f.write(f"{format_time(start)} --> {format_time(end)}\n")
f.write(f"{text}\n\n")
print(f"SRT saved: {output_file} ({len(chunks)} subtitle blocks)")
if __name__ == "__main__":
# Transcribe a meeting recording
result = transcribe_file("meeting.mp3", language="en")
print(f"Transcript ({result['duration_seconds']:.0f}s audio):")
print(result["text"][:500])
print(f"\nCredits used: {result['credits_used']}")
# Generate subtitles for a video
result_pt = transcribe_file("video_audio.mp3", language="pt")
generate_srt(result_pt, "video_subtitles.srt")
# Batch process a folder of recordings
audio_dir = Path("recordings/")
for audio_file in audio_dir.glob("*.mp3"):
print(f"Transcribing {audio_file.name}...")
result = transcribe_file(str(audio_file))
# Save transcript
transcript_path = audio_file.with_suffix(".txt")
transcript_path.write_text(result["text"])
print(f" ✓ Saved to {transcript_path}")
Accuracy Notes by Language
Whisper-based APIs (including IteraTools and OpenAI) generally excel at:
- English, Spanish, French, German, Japanese, Portuguese — very high accuracy
- Mandarin, Arabic, Hindi — good accuracy
- Less common languages — variable; test with your specific language
AssemblyAI and Deepgram use custom models optimized for English, often with better accuracy for business audio, accents, and domain-specific terminology.
Conclusion
For developers who need transcription at reasonable cost with solid multi-language support, IteraTools provides a great balance: Whisper-quality transcription at ~$0.003/min, with word timestamps, and no subscription required. It's also part of a broader API toolkit — you can immediately pass the transcript to IteraTools' text/embedding/search endpoints.
For English-only applications that need speaker diarization, AssemblyAI or Deepgram are worth the premium.
→ Try IteraTools transcription — 99+ languages, pay per use.
Top comments (0)