DEV Community

Fred Santos
Fred Santos

Posted on

Cheapest Audio Transcription APIs in 2025: Whisper via API vs AssemblyAI vs Deepgram

Cheapest Audio Transcription APIs in 2025: Whisper via API vs AssemblyAI vs Deepgram

Audio transcription has become a commodity — Whisper changed everything. But running Whisper locally requires a GPU (or at least a beefy CPU), and hosting it yourself adds ops overhead. The better path for most developers: use a transcription API.

This guide compares the leading audio transcription APIs by price, accuracy, language support, and developer experience.

What to Consider When Choosing a Transcription API

  • Price: Charged per minute of audio, per hour, or per request. Volume discounts matter.
  • Accuracy: Varies by language, audio quality, and domain (medical, legal, technical).
  • Languages: Whisper supports 99+ languages; some services only optimize for English.
  • Speaker diarization: Can it distinguish who's speaking?
  • Turnaround time: Real-time streaming vs async batch processing.
  • Word-level timestamps: Needed for video subtitles and caption generation.

Comparison Table

Tool Price Languages Diarization Timestamps Base Model
IteraTools ~$0.003/min (credits) 99+ (Whisper) No Yes Whisper
AssemblyAI $0.01/min 99+ Yes Yes Custom
Deepgram $0.0043/min 36 Yes Yes Custom
OpenAI Whisper API $0.006/min 99+ No Yes Whisper
Groq Whisper $0.002/min 99+ No No Whisper large-v3

IteraTools Transcription — How to Use It

Transcribe from a URL:

curl -X POST https://api.iteratools.com/v1/transcribe \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "url": "https://example.com/audio/interview.mp3",
    "language": "en"
  }'
Enter fullscreen mode Exit fullscreen mode

Upload a local file:

curl -X POST https://api.iteratools.com/v1/transcribe \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -F "file=@recording.mp3" \
  -F "language=pt"
Enter fullscreen mode Exit fullscreen mode

Response:

{
  "text": "Hello, today we're going to discuss the quarterly results...",
  "language": "en",
  "duration_seconds": 142.5,
  "words": [
    {"word": "Hello", "start": 0.0, "end": 0.4, "confidence": 0.99},
    {"word": "today", "start": 0.5, "end": 0.8, "confidence": 0.98}
  ],
  "credits_used": 5
}
Enter fullscreen mode Exit fullscreen mode

Complete Python Example

import requests
from pathlib import Path

API_KEY = "your_api_key_here"
BASE_URL = "https://api.iteratools.com/v1"

def transcribe_file(audio_path: str, language: str = "en") -> dict:
    """Transcribe a local audio file."""
    with open(audio_path, "rb") as f:
        response = requests.post(
            f"{BASE_URL}/transcribe",
            headers={"Authorization": f"Bearer {API_KEY}"},
            files={"file": (Path(audio_path).name, f)},
            data={"language": language}
        )
    response.raise_for_status()
    return response.json()

def transcribe_url(audio_url: str, language: str = "en") -> dict:
    """Transcribe audio from a URL."""
    response = requests.post(
        f"{BASE_URL}/transcribe",
        headers={"Authorization": f"Bearer {API_KEY}"},
        json={"url": audio_url, "language": language}
    )
    response.raise_for_status()
    return response.json()

def generate_srt(transcription: dict, output_file: str = "subtitles.srt"):
    """Generate SRT subtitle file from transcription with timestamps."""
    words = transcription.get("words", [])
    if not words:
        print("No word-level timestamps available")
        return

    # Group words into subtitle chunks (max 10 words per chunk)
    chunks = []
    chunk_words = []

    for word in words:
        chunk_words.append(word)
        if len(chunk_words) >= 10:
            chunks.append(chunk_words)
            chunk_words = []

    if chunk_words:
        chunks.append(chunk_words)

    def format_time(seconds: float) -> str:
        h = int(seconds // 3600)
        m = int((seconds % 3600) // 60)
        s = int(seconds % 60)
        ms = int((seconds % 1) * 1000)
        return f"{h:02d}:{m:02d}:{s:02d},{ms:03d}"

    with open(output_file, "w") as f:
        for i, chunk in enumerate(chunks, 1):
            start = chunk[0]["start"]
            end = chunk[-1]["end"]
            text = " ".join(w["word"] for w in chunk)
            f.write(f"{i}\n")
            f.write(f"{format_time(start)} --> {format_time(end)}\n")
            f.write(f"{text}\n\n")

    print(f"SRT saved: {output_file} ({len(chunks)} subtitle blocks)")

if __name__ == "__main__":
    # Transcribe a meeting recording
    result = transcribe_file("meeting.mp3", language="en")

    print(f"Transcript ({result['duration_seconds']:.0f}s audio):")
    print(result["text"][:500])
    print(f"\nCredits used: {result['credits_used']}")

    # Generate subtitles for a video
    result_pt = transcribe_file("video_audio.mp3", language="pt")
    generate_srt(result_pt, "video_subtitles.srt")

    # Batch process a folder of recordings
    audio_dir = Path("recordings/")
    for audio_file in audio_dir.glob("*.mp3"):
        print(f"Transcribing {audio_file.name}...")
        result = transcribe_file(str(audio_file))

        # Save transcript
        transcript_path = audio_file.with_suffix(".txt")
        transcript_path.write_text(result["text"])
        print(f"  ✓ Saved to {transcript_path}")
Enter fullscreen mode Exit fullscreen mode

Accuracy Notes by Language

Whisper-based APIs (including IteraTools and OpenAI) generally excel at:

  • English, Spanish, French, German, Japanese, Portuguese — very high accuracy
  • Mandarin, Arabic, Hindi — good accuracy
  • Less common languages — variable; test with your specific language

AssemblyAI and Deepgram use custom models optimized for English, often with better accuracy for business audio, accents, and domain-specific terminology.

Conclusion

For developers who need transcription at reasonable cost with solid multi-language support, IteraTools provides a great balance: Whisper-quality transcription at ~$0.003/min, with word timestamps, and no subscription required. It's also part of a broader API toolkit — you can immediately pass the transcript to IteraTools' text/embedding/search endpoints.

For English-only applications that need speaker diarization, AssemblyAI or Deepgram are worth the premium.

Try IteraTools transcription — 99+ languages, pay per use.

Top comments (0)