Compare per-minute rates, free tiers, and accuracy across the top speech-to-text APIs.
In 2026, the transcription API market remains fractured across three dominant commercial players, an open-source alternative, and a dozen smaller competitors. Teams evaluating speech-to-text solutions face a real choice: OpenAI's Whisper API dominates on cost-per-minute at scale, Deepgram leads on free tier generosity and real-time performance, and AssemblyAI excels in accuracy and compliance features. This guide cuts through the marketing and compares per-minute rates, free-tier limits, accuracy trade-offs, and the concrete reasons to pick one provider over another. If you are buying transcription infrastructure in 2026, this explains what you are actually paying for and where the hidden costs hide.
Why this matters now
Speech-to-text has become table stakes for any product that ingests audio: customer support platforms, meeting recorders, podcasting tools, video accessibility, and live event captions all depend on transcription. The price difference between providers can swing your margin by 2% to 8%, depending on audio volume. More crucially, free tiers have diverged sharply since 2023. Deepgram's 12.5 free hours monthly now make it viable for startups that were previously excluded by pay-as-you-go minimums. Meanwhile, Whisper's open-source release in 2022 created a credible self-hosting option for teams with engineering capacity, collapsing the premium that proprietary APIs once commanded. By 2026, the decision is less "which API is best" and more "which pricing model maps to our budget and technical constraints."
The secondary driver: regulatory pressure and data residency. AssemblyAI's compliance certifications (SOC 2, HIPAA-ready) and EU-based processing now factor into purchasing decisions that price alone cannot influence. Deepgram's real-time streaming, which Whisper lacks entirely, creates a hard constraint for live-caption products. The landscape has matured enough that comparing APIs on cost alone will lead to the wrong choice. You need to compare use case by use case.
Whisper API: The cost leader, with caveats

Photo by Daniil Komov on Pexels.
OpenAI's Whisper API costs $0.02 per minute of audio processed, making it the lowest per-minute rate among commercial services. The pricing is simple, non-tiered, and applies equally whether you send a 30-second clip or a 10-hour recording. At 1 million minutes annually (roughly 19 hours per day), Whisper runs $24,000 per year, or $0.02 per minute, with no volume discount.
The practical constraint: Whisper has no free tier. Teams evaluating the API must pay from the first minute. There is also no real-time streaming API. You upload audio and wait for a response, typically 1 to 5 minutes for files under 5 minutes long. If your use case requires live captions or streaming input, Whisper is not an option, regardless of cost.
The accuracy picture is mixed. Whisper generalizes well across accents and background noise, reflecting its training on 680,000 hours of multilingual audio. But it lacks domain-specific training. In medical, legal, or highly technical domains, Whisper's word error rate (WER) often climbs to 15% to 25%, whereas specialized models from Deepgram or AssemblyAI may hit 5% to 10%. For meetings, podcasts, and general dialogue, Whisper performs competitively at 5% to 8% WER.
Self-hosting is an option. OpenAI open-sourced Whisper, and the model runs locally on CPU or GPU. Inference speed depends on hardware, from real-time on an A100 GPU to 10x slower than real-time on a 4-core CPU. Self-hosting avoids per-minute fees and keeps audio private, but introduces infrastructure cost and operational overhead. Most teams will choose the API unless they process petabytes annually or face hard privacy rules.
Deepgram: The free-tier leader with surprising depth
Deepgram has become the de facto choice for budget-conscious startups, offering 12.5 free hours monthly without a credit card requirement. That tier supports real-time streaming at no cost, which no other major provider offers for free. Once you exceed 12.5 hours, Deepgram's standard tier pricing sits at $0.0043 per minute when paid monthly, though month-to-month pay-as-you-go rates climb to $0.005 per minute. For context, $0.0043 per minute translates to $258 per thousand minutes, or roughly $3,096 per year for 12 million minutes (about 228 hours).
The appeal extends beyond price. Deepgram's real-time streaming API uses websockets and delivers transcripts with 300 to 500 millisecond latency, fast enough for live captions without noticeable delay. Whisper cannot do this. AssemblyAI can, but charges for real-time at a higher per-minute rate than batch processing. For any live-caption or call-center use case, Deepgram becomes the practical default.
Accuracy is respectable but not best-in-class. Deepgram's default model achieves 7% to 12% WER on clean audio, competitive with Whisper but behind AssemblyAI's specialized models in domain-specific tasks. Deepgram offers custom vocabulary and speaker diarization as add-ons, pushing the total per-minute cost higher if you rely on these features. The transparency on these pricing layers is weak; you must contact sales for volume discounts above a certain threshold.
International support is a strength. Deepgram supports 99 languages and runs infrastructure in multiple regions, including EU-based processing for GDPR compliance. The free tier, lack of credit card requirement, and real-time performance combine to make Deepgram the easiest on-ramp for developers building a proof-of-concept.
AssemblyAI: Premium accuracy and compliance

Photo by dlxmedia.hu on Pexels.
AssemblyAI positions itself at the premium end of the market, and the pricing reflects it. The free tier provides 100 minutes monthly, a ceiling that is roughly 10% of Deepgram's free allowance. Paid tiers begin at $0.005 per minute for batch processing and $0.015 per minute for real-time streaming. For 100,000 minutes annually, batch processing costs $500 per year at the standard rate.
The premium derives from three sources: accuracy, compliance, and specialized models. AssemblyAI's flagship Conformer model achieves 5% to 8% WER on general audio, trailing Deepgram slightly on standard cases but significantly outperforming both competitors on accented speech, background noise, and technical jargon. The company invests heavily in domain-specific models for medical, legal, and financial transcription, where accuracy directly impacts downstream costs or liability.
Compliance is material for regulated industries. AssemblyAI holds SOC 2 Type II certification, HIPAA compliance readiness, and GDPR alignment. The company offers EU-based data processing, audit logs, and encryption at rest. For any health-tech, financial-services, or legal-tech product, AssemblyAI's compliance posture reduces sales friction and shortens security reviews. Whisper and Deepgram lack this depth.
The user experience trade-off: AssemblyAI charges for features that competitors bundle. Speaker diarization (separating speaker 1 from speaker 2) costs extra. Custom vocabulary costs extra. Transcript storage beyond 30 days incurs retrieval fees. The sticker price of $0.005 per minute is misleading if your production use case requires three or four add-ons. Real-world cost for a team using medical transcription, diarization, and custom vocabulary can exceed $0.015 per minute once all features are enabled.
Free tier comparison: What you actually get
The free-tier landscape shifted meaningfully in 2024 and 2025:
Deepgram: 12.5 hours monthly, real-time streaming included, no credit card required. Resets monthly. Best-in-class free tier.
AssemblyAI: 100 minutes monthly, batch only, credit card required. Resets monthly. Suitable for light testing.
Whisper API: No free tier, but open-source model available for local use at no cost.
Google Cloud Speech-to-Text: 60 minutes monthly, no credit card required for first 12 months. Real-time streaming supported.
AWS Transcribe: 4,000 minutes in first year, included in free tier, then $0.0001 per second (not recommended at scale).
For teams prototyping: Deepgram's free tier is the clear winner if you need real-time. If batch processing suffices, Google Cloud provides 60 free minutes monthly without credit card friction, and AWS's generous first-year allowance covers many small projects. Whisper API charges from day one but offers the lowest marginal cost at high volume.
Hidden costs and gotchas
Per-minute pricing obscures several recurring expenses that compound quickly:
Transcript storage and retrieval: AssemblyAI charges for storing transcripts beyond 30 days, then charges again to retrieve them. If you keep a long-term audit log, this cost sneaks in. Deepgram and Whisper do not charge for storage, but you must manage it yourself.
Feature add-ons: Speaker diarization (identifying which speaker said what) costs extra on AssemblyAI and Deepgram. Custom vocabulary for domain-specific terms is often a separate line item. Real-time streaming is priced higher than batch on most platforms. Read the fine print before scaling beyond the base rate.
Overage billing: Monthly plans often have ceiling rates. Exceed them mid-month and you may be bumped to a higher per-minute rate for the remainder, or charged overage fees. Always clarify what happens if you spike in usage.
Language model add-ons: Some providers, particularly AssemblyAI, allow you to attach your own fine-tuned language model. This is powerful but often requires a higher minimum commitment or per-instance fee.
Data residency and compliance certifications: EU-based processing or HIPAA compliance may require enterprise contracts with minimum spend. Do not assume free or standard tiers qualify for these features.
When transcription APIs fail
Transcription accuracy remains fundamentally limited by audio quality. Heavy background noise, poor microphone placement, overlapping speakers, and non-native accents remain the primary sources of errors. No API has solved this. Whisper and Deepgram perform slightly better on accented speech than older models, but 15% to 20% WER on noisy audio is a realistic floor.
Real-time latency is a hard constraint for some use cases. If you need captions synchronized within 100 milliseconds, Whisper is not viable because it offers no streaming API. Deepgram and AssemblyAI both support real-time, but AssemblyAI's real-time pricing is 3x higher than batch, which may make batch processing with delayed captions more cost-effective for recorded content.
Domain specificity is overstated in marketing. Custom vocabulary helps with proper nouns and jargon, but it does not fix fundamental accuracy issues with heavily accented speech or poor audio. For medical or legal transcription, expecting 99% accuracy is unrealistic. Expect 92% to 98% depending on audio quality and speaker clarity. Plan for human review on high-stakes content.
Privacy and data residency come with operational friction. If you require EU-only processing or on-premise deployment, most APIs fall out of contention. Whisper's open-source model becomes the only viable option, but it requires infrastructure investment and operational expertise.
Making the choice: Decision tree by use case
If you are building a live-caption product for events or calls: Deepgram is the practical choice. Its real-time API is mature, pricing is transparent, and the free tier lets you ship a proof-of-concept without spending. Alternative: AssemblyAI if you need superior accuracy and compliance certifications are non-negotiable.
If you are processing pre-recorded audio (podcasts, meetings, support calls) at high volume (1+ million minutes annually): Whisper API by cost alone, unless you need speaker diarization or domain-specific accuracy, in which case evaluate AssemblyAI's total cost including add-ons.
If you are a startup evaluating transcription for the first time: Start with Deepgram's free tier (12.5 hours monthly). It costs nothing, includes real-time streaming, and requires no credit card. Once you hit the ceiling, the decision to upgrade becomes clearer based on your actual usage pattern.
If you are in health-tech, legal-tech, or financial services and need compliance certifications: AssemblyAI or Google Cloud Speech-to-Text. AssemblyAI has deeper compliance pedigree. Do not optimize for price; optimize for audit readiness.
If you have sensitive audio data and cannot send it to external APIs: Whisper (open-source, self-hosted) is the only viable choice. Budget for infrastructure, GPU costs, and engineering time. The per-minute API cost savings may evaporate against operational overhead unless you process massive volume.
The transcription API market in 2026 is mature and competitive. There is no universally "best" choice. Whisper wins on marginal cost, Deepgram on free-tier value and real-time performance, and AssemblyAI on accuracy and compliance. Evaluate against your actual constraints: real-time or batch, budget, compliance needs, audio domain, and expected volume. The wrong choice costs 2% to 10% of revenue. The right choice is the one that maps your use case to pricing structure without forcing you to subsidize unused features.
This article was originally published on AI Glimpse.
Top comments (0)