DEV Community: Piotr

Speech-to-Text API Comparison: Whisper API Options in 2026

Piotr — Tue, 09 Jun 2026 13:26:14 +0000

You need speech-to-text in your app. Whisper Large V3 keeps showing up as the answer - 99 languages, solid accuracy, MIT license. The model itself is settled science. What isn't settled is where you run it.

OpenAI hosts it at $0.36/hour. Groq runs a turbo variant for $0.02/hour. Deepgram built their own model that beats Whisper on noisy audio. AssemblyAI bundles diarization and sentiment analysis on top. deAPI transcribes directly from YouTube URLs for $0.021/hour. And you can always self-host the thing on your own GPU.

This article compares all six options on the metrics that actually drive the decision: price per hour of audio, speed, features you get out of the box, and the integration quirks nobody mentions until you're knee-deep in code.

The pricing table you came here for

Every price below is list rate as of June 2026. Enterprise discounts, volume tiers, and committed-use agreements can drop these 30-70% - but most developers reading this aren't negotiating enterprise contracts.

Provider	Model	Price/hour	Billing model
OpenAI	Whisper large-v3	$0.36	Per minute ($0.006/min)
Groq	Whisper large-v3-turbo	~$0.02	Per hour
Deepgram	Nova-3	$0.26 (batch) / $0.46 (stream)	Per minute
AssemblyAI	Universal-2	$0.12 (Nano) / $0.75 (Best)	Per minute
deAPI	Whisper large-v3	$0.021	Per hour of audio
Self-hosted	Whisper large-v3	$0.05-0.15 (GPU cost)	Your infrastructure

The spread is 17x between the cheapest hosted option and the most expensive. Same underlying model architecture, radically different price tags. The difference comes from hardware (consumer GPUs vs. cloud A100s), billing granularity, and what's bundled in.

What each option actually gives you

OpenAI Whisper API

Most developers start here. Upload a file, get a transcript - the SDK and docs have been battle-tested for years, and Stack Overflow covers every edge case.

The simplicity has a ceiling, though. Streaming and speaker diarization don't exist. The 25 MB file size cap forces you to chunk long recordings, then stitch transcripts back together on your side. Processing speed sits around 45-60 seconds per hour of audio.

At $0.36/hour, OpenAI charges 17x more than the cheapest hosted alternative. That gap is invisible when you're transcribing a few test files. Cross 100 hours per month and it's $36 that could be $2.10 on deAPI.

The sweet spot: quick integration, prototyping, and teams already deep in the OpenAI ecosystem who value familiarity over cost.

Groq Whisper

Groq runs Whisper large-v3-turbo on custom LPU hardware. One hour of audio transcribes in 8-12 seconds. Price matches the speed: ~$0.02/hour.

You give up the same things as with OpenAI (streaming, diarization, 25 MB file cap), plus Groq adds its own wrinkle: availability drops during peak demand, and the free tier rate limits are tight enough to block serious testing.

Where it shines: batch pipelines that need to chew through hundreds of hours overnight. Podcast archives, meeting backlogs, content indexing - anything where latency to the end user doesn't matter.

Deepgram Nova-3

Deepgram didn't just host Whisper - they built Nova-3 from scratch. On clean English, it matches Whisper. On noisy, accented, and phone-quality audio, it pulls ahead: ~9.4% WER on telephony vs. Whisper's ~12.8%.

Batch transcription costs $0.26/hour. Streaming runs $0.46/hour but delivers sub-300ms latency with real-time diarization. The $200 free credit on signup covers a full evaluation.

AssemblyAI

AssemblyAI sells the layer above transcription. Universal-2 handles 99 languages with diarization, and "Audio Intelligence" add-ons let you bolt on sentiment analysis, PII redaction, topic detection, and summarization per job.

Read the pricing carefully, though. Nano ($0.12/hour) covers basic transcription. Best ($0.75/hour) improves accuracy. Each add-on stacks $0.02-0.08/hour extra, so a fully-featured pipeline can double the headline rate before you notice.

The $50 credit plus 185 free hours gives you real runway for testing. Meeting assistants, compliance workflows, content analysis platforms - anything where raw text isn't enough and you need structured intelligence on top.

deAPI

deAPI runs Whisper Large V3 on a distributed network of consumer-grade GPUs. The price reflects that architecture: $0.021 per hour of audio, which makes it the cheapest hosted Whisper endpoint that runs the full (non-turbo) model.

The standout feature is direct URL transcription. Pass a YouTube, Twitch, TikTok, Kick, or X URL - including X Spaces - and the API handles audio extraction server-side. You skip the yt-dlp → ffmpeg → format conversion → chunking pipeline entirely, which saves more engineering time than the pricing difference suggests.

import requests

response = requests.post(
    "https://api.deapi.ai/api/v1/client/transcribe",
    headers={"Authorization": "Bearer YOUR_KEY"},
    data={
        "source_url": "https://www.youtube.com/watch?v=VIDEO_ID",
        "model": "WhisperLargeV3",
        "include_ts": True
    }
)
request_id = response.json()["data"]["request_id"]

Six lines of Python. The URL goes in, the transcript comes back with timestamps. Compare that to the typical Whisper pipeline: download video with yt-dlp, extract audio with ffmpeg, convert to the right format, chunk if over 25 MB, upload, transcribe, stitch chunks back together. deAPI collapses that entire chain into a single API call.

The API also supports OpenAI SDK compatibility, so migrating from OpenAI's Whisper endpoint means changing base_url and api_key while keeping your existing parsing logic intact.

The initial_prompt parameter works the same way as OpenAI's - a text snippet that conditions the model toward specific terminology, proper nouns, and formatting conventions.

Limitations: Batch only - no streaming, no diarization. A 10-minute video typically processes in under 30 seconds.

The migration path from OpenAI is the easiest of any provider here: swap two lines of code, keep your parsing logic, cut your bill by 17x.

Self-hosted (faster-whisper / whisper.cpp)

Running Whisper on your own GPU eliminates per-minute costs entirely. faster-whisper delivers 4-10x speedups over the original implementation; whisper.cpp runs on CPU if you're patient.

A cloud L4 instance costs $0.05-0.15/hour depending on provider. At high volume, transcription cost per hour approaches zero because you're paying for the GPU regardless of utilization.

The bill you don't see is engineering time. GPU provisioning, 25 MB chunking logic, hallucination mitigation on silent segments, deployment maintenance - each one is a small project that never fully goes away. Diarization means bolting on pyannote as a separate pipeline.

Makes sense at 1000+ hours/month, or in air-gapped environments where API calls aren't an option.

Feature comparison

Feature	OpenAI	Groq	Deepgram	AssemblyAI	deAPI	Self-hosted
Streaming	✗	✗	✓	✓	✗	DIY
Diarization	✗	✗	✓	✓	✗	via pyannote
Languages	99	99	40	99	99	99
URL transcription	✗	✗	✗	✗	✓	✗
Max file size	25 MB	25 MB	No limit	No limit	50 MB (URL: no limit)	Your GPU memory
Timestamps	✓	✓	✓	✓	✓	✓
Translation (→EN)	✓	✓	✗	✗	✓	✓
initial_prompt	✓	✓	✗	Word boost	✓	✓
OpenAI SDK compatible	Native	✓	✗	✗	✓	✗
Free tier	None	Rate-limited	$200 credit	$50 + 185 hrs	$5 credit (~237 hrs)	GPU cost

Two camps emerge. Deepgram and AssemblyAI compete on features - streaming, diarization, audio intelligence built in. OpenAI, Groq, and deAPI compete on Whisper compatibility and simplicity. Self-hosting sits in its own lane: maximum control, minimum hand-holding.

The decision axis is straightforward. Voice assistants and live captions need Deepgram's streaming. Meeting recordings with speaker labels need AssemblyAI. YouTube backlogs and batch workloads need deAPI or Groq at a fraction of the cost.

Cost at scale: 500 hours per month

Abstract pricing means nothing without volume context. Here's what 500 hours of monthly transcription costs on each platform:

Provider	Monthly cost (500 hrs)
OpenAI	$180.00
Deepgram (batch)	$130.00
AssemblyAI (Best)	$375.00
AssemblyAI (Nano)	$60.00
Groq	~$10.00
deAPI	$10.52
Self-hosted (L4 GPU)	$25-75 (infra)

The difference between $180/month (OpenAI) and $10.52/month (deAPI) buys you a lot of other API calls.

When Whisper isn't the right model

Whisper excels at batch transcription of clean-to-moderate audio across dozens of languages. It starts falling short in specific scenarios.

Phone and call-center audio recorded at 8 kHz exposes Whisper's weakness. Deepgram Nova-3 was built for this - their WER on telephony audio is 9.4% vs. Whisper's 12.8%. If your audio comes from phone lines, Deepgram or Speechmatics will produce measurably better output.

Real-time voice applications need sub-300ms latency. Whisper is batch-only across every hosted provider. Deepgram's streaming endpoint and AssemblyAI's Universal-Streaming are the viable options here.

Heavy accent and code-switching scenarios - speakers mixing languages mid-sentence - benefit from models trained specifically for that pattern. Speechmatics and Deepgram handle this better than vanilla Whisper.

For everything else - podcast transcription, YouTube content, meeting recordings, multilingual batch processing - Whisper Large V3 through any of the hosted options above will get the job done.

The bottom line

Six providers, one model, wildly different trade-offs.

Real-time streaming or speaker labels rule out every Whisper-based option - go with Deepgram or AssemblyAI. If your input is YouTube, Twitch, or X Spaces URLs, deAPI is the only provider that skips the download-extract-upload pipeline. And if cost drives the decision, deAPI ($0.021/hr) and Groq (~$0.02/hr) run the same model for 17x less than OpenAI.

The transcription quality is comparable across the board. What separates these providers is the engineering you do (or don't have to do) around it.

Prices verified June 2026. All platforms update pricing regularly - check their docs for current rates.

Try deAPI: app.deapi.ai - $5 free credits on signup, no credit card. The /transcribe endpoint accepts YouTube, Twitch, TikTok, Kick, and X URLs directly.

Replicate vs deAPI: Price Comparison for AI Inference (2026)

Piotr — Wed, 03 Jun 2026 15:09:26 +0000

Replicate vs deAPI: Price Comparison for AI Inference (2026)

You're building an app that generates images, transcribes audio, or synthesizes speech. Two API platforms keep showing up in your research: Replicate and deAPI. They run many of the same open-source models and charge per use.

This article compares actual costs across four common tasks. Every price comes from the official pricing page or API response.

How each platform bills you

The billing model is the first difference, and it affects everything downstream.

Replicate uses two pricing systems. "Official models" (Flux, Whisper, Claude) have fixed per-unit prices - $0.003 per image, $0.09 per second of video. Community models bill by GPU time instead: you pick a hardware tier (T4 at $0.000225/sec through H100 at $0.001525/sec), and you pay for however long inference takes. That run time varies with input size, model load, and cold starts. (See Replicate's pricing page for current hardware rates.)

deAPI bills by task output. An image costs $0.00136, an hour of transcription costs $0.021, a million characters of speech cost $0.77 - regardless of what GPU runs it behind the scenes. The /price endpoint calculates exact cost before you submit a job.

This distinction matters most at scale. With time-based billing, the same request can cost different amounts depending on queue depth and cold start behavior. With task-based billing, the cost is deterministic.

Image generation: Flux Schnell

Both platforms run Flux Schnell, the fast 12B image model from Black Forest Labs.

	Replicate	deAPI
Price	$0.003/image	$0.00136/image (512x512, 4 steps)
Billing model	Per image (Official Model)	Per image (resolution x steps)
Max resolution	Model default	2048x2048
LoRA support	Community models	Yes (7 LoRAs available)

Cost for 1,000 images: Replicate $3.00 vs deAPI $1.36.

deAPI's pricing scales with resolution and step count, so a 1024x1024 image costs more than a 512x512 (about $0.0027 vs $0.00136). Replicate charges a flat $0.003 regardless of dimensions. For lower resolutions - which cover most prototyping and thumbnail workflows - deAPI is roughly 2x cheaper. At higher resolutions, the gap narrows.

deAPI also runs Flux.2 Klein 4B and Z-Image-Turbo INT8 as alternatives. Replicate has Flux Dev ($0.025/image) and Flux 1.1 Pro ($0.04/image) for higher quality output.

Transcription: Whisper Large V3

Both platforms offer Whisper Large V3 for speech-to-text.

	Replicate	deAPI
Price	~$0.0014/run (T4 GPU, ~7s avg)	$0.021/hour of audio
Billing model	GPU time (T4: $0.000225/sec)	Per hour of audio duration
Direct URL transcription	No (file upload only)	Yes (YouTube, Twitch, Kick, X, TikTok)
Max file size	50MB	50MB (URL: no limit)

The pricing comparison here depends entirely on how you use it.

Short clips (under 1 minute): Replicate's time-based billing works out to roughly $0.001-0.002 per clip because inference is fast. deAPI charges by audio duration, so a 30-second clip costs about $0.000175. deAPI wins on short content.

Long-form audio (1 hour podcast): On Replicate, you'd need to chunk the file and run multiple predictions. Each chunk takes 5-15 seconds of GPU time on a T4 ($0.000225/sec), plus cold start overhead. Total cost varies, but expect $0.15-0.50 depending on chunking strategy. deAPI charges a flat $0.021 for the same hour.

The URL feature is the real differentiator. deAPI transcribes directly from YouTube, Twitch, Kick, TikTok, and X URLs - including X Spaces. Paste a link, get text. On Replicate, you download the file first, then upload it - which means writing download logic, managing temporary storage, and handling cleanup.

For reference, OpenAI's Whisper API charges $0.36/hour. deAPI runs the same model at $0.021/hour - roughly 17x cheaper.

Text-to-speech: Kokoro

Both platforms run Kokoro, the lightweight 82M parameter TTS model.

	Replicate	deAPI
Price	~$0.0018/run (T4, ~9s avg)	$0.77/million characters
Billing model	GPU time	Per character
Voices	20+ (American, British English)	54+ voices, 8 languages
Voice cloning	No (Kokoro only)	Yes (via Qwen3 TTS)
Voice design	No	Yes (via Qwen3 TTS)
OpenAI SDK compatible	No	Yes

Cost for 10,000 characters (~8 minutes of speech): Replicate runs it in one prediction - roughly $0.0018. deAPI charges $0.0077.

On raw Kokoro pricing, Replicate is cheaper for single short runs. The T4's low hourly rate ($0.81/hr) makes lightweight models like Kokoro very affordable there.

But deAPI's TTS story extends beyond Kokoro. The same endpoint gives you Qwen3 TTS with voice cloning (upload a 5-15 second reference clip and generate speech in that voice) and voice design (describe a voice in text, generate speech with it). Replicate has separate community models for these features, each with different APIs and billing.

deAPI's OpenAI SDK compatibility also means migrating from OpenAI TTS ($15/million characters) takes two changed lines of code. Your existing response parsing stays intact.

Video generation

Video pricing is where the platforms diverge most.

	Replicate	deAPI
Model	Wan 2.1 I2V (WaveSpeed)	LTX-Video 13B / LTX-2.3 22B
Budget tier	$0.45 (Wan 2.1, 480p, 5s @ $0.09/sec)	~$0.0088 (LTX-Video 13B, 768x768, 4s max)
Quality tier	$1.25 (Wan 2.1, 720p, 5s @ $0.25/sec)	~$0.047 (LTX-2.3 22B, 768x768, 5s)
Clip length	Flexible	LTX-13B capped at 4s (120 frames @ 30fps); LTX-2.3 up to 10s
Audio sync	Model-dependent	Yes (LTX-2.3)
Image-to-video	Yes	Yes
Text-to-video	Yes	Yes

The models are different (Wan vs LTX), so this isn't a pure apples-to-apples comparison - and the resolutions don't line up exactly either (768x768 sits between 480p and 720p). Read it as a comparison of tiers: a budget model versus a quality model on each side. Replicate has a wider selection of video models, including proprietary options like Runway Gen-4.5 and Google Veo 3.1. deAPI focuses on open-source models at lower price points.

For developers who need basic text-to-video or image-to-video functionality, the cost difference is dramatic. A 5-second clip on Replicate (Wan 2.1, 480p) costs $0.45. A comparable clip on deAPI (LTX-Video 13B at 768x768, its 4-second maximum) costs roughly $0.0088 - about 50x cheaper. Drop to 512x512 and it falls to ~$0.0056. Note that LTX-Video 13B runs at a fixed 30fps and tops out at 120 frames, so 4 seconds is its ceiling per clip; for longer or audio-synced clips you step up to LTX-2.3 22B (~$0.047 for 5s at 768x768).

Replicate also offers the Wan open-source models as community deployments at lower prices, but they bill by GPU time - so cost varies with inference duration and hardware choice.

What Replicate does that deAPI doesn't

LLMs. Replicate runs Claude, DeepSeek, Llama, and other language models. deAPI doesn't serve LLMs at all - it focuses on media generation, transcription, and embeddings. If you need chat completions alongside image generation, Replicate (or a multi-provider setup) is your path.

Custom model deployment. Replicate lets you package and deploy your own models using Cog. You get a dedicated endpoint, auto-scaling, and full control over the model code. deAPI runs a fixed catalog of models.

Broader model catalog. Replicate hosts thousands of community-contributed models. If you need a niche model - a specific ControlNet variant, a fine-tuned Stable Diffusion checkpoint, a custom video model - Replicate likely has it.

Proprietary video models. Runway Gen-4.5, Google Veo 3.1, Kling 3.0 - these are only available on platforms like Replicate.

What deAPI does that Replicate doesn't

Direct URL transcription. Paste a YouTube, Twitch, TikTok, or X link. Get text back. This eliminates the download-upload-cleanup pipeline that every other transcription API requires.

The /price endpoint is worth mentioning separately. It calculates exact cost before you submit, so your billing is deterministic - no variance from GPU warm-up time or queue depth.

OpenAI SDK compatibility lets you point your existing OpenAI code at deAPI by changing base_url and api_key. Images, TTS, transcription, embeddings, and video generation all follow the standard OpenAI response format.

On the audio side, deAPI bundles voice cloning (upload a 5-second reference clip) and voice design (describe a voice in text) into the same TTS endpoint. Replicate requires separate community models for each.

ACE-Step 1.5 handles music generation with lyrics, tempo, key, and style control. Replicate has community music models, but they're scattered across different maintainers with varying APIs.

The cost summary

Prices for 1,000 units of each task:

Task	Replicate	deAPI	Difference
Image (Flux Schnell, 512x512)	$3.00	$1.36	deAPI 2.2x cheaper
Transcription (1hr audio)	~$0.15-0.50	$0.021	deAPI 7-24x cheaper
TTS (10K chars, Kokoro)	~$0.0018	$0.0077	Replicate 4x cheaper
Video (budget tier, ~5s)	$0.45	~$0.0088	deAPI ~50x cheaper

TTS is the one category where Replicate's time-based billing on cheap hardware (T4) undercuts deAPI's per-character pricing. For everything else, deAPI's decentralized GPU network produces significantly lower costs.

When to use which

Replicate makes sense if your stack needs LLMs alongside media models, or if you want to deploy custom models through Cog.

deAPI fits better when cost drives the decision, when you're transcribing from URLs, or when your app is purely media generation without LLM chat.

The two aren't mutually exclusive. OpenAI SDK compatibility means you can run a Replicate client for GPT/Claude and a deAPI client for images, audio, and video - same SDK, different base_url.

Try it

Replicate: replicate.com - pay-as-you-go, no minimum
deAPI: app.deapi.ai - $5 free credits on signup, no credit card

Prices verified as of June 2026. Both platforms update pricing regularly - check their docs for current rates.