synthorai

Posted on Jun 26 • Originally published at synthorai.io

What a Simple Transcription Test Can and Can't Tell You

#webdev #programming #ai

Synthorai now transcribes audio, with thirteen models behind one endpoint in two families.

That one endpoint hides a lot of work, because natively these models barely resemble each other. whisper-1 takes a multipart file upload and returns {text}. gpt-4o-transcribe uses the same upload but adds token usage. Gemini is not a transcription API at all: you base64-encode the audio into a JSON generateContent request and dig the transcript out of candidates[0].content.parts[].text. ByteDance's seed-asr speaks the BytePlus AUC protocol, and Google's chirp models are Cloud Speech-to-Text recognizers reached with OAuth.

Different endpoints, different auth, different response shapes, one more integration each. Through the gateway it is one OpenAI-compatible call: swap gpt-4o-mini-transcribe for gemini-2.5-flash-lite or seed-asr-bigmodel, and nothing else in your code changes.

The call is the OpenAI-compatible transcription endpoint, so it is a drop-in if you already use Whisper:

curl https://synthorai.io/v1/audio/transcriptions \
  -H "Authorization: Bearer $SYNTHORAI_API_KEY" \
  -F file=@meeting.mp3 \
  -F model=gemini-2.5-flash-lite

from openai import OpenAI

client = OpenAI(base_url="https://synthorai.io/v1", api_key="sk-syn-...")

with open("meeting.mp3", "rb") as f:
    result = client.audio.transcriptions.create(model="gemini-2.5-flash-lite", file=f)

print(result.text)

The transcript comes back in text, and the billed cost is in the x-total-cost-usd response header.

We put all thirteen through the same simple test, and what that test is shapes every number below.

What this test is, and isn't

We generated everyday passages with no proper nouns (a morning, the weather, a trip to the market) with a standard text-to-speech voice in each of the world's five most-spoken languages, then transcribed each clip through all thirteen models. Each clip runs about 12 to 15 seconds, roughly 40 words of normal-paced speech with no long silences, encoded as 16 kHz mono 16-bit PCM WAV (256 kbps, about 2 MB a minute). The text is the ground truth and the durations are exact.

This is a deliberately easy case: clean, scripted, single-speaker audio with no accents, noise, or jargon. That makes it good for the things that do not depend on how hard the audio is. It measures cost, latency, which languages a model accepts at all, and whether it can stream, and those are stable facts.

It is not a quality benchmark. Real recordings with accents, background noise, domain vocabulary, overlapping speakers, and an hour of runtime separate these models in ways clean speech never will, and nothing here predicts that. Read the accuracy numbers as a floor check, not a ranking, and treat the cost, coverage, and streaming results as the baseline you can actually rely on.

Two model types, three request modes

The thirteen models come in two kinds:

Native multimodal models (six, Google's Gemini family: gemini-2.5-flash-lite, gemini-3.1-flash-lite-preview, gemini-2.5-flash, gemini-3-flash-preview, gemini-3.5-flash, gemini-2.5-pro). General audio-and-text models that transcribe as a side effect of being multimodal.
Dedicated ASR models (seven: OpenAI's whisper-1, gpt-4o-transcribe, gpt-4o-mini-transcribe; ByteDance's seed-asr-bigmodel; Alibaba's qwen3-asr-flash; Google's chirp-2 and chirp-3). Purpose-built for speech.

And three ways to send the audio:

File in, batch out: upload a complete recording, get the full transcript in one response. Every model supports it.
File in, streamed text out: the same upload, but the transcript streams back over SSE as it is produced. Some models support this; others are batch-only.
Audio stream in, text stream out: real-time recognition of a live mic or call. In development, not yet available, so everything below is the first two modes.

How transcription is billed

Two billing shapes. Per audio-minute (whisper-1, seed-asr, qwen3-asr-flash, the Chirp models): you pay for the wall-clock length of the recording, whatever is in it. Per token (the gpt-4o and Gemini models): audio tokenizes at a flat rate, and you pay for those input tokens plus the transcript output tokens, so silence is cheaper than dense speech.

The per-token shape has a trap: the listed input rate is for text, but audio bills higher (gpt-4o-mini-transcribe lists $1.25/M input but bills audio at $3/M). Estimate from the text rate and you undershoot. The gateway returns the real charge in an x-total-cost-usd header, so read that rather than guessing from a price page.

Cost

This is the part the test pins down cleanly, and it varies the most. Cost per minute, from the billed header:

Model	Type	Cost / min	Latency	Streams
`gemini-2.5-flash-lite`	multimodal	$0.0006	≈4s	chunks
`gemini-3.1-flash-lite-preview`	multimodal	$0.0016	≈3s	chunks
`seed-asr-bigmodel`	dedicated	$0.0020	≈10s	no
`qwen3-asr-flash`	dedicated	$0.0021	≈3s	no
`gemini-2.5-flash`	multimodal	$0.0026	≈2s	chunks
`gpt-4o-mini-transcribe`	dedicated	$0.0031	≈3s	token-by-token
`gemini-3-flash-preview`	multimodal	$0.0035	≈4s	chunks
`whisper-1`	dedicated	$0.0060	≈4s	no
`gpt-4o-transcribe`	dedicated	$0.0062	≈2s	token-by-token
`gemini-2.5-pro`	multimodal	$0.0082	≈5s	chunks
`chirp-2`	dedicated	$0.0164	≈3s	no
`chirp-3`	dedicated	$0.0164	≈4s	no
`gemini-3.5-flash`	multimodal	$0.0178	≈5s	chunks

The spread is about 30x, from gemini-2.5-flash-lite at $0.0006 a minute to gemini-3.5-flash at $0.0178. Two things are worth noticing, both about price rather than quality. The single cheapest model is a Gemini flash-lite, three times cheaper than the cheapest dedicated ASR. And within the Gemini family the price had no relationship to accuracy on this test, so a bigger, pricier model is not automatically the safer choice; it is a reason to benchmark the cheap one on your own audio before paying for the large one.

How these numbers move with your own files depends on the billing shape. The per-minute models (whisper-1, seed-asr, qwen3-asr-flash, the Chirps) bill by duration alone, so the rate is portable: ten minutes of audio costs ten times the per-minute figure, whatever the format or content.

The per-token models (the gpt-4o and Gemini rows) scale their input cost with duration, not file size, because the provider resamples the audio before tokenizing. A heavy 320 kbps MP3 and our lean 16 kHz WAV of the same words tokenize to about the same cost, so compressing your files saves storage, not transcription spend. What does move a per-token bill is how much is actually spoken: our clips are normal-paced with no dead air, so audio that is denser or quieter than that bills a little more or less on the output tokens. The x-total-cost-usd header is the ground truth in every case.

Accuracy and language coverage

On English, Spanish, and French, every model that accepts the language scored about 0% error. That is the floor, and everyone clears it. Mandarin and Hindi are where even this easy test starts to show cracks, but read that as a hint about where to point your own testing, not a verdict:

Model	Mandarin (CER)	Hindi (WER)	Coverage
`gemini-2.5-flash-lite`	0%	13%	all five
`gemini-3.1-flash-lite-preview`	0%	15%	all five
`seed-asr-bigmodel`	0%	fails	English + Chinese only
`qwen3-asr-flash`	0%	15%	all five
`gemini-2.5-flash`	0%	15%	all five
`gpt-4o-mini-transcribe`	0%	4%	all five
`gemini-3-flash-preview`	16%	7%	all five
`whisper-1`	0%	22%	all five
`gpt-4o-transcribe`	0%	13%	all five
`gemini-2.5-pro`	0%	15%	all five
`chirp-2`	16%	15%	all five
`chirp-3`	2%	15%	all five
`gemini-3.5-flash`	0%	15%	all five

The hard fact here is coverage, not accuracy. seed-asr returns a useless transcript for Hindi, Spanish, and French: it is an English-and-Chinese model, so it is only an option if your audio is one of those two languages. Everything else handled all five.

The Hindi spread and the Mandarin slips (chirp-2, one Gemini) say those models are worth testing on your harder languages before you trust them, not that one is better than another. The absolute numbers are inflated by the synthetic voice and the scoring and move from run to run. The honest read is that on clean speech in major languages, accuracy is not where these models separate, so it is not where this test can tell you to choose.

Streaming output

Whether a model can stream its transcript is a capability, not a quality call, and it splits the lineup. The per-minute models (whisper-1, seed-asr, qwen3-asr-flash, and both Chirps) are batch-only; the gateway returns a 400 if you ask them to stream. The gpt-4o models stream token by token: gpt-4o-transcribe returns its first words in about a second and fills in the rest, which is what a live-feel UI needs. The Gemini models technically stream, but in three to six large blocks, with the first arriving about when the whole transcript is done, so it buys almost nothing. Cost is unchanged from batch. To stream, add stream=true:

curl -N https://synthorai.io/v1/audio/transcriptions \
  -H "Authorization: Bearer $SYNTHORAI_API_KEY" \
  -F file=@meeting.mp3 -F model=gpt-4o-transcribe -F stream=true
# data: {"type":"transcript.text.delta","delta":"When"}
# data: {"type":"transcript.text.delta","delta":" you"} ...

Caching repeated audio

Caching is where the two billing shapes split one more time. The per-minute models cannot cache: we sent the same clip to whisper-1 five times and paid an identical $0.015478 every time, because the bill is just duration. The token-billed Gemini models can. Send the same file repeatedly and Gemini's implicit cache reuses the audio tokens: on a 155-second clip sent five times, gemini-2.5-flash dropped from $0.0054 to $0.0026 on two of the repeats, about 51% off, and gemini-2.5-pro fell about 39%.

Two caveats keep it from being a sure thing. It is best-effort, so some repeats hit the cache and some pay full price; and the audio has to clear Gemini's token floor, roughly a minute or more, which the short clips elsewhere in this test never do. The gpt-4o models list no cache rate and showed only ordinary run-to-run variation. So if your workload re-transcribes the same files, caching is a real discount on the token-billed models and nothing on the per-minute ones.

What to check first, and what to test yourself

This test cannot tell you which model is most accurate on your recordings. It can tell you what to filter on before you run your own evaluation:

Languages. Check that the model accepts every language you need. seed-asr is English and Chinese only; the other twelve handled all five we tried. This is a hard gate, not a preference.
Streaming. If you need a live transcript, only the gpt-4o models stream token by token; the per-minute models are batch-only and Gemini's streaming is coarse.
Cost. The spread is about 30x. gemini-2.5-flash-lite is the cheapest and still multilingual; the Chirps and the largest Gemini are the most expensive. A bigger model in the same family did not earn its premium on the easy clips, so do not assume you need it without checking. If you re-transcribe the same files often, the token-billed Gemini models can also cache the audio, as above.

Once a few models clear those, the question that is left, how accurate each one is on your own audio with its accents, noise, and vocabulary, is the one you have to answer yourself. No clean-speech benchmark substitutes for running the survivors on real recordings.

Bottom line

On clean, scripted speech in major languages, all thirteen models are about equally accurate, which is the most useful thing this test says: accuracy is not the axis to choose on. What it does pin down, and what genuinely varies, is the baseline: cost spans about 30x, one model covers only two languages, and several cannot stream. Use those to narrow the field, not to declare a winner, then run the two or three survivors on your own audio. That last step is the one no simple test can do for you.

Sources

Costs and latencies measured on Synthorai on 2026-06-25 across thirteen models and five languages (English, Mandarin, Hindi, Spanish, French), via the x-total-cost-usd header and SSE timing. The audio was text-to-speech generated and deliberately easy, so the accuracy figures are a floor check rather than a quality benchmark; real-world speech with accents and noise would separate these models differently. Latency varies run to run. Listing prices are this platform's rates as of that date. Verify current pricing before relying on it.

DEV Community