Quick recap of where we are if you haven't been following OpenAI's STT roadmap: the classic whisper-1 endpoint is batch-only — you upload a file, wait, get back a finished transcript. There's no stream=True because the underlying Whisper decoder wasn't designed for it, and the endpoint probably won't ever get streaming retrofitted onto it.
That was a genuine blocker for about two years. If you wanted live captions or partial transcripts, you had to either self-host Whisper with a streaming fork, or reach for a third-party like AssemblyAI / Deepgram.
Then, quietly, OpenAI shipped two replacements that between them cover every STT streaming use case I've needed:
-
gpt-4o-transcribe/gpt-4o-mini-transcribe— file upload withstream=True, delivers partial transcripts as the audio is processed. -
Realtime API (
gpt-4o-realtime-preview) — WebSocket, bidirectional, built for live mic-in / TTS-out with a live-transcription mode.
I helped someone get unblocked on this in openai/openai-python#2306 and realised I'd never written up the full picture. Here it is — with the trade-offs, working code for each, and a decision rule at the end.
Option 1: gpt-4o-transcribe with stream=True
Same API shape as the old audio.transcriptions.create call, just with a new model and stream=True. You get incremental transcript.text.delta events as chunks come back:
from openai import OpenAI
client = OpenAI()
with open("meeting.mp3", "rb") as f:
stream = client.audio.transcriptions.create(
model="gpt-4o-transcribe", # or "gpt-4o-mini-transcribe" (cheaper)
file=f,
response_format="text",
stream=True,
)
transcript = []
for event in stream:
if event.type == "transcript.text.delta":
print(event.delta, end="", flush=True)
transcript.append(event.delta)
elif event.type == "transcript.text.done":
print() # final newline
full_text = "".join(transcript)
Async works identically with AsyncOpenAI:
import asyncio
from openai import AsyncOpenAI
async def transcribe(path: str) -> str:
client = AsyncOpenAI()
parts = []
with open(path, "rb") as f:
stream = await client.audio.transcriptions.create(
model="gpt-4o-transcribe",
file=f,
response_format="text",
stream=True,
)
async for event in stream:
if event.type == "transcript.text.delta":
parts.append(event.delta)
return "".join(parts)
text = asyncio.run(transcribe("meeting.mp3"))
Why I default to this for "finished file" use cases
Three practical reasons beyond the streaming:
-
Accuracy. On the English + mixed-language audio I've benchmarked,
gpt-4o-transcribeis noticeably better thanwhisper-1at speaker changes, acronyms, and technical vocabulary.gpt-4o-mini-transcribeis a smaller quality step down but still beatswhisper-1in my tests. - Latency perception. Even though the total time is similar, partial transcripts streaming into your UI feel much faster to users. A 3-minute audio file that takes 20 seconds to transcribe feels instant if the first words show up after ~500ms; it feels sluggish if you wait the full 20 seconds for the whole blob.
-
Same file-upload ergonomics as Whisper. Swapping
model="whisper-1"formodel="gpt-4o-transcribe"+ addingstream=Trueis almost a drop-in change, so migrating an existing pipeline is a 5-minute job, not a rewrite.
What you give up
-
No word-level timestamps (yet, at time of writing).
whisper-1withresponse_format="verbose_json"andtimestamp_granularities=["word"]still wins if you need precise word-level timing for subtitle alignment. If that's your use case, stay onwhisper-1. - No speaker diarization in either. If you need "who said what", both of these need to be paired with a separate diarization step (pyannote is the usual pick).
Option 2: Realtime API for true live audio
If you're transcribing live audio — a microphone, a phone call, a meeting as it happens — you want the Realtime API, not a file upload. It's a WebSocket connection you push PCM16 chunks into, and you get back conversation.item.input_audio_transcription.delta events every ~200–500ms.
import asyncio, base64
import sounddevice as sd
from openai import AsyncOpenAI
SAMPLE_RATE = 24_000
CHUNK_MS = 50 # 50ms chunks
async def live_transcribe():
client = AsyncOpenAI()
async with client.beta.realtime.connect(
model="gpt-4o-realtime-preview"
) as conn:
# Configure for transcription-only (no model replies, no TTS)
await conn.session.update(session={
"modalities": ["text"],
"input_audio_format": "pcm16",
"input_audio_transcription": {"model": "gpt-4o-transcribe"},
"turn_detection": {"type": "server_vad"}, # let OpenAI handle silence detection
})
# Start streaming audio in the background
audio_queue: asyncio.Queue[bytes] = asyncio.Queue()
def on_audio(indata, frames, time_info, status):
audio_queue.put_nowait(bytes(indata))
with sd.RawInputStream(
samplerate=SAMPLE_RATE,
blocksize=int(SAMPLE_RATE * CHUNK_MS / 1000),
channels=1,
dtype="int16",
callback=on_audio,
):
sender = asyncio.create_task(_send_audio(conn, audio_queue))
try:
async for event in conn:
if event.type == "conversation.item.input_audio_transcription.delta":
print(event.delta, end="", flush=True)
elif event.type == "conversation.item.input_audio_transcription.completed":
print(f"\n[final: {event.transcript}]")
finally:
sender.cancel()
async def _send_audio(conn, q: asyncio.Queue[bytes]):
while True:
chunk = await q.get()
await conn.input_audio_buffer.append(
audio=base64.b64encode(chunk).decode("ascii")
)
asyncio.run(live_transcribe())
A few things worth knowing:
- PCM16 at 24 kHz is the expected format. If you're capturing at a different sample rate, resample before sending — the server won't resample for you.
-
Let server-side VAD handle turn detection (
turn_detection: {type: "server_vad"}) unless you have a specific reason to do it client-side. OpenAI's VAD is well-tuned and keeps your client code simple. -
The
conversation.item.input_audio_transcription.deltaevents are your partial captions;conversation.item.input_audio_transcription.completedfires when the user finishes a "turn" (i.e. stops talking for ~500ms). Use the deltas to drive your live caption UI, and the completed event to commit a finalised sentence to your transcript log. -
You can also use the Realtime API for voice-to-voice (audio in, audio out) by adding
"audio"tomodalitiesand setting a TTS voice. The transcription deltas still fire, so you get the transcript "for free" even in a full voice-assistant setup.
Latency is the killer feature
In my testing, the partial-transcript latency on Realtime is 200–400ms from end-of-phoneme to delta, which is what you need for live captions to feel responsive. File-based gpt-4o-transcribe with streaming still has to wait for the chunk to arrive on the server before it can start, so the first delta on a file upload lands ~1–2s in — fine for "uploaded recording" UX, too slow for "live."
Option 3: Stay on whisper-1
If:
- You genuinely don't care about streaming (batch transcription of a recorded file where the UX is "upload → come back in a minute for the result").
- You need word-level timestamps for subtitle alignment.
- You're cost-optimising hard and the 50% discount of
whisper-1overgpt-4o-mini-transcribematters.
... then whisper-1 is still the right call, and probably will be for a while. It's not going anywhere, it's cheap, it's stable.
The decision rule
Written out as an actual rule I use:
"Is the user staring at the UI waiting for the transcript?"
- No (batch job, background processing, subtitle generation) →
whisper-1if you need timestamps,gpt-4o-mini-transcribeotherwise.- Yes, and the audio is a finished file →
gpt-4o-transcribeorgpt-4o-mini-transcribewithstream=True.- Yes, and the audio is live (mic, phone, meeting) → Realtime API with
input_audio_transcription.
There's one additional axis worth flagging: language coverage. whisper-1 still has the broadest language support (it was trained on 98 languages). gpt-4o-transcribe is very good on the major languages but gets noticeably worse as you head into the long tail. If you're transcribing Swahili, Bengali, or any other non-top-20 language, benchmark both on a sample before picking — don't assume the newer model is always better.
Common pitfalls
A few things that cost me hours the first time:
-
The
fileparameter foraudio.transcriptions.createwants a file-like object, not bytes. If you have raw bytes in memory (e.g. from an upload handler), wrap them inio.BytesIOand set a.nameattribute ending in the right extension: the SDK uses the filename to inferContent-Type.
from io import BytesIO
buf = BytesIO(audio_bytes)
buf.name = "recording.mp3" # ← critical
client.audio.transcriptions.create(model="gpt-4o-transcribe", file=buf, stream=True)
Realtime API auth uses the same key as the rest of OpenAI, but the connection is authenticated at the WebSocket handshake. Your API key briefly appears in the
Authorizationheader of the initial HTTP upgrade request, which is fine server-side — but if you're building a browser client, you need to proxy the handshake through your backend so the key never touches the client. OpenAI has a "client secret" flow for this.The event types are stable but there are a lot of them. The Realtime API emits ~20 distinct event types; if you find yourself writing a giant
if/elifchain, factor it into a dispatch dict indexed byevent.typeearly — much easier to extend later.Silence doesn't count as transcription. If your audio has a lot of pauses, you'll see
input_audio_buffer.speech_stoppedevents but no transcription deltas for the silent parts. That's expected; don't treat it as a bug.
References
-
Speech-to-Text guide — covers
gpt-4o-transcribeandwhisper-1together - Realtime API guide — the overview
- openai-python SDK Realtime docs — the Python specifics
- openai/openai-python#2306 — the discussion that prompted this writeup
If you're building something non-trivial with streaming STT — especially multi-speaker scenarios, code-switching (mixing languages), or very noisy audio — leave a comment, I've been collecting notes on which approach wins in each setting.
Top comments (0)