DEV Community

S M Tahosin
S M Tahosin

Posted on

Streaming Speech-to-Text with OpenAI in 2026: Moving Beyond Whisper

Quick recap of where we are if you haven't been following OpenAI's STT roadmap: the classic whisper-1 endpoint is batch-only — you upload a file, wait, get back a finished transcript. There's no stream=True because the underlying Whisper decoder wasn't designed for it, and the endpoint probably won't ever get streaming retrofitted onto it.

That was a genuine blocker for about two years. If you wanted live captions or partial transcripts, you had to either self-host Whisper with a streaming fork, or reach for a third-party like AssemblyAI / Deepgram.

Then, quietly, OpenAI shipped two replacements that between them cover every STT streaming use case I've needed:

  1. gpt-4o-transcribe / gpt-4o-mini-transcribe — file upload with stream=True, delivers partial transcripts as the audio is processed.
  2. Realtime API (gpt-4o-realtime-preview) — WebSocket, bidirectional, built for live mic-in / TTS-out with a live-transcription mode.

I helped someone get unblocked on this in openai/openai-python#2306 and realised I'd never written up the full picture. Here it is — with the trade-offs, working code for each, and a decision rule at the end.

Option 1: gpt-4o-transcribe with stream=True

Same API shape as the old audio.transcriptions.create call, just with a new model and stream=True. You get incremental transcript.text.delta events as chunks come back:

from openai import OpenAI
client = OpenAI()

with open("meeting.mp3", "rb") as f:
    stream = client.audio.transcriptions.create(
        model="gpt-4o-transcribe",       # or "gpt-4o-mini-transcribe" (cheaper)
        file=f,
        response_format="text",
        stream=True,
    )
    transcript = []
    for event in stream:
        if event.type == "transcript.text.delta":
            print(event.delta, end="", flush=True)
            transcript.append(event.delta)
        elif event.type == "transcript.text.done":
            print()    # final newline
    full_text = "".join(transcript)
Enter fullscreen mode Exit fullscreen mode

Async works identically with AsyncOpenAI:

import asyncio
from openai import AsyncOpenAI

async def transcribe(path: str) -> str:
    client = AsyncOpenAI()
    parts = []
    with open(path, "rb") as f:
        stream = await client.audio.transcriptions.create(
            model="gpt-4o-transcribe",
            file=f,
            response_format="text",
            stream=True,
        )
        async for event in stream:
            if event.type == "transcript.text.delta":
                parts.append(event.delta)
    return "".join(parts)

text = asyncio.run(transcribe("meeting.mp3"))
Enter fullscreen mode Exit fullscreen mode

Why I default to this for "finished file" use cases

Three practical reasons beyond the streaming:

  • Accuracy. On the English + mixed-language audio I've benchmarked, gpt-4o-transcribe is noticeably better than whisper-1 at speaker changes, acronyms, and technical vocabulary. gpt-4o-mini-transcribe is a smaller quality step down but still beats whisper-1 in my tests.
  • Latency perception. Even though the total time is similar, partial transcripts streaming into your UI feel much faster to users. A 3-minute audio file that takes 20 seconds to transcribe feels instant if the first words show up after ~500ms; it feels sluggish if you wait the full 20 seconds for the whole blob.
  • Same file-upload ergonomics as Whisper. Swapping model="whisper-1" for model="gpt-4o-transcribe" + adding stream=True is almost a drop-in change, so migrating an existing pipeline is a 5-minute job, not a rewrite.

What you give up

  • No word-level timestamps (yet, at time of writing). whisper-1 with response_format="verbose_json" and timestamp_granularities=["word"] still wins if you need precise word-level timing for subtitle alignment. If that's your use case, stay on whisper-1.
  • No speaker diarization in either. If you need "who said what", both of these need to be paired with a separate diarization step (pyannote is the usual pick).

Option 2: Realtime API for true live audio

If you're transcribing live audio — a microphone, a phone call, a meeting as it happens — you want the Realtime API, not a file upload. It's a WebSocket connection you push PCM16 chunks into, and you get back conversation.item.input_audio_transcription.delta events every ~200–500ms.

import asyncio, base64
import sounddevice as sd
from openai import AsyncOpenAI

SAMPLE_RATE = 24_000
CHUNK_MS = 50   # 50ms chunks

async def live_transcribe():
    client = AsyncOpenAI()
    async with client.beta.realtime.connect(
        model="gpt-4o-realtime-preview"
    ) as conn:
        # Configure for transcription-only (no model replies, no TTS)
        await conn.session.update(session={
            "modalities": ["text"],
            "input_audio_format": "pcm16",
            "input_audio_transcription": {"model": "gpt-4o-transcribe"},
            "turn_detection": {"type": "server_vad"},   # let OpenAI handle silence detection
        })

        # Start streaming audio in the background
        audio_queue: asyncio.Queue[bytes] = asyncio.Queue()

        def on_audio(indata, frames, time_info, status):
            audio_queue.put_nowait(bytes(indata))

        with sd.RawInputStream(
            samplerate=SAMPLE_RATE,
            blocksize=int(SAMPLE_RATE * CHUNK_MS / 1000),
            channels=1,
            dtype="int16",
            callback=on_audio,
        ):
            sender = asyncio.create_task(_send_audio(conn, audio_queue))
            try:
                async for event in conn:
                    if event.type == "conversation.item.input_audio_transcription.delta":
                        print(event.delta, end="", flush=True)
                    elif event.type == "conversation.item.input_audio_transcription.completed":
                        print(f"\n[final: {event.transcript}]")
            finally:
                sender.cancel()

async def _send_audio(conn, q: asyncio.Queue[bytes]):
    while True:
        chunk = await q.get()
        await conn.input_audio_buffer.append(
            audio=base64.b64encode(chunk).decode("ascii")
        )

asyncio.run(live_transcribe())
Enter fullscreen mode Exit fullscreen mode

A few things worth knowing:

  • PCM16 at 24 kHz is the expected format. If you're capturing at a different sample rate, resample before sending — the server won't resample for you.
  • Let server-side VAD handle turn detection (turn_detection: {type: "server_vad"}) unless you have a specific reason to do it client-side. OpenAI's VAD is well-tuned and keeps your client code simple.
  • The conversation.item.input_audio_transcription.delta events are your partial captions; conversation.item.input_audio_transcription.completed fires when the user finishes a "turn" (i.e. stops talking for ~500ms). Use the deltas to drive your live caption UI, and the completed event to commit a finalised sentence to your transcript log.
  • You can also use the Realtime API for voice-to-voice (audio in, audio out) by adding "audio" to modalities and setting a TTS voice. The transcription deltas still fire, so you get the transcript "for free" even in a full voice-assistant setup.

Latency is the killer feature

In my testing, the partial-transcript latency on Realtime is 200–400ms from end-of-phoneme to delta, which is what you need for live captions to feel responsive. File-based gpt-4o-transcribe with streaming still has to wait for the chunk to arrive on the server before it can start, so the first delta on a file upload lands ~1–2s in — fine for "uploaded recording" UX, too slow for "live."

Option 3: Stay on whisper-1

If:

  • You genuinely don't care about streaming (batch transcription of a recorded file where the UX is "upload → come back in a minute for the result").
  • You need word-level timestamps for subtitle alignment.
  • You're cost-optimising hard and the 50% discount of whisper-1 over gpt-4o-mini-transcribe matters.

... then whisper-1 is still the right call, and probably will be for a while. It's not going anywhere, it's cheap, it's stable.

The decision rule

Written out as an actual rule I use:

"Is the user staring at the UI waiting for the transcript?"

  • No (batch job, background processing, subtitle generation) → whisper-1 if you need timestamps, gpt-4o-mini-transcribe otherwise.
  • Yes, and the audio is a finished filegpt-4o-transcribe or gpt-4o-mini-transcribe with stream=True.
  • Yes, and the audio is live (mic, phone, meeting) → Realtime API with input_audio_transcription.

There's one additional axis worth flagging: language coverage. whisper-1 still has the broadest language support (it was trained on 98 languages). gpt-4o-transcribe is very good on the major languages but gets noticeably worse as you head into the long tail. If you're transcribing Swahili, Bengali, or any other non-top-20 language, benchmark both on a sample before picking — don't assume the newer model is always better.

Common pitfalls

A few things that cost me hours the first time:

  1. The file parameter for audio.transcriptions.create wants a file-like object, not bytes. If you have raw bytes in memory (e.g. from an upload handler), wrap them in io.BytesIO and set a .name attribute ending in the right extension: the SDK uses the filename to infer Content-Type.
   from io import BytesIO
   buf = BytesIO(audio_bytes)
   buf.name = "recording.mp3"    # ← critical
   client.audio.transcriptions.create(model="gpt-4o-transcribe", file=buf, stream=True)
Enter fullscreen mode Exit fullscreen mode
  1. Realtime API auth uses the same key as the rest of OpenAI, but the connection is authenticated at the WebSocket handshake. Your API key briefly appears in the Authorization header of the initial HTTP upgrade request, which is fine server-side — but if you're building a browser client, you need to proxy the handshake through your backend so the key never touches the client. OpenAI has a "client secret" flow for this.

  2. The event types are stable but there are a lot of them. The Realtime API emits ~20 distinct event types; if you find yourself writing a giant if/elif chain, factor it into a dispatch dict indexed by event.type early — much easier to extend later.

  3. Silence doesn't count as transcription. If your audio has a lot of pauses, you'll see input_audio_buffer.speech_stopped events but no transcription deltas for the silent parts. That's expected; don't treat it as a bug.

References

If you're building something non-trivial with streaming STT — especially multi-speaker scenarios, code-switching (mixing languages), or very noisy audio — leave a comment, I've been collecting notes on which approach wins in each setting.

Top comments (0)