Speech Recognition API: Streaming, WebSockets and Latency
A speech recognition API that accepts a file and returns a transcript is a solved problem.The architecture is simple because the constraints are simple.
Real-time transcription is different. The audio doesn't exist yet when processing needs to begin. The user is still speaking while the system needs to be building a hypothesis about what they said. The application needs a partial answer now, not a complete answer in two seconds. These constraints change the architecture at every layer, from how audio is captured and transmitted to how the recognition model processes it and how results flow back to the client.
This piece walks through that architecture end to end. Not as an API reference, but as an explanation of what is actually happening inside a streaming speech recognition system and why each component is designed the way it is.
The fundamental problem with batch transcription for real-time use
Before looking at how streaming ASR works, it helps to understand precisely why the batch approach breaks down when applied to real-time audio.
In a batch system, the flow is straightforward. Audio is captured, buffered until complete, sent to a recognition service via an HTTP POST request, processed server-side, and a transcript is returned in the response body. The model sees the entire utterance before producing any output. This gives it full context, which tends to produce accurate results.
The problem is time. If a user speaks for five seconds, the system cannot return any transcript until those five seconds of audio have been captured, transmitted, and processed. Even with a fast model, the user experiences a dead pause after finishing their sentence before anything happens. In a voice agent or real-time captioning system, that pause breaks the interaction.
The deeper problem is that buffering introduces a fundamental floor on latency that no amount of model optimization can eliminate. Even an infinitely fast model cannot return a transcript before the audio has been collected and sent. The latency is baked into the architecture.
Streaming ASR removes this floor by changing the fundamental contract. Rather than collecting audio and then processing it, the system processes audio as it arrives.
How a streaming ASR API receives audio
The first architectural shift in a streaming system is the transport layer. HTTP request-response is the wrong shape for continuous audio delivery.
A new HTTP connection carries significant overhead including DNS resolution, TCP handshake, TLS negotiation, and HTTP headers on every request. For a file upload, this overhead is negligible relative to the payload. For 20-millisecond audio packets arriving fifty times per second, it is prohibitive. The connection overhead would dominate the actual audio data.
A WebSocket connection solves this by establishing a single persistent connection that remains open for the duration of the session. The initial handshake happens once. After that, both sides can send data at any time without per-message overhead. The client pushes audio packets as they arrive from the microphone. The server pushes transcript events as they are produced by the recognition model. Neither side waits for the other to finish.
Python Batch Processing Example
- Prerequisite: SmallestAI's API Key
import os
import requests
API_KEY = os.getenv("SMALLEST_API_KEY")
audio_file = "meeting_recording.wav"
url = "https://api.smallest.ai/api/v1/pulse/get_text"
params = {
"model": "pulse",
"language": "en",
"word_timestamps": "true",
"diarization":"true",
"emotion_detection": "true"
}
headers = {
"Authorization": f"Bearer {API_KEY}",
"Content-Type": "audio/wav"
}
with open(audio_file, "rb") as f:
audio_data = f.read()
response = requests.post(url, params=params, headers=headers, data=audio_data)
result = response.json()
print("Transcription:", result.get("transcription"))
for word in result.get("words", []):
speaker = word.get("speaker", "N/A")
print(f" [Speaker {speaker}] [{word['start']:.2f}s - {word['end']:.2f}s] {word['word']}")
if "emotions" in result:
print("\nEmotions detected:")
for emotion, score in result["emotions"].items():
if score > 0.1:
print(f" {emotion}: {score:.1%}")
Output
The audio capture and network transmission run concurrently. The recognition server receives a continuous stream of small packets rather than waiting for a complete file.
Inside the recognition model: how streaming inference works
Once audio packets arrive at the recognition server, the ASR model needs to produce transcript output without waiting for the utterance to complete. This requires a different inference architecture from batch transcription.
Modern streaming ASR systems use a buffer that accumulates incoming audio packets and runs the recognition model against overlapping windows of that buffer. The window size is typically 250 to 500 milliseconds, much longer than the 20ms packet size, because the model needs enough acoustic context to make meaningful predictions. Each time new audio arrives, the window advances and the model produces an updated hypothesis.
The model's job at each step is to answer the same question. Given all the audio seen so far, what is the most likely transcript? The answer changes as more audio arrives. A word that looked like "their" in the first pass might resolve to "there" when the following words provide context. These updates produce the partial transcript stream.
The internal architecture of the recognition model is typically an encoder-decoder transformer. The encoder converts the incoming audio frames into a sequence of dense vector representations capturing phonetic and prosodic features. The decoder attends to those representations to produce token predictions, one sub-word token at a time, building the transcript incrementally.
What makes this work in streaming mode is a technique called chunked attention, where the encoder is constrained to attend only to audio within a rolling window rather than the full utterance. This means the model can produce outputs without waiting for the sentence to end, at the cost of slightly reduced accuracy on words near the end of the window where context is limited.
WebSocket Connection Example
# file name: websocket.py
import asyncio
import json
import os
import websockets
API_KEY = os.environ.get("SMALLEST_API_KEY")
async def transcribe_stream():
# Build the WebSocket URL with query parameters
params = {
"language": "en",
"encoding": "linear16",
"sample_rate": "16000",
"word_timestamps": "true",
}
query_string = "&".join(f"{k}={v}" for k, v in params.items())
uri = f"wss://waves-api.smallest.ai/api/v1/pulse/get_text?{query_string}"
# Connect with Bearer token authentication header
headers = {"Authorization": f"Bearer {API_KEY}"}
async with websockets.connect(uri, extra_headers=headers) as ws:
print("Connected to Pulse STT")
# Launch concurrent tasks: send audio & receive transcripts
send_task = asyncio.create_task(send_audio(ws))
recv_task = asyncio.create_task(receive_transcripts(ws))
await asyncio.gather(send_task, recv_task)
async def send_audio(ws):
"""Read audio from a source and stream it to the WebSocket."""
with open("audio_16k_mono.raw", "rb") as f:
while True:
chunk = f.read(4096) # Recommended chunk size
if not chunk:
break
await ws.send(chunk)
await asyncio.sleep(0.05) # Pace stream to simulate real-time
async def receive_transcripts(ws):
"""Receive and process transcript responses from the server."""
async for message in ws:
response = json.loads(message)
if response.get("transcript"):
status = "FINAL" if response.get("is_final") else "PARTIAL"
lang = response.get("language", "unknown")
print(f"[{status}] ({lang}): {response['transcript']}")
# Access word timestamps if enabled
if "words" in response:
for word in response["words"]:
print(f" {word['word']}: {word['start']:.2f}s - {word['end']:.2f}s")
asyncio.run(transcribe_stream())
Output
Word timestamps are a byproduct of the encoder's attention alignment. The model learns which audio frames correspond to which output tokens during training, and this alignment is surfaced as timing metadata without additional inference cost.
Endpointing and deciding when an utterance ends
One of the harder problems in streaming ASR is endpointing, which is detecting when the speaker has finished a turn rather than simply paused mid-sentence. This matters because the system needs to know when to commit a final transcript and when to keep accumulating audio for the current hypothesis.
Getting endpointing wrong in either direction has visible consequences. An endpointer that fires too early cuts off sentences, producing truncated transcripts. One that fires too late adds perceptible delay after the speaker finishes, because the application has to wait for the endpointer before it can act on what was said.
The simplest approach is energy-based voice activity detection. If the audio energy drops below a threshold for a fixed duration, the system assumes the speaker has finished. This works adequately in quiet environments but fails under noise, where energy never drops cleanly to silence, and for speakers who naturally pause mid-thought.
Better endpointing systems combine acoustic signals with semantic signals. The acoustic layer watches for energy drops and spectral changes that characterize sentence endings. The semantic layer, often a small language model running on the partial transcript, checks whether the utterance is syntactically complete. A partial transcript ending mid-noun-phrase is unlikely to represent a complete turn. One ending with a complete declarative sentence is more likely to be a real turn boundary.
The output of the endpointer determines when partial transcript events transition to final transcript events in the client. A partial event is a hypothesis update. A final event is a committed result that the application can act on.
The transcript event stream and how to handle it
A streaming ASR API produces a continuous stream of events rather than a single response. Each event carries a type field distinguishing partial from final results, the current transcript text, word-level metadata if requested, and a timestamp indicating where in the audio this result corresponds.
Partial Response (is_final: false)
{
"session_id": "sess_12345abcde",
"transcript": "the customer said they want",
"is_final": false,
"is_last": false,
"language": "en",
"words": [
{"word": "the", "start": 0.00, "end": 0.12},
{"word": "customer", "start": 0.14, "end": 0.52},
{"word": "said", "start": 0.54, "end": 0.74},
{"word": "they", "start": 0.76, "end": 0.90},
{"word": "want", "start": 0.92, "end": 1.10}
]
}
Final Response (is_final: true)
{
"session_id": "sess_12345abcde",
"transcript": "the customer said they want a refund",
"is_final": true,
"is_last": false,
"language": "en",
"words": [
{"word": "the", "start": 0.00, "end": 0.12},
{"word": "customer", "start": 0.14, "end": 0.52},
{"word": "said", "start": 0.54, "end": 0.74},
{"word": "they", "start": 0.76, "end": 0.90},
{"word": "want", "start": 0.92, "end": 1.10},
{"word": "a", "start": 1.10, "end": 1.14},
{"word": "refund", "start": 1.14, "end": 1.48}
]
}
Notice that the word "want" had a confidence of 0.72 in the partial event. The model was uncertain whether more audio would follow and change the interpretation. In the final event, with the complete context, it scores 0.91. The word did not change, but the model's certainty about it did.
This is why acting on partial transcript content is architecturally risky. A word at low confidence in a partial might resolve to something different in the final. Any downstream action triggered by the partial would have been based on an unstable input.
The correct pattern is to use partial transcripts for user-facing display only, where visible corrections feel natural and expected, and to gate all application logic on final transcripts.
Confidence scores and what they actually measure
Every word in a streaming transcript carries a confidence score, typically a probability between 0.0 and 1.0, representing how certain the model is about that prediction.
The confidence score is not a measure of whether the word is correct. It measures how much probability mass the model assigned to this particular output versus the alternatives it considered. A score of 0.95 means the model strongly preferred this word over all others it evaluated. A score of 0.60 means there were plausible alternatives that the model considered seriously.
Words with low confidence scores are disproportionately likely to be wrong, but the relationship is not one-to-one. A model can be highly confident and wrong, particularly on proper nouns or domain-specific terms not present in training data. And it can be somewhat uncertain and still produce the correct output.
The most useful application of confidence scores is flagging rather than filtering. Rather than discarding low-confidence words, mark them for downstream attention. In a customer service context, a low-confidence stretch in a critical part of a call is a signal to route the transcript for human review. In a voice agent, a low-confidence final transcript is a signal to ask for clarification rather than proceeding.
async def transcribe_with_flagging(file_path: str):
async with AsyncWavesClient(api_key="YOUR_API_KEY") as client:
result = await client.transcribe(
file_path=file_path,
word_timestamps=True,
)
words = result.get("words", [])
return result["transcription"], words
Paralinguistic signals alongside the transcript
The acoustic signal carries information that survives the conversion to text and information that does not. Tone, emotional register, estimated speaker age and gender are all present in the audio and absent from the transcript. A recognition service that surfaces these as structured metadata gives the application something the transcript alone cannot provide.
Emotion detection tags the emotional register of each segment. Gender and age detection provide demographic signals. These are extracted during the same inference pass as the transcript, using acoustic features the encoder computes regardless. They come without the cost of a separate processing pipeline.
async def full_acoustic_analysis(file_path: str):
async with AsyncWavesClient(api_key="YOUR_API_KEY") as client:
result = await client.transcribe(
file_path=file_path,
language="en",
word_timestamps=True,
age_detection=True,
gender_detection=True,
emotion_detection=True,
)
print("Transcript:", result.get("transcription"))
# Handle emotions as a dictionary
if 'emotions' in result:
print("Detected emotions:")
for emotion, score in result['emotions'].items():
print(f" {emotion.capitalize()}: {score:.2f}")
print("Detected gender:", result.get("gender"))
print("Estimated age range:", result.get("age"))
These are probabilistic estimates. They work well in aggregate and should be treated as signals rather than ground truth, particularly for individual utterances where the acoustic evidence may be ambiguous.
Streaming TTS and closing the audio loop
A streaming ASR system rarely lives alone. In a voice agent, the transcript from the recognition service feeds a language model, which generates a response that must be converted back to audio and played to the user. The latency of that full loop determines whether the agent feels conversational.
The dominant contributor is LLM reasoning, typically 300 to 500ms even with a fast model. The highest-impact optimization is therefore starting TTS synthesis before the LLM has finished generating, feeding the streaming token output rather than waiting for the complete response.
Smallest.ai's WavesStreamingTTS connects to the synthesis service via a persistent WebSocket and accepts text chunks as they arrive from the LLM token stream. Audio chunks come back as each sentence is ready, so playback can begin in under 100ms from the first LLM token.
Basic Setup
from smallestai.waves import TTSConfig, WavesStreamingTTS
import wave
config = TTSConfig(
voice_id="magnus",
api_key="YOUR_SMALLEST_API_KEY",
sample_rate=24000,
speed=1.0,
max_buffer_flush_ms=100,
)
streaming_tts = WavesStreamingTTS(config)
Standard Streaming
text = "Streaming delivers audio in real-time for voice assistants and chatbots."
audio_chunks = list(streaming_tts.synthesize(text))
with wave.open("streamed.wav", "wb") as wf:
wf.setnchannels(1)
wf.setsampwidth(2)
wf.setframerate(24000)
wf.writeframes(b"".join(audio_chunks))
Multilingual streaming and code-switching
Streaming ASR systems designed around a single language make assumptions that break in multilingual environments. A model trained primarily on English will produce poor results on Hindi and may fail on code-switching utterances that mix the two within a single sentence.
Smallest.ai's Lightning ASR model supports 30 languages including Hindi, German, French, Spanish, Italian, Portuguese, Russian, Arabic, Polish, Dutch, Tamil, Bengali, Gujarati, Kannada and Malayalam, with a multi mode for code-switching environments where speakers alternate between languages.
async def multilingual_transcription(file_path: str):
async with AsyncWavesClient(api_key="SMALLEST_API_KEY") as client:
result = await client.transcribe(
file_path=file_path,
language="auto", # Auto Language Detection
word_timestamps=True,
emotion_detection=True,
)
return result["transcription"]
Code-switching is architecturally harder than single-language recognition because the model cannot assume a stable phoneme inventory or language model distribution. The multi mode handles detection and switching internally, removing the need for the application to route audio to different models based on detected language.
What the architecture means for how you build
Understanding how streaming ASR works at each layer changes what you build around it.
Because partial transcripts are revised as more audio arrives, any application logic that needs to act on what the user said must wait for a final transcript event. Displaying partials in a UI is fine because visible corrections feel natural. Triggering downstream actions, database lookups, tool calls, or routing decisions on partial content is architecturally unsound. The final event is the contract.
Because word confidence scores reflect model uncertainty rather than correctness, the right use is flagging rather than filtering. A word with 0.60 confidence is a candidate for human review or a clarification prompt, not something to silently drop from the transcript.
Because the endpointer determines when final transcripts are emitted, the responsiveness of the system to turn endings is determined by endpointing quality, not model speed. A fast model behind a slow endpointer still feels slow. Endpointing latency is worth measuring explicitly and separately from overall model throughput.
Because WebSocket connections carry state, connection management becomes an application concern. Dropped connections need reconnection logic. Audio buffered during a reconnect gap needs to either be replayed or explicitly acknowledged as lost. These failure modes do not exist in batch transcription and need to be designed for in streaming systems from the start.
The technology is capable. The architecture around it determines whether that capability reaches the user.
*Tools referenced in this piece: AsyncWavesClient, WavesStreamingTTS, Pulse STT



Top comments (0)