Voice input sounds simple until you actually try to parse it. Someone says "add thirty units of the premium package for Novák company, delivered next Tuesday" and suddenly you're dealing with accent-mangled names, ambiguous dates, and a number that might be "thirty" or "dirty" depending on background noise.
I've been building Voice Tables — a tool that lets you fill structured data tables using just your voice. The core pipeline is Whisper for transcription plus an LLM with function calling for extraction. Here's what I learned making it actually work.
The Pipeline
The flow looks straightforward on paper:
Audio → Whisper API → Raw text → LLM + Schema → Structured JSON → Table row
But each arrow hides a pile of edge cases.
Step 1: Whisper transcription. We send audio chunks to OpenAI's Whisper API. For most English input, word error rate sits around 4-6%. But throw in Czech names, technical jargon, or someone dictating from a noisy warehouse — error rate jumps to 12-18%. That's where custom prompts come in.
Whisper accepts an optional prompt parameter. Most people ignore it. We don't. We prepend domain vocabulary directly:
whisper_prompt = "Vocabulary: Novák, Dvořák, Svoboda, Premium Package, Express Delivery"
This single trick dropped our CZ-name misrecognition rate from ~35% down to about 12%. Whisper uses the prompt as a conditioning signal — it biases the model toward expected tokens without hard-constraining it.
Step 2: LLM extraction with schema enforcement. Raw transcript goes to an LLM with a function call definition that mirrors the target table schema:
{
"name": "add_table_row",
"parameters": {
"type": "object",
"properties": {
"company": { "type": "string" },
"quantity": { "type": "integer" },
"product": { "type": "string", "enum": ["Basic", "Premium", "Enterprise"] },
"delivery_date": { "type": "string", "format": "date" }
},
"required": ["company", "quantity", "product"]
}
}
The schema acts as a contract. The LLM can't return free-form text — it must fit the structure or fail validation. This is where function calling shines over plain prompting.
The Hard Parts
Accent variability. Czech speakers dictating in English produce phoneme patterns Whisper wasn't optimized for. "Thirty" becomes "dirty," "Novák" becomes "no vac." Our Whisper prompt helps, but we also run a fuzzy match post-step: if the extracted company name is within Levenshtein distance 2 of a known entity, we snap to the known value.
Date resolution. "Next Tuesday" is contextual. "End of month" is contextual. "ASAP" means different things to different people. We pass current date context in the system prompt and instruct the LLM to resolve all relative dates to ISO format. Ambiguous cases get flagged — the table cell shows a yellow indicator and the user confirms with one tap.
Schema validation retries. Sometimes the LLM returns JSON that doesn't match the schema — wrong enum value, missing required field, number as string. We run validation immediately and retry once with the error message appended:
if not validate(result, schema):
retry_prompt = f"Previous output failed validation: {error}. Fix and retry."
result = llm_call(transcript, schema, retry_prompt)
First-pass validation success rate is about 89%. With one retry, it climbs to 97%. Two retries hit 99%+, but we cap at one to keep latency under 3 seconds total.
Real Numbers
End-to-end for a typical voice entry (5-15 seconds of speech, English with occasional Czech names):
- Whisper transcription: ~800ms (API, including network)
- LLM extraction: ~1200ms (function calling, streaming)
- Validation + optional retry: ~200ms (or +1200ms on retry)
- Total p50: ~2.2 seconds
- Total p95: ~3.8 seconds
- Cost per entry: ~$0.003 (Whisper) + ~$0.002 (LLM) = roughly half a cent
For context, manual data entry for the same row takes 15-30 seconds and has a higher error rate on numeric fields.
What's Next
We're experimenting with streaming transcription for real-time feedback — showing partial text as the user speaks, then snapping to structured form when they stop. Also looking at fine-tuning a smaller model specifically for our extraction step to cut that 1200ms down.
The combination of Whisper prompts + schema-enforced function calling is surprisingly robust for production use. The key insight: treat voice input as a noisy channel and design every step to gracefully handle the noise rather than assuming clean input.
If you're building anything similar, the Whisper prompt trick alone is worth trying — it's a one-line change with measurable impact.
I'm Jakub, building Voice Tables and other products at Inithouse. We're a small studio running ~14 MVPs, all in early stages, all learning fast. Voice Tables is one of the tools where the tech challenge turned out to be more interesting than expected — HereWeAsk and BeRecommended are two others from the portfolio tackling different problems.
Top comments (0)