The hardest part of voice-to-data isn't the transcription. It's making sense of someone thinking out loud.
I've been building Voice Tables — a tool that lets you speak naturally and get structured spreadsheet rows back. No forms, no typing, just talk. Under the hood, it's a two-stage pipeline: Whisper handles the transcription, then a custom prompt chain extracts structured fields from the raw text.
Here's how it actually works, and where things get tricky.
The Pipeline
The architecture is deceptively simple:
Voice recording → Whisper transcription → Prompt-based extraction → Structured row
Stage one (Whisper) is mostly a solved problem. Stage two — turning a messy human monologue into clean, column-mapped data — is where the real engineering lives.
Whisper Configuration
For Voice Tables, I run Whisper with these considerations:
// Whisper config essentials
const whisperConfig = {
model: "whisper-1",
language: null, // auto-detect — users speak CZ, EN, DE...
temperature: 0, // deterministic output
response_format: "text" // plain text, no timestamps needed
};
A few things I learned the hard way:
Auto-detect language works surprisingly well. When you're running a product across multiple markets, hardcoding language: "en" breaks the moment a Czech user starts speaking. Let Whisper figure it out.
Temperature 0 matters more than you'd think. Even small variations in transcription can cascade into extraction errors downstream. "Twenty three" vs "23" vs "twenty-three" — each triggers different parsing paths.
Model size tradeoffs: The large model catches more edge cases (mumbling, background noise, accented speech) but adds latency. For a real-time-ish UX, the base model with a retry fallback to large works well.
The Extraction Prompt
This is where it gets interesting. Raw Whisper output looks like this:
"So I had a meeting with, uh, John from Acme Corp yesterday,
it was about the Q2 renewal, they want to bump it up to
fifty thousand annually, told him I'd get back by Friday"
And you need to extract:
| Contact | Company | Topic | Amount | Deadline |
|---|---|---|---|---|
| John | Acme Corp | Q2 renewal | $50,000/yr | Friday |
The prompt template is column-aware — it knows what fields exist in the user's table and adapts extraction accordingly:
function buildExtractionPrompt(columns, transcript) {
const columnDefs = columns
.map(c => `- ${c.name} (${c.type}): ${c.description || "no description"}`)
.join("\n");
return `Extract structured data from this voice transcript.
Target columns:
${columnDefs}
Rules:
1. Extract ONLY the columns listed above
2. If a value is ambiguous, use your best interpretation
3. If a value is missing from the transcript, use null
4. For dates, interpret relative references relative to today
5. For amounts, normalize to numbers ("fifty thousand" → 50000)
6. Return valid JSON array of objects
Transcript:
"${transcript}"
Respond with ONLY the JSON array.`;
}
The key insight: column descriptions are the secret weapon. When a user sets up their table with a "Company" column described as "the organization, not the person," extraction accuracy jumps dramatically. The prompt doesn't have to guess — it has context.
The Validation Layer
Garbage in, garbage out. Voice input is inherently messy, so the validation layer does heavy lifting:
function validateExtraction(extracted, columns) {
return extracted.map(row => {
const clean = {};
for (const col of columns) {
let value = row[col.name];
// Type coercion
if (col.type === "number" && typeof value === "string") {
value = parseFloat(value.replace(/[^0-9.-]/g, ""));
if (isNaN(value)) value = null;
}
// Date normalization
if (col.type === "date" && value) {
value = parseRelativeDate(value) || value;
}
// Flag missing required fields
if (value === null && col.required) {
clean._warnings = clean._warnings || [];
clean._warnings.push(`Missing required: ${col.name}`);
}
clean[col.name] = value;
}
return clean;
});
}
The _warnings array is surfaced in the UI so users can quickly spot and fix extractions that need human review. You don't want a tool that silently produces wrong data.
Edge Cases That Will Break You
After processing thousands of voice entries across Voice Tables, here are the edge cases that consumed the most debugging time:
Homophony. "Their" vs "there" vs "they're" — Whisper usually gets it right, but when it doesn't and you're extracting a company name, you get phantom entries. The fix: fuzzy matching against known entities when the column has existing data.
Partial sentences. People pause, restart, contradict themselves. "The price is... actually no, make it twelve hundred... wait, fifteen hundred." The prompt needs explicit instructions: "If the speaker corrects themselves, use the final stated value."
Number ambiguity. "I need to order one fifty." Is that 1.50? 150? One unit of item #50? Context from column type and description helps, but this is genuinely hard. We added a confirmation step for ambiguous numbers.
Multi-row entries. Sometimes one recording contains data for multiple rows. "I talked to John about project Alpha and then Maria about project Beta." The extraction prompt handles this — it returns an array — but users are often surprised when one voice note creates two rows.
What I'd Do Differently
If I were starting Voice Tables from scratch:
Start with constrained input. Let users define expected patterns ("I met [person] from [company] about [topic]") before going fully freeform. Constrained extraction is 10x more reliable.
Build the feedback loop earlier. Users correcting extractions is the best training signal. Every correction should tune the extraction prompt for that specific table.
Batch processing matters. The initial version processed one recording at a time. Batching multiple short recordings with shared context (same meeting, same project) dramatically improves extraction quality.
The Stack
For anyone building something similar, here's what powers this at Inithouse:
- Whisper API (OpenAI) for transcription
- GPT-4o for extraction (structured outputs mode)
- React + Supabase frontend and storage (built with Lovable)
- Edge functions for the processing pipeline
The whole thing runs surprisingly lean. Most of the complexity is in prompt engineering, not infrastructure.
Voice-to-structured-data is one of those problems that sounds simple until you actually build it. The transcription part is essentially commoditized. The real challenge is bridging the gap between how humans think out loud and how databases expect data to arrive.
If you're working on something similar, I'd love to hear what approaches you've tried. Drop a comment or find me on Dev.to.
Top comments (0)