DEV Community

Jakub
Jakub

Posted on

Whisper + Custom Prompts: Turning Messy Voice Into Structured Data

The hardest part of voice-to-data isn't the transcription. It's making sense of someone thinking out loud.

I've been building Voice Tables — a tool that lets you speak naturally and get structured spreadsheet rows back. No forms, no typing, just talk. Under the hood, it's a two-stage pipeline: Whisper handles the transcription, then a custom prompt chain extracts structured fields from the raw text.

Here's how it actually works, and where things get tricky.

The Pipeline

The architecture is deceptively simple:

Voice recording → Whisper transcription → Prompt-based extraction → Structured row
Enter fullscreen mode Exit fullscreen mode

Stage one (Whisper) is mostly a solved problem. Stage two — turning a messy human monologue into clean, column-mapped data — is where the real engineering lives.

Whisper Configuration

For Voice Tables, I run Whisper with these considerations:

// Whisper config essentials
const whisperConfig = {
  model: "whisper-1",
  language: null,            // auto-detect — users speak CZ, EN, DE...
  temperature: 0,            // deterministic output
  response_format: "text"    // plain text, no timestamps needed
};
Enter fullscreen mode Exit fullscreen mode

A few things I learned the hard way:

Auto-detect language works surprisingly well. When you're running a product across multiple markets, hardcoding language: "en" breaks the moment a Czech user starts speaking. Let Whisper figure it out.

Temperature 0 matters more than you'd think. Even small variations in transcription can cascade into extraction errors downstream. "Twenty three" vs "23" vs "twenty-three" — each triggers different parsing paths.

Model size tradeoffs: The large model catches more edge cases (mumbling, background noise, accented speech) but adds latency. For a real-time-ish UX, the base model with a retry fallback to large works well.

The Extraction Prompt

This is where it gets interesting. Raw Whisper output looks like this:

"So I had a meeting with, uh, John from Acme Corp yesterday, 
it was about the Q2 renewal, they want to bump it up to 
fifty thousand annually, told him I'd get back by Friday"
Enter fullscreen mode Exit fullscreen mode

And you need to extract:

Contact Company Topic Amount Deadline
John Acme Corp Q2 renewal $50,000/yr Friday

The prompt template is column-aware — it knows what fields exist in the user's table and adapts extraction accordingly:

function buildExtractionPrompt(columns, transcript) {
  const columnDefs = columns
    .map(c => `- ${c.name} (${c.type}): ${c.description || "no description"}`)
    .join("\n");

  return `Extract structured data from this voice transcript.

Target columns:
${columnDefs}

Rules:
1. Extract ONLY the columns listed above
2. If a value is ambiguous, use your best interpretation
3. If a value is missing from the transcript, use null
4. For dates, interpret relative references relative to today
5. For amounts, normalize to numbers ("fifty thousand" → 50000)
6. Return valid JSON array of objects

Transcript:
"${transcript}"

Respond with ONLY the JSON array.`;
}
Enter fullscreen mode Exit fullscreen mode

The key insight: column descriptions are the secret weapon. When a user sets up their table with a "Company" column described as "the organization, not the person," extraction accuracy jumps dramatically. The prompt doesn't have to guess — it has context.

The Validation Layer

Garbage in, garbage out. Voice input is inherently messy, so the validation layer does heavy lifting:

function validateExtraction(extracted, columns) {
  return extracted.map(row => {
    const clean = {};

    for (const col of columns) {
      let value = row[col.name];

      // Type coercion
      if (col.type === "number" && typeof value === "string") {
        value = parseFloat(value.replace(/[^0-9.-]/g, ""));
        if (isNaN(value)) value = null;
      }

      // Date normalization
      if (col.type === "date" && value) {
        value = parseRelativeDate(value) || value;
      }

      // Flag missing required fields
      if (value === null && col.required) {
        clean._warnings = clean._warnings || [];
        clean._warnings.push(`Missing required: ${col.name}`);
      }

      clean[col.name] = value;
    }
    return clean;
  });
}
Enter fullscreen mode Exit fullscreen mode

The _warnings array is surfaced in the UI so users can quickly spot and fix extractions that need human review. You don't want a tool that silently produces wrong data.

Edge Cases That Will Break You

After processing thousands of voice entries across Voice Tables, here are the edge cases that consumed the most debugging time:

Homophony. "Their" vs "there" vs "they're" — Whisper usually gets it right, but when it doesn't and you're extracting a company name, you get phantom entries. The fix: fuzzy matching against known entities when the column has existing data.

Partial sentences. People pause, restart, contradict themselves. "The price is... actually no, make it twelve hundred... wait, fifteen hundred." The prompt needs explicit instructions: "If the speaker corrects themselves, use the final stated value."

Number ambiguity. "I need to order one fifty." Is that 1.50? 150? One unit of item #50? Context from column type and description helps, but this is genuinely hard. We added a confirmation step for ambiguous numbers.

Multi-row entries. Sometimes one recording contains data for multiple rows. "I talked to John about project Alpha and then Maria about project Beta." The extraction prompt handles this — it returns an array — but users are often surprised when one voice note creates two rows.

What I'd Do Differently

If I were starting Voice Tables from scratch:

  1. Start with constrained input. Let users define expected patterns ("I met [person] from [company] about [topic]") before going fully freeform. Constrained extraction is 10x more reliable.

  2. Build the feedback loop earlier. Users correcting extractions is the best training signal. Every correction should tune the extraction prompt for that specific table.

  3. Batch processing matters. The initial version processed one recording at a time. Batching multiple short recordings with shared context (same meeting, same project) dramatically improves extraction quality.

The Stack

For anyone building something similar, here's what powers this at Inithouse:

  • Whisper API (OpenAI) for transcription
  • GPT-4o for extraction (structured outputs mode)
  • React + Supabase frontend and storage (built with Lovable)
  • Edge functions for the processing pipeline

The whole thing runs surprisingly lean. Most of the complexity is in prompt engineering, not infrastructure.


Voice-to-structured-data is one of those problems that sounds simple until you actually build it. The transcription part is essentially commoditized. The real challenge is bridging the gap between how humans think out loud and how databases expect data to arrive.

If you're working on something similar, I'd love to hear what approaches you've tried. Drop a comment or find me on Dev.to.

Top comments (0)