How Voice Tables by Inithouse turns speech into structured rows (Whisper + LLM function calling)

#ai #webdev #productivity #machinelearning

Voice Tables is a product we built at Inithouse, a studio running parallel product experiments. The pitch is simple: talk to your database instead of typing into it. Say "add a new lead, John from Acme, called yesterday, follow up Friday" and it becomes a structured row in a table.

This post walks through how the pipeline actually works.

The problem with voice and structured data

Voice input is messy. People don't speak in columns and rows. They say things like "met Sarah at the conference, she runs marketing at that fintech startup, seemed interested in the enterprise plan." That sentence contains a name, a company, a role, a context, and a deal stage. Getting a computer to parse that reliably into the right fields is the core technical challenge.

We tried several approaches across our portfolio at Inithouse before landing on the current architecture. Rule-based parsing broke on anything outside the expected pattern. Fine-tuned classification models needed too much training data for every new table schema. What worked was combining two existing tools in a specific way.

The pipeline: Whisper + LLM function calling

The architecture has three stages:

Stage 1: Speech to text (Whisper)

Audio from the browser goes to OpenAI's Whisper API. We use the whisper-1 model with language detection enabled. The key decision here: we send raw audio chunks, not processed segments. Whisper handles punctuation and sentence boundaries better when it gets longer context.

One thing we learned: Whisper's word-level timestamps are useful for debugging but not for the actual pipeline. What matters is the full transcription text that goes to stage 2.

Stage 2: Schema-aware extraction (LLM function calling)

This is where it gets interesting. We take the transcription and the user's table schema (column names, types, descriptions) and construct an LLM function call. The function signature mirrors the table structure:

function add_row(
  name: string,        // Contact name
  company: string,     // Company name
  role: string,        // Job title or role
  context: string,     // How/where you met
  next_step: string,   // Next action item
  priority: "high" | "medium" | "low"
)

The LLM (we use GPT-4o-mini for speed, GPT-4o for complex schemas) receives the transcription plus this function definition. Function calling forces structured output: the model must fill the parameters or explicitly mark them as null. No free-text responses, no hallucinated columns.

Stage 3: Validation and insertion

The extracted parameters go through type validation (dates parsed, enums checked, required fields verified). If validation passes, the row inserts into the user's Supabase table. If it fails, the user gets a clear message about what's missing.

What we measured

Across our testing, accuracy on well-defined schemas (clear column names, type hints, 1-2 word descriptions) sits around 92-95% for English input. Accuracy drops to ~85% when column names are ambiguous ("status" vs "stage" vs "phase" in the same table) or when the voice input references data that requires context the model doesn't have.

Speed: the full pipeline (audio upload + Whisper + LLM call + validation + insert) takes 2-4 seconds for a typical single-row input. Most of that is the Whisper transcription.

Why function calling, not JSON mode

We tested both. JSON mode works, but function calling has two advantages for this use case. First, the schema definition in function calling format is more natural for describing table columns than a JSON schema. Second, function calling handles optional fields better: the model can omit parameters it can't extract, rather than guessing or inserting empty strings.

What we'd do differently

If we rebuilt this today, we'd add a confidence score per field. Right now it's binary: either the field is extracted or it's null. A confidence score would let us highlight uncertain extractions for the user to review, which matters more as schemas get complex.

We'd also explore local Whisper models for latency-sensitive use cases. The API call adds 500-800ms that a local model could skip, though at the cost of accuracy on non-English input.

Try it

Voice Tables is live and free to start. If you're building something similar, the key insight from our work at Inithouse (a lab building many products at once) is: function calling with schema-derived signatures gives you structured extraction without training data. The schema is the prompt.

We've seen similar patterns work across other products in the portfolio. Be Recommended, our AI visibility tool, uses a related approach for structured report generation from unstructured AI responses. The principle holds: give the LLM a rigid output shape, and it fills it reliably.