Jakub

Posted on Jun 26

Building a Voice-to-Database Pipeline: How We Turn Speech Into Structured Rows

#ai #webdev #programming #tutorial

We build Voice Tables, an AI-native workspace where you talk instead of type. You say "add fifty oak planks, twelve euros each, delivery next Tuesday" and it becomes a structured row in your table. Here's how the pipeline works under the hood.

The problem with voice and structured data

Speech is messy. A single sentence like "add fifty oak planks, twelve euros each, delivery next Tuesday" packs in at least four distinct operations: entity extraction (oak planks), numeric parsing (fifty to 50, twelve to 12), date resolution (next Tuesday to a concrete date), and schema mapping (which column gets which value).

Traditional NLP approaches required you to define rigid grammars and intent classifiers for each of these. You'd write a parser for dates, another for numbers, another for entity names, and wire them together with fragile if-else logic. It worked for narrow use cases (think "set a timer for five minutes") but fell apart the moment users spoke naturally.

The pipeline

Our voice-to-database flow has four stages. Each one can fail independently, which matters for error handling.

1. Speech-to-text

We use Whisper for transcription. The important detail here isn't the model itself but how we handle the output. Whisper gives us raw text with no punctuation guarantees and occasional hallucinations on silence. We normalize the transcript before passing it downstream: trimming filler words, collapsing repeated segments, and flagging low-confidence spans.

2. LLM parsing

This is where the real work happens. We send the normalized transcript to an LLM with the user's current table schema as context. The prompt includes column names, types, and a few example rows. The model returns a JSON object mapping column names to extracted values.

The schema context is critical. If your table has columns for "item", "quantity", "unit_price", and "delivery_date", the model knows where to put each piece of information. Without it, "twelve euros each" is ambiguous: is that a total or a per-unit price?

3. Schema validation

The LLM output gets validated against the actual table schema. Types are checked (is "next Tuesday" actually resolved to a date?), required fields are verified, and any values outside expected ranges get flagged. This step catches most LLM parsing errors before they hit the database.

4. Database write

If validation passes, we write the row. If it fails, we surface the issue back to the user with a suggested correction. The user can confirm, edit, or re-speak.

Why LLMs changed this

Before LLMs, the voice-to-data problem required you to anticipate every possible way a user might phrase something. You'd build intent classifiers, slot fillers, entity recognizers, each one trained on domain-specific data.

LLMs handle ambiguity without explicit training. "Add fifty oak planks" and "I need 50 of those oak boards" produce the same structured output because the model understands synonyms, context, and implied meaning. We didn't train anything. We wrote a good prompt with schema context and it worked on day one.

That said, LLMs introduce their own problems. Latency is higher than rule-based parsing. Costs scale with usage. And occasionally the model invents data that wasn't in the transcript. The validation step (stage 3) exists specifically to catch these cases.

Edge cases we actually handle

Corrections mid-sentence. "Add fifty oak planks... actually make that sixty." The LLM sees the correction in context and uses the final number. This works surprisingly well without any special handling because correction is a natural language pattern the model already understands.

Multi-item entries. "Add fifty oak planks at twelve euros and thirty pine boards at eight euros." This produces two rows, not one. The LLM returns an array of objects, and we write them as a batch.

Contextual defaults. If your last three entries all had "delivery next Tuesday", the model starts suggesting the same default for new entries. We do this through the example rows in the prompt: recent entries prime the model's expectations.

Language mixing. A Czech user saying "pridej fifty oak planks" (mixing Czech verb with English nouns). Whisper handles multilingual input well, and the LLM parses mixed-language content without issues. This matters for us because our users span five languages.

What still breaks

Accents and noisy environments. Whisper's accuracy drops noticeably with strong accents or background noise. We're experimenting with audio preprocessing (noise reduction, gain normalization) but haven't found a solution that works without adding perceptible latency.

Ambiguous quantities. "A couple dozen" could mean 24 or "roughly 20-something." We default to the literal interpretation (24) and show the user what we parsed, but this remains a friction point.

Compound units. "Three and a half meters at twenty-two fifty per meter" has parsing challenges: is it 22.50 or 2250? Context usually resolves this (currency column suggests 22.50), but not always.

We log every failed parse and review them weekly. Most improvements come from adjusting the prompt, not from changing the pipeline architecture.

Try it

You can test the full pipeline at Voice Tables. Speak into the mic, watch it become a structured table. If you build voice interfaces, we'd like to hear what breaks for you.

DEV Community