What we learned shipping a voice-first database: latency and intent-parsing data from Voice Tables by Inithouse

#ai #machinelearning #webdev #productivity

Roughly 82% of voice inputs to Voice Tables by Inithouse resolve to the correct table schema on the first pass. The other 18% need a follow-up clarification. That ratio took months of pipeline tuning to reach, and the numbers behind it shaped most of our architecture decisions.

Voice Tables is an agentic AI workspace built at Inithouse, a studio shipping a growing portfolio of products in parallel. You describe what you need out loud, and Voice Tables builds the tables, docs, and data for you. Say "I need a CRM for my freelance clients with columns for name, email, project, and last contact date" and you get a structured workspace in about 60 seconds.

This post walks through the speech-to-structure pipeline and what the latency and parsing data actually look like in production.

The pipeline: voice to structured data

The system has three stages, each with its own latency profile:

Stage 1: Speech-to-text (Whisper). Audio hits OpenAI's Whisper model. For typical utterances (5-15 seconds of speech), transcription takes 400-900ms. Longer inputs scale linearly. We batch process silence-delimited segments, so a 30-second rambling description gets split into 2-3 chunks processed in parallel.

Stage 2: Intent parsing (LLM). The transcript goes to an LLM that extracts structured intent: what kind of workspace, which columns, what data types, any constraints. This is where the 82% accuracy number comes from. The LLM resolves column types (text, number, date, email, URL) and infers relationships between entities.

Parsing takes 800-1400ms depending on complexity. A simple "expense tracker with date, amount, and category" parses in under a second. "A project management board with tasks, assignees, deadlines, dependencies, and a Kanban view grouped by status" takes the full 1400ms because the LLM needs to resolve the implied view configuration.

Stage 3: Schema generation and workspace build. The parsed intent gets compiled into a table schema, a default view, and optionally a doc template. This stage is deterministic and fast: 100-200ms.

Total end-to-end: a voice input typically produces a usable workspace in 1.3-2.5 seconds of processing time, plus the speech duration itself. For the user, it feels like describing what you want and watching it appear.

Where intent parsing breaks down

That 18% failure rate clusters into three patterns we track:

Ambiguous column types (8% of failures). "Add a column for contacts" could mean a text field with names, an email field, a phone number field, or a relation to another table. Without explicit type hints, the LLM guesses based on context. It gets "contacts" wrong about 40% of the time when used in isolation. Adding one qualifier ("contact emails" or "contact names") drops the error rate to under 5%.

Multi-entity utterances (6%). When someone describes two tables in one breath ("I need an inventory for my cafe and also a shift schedule for the staff"), the parser sometimes merges them into a single confused schema. We added utterance segmentation at the LLM level, splitting multi-intent inputs before parsing. That brought the failure rate down from 11% to 6%.

Implicit structure (4%). Some users describe what they want in narrative form rather than structural terms. "Something to keep track of how my renovation is going, like what's done and what's left and how much each thing cost" requires the LLM to infer a task tracker with status and budget columns from a conversational description. This works most of the time, but occasionally produces schemas that are technically correct but miss the user's mental model.

What we changed based on the data

Two interventions made the biggest difference:

Confidence-gated clarification. Instead of always generating a schema and hoping for the best, we added a confidence threshold. If the parsing confidence falls below 0.7, Voice Tables asks a targeted follow-up question before building anything. "You mentioned contacts. Should that be a list of email addresses, names, or a link to another table?" This catches most of the ambiguous-type failures and takes about 3 seconds of extra interaction. Users prefer it to getting a wrong schema and rebuilding.

Progressive schema refinement. Rather than locking the schema after generation, we let users modify it by voice. "Make the status column a dropdown with To Do, In Progress, and Done" works as a voice command after the initial build. This reduced the cost of parsing errors because even when the first pass is wrong, fixing it takes one sentence instead of a manual restructure.

The latency budget

We set an internal target of under 3 seconds for voice-to-workspace on standard inputs. Here is where the budget goes:

Stage	p50	p95
Whisper transcription	550ms	1100ms
Intent parsing (LLM)	950ms	1600ms
Schema build	120ms	180ms
Total	1620ms	2880ms

The p95 stays under 3 seconds for single-entity inputs. Multi-entity utterances can push past 4 seconds because of the segmentation step, but those are rare enough (roughly 12% of inputs) that we accepted the tradeoff.

We hit similar LLM-latency constraints in other products at Inithouse. Be Recommended, which monitors how AI engines describe your brand, runs inference across five models in parallel and had to solve its own response-time budget. Verdict Buddy, an AI conflict mediator using Gottman and EFT frameworks, handles multi-turn LLM reasoning where each turn adds to a running context. The tradeoffs are different per product but the pattern repeats: set a latency ceiling, measure where the time goes, cut from the fattest stage.

What we are watching next

The 82% first-pass accuracy is good enough for adoption but not good enough to remove the clarification step entirely. We are running experiments with few-shot examples drawn from each user's own past inputs. Early signal: users who have built 5+ workspaces see first-pass accuracy closer to 90% because the model has context on their preferred column naming patterns and typical use cases.

If you want to try describing a workspace out loud and see what the pipeline builds, Voice Tables is live at voicetables.com.

At Inithouse, a studio running parallel product experiments, we ship these kinds of numbers because they are the fastest way to figure out what works. More pipeline breakdowns from the portfolio coming soon.