Our first voice parser turned "add milk" into a new database table called "milk." Not a row. A whole table. That was the moment I realized voice-to-database is a fundamentally different problem from voice-to-text.
I build Voice Tables at Inithouse. It is an AI-native workspace where you talk to your database instead of typing into cells. Getting here took a lot of wrong turns. Here are five mistakes we made and what we changed.
1. Mapping voice directly to SQL
The first instinct is obvious: take the transcription, parse it, generate SQL. "Add a row with name John and age 30" becomes INSERT INTO users (name, age) VALUES ('John', 30).
This breaks immediately. People do not talk in SQL. They say "put John in there, he is 30." No table name. No column specification. Ambiguous pronouns.
What we changed: We added an intent layer between the transcription and the database operation. The system first figures out what the user means (add a record, update a field, create a new structure) before it touches any data. The LLM handles intent classification, then a structured function call handles the actual write. Two separate steps, not one.
2. Not handling ambiguity gracefully
When a user says "add the meeting," which table do they mean? Which columns should be populated? What meeting?
Our early version guessed. It picked the most recent table and filled in whatever it could infer. Sometimes it guessed right. When it guessed wrong, users lost trust fast. One wrong insertion and people stop using voice entirely.
What we changed: When the system is not confident (below a threshold), it asks. "Which table should I add this to?" or "Do you mean the project name or the client name?" A quick clarification takes two seconds. Fixing a wrong insertion takes two minutes and erodes trust.
The key insight: a voice interface that asks clarifying questions feels smarter than one that silently does the wrong thing.
3. Ignoring correction patterns
Users correct themselves constantly. "Add revenue 5000... no wait, 50000." "Put that in the Q2 column. Actually, Q3."
Our first implementation treated every utterance as a new command. "No wait, 50000" became a new row with value 50000, while the wrong 5000 stayed. The user now had two entries and needed to manually delete one.
What we changed: We built correction detection into the pipeline. Phrases like "no, I meant," "actually," "wait," and "change that to" trigger an undo-and-reinterpret flow. The system rolls back the last action and reprocesses with the corrected input. This single change cut support complaints by roughly half.
4. No visual feedback for voice actions
Voice is invisible. When you type "5000" into a cell, you see it appear. When you say "add 5000," you need confirmation that it went to the right place.
We shipped without real-time visual feedback. Users would dictate three entries, then scroll around trying to verify what happened. Some would re-dictate the same data because they were not sure it worked the first time.
What we changed: Every voice action now triggers an immediate visual response. The affected row highlights. A small toast shows what was added or changed. The cell briefly pulses. This sounds like polish, but it is load-bearing infrastructure for a voice-first product. Without it, users simply do not trust that anything happened.
5. Testing only with clean audio
Our dev environment was quiet. Everyone on the team speaks clearly. The test recordings were studio-quality single-speaker audio.
Then real users showed up. Kitchen background noise. Accents the model had not encountered enough. Partial sentences where someone started talking, got interrupted, and came back to finish. Speakerphone echo. All of these degraded accuracy significantly.
What we changed: We built a test suite with intentionally bad audio: recorded over speakerphone, in cafes, with interruptions mid-sentence. We also added a confidence threshold: if the transcription confidence is below 70%, the system asks the user to repeat instead of acting on garbage input. Accuracy on noisy audio improved from roughly 60% to 85% after these changes.
The pattern
Every mistake on this list comes from the same root assumption: that voice input behaves like keyboard input. It does not. Voice is messy, ambiguous, self-correcting, and invisible. Building a voice-to-database interface means designing for all of that from the start, not patching it later.
We are still iterating on Voice Tables. The five mistakes above are the ones that cost us the most time. If you are building anything voice-first, skip the part where you learn these the hard way.
Jakub, builder @ Inithouse
Top comments (0)