I'm dictating this draft into a transcription tool because the first pass of a post is a long stream-of-consciousness, and talking is the fastest way to get it out of my head. I'll edit it later with my hands on a keyboard, like a sane person.
That split — voice for capture, keyboard for refinement — is the one voice pattern I've found that reliably earns its keep. It isn't even new. Writing a book report in the 1980s, I'd dump every idea onto the page first, then read it back, build a rough order, then rewrite and rearrange until it actually said something. The workflow never changed; voice just lets me sprint the first step faster. Almost everywhere else I see designers reaching for voice as a primary input, I want to ask whether they've ever tried to use it in a room with another human being.
Where Voice Actually Works
Dictating a loose draft alone in my office works for a reason I didn't expect. The problem with typing the first dump isn't speed; it's that the moment I type, I slip into editor mode and start polishing sentences before the ideas are even out. Voice blocks that. I can't easily go back and fix what I just said, so it forces me forward in a straight line until the whole mess is on the page. It's a limiter that imposes a behavior I know is better but won't choose on my own.
Hands-busy contexts earn it too, and earn it cleanly. Capturing a thought while driving is both useful and far safer than the alternative of thumbing at a screen; the case is so obvious that "call my wife" and "add a stop for gas" are the rare voice commands even skeptics reach for. Setting a kitchen timer mid-recipe, hands covered in flour, is the single thing I use Siri for most, and I doubt I'm alone; it may be the most-used thing the assistant does. Whether that's because timers are a perfect high-frequency hands-busy fit or because Siri is too unreliable for anything harder is genuinely unclear (probably some of both).
And it stays narrow even when I'm alone. For real design or engineering work, typing slows me to the cadence the problem needs: fast enough to keep up with the thinking, slow enough to organize it, and that instinct to pause and add nuance is a feature, not friction. Talking at a screen runs weirdly fast for that kind of work; the register doesn't match. Then there's the reliability tax. Across every transcription tool I've used over twenty-five years (Dragon, the macOS built-in, Wispr Flow) there's a real failure rate where I thought I'd started recording and hadn't, and nothing rage-quits a session faster than monologuing for ten minutes to a machine that captured none of it.
That's the ceiling on voice when the room is empty. Add one more person and it drops through the floor — the interaction goes from awkward to actively rude.
The Office Problem
Picture a general contractor's office. There are well over four hundred thousand home-building businesses in the United States, and once you get past the few hundred big national builders, nearly eighty percent of them are tiny: self-employed operators and two-or-three-person shops. The office matches: a couple of people in a light-industrial unit or a tired strip mall, because the building is the point and the office is just where you meet a client and run takeoffs. Nobody's spending money on a nice one, and nobody needs to.
The size barely matters, though. Scale up to Intel and you've traded the strip mall for a cube farm, which is worse; there is nothing quite as sensory-overloading as a sales floor packed with cubicles. Some people thrive in that. In my experience the people who build software (designers, engineers, the analytically minded) mostly can't think through that much noise, for good reasons. It's self-defeating fast: even if voice is a small optimization for one person, in a shared room that one person is wrecking everyone else's concentration.
Now drop voice-first software into that office. One person talking at their computer is fine. Two is a soundscape nobody can think in. Three is unworkable. The entire reason the room exists is to be a small quiet place where two or three people can concentrate, and voice-first software breaks that premise on contact. The same math ruins any shared space: a café where everyone dictates is intolerable for the baristas and the next table over; an open-plan floor where voice replaces typing is just the boiler-room sales pit rebuilt on purpose. We've been there. We didn't like it.
So I think some offices will eventually ban voice input at the desk for the same reason they once banned smoking — not because it harms the speaker, but because one person's input is everyone else's externality, and at some density the host has to step in. Whispering into the mic doesn't fix it; it just makes you look like you're plotting something. Headsets don't fix it, because the talking is the problem, not the listening. And the honest alternative isn't a better mute button. It's rethinking whether we need to be in the shared room at all, and for how long. George Jetson clocked a nine-hour week pushing one button; it's well past time we took that as a target rather than a punchline.
The Minority Report Test
Voice sits in the same bin as the gestural interface Tom Cruise waves around in Minority Report. It looks spectacular on screen. Imagine doing it for eight hours and you realize you've never wanted to move your arms that much; a coworker drifting past at the wrong second smears everything you were holding in the air. The film literally shows the strain and somehow audiences walked out wanting the gloves instead of reading them as a warning.
There's probably a whole essay in how badly science fiction designs interfaces, because it's optimizing for what reads on camera, not what survives a workday. Hackers is my favorite tell: swooping 3D cityscapes of glowing files, when the real thing was, is, and will almost certainly remain a command prompt. What looks good on screen and what's good to use are mostly different things, and voice has been coasting on the former.
So here's the test I run on any input modality: imagine doing it for forty (or sixty) hours a week on a floor of peers all doing the same thing. Voice fails. Gestures fail; your arms would simply give out, the same way plenty of people's hands give out from typing. That last part is the fair steelman for voice. Fewer hours at the keyboard means less repetitive strain, which is a real benefit, and I can't talk for eight hours any more than I can type for eight. The answer isn't to crown one modality. It's that I need several, and voice can be one of them — but only if the design accounts for the room, because keyboard and mouse clear a bar voice can't. They're near-silent, screen-only, and personal. They scale into dense shared space precisely because they impose on no one else.
This is the part the voice-first crowd skips, and it's the whole game. They evaluate the interaction in isolation (does it work for one user, one task, one ideal environment) and never ask whether it works in the room. And the room is the deeper point: as a designer you are responsible for the entire context of use, including the parts you don't control. I can't stop a user from opening my app on a packed train. If it supports mobile, they're using it exactly as intended, so I have to design for that train. Not by forbidding it, but by never forcing the awkward path. If the only way to enter a destination is to say it out loud because someone decided "voice matters while driving," that's just as broken in reverse, because half the time I'm parked and I just want to type the address. You design for the whole situation and the whole audience, not the slice you can see in the demo.
The Restraint Pitch
When someone on the team pushes voice as a hero feature, ask two questions. First: what physical environment is the typical user actually in? If there's another human within earshot, you're not designing for voice; you're designing a fight with the user's coworkers, family, baristas, and seatmates. Second, and this is the one people skip: when and why would a user not be able or willing to speak? The answer is almost never "rarely." It means voice lands somewhere around thirty, forty, sixty percent of sessions (a real but partial slice) and almost never the ninety-nine percent that would justify making it the front door.
That percentage is the tell, and it's the same one from the Chat Is an Input argument, because voice is chat, just entered by mouth instead of keyboard. It inherits every limit of language as an interface: the imprecision, the ambiguity, all of it, then adds an acoustic externality typing never had, and arguably gets worse, since few of us are as precise out loud as we are on the page. It's why we built distinct interfaces for spreadsheets and CAD and design tools in the first place; sometimes a picture beats a thousand words, and sometimes a thousand words can't do what a single dragged handle does. A growing crowd in the Bay Area now narrates at a fleet of agents all day, and good for them; I do a version of it myself. But that is not how a general contractor works, or an accountant, or a business analyst, and mistaking your own workflow for your users' is the original sin here.
The job of design isn't to expose every modality the underlying tech can support. It's to choose the ones that work for the user and the people around the user, build those well, and let the operating system catch the edge cases. Every major OS already does competent voice-to-text for anyone who wants it. Voice is an edge case for most software. The discipline is having the restraint to let it stay one, to design for the room you can't see, not just the user in front of you.
Top comments (0)