Every time you use a voice keyboard on your iPhone, your audio is transmitted somewhere.
Not just the text — the raw audio. Your voice. Before transcription happens.
Most people don't think about this. It's worth thinking about.
Voice Data Is Biometric Data
Your voice is unique. More unique than a fingerprint in some ways.
Voiceprints can identify you across recordings. They reveal your emotional state, health conditions, accent, and origin. They're used in law enforcement, bank authentication, and increasingly in surveillance.
When a voice keyboard sends your audio to a cloud server, that server has your voice. Not a hash of it. Not a summary. The actual audio file.
What they do with it depends entirely on their privacy policy, their security practices, and whether those policies hold up under commercial pressure.
The Structural Problem With Cloud Voice
The dominant iOS voice keyboards are cloud-only. Your audio goes to their servers. Their servers need to be profitable. Your data is valuable.
Users have raised concerns about voice data being used for model training without explicit consent, and about apps transmitting data even when not actively transcribing. These aren't isolated incidents — they're the structural reality of any cloud voice service where the incentive to monetize data always exists.
The Only Fix: Self-Hosted
The only way to guarantee your voice stays private is to ensure it never leaves hardware you control.
Diction is an iOS keyboard that connects to a speech-to-text server you run yourself.
git clone https://github.com/omachala/diction
cd diction
docker compose up -d
Your audio goes from your iPhone to your server. Nowhere else. Audio is processed in memory and discarded — never permanently stored, never logged.
You can verify this. The gateway is open source. There's no database, no file write, no retention of audio anywhere in the pipeline.
What You Need to Self-Host
Any machine that can run Docker:
- Old laptop or mini PC — works well, 3-5s latency depending on CPU
- Home server / NAS — faster, 1-2s
- Cloud VPS (2 CPU, 4GB RAM) — fast, cheap, your own instance
The server runs open-source speech recognition optimized for CPU inference. Your hardware, your data.
On-Device Mode: No Network at All
Don't want a server? Diction also supports fully on-device transcription.
Speech recognition models download to your iPhone. After that, nothing leaves the device — no network calls, no server, no latency beyond model inference time.
Three models:
- Standard (142MB) — downloads automatically on first launch
- Advanced (632MB) — higher accuracy, handles accents and noisy environments
- Premium (500MB) — best accuracy and speed, included with Diction One subscription
Zero Tracking SDKs
The Diction iOS app ships with zero analytics, zero crash reporting, and zero telemetry. No third-party SDKs that phone home.
Most apps include 10-30 third-party SDKs, each with their own data collection practices. Diction's server infrastructure is open source — you can read every line of the gateway that handles your audio.
Diction One (Cloud)
If you want the best accuracy without running your own server, Diction One is the cloud option.
"We don't store audio" is a meaningful claim for Diction because you can verify it. The gateway code is open source. There's no database write anywhere in the audio path — audio is proxied to the transcription model and the result is returned. Nothing is retained.
Most cloud services tell you they don't store audio. Diction can show you.
Voice is intimate. What you say deserves the same privacy as what you say in person.
Download on the App Store · diction.one · github.com/omachala/diction
Top comments (0)