I Stopped Paying $15/Month for Wispr Flow. Here's the Open-Source Replacement.

#ios #opensource #productivity #privacy

I paid for Wispr Flow for five months.

A monthly subscription. Every month. For voice-to-text on my iPhone.

It's a good product. The AI editing layer is genuinely impressive — it strips filler words, fixes grammar, adapts to how you write. That part works. If you want the best cloud-based dictation and don't mind paying, Wispr delivers.

But every time I used it, the same thought: my voice is going to their cloud. Not my cloud. Theirs.

I already run a home server. Docker Compose, Tailscale, the usual homelab stack. I had faster-whisper running for other things. The transcription engine was already there. I just didn't have a way to use it from my phone.

So I built one.

What the switch actually looked like

The server side was easy. I already had the transcription container. I wrote a small Go gateway to handle WebSocket streaming from the phone, and wrapped both in a compose file:

services:
  gateway:
    image: ghcr.io/omachala/diction-gateway:latest
    ports:
      - "8080:8080"
    environment:
      DEFAULT_MODEL: small

  whisper-small:
    image: fedirz/faster-whisper-server:latest-cpu
    environment:
      WHISPER__MODEL: Systran/faster-whisper-small
      WHISPER__INFERENCE_DEVICE: cpu

docker compose up -d, point the app at http://your-server:8080, and you're dictating.

The hard part was the iOS keyboard. Keyboard extensions on iOS run in a sandbox with a 48MB memory ceiling, no direct mic access without Full Access, and a text proxy that behaves differently in every app. That took months, not hours.

The result is Diction — a voice keyboard that connects to whatever transcription server you point it at.

Where I actually use it

Voice-to-text is useless if you don't end up talking to your phone. Before the switch, Wispr sat on my Home Screen and I used it for the same three things every day. After the switch, usage tripled — not because the keyboard is magical, but because I stopped rationing it to stay under a word cap. Typical day now:

Long messages in Telegram and Signal. The group chats where I'd normally send voice notes I'd rather send as text. A four-sentence reply takes eight seconds including the thinking. Tapping the same thing on a phone keyboard is a minute of typos.
Notes app while walking. Ideas arrive on the move and typing them in means stopping. Mic button, sentence, back in the pocket. Rolling session context keeps punctuation and tone consistent across a run of quick notes.
Email composer. Most replies are a rough draft dictated, then a handful of keyboard edits. Faster than tapping out three paragraphs on glass, and the AI Companion cleans the filler before I even see it.
Search and address bars. Any place with a text input, which on iOS is basically everywhere. The mute/speak rhythm is tighter than my thumbs ever were.
Dictating to Claude Code over SSH. My terminal runs on the same home server. The keyboard dictates into a Prompt SSH session, and voice prompts land in claude without a laptop in sight.

Nothing here is a corner case. It's mundane iPhone usage. The difference is volume — on Wispr's free tier I'd hit 1,000 words by Wednesday. With my own server there's no counter running.

What's honestly worse

Wispr's AI editing layer is more refined. It doesn't just transcribe — it rewrites. Filler words vanish, punctuation lands correctly, tone matches the document. They've had years to tune it.

Diction's AI Companion does run cleanup on every transcription now — optional, yours to toggle — and it's improved a lot. Profile text (who you are, how you write), per-app tones, custom dictionary for jargon and proper nouns, and rolling session context all feed the LLM on every dictation. It's close. Wispr is still ahead on the polish of the final rewrite pass.

If you don't want to think about infrastructure and just want the best cloud experience, Wispr is still a strong choice.

What's better

My audio stays on my network. I can verify that because the server code is open source — there's nothing to trust on faith.

No word limits. Wispr's free tier caps iPhone at 1,000 words/week. Self-hosted Diction has no caps, no subscription, no catch.

Latency on a local network is excellent. The small Whisper model on a modern CPU returns transcriptions in 2–4 seconds. With a GPU, it's near instant.

And when my internet goes down, on-device mode keeps working. Wispr is cloud-only — no connection, no transcription.

The honest trade-off

I traded polish for control. Wispr is more refined. Diction gives me ownership of the entire pipeline — from the mic to the model — and it's getting better with every release.

If you're already running Docker at home and the idea of sending every word you speak to someone else's server bothers you, the self-hosted setup takes about 10 minutes.

github.com/omachala/diction