You've got Ollama running on your home server. Your iPhone keyboard is still phoning home.
That gap bothered me enough to close it. Diction is an open-source iOS keyboard that speaks the standard OpenAI transcription API, POST /v1/audio/transcriptions. Its gateway is open source Go. You configure it entirely with environment variables. The latest update adds something I wanted from the start: plug in your own LLM for post-processing. Ollama, OpenAI, anything with an OpenAI-compatible endpoint.
Here's the full stack.
What Each Piece Does
The speech engine does the transcription. I ran Whisper for months. But there's a faster option if you dictate in English or another European language: NVIDIA's Parakeet model, available via the achetronic/parakeet Docker image. It supports 25 languages. On CPU it's roughly 10x faster than Whisper, uses about 2GB RAM, and edges out Whisper large-v3 on accuracy for English. Models are baked into the image, so no first-run download.
If you need Asian languages, Arabic, or anything outside those 25, use Whisper instead. Everything below works with either.
The gateway sits in front of the speech engine. It handles WebSocket streaming, so your phone streams audio live while you're still talking. By the time you stop, the transcript is mostly ready. It also handles the new LLM post-processing step.
Ollama cleans up the transcript after transcription. The gateway calls it with a system prompt you write, and the cleaned text is what gets inserted into the app. Your model, your prompt.
The Stack
Three containers. No GPU required. Under 3GB RAM without Ollama, 8-10GB with a 9B model loaded.
services:
parakeet:
image: ghcr.io/achetronic/parakeet:latest-int8
ports:
- "9006:5092"
gateway:
image: ghcr.io/omachala/diction-gateway:latest
ports:
- "8080:8080"
depends_on:
- parakeet
environment:
DEFAULT_MODEL: parakeet-v3
LLM_BASE_URL: http://ollama:11434/v1
LLM_API_KEY: ollama
LLM_MODEL: gemma2:9b
LLM_PROMPT: "Clean up this voice transcription. Remove filler words (um, uh, like). Fix punctuation and grammar. Return only the cleaned text."
ollama:
image: ollama/ollama
volumes:
- ollama-data:/root/.ollama
volumes:
ollama-data:
docker compose up -d
docker compose exec ollama ollama pull gemma2:9b
Skip the Ollama block entirely if you don't want AI cleanup. The gateway checks for LLM_BASE_URL on startup. If it's not set, transcriptions come back raw.
If you already have Ollama running on a different machine, point LLM_BASE_URL at it. Works with any model you've already pulled.
For Whisper instead of Parakeet, swap the parakeet service for fedirz/faster-whisper-server:latest-cpu and set DEFAULT_MODEL: small (or medium, large-turbo) on the gateway.
Connecting the App
- Install Diction from the App Store
- In iPhone Settings: General → Keyboard → Keyboards → Add New Keyboard → Diction
- Open the Diction app, switch to Self-Hosted
- Paste your server address:
http://192.168.1.100:8080 - A green dot confirms the connection
- Enable AI Enhancement in app settings (requires a Diction One subscription to unlock the toggle — the processing runs on your server, not Diction's)
The keyboard is now using your server. Audio goes from your phone to your server, Parakeet transcribes it, Ollama cleans the result, text lands in whatever app you're typing in. Nothing leaves your network.
If you're away from home, Tailscale or a Cloudflare Tunnel connects your phone without opening router ports.
Writing a Prompt That Works
The LLM_PROMPT env var is a single system prompt. The gateway sends it with every transcription request. The transcript is the user message. You control both.
A few starting points:
# General dictation
Remove filler words (um, uh, like, you know). Fix punctuation and grammar.
Preserve meaning and tone. Return only the cleaned result.
# Technical / developer notes
Fix transcription errors. Preserve technical terms, command names, and file paths
exactly as spoken. Remove filler words. Return cleaned text only.
# Medical or domain-specific
Fix transcription errors. Preserve all domain-specific terminology exactly as spoken.
Fix grammar and punctuation only. Return the corrected text.
One practical note: models under 7B parameters often answer the transcript rather than clean it. Gemma2 9B is reliable. Qwen2.5 7B is borderline on this task. Anything 9B+ behaves predictably.
What You Get
Audio from your phone to your server. Parakeet or Whisper transcribes it. Ollama cleans it. Text inserted. No third party, no word limits. The Diction gateway is fully open source on GitHub — inspect every line that runs in your network.
Full setup docs at diction.one/self-hosted. If you're already running a homelab with Ollama, the marginal effort is a single compose file and 10 minutes.
Top comments (0)