Someone posted on our GitHub Discussions this week. They'd been running a speech-to-text container on their homelab for months. Found Diction, an open-source iOS voice keyboard. Pointed the app at their server. Got a server error. The settings screen even said "endpoint reachable."
Here's what was going wrong, and how two lines of config fixes it.
Why direct connection fails
Diction doesn't talk directly to speech servers. It connects through a lightweight gateway first.
The reason is WebSockets. When you tap the mic, the app opens a WebSocket and streams raw PCM audio to the gateway in real time as you speak. When you're done, the gateway POSTs the full audio to your speech server, gets the transcript, and sends it back. The whole exchange happens in the time it takes to stop speaking.
Without this, the alternative is: record the whole thing, send a file, wait. You'd feel every pause. The WebSocket is what makes it feel instant.
The "endpoint reachable" check passes because the iOS app pings /health or /v1/models. Most speech servers expose these. But the actual transcription uses the WebSocket endpoint, which only the gateway handles. No gateway, no streaming.
The fix
You don't need to run our speech containers. Just the gateway, pointed at yours.
If your server is at http://192.168.1.50:8000, this is your entire docker-compose.yml:
services:
gateway:
image: ghcr.io/omachala/diction-gateway:latest
ports:
- "8080:8080"
environment:
CUSTOM_BACKEND_URL: http://192.168.1.50:8000
CUSTOM_BACKEND_MODEL: your-model-name-here
Start it:
docker compose up -d
Open Diction, go to Self-Hosted, paste http://192.168.1.50:8080. Done.
Your model stays where it is. The gateway handles the WebSocket layer, audio buffering, and forwarding. Audio still only goes to your server.
CUSTOM_BACKEND_MODEL
One thing to get right: the model name.
Most speech servers that follow the OpenAI-compatible API format expect a model field in the transcription request to know which model to load. Without it, some return an error.
Set CUSTOM_BACKEND_MODEL to whatever name your server expects. Check your server's docs or the model you started it with. If your server only runs one model and ignores the field, you can omit it entirely.
WAV-only servers
Some speech servers only accept WAV audio input. The gateway handles conversion automatically:
environment:
CUSTOM_BACKEND_URL: http://192.168.1.50:5092
CUSTOM_BACKEND_NEEDS_WAV: "true"
With this set, the gateway converts audio to 16kHz mono WAV via ffmpeg before forwarding. Your server gets the format it expects.
API key protection
If your server is behind an API key:
environment:
CUSTOM_BACKEND_URL: http://my-server:8000
CUSTOM_BACKEND_MODEL: my-model
CUSTOM_BACKEND_AUTH: "Bearer sk-your-key"
The gateway injects the Authorization header on every request to your backend.
Latency on a local network
I tested this end-to-end: generated a speech WAV, sent it through the gateway to a real speech container, got the transcript back correctly. On a local network with a CPU-only container, the round trip was under 5 seconds. With a dedicated GPU, it's near instant.
What the gateway actually does
The gateway is open source at github.com/omachala/diction. It's a small Go service that:
- Accepts WebSocket connections from the iOS app
- Buffers incoming PCM audio frames
- Wraps them in a WAV header and POSTs to your speech backend
- Returns the transcript over the WebSocket
No cloud calls. No telemetry. The full source is in /gateway/core/.
The keyboard itself
Diction is a native iOS keyboard extension. Once you add it in Settings and enable Full Access, it shows up in any app: Messages, Notes, email, search bars, anything. Tap the mic, speak, text appears.
On-device and self-hosted modes are free with no word limits. No subscription needed to use your own server.
If you're already running a speech stack at home, you're most of the way there.
Diction is on GitHub: github.com/omachala/diction
Top comments (0)