Someone posted on our GitHub Discussions this week. They'd been running a speech-to-text container on their homelab for months. Found Diction, an open-source iOS voice keyboard. Pointed the app at their server. Got a server error. The settings screen even said "endpoint reachable."
Here's what was going wrong, and how two lines of config fixes it.
Why direct connection sometimes fails
Diction can talk directly to any server that speaks the OpenAI transcription API (POST /v1/audio/transcriptions). For one-shot transcriptions, you paste the URL and it works.
But Diction also does streaming — when you tap the mic, the app opens a WebSocket and sends raw PCM audio as you speak. By the time you stop talking, the transcript is mostly back. That WebSocket layer isn't in the OpenAI spec, so most speech servers don't implement it. Without it, the app has to wait for you to finish, POST a full file, and sit on the response. Short phrases feel fine. Longer dictations have a visible pause.
The "endpoint reachable" check passes because the iOS app pings /health or /v1/models. Most speech servers expose these. But when streaming is on, the real work happens on a WebSocket endpoint only the gateway handles. No gateway, no streaming.
The fix
You don't need to run our speech containers. Just the gateway, pointed at yours.
If your server is at http://192.168.1.50:8000, this is your entire docker-compose.yml:
services:
gateway:
image: ghcr.io/omachala/diction-gateway:latest
ports:
- "8080:8080"
environment:
CUSTOM_BACKEND_URL: http://192.168.1.50:8000
CUSTOM_BACKEND_MODEL: your-model-name-here
Start it:
docker compose up -d
Open Diction, go to Self-Hosted, paste http://192.168.1.50:8080. Done.
Your model stays where it is. The gateway handles the WebSocket layer, audio buffering, and forwarding. Audio still only goes to your server.
CUSTOM_BACKEND_MODEL
One thing to get right: the model name.
Most speech servers that follow the OpenAI-compatible API format expect a model field in the transcription request to know which model to load. Without it, some return an error.
Set CUSTOM_BACKEND_MODEL to whatever name your server expects. Check your server's docs or the model you started it with. If your server only runs one model and ignores the field, you can omit it entirely.
WAV-only servers
Some speech servers only accept WAV audio input. The gateway handles conversion automatically:
environment:
CUSTOM_BACKEND_URL: http://192.168.1.50:5092
CUSTOM_BACKEND_NEEDS_WAV: "true"
With this set, the gateway converts audio to 16kHz mono WAV via ffmpeg before forwarding. Your server gets the format it expects.
API key protection
If your server is behind an API key:
environment:
CUSTOM_BACKEND_URL: http://my-server:8000
CUSTOM_BACKEND_MODEL: my-model
CUSTOM_BACKEND_AUTH: "Bearer sk-your-key"
The gateway injects the Authorization header on every request to your backend.
When you don't need the gateway
If you're fine with the pause after you stop talking — short notes, quick replies, a server on fast local hardware — skip the gateway. Paste your server's URL straight into Diction's Self-Hosted tab and make sure your server speaks the OpenAI transcription API. The app handles it via the HTTP fallback path.
The gateway is what turns dictation into something that feels like typing. For longer dictations, it's the difference between fluid and waiting.
Latency on a local network
I tested this end-to-end: generated a speech WAV, sent it through the gateway to a real speech container, got the transcript back correctly. On a local network with a CPU-only container, the round trip was under 5 seconds. With a dedicated GPU, it's near instant.
What the gateway actually does
The gateway is open source at github.com/omachala/diction. It's a small Go service that:
- Accepts WebSocket connections from the iOS app
- Buffers incoming PCM audio frames
- Wraps them in a WAV header and POSTs to your speech backend
- Returns the transcript over the WebSocket
No cloud calls. No telemetry. The full source is in /gateway/core/.
If you're running a Whisper server and want to get started from scratch, the 3-command setup guide covers the full stack.
Top comments (0)