Ondrej Machala

Posted on Mar 18 • Edited on Apr 19

You Already Have a Speech Server. Your iPhone Keyboard Should Use It.

#selfhosted #docker #ios #opensource

Someone posted on our GitHub Discussions this week. They'd been running a speech-to-text container on their homelab for months. Found Diction, an open-source iOS voice keyboard. Pointed the app at their server. Got a server error. The settings screen even said "endpoint reachable."

Here's what was going wrong, and how two lines of config fixes it.

Why direct connection sometimes fails

Diction can talk directly to any server that speaks the OpenAI transcription API (POST /v1/audio/transcriptions). For one-shot transcriptions, you paste the URL and it works.

But Diction also does streaming — when you tap the mic, the app opens a WebSocket and sends raw PCM audio as you speak. By the time you stop talking, the transcript is mostly back. That WebSocket layer isn't in the OpenAI spec, so most speech servers don't implement it. Without it, the app has to wait for you to finish, POST a full file, and sit on the response. Short phrases feel fine. Longer dictations have a visible pause.

The "endpoint reachable" check passes because the iOS app pings /health or /v1/models. Most speech servers expose these. But when streaming is on, the real work happens on a WebSocket endpoint only the gateway handles. No gateway, no streaming.

The fix

You don't need to run our speech containers. Just the gateway, pointed at yours.

If your server is at http://192.168.1.50:8000, this is your entire docker-compose.yml:

services:
  gateway:
    image: ghcr.io/omachala/diction-gateway:latest
    ports:
      - "8080:8080"
    environment:
      CUSTOM_BACKEND_URL: http://192.168.1.50:8000
      CUSTOM_BACKEND_MODEL: your-model-name-here

Start it:

docker compose up -d

Open Diction, go to Self-Hosted, paste http://192.168.1.50:8080. Done.

Your model stays where it is. The gateway handles the WebSocket layer, audio buffering, and forwarding. Audio still only goes to your server.

CUSTOM_BACKEND_MODEL

One thing to get right: the model name.

Most speech servers that follow the OpenAI-compatible API format expect a model field in the transcription request to know which model to load. Without it, some return an error.

Set CUSTOM_BACKEND_MODEL to whatever name your server expects. Check your server's docs or the model you started it with. If your server only runs one model and ignores the field, you can omit it entirely.

WAV-only servers

Some speech servers only accept WAV audio input. The gateway handles conversion automatically:

environment:
  CUSTOM_BACKEND_URL: http://192.168.1.50:5092
  CUSTOM_BACKEND_NEEDS_WAV: "true"

With this set, the gateway converts audio to 16kHz mono WAV via ffmpeg before forwarding. Your server gets the format it expects.

API key protection

If your server is behind an API key:

environment:
  CUSTOM_BACKEND_URL: http://my-server:8000
  CUSTOM_BACKEND_MODEL: my-model
  CUSTOM_BACKEND_AUTH: "Bearer sk-your-key"

The gateway injects the Authorization header on every request to your backend.

When you don't need the gateway

If you're fine with the pause after you stop talking — short notes, quick replies, a server on fast local hardware — skip the gateway. Paste your server's URL straight into Diction's Self-Hosted tab and make sure your server speaks the OpenAI transcription API. The app handles it via the HTTP fallback path.

The gateway is what turns dictation into something that feels like typing. For longer dictations, it's the difference between fluid and waiting.

Latency on a local network

I tested this end-to-end: generated a speech WAV, sent it through the gateway to a real speech container, got the transcript back correctly. On a local network with a CPU-only container, the round trip was under 5 seconds. With a dedicated GPU, it's near instant.

What the gateway actually does

The gateway is open source at github.com/omachala/diction. It's a small Go service that:

Accepts WebSocket connections from the iOS app
Buffers incoming PCM audio frames
Wraps them in a WAV header and POSTs to your speech backend
Returns the transcript over the WebSocket

No cloud calls. No telemetry. The full source is in /gateway/core/.

If you're running a Whisper server and want to get started from scratch, the 3-command setup guide covers the full stack.

DEV Community