Ondrej Machala

Posted on Apr 18 • Edited on Apr 19

How to Set Up Diction: The Self-Hosted Speech-to-Text Alternative to Wispr Flow

#selfhosted #docker #ios #privacy

This article is about getting your own private speech-to-text on your iPhone. Tap a key, speak, watch the words land in whatever app you're in. No cloud in the middle, no subscription, no company on the other end reading what you said. The keyboard is Diction. This post is the full setup, start to finish, blank machine to working dictation in under thirty minutes.

I built the server side for myself:
https://github.com/omachala/diction

I talk to my AI agents all day. Claude in the terminal, my Telegram bot OpenClaw, a handful of others. Voice for everything. Long prompts, half-formed plans, emails I want rewritten, code I want reviewed. Every word used to pass through someone else's transcription cloud before my own agents ever heard it. Not anymore.

A small Docker stack on a box at home now handles the transcription. An optional cleanup step scrubs filler words and fixes punctuation using any LLM you want: OpenAI, Groq, a local Ollama model, anything OpenAI-compatible.

Every command is below.

What You'll End Up With

A box at home running the speech model, 24/7
Your iPhone sending audio to it over your home WiFi
Optional: an LLM of your choice for cleaning up filler words and fixing punctuation (OpenAI, Groq, Anthropic, a local Ollama model, anything with an OpenAI-compatible API)
Total running cost with cleanup on: depends on the LLM you pick. Roughly a cent per hour of dictation on gpt-4o-mini, zero if you run a local model.

The speech part is free forever. The cleanup part costs whatever your LLM provider charges. Use a local model and pay nothing. More on that at the end.

What You Need

Any machine that can run Docker: Mac mini, an old laptop, a home server in a closet, a NUC, a home lab box. Apple Silicon or any modern x86 works fine. Raspberry Pi is a stretch for the speech part. Anything newer is comfortable.
An iPhone running iOS 17 or newer
Both on the same WiFi network
Optional: an API key for any OpenAI-compatible LLM (OpenAI, Groq, Together, Anthropic via a proxy, Ollama running locally, etc.) if you want AI cleanup

I'll assume you know what Docker is and how to open a terminal. That's it.

Step 1: Install Docker

You need Docker Engine plus Docker Compose. Both come bundled in Docker Desktop on Mac and Windows. On Linux you install them separately (they're both free and open source).

macOS (Intel or Apple Silicon): Download Docker Desktop, open the .dmg, drag the whale icon to Applications, launch it. The first run asks for admin credentials (it needs to install a helper tool and set up networking). When the whale icon in the menu bar stops animating and says "Docker Desktop is running", you're ready.

Windows: Download Docker Desktop. The installer will enable WSL2 if it's not already on - this is required, and needs a reboot. After the reboot, launch Docker Desktop. Same whale icon in the system tray tells you when it's ready.

Linux: Either install Docker Desktop (same download page) or go with the native packages:

# Ubuntu / Debian
sudo apt update
sudo apt install docker.io docker-compose-plugin

# Fedora / RHEL
sudo dnf install docker docker-compose-plugin

# Arch
sudo pacman -S docker docker-compose

Start the service and add your user to the docker group so you don't need sudo every time:

sudo systemctl enable --now docker
sudo usermod -aG docker "$USER"

Log out and back in (or reboot) so the group change takes effect. Yes, you really need to log out. Running newgrp docker works too but only in the current shell.

Verify it's all working:

docker --version
docker compose version
docker run --rm hello-world

The last command pulls a tiny test image and prints a greeting. If it fails with "permission denied" on Linux, you skipped the log-out-and-back-in step.

Apple Silicon users, one extra thing: open Docker Desktop → Settings → General and make sure "Use Rosetta for x86/amd64 emulation" is enabled. This is the default on recent Docker Desktop builds. The Diction gateway image is built for amd64 (multi-arch is on the roadmap), so Docker needs Rosetta to run it on your M1/M2/M3/M4. Performance impact is negligible - the speech model image is multi-arch and runs natively on arm64, so Rosetta is only handling the small Go binary in front of it.

While you're in Settings, also check Resources → Memory. The default Docker Desktop VM ships with 2 GB, which is tight for medium (~2.1 GB) and will OOM silently. Bump to 4 GB if you're running anything above small.

Step 2: Create a Project Folder

Pick a home for the compose file and any supporting config. Anywhere works. I use ~/diction:

mkdir -p ~/diction && cd ~/diction

Everything in the rest of this article assumes you're sitting in that folder. Docker Compose looks for docker-compose.yml in the current directory, so all the docker compose commands Just Work as long as you cd ~/diction first.

If you're setting this up on a remote server (Linux box in a closet, NUC, etc.), SSH in and run the same command there. Where you edit the file is up to you: nano docker-compose.yml on the server, VSCode Remote-SSH, or editing locally and scp-ing the file over. All fine.

Step 3: Write the Compose File

Here's what we're about to spin up. Two containers working together:

Diction Gateway. The open-source Go service at the front of the stack. On the outside it speaks the standard OpenAI transcription API (POST /v1/audio/transcriptions), which is what the Diction iPhone app talks to. On the inside it routes your audio to whichever speech model you've loaded, and optionally passes the transcript through an LLM for cleanup. The source is on GitHub, MIT licensed. Small, boring Go. Read it, fork it, bend it to your needs.
A voice model. The engine that actually turns audio into text. For this starter stack we're using faster-whisper - a compact, battle-tested open-source model that ships in sizes tiny, base, small, medium, large-v3, and large-v3-turbo. Bigger means more accurate and slower. We'll run small. It's the sweet spot for CPU-only machines: accurate enough for real dictation, transcribes a 5-second clip in 1 to 2 seconds on a modern Mac mini or NUC.

If you've got an NVIDIA GPU sitting in the machine, you can skip small and run something far better (Parakeet or large-v3-turbo). Jump to the "Got an NVIDIA GPU Sitting Idle?" section below before you paste the compose file. Otherwise continue here.

Paste this into ~/diction/docker-compose.yml:

services:
  whisper-small:
    image: fedirz/faster-whisper-server:latest-cpu
    container_name: diction-whisper-small
    restart: unless-stopped
    volumes:
      - whisper-models:/root/.cache/huggingface
    environment:
      WHISPER__MODEL: Systran/faster-whisper-small
      WHISPER__INFERENCE_DEVICE: cpu

  gateway:
    image: ghcr.io/omachala/diction-gateway:latest
    platform: linux/amd64
    container_name: diction-gateway
    restart: unless-stopped
    ports:
      - "8080:8080"
    depends_on:
      - whisper-small
    environment:
      DEFAULT_MODEL: small

volumes:
  whisper-models:

What each line does

Quick tour so you know what you're pasting.

whisper-small service:

image: fedirz/faster-whisper-server:latest-cpu. The voice model engine. faster-whisper is a C++/CTranslate2 reimplementation of the original open-source voice model from OpenAI, running 4x faster with less memory. fedirz/faster-whisper-server wraps it in a small Python server that speaks the OpenAI transcription API. The -cpu tag is the CPU build. There's also a -cuda tag for NVIDIA users (see the GPU section below).
container_name: diction-whisper-small. Just a friendly name so docker ps shows something readable instead of a random string.
restart: unless-stopped. If the container crashes or the host reboots, Docker brings it back. The only thing that stops it is you explicitly running docker compose down.
volumes: - whisper-models:/root/.cache/huggingface. The model weights are downloaded on first start (about 500MB for small). This volume persists them across container rebuilds, so you don't re-download every time you pull a newer image.
WHISPER__MODEL: Systran/faster-whisper-small. The specific voice model to load. It's a HuggingFace repo ID. You can swap this for any CT2-compatible voice model.
WHISPER__INFERENCE_DEVICE: cpu. Tells it to run on CPU. Swap to cuda if you've got an NVIDIA card (full example in the GPU section below).

gateway service:

image: ghcr.io/omachala/diction-gateway:latest. The Diction gateway from GitHub Container Registry.
platform: linux/amd64. The current published image is amd64-only. On Apple Silicon, Docker will run it under Rosetta transparently. Drop this line on a native x86 host if you want the error message to be slightly tidier on docker compose config.
ports: - "8080:8080". Maps port 8080 on the host to 8080 in the container. This is the one your iPhone will talk to. If 8080 is already in use on your machine, change the left side: "18080:8080" and use http://your-ip:18080 from the phone.
depends_on: - whisper-small. Docker starts the whisper container first so the gateway doesn't throw connection-refused on startup. Not strictly required (the gateway retries), but makes logs cleaner.
DEFAULT_MODEL: small. The model the gateway routes to when the iPhone sends a request without specifying one. The gateway has a built-in mapping of short names (small, medium, large-v3-turbo, parakeet-v3) to backend service URLs. Setting DEFAULT_MODEL: small makes it expect a service named whisper-small on port 8000. This is why the first service is named whisper-small and not whisper.

volumes: block at the bottom: declares the named volume Docker uses for the model cache. Named volumes are managed by Docker itself and survive container rebuilds.

Model sizes and what to pick

small is the starter. It's accurate enough for everyday dictation and fits comfortably on any modern laptop or NUC. If you want something else, swap WHISPER__MODEL in the compose file:

Model	Parameters	RAM	CPU latency (5s clip)	Notes
`Systran/faster-whisper-small`	244M	~850 MB	3-4s	Sweet spot for CPU
`Systran/faster-whisper-medium`	769M	~2.1 GB	8-12s	More accurate, slow on CPU
`deepdml/faster-whisper-large-v3-turbo-ct2`	809M	~2.3 GB	<2s on GPU	Best with NVIDIA

The latency numbers are from my own homelab (AMD Ryzen 9 7940HS, CPU-only). Apple Silicon is in the same ballpark: fast enough for small to feel instant, slow enough that medium will make you wait.

Two rules when switching models:

Also change DEFAULT_MODEL on the gateway to match one of: small, medium, large-v3-turbo.
Rename the service to the one the gateway expects: whisper-small, whisper-medium, or whisper-large-turbo. The gateway looks up its backend by service hostname. (The tiny model is reserved for on-device use inside the Diction iPhone app and isn't routed by the gateway - swapping to it silently returns 404 on every request.)

Skip either and the gateway will give you a 404 when the app asks for a model.

One caveat for Mac mini / Apple Silicon users

Docker on macOS runs everything inside a Linux VM. That VM can't reach Apple's GPU or Neural Engine. Containers are CPU-only regardless of how nice your M4's GPU is. Sounds bad on paper, but for dictation workloads you won't feel it: the small voice model handles a short sentence well under five seconds on an M-series CPU. Longer dictations scale linearly. If you want GPU speed, either (a) run a Linux box with an NVIDIA card and keep the Mac as a client, or (b) use Diction's on-device mode on the iPhone itself (Core ML on the Neural Engine).

Step 4: Start Everything

Make sure you're in the project folder, then:

docker compose up -d

The -d flag runs the containers in the background (detached mode).

On the first run this takes a minute or two. Docker pulls two images from their registries:

fedirz/faster-whisper-server:latest-cpu - about 1.7 GB, includes the Python runtime and CTranslate2 binaries
ghcr.io/omachala/diction-gateway:latest - about 210 MB, a compiled Go binary plus ffmpeg for audio conversion

After the pulls finish, the voice model container does one more thing on first boot: it downloads the model weights from HuggingFace into the whisper-models volume (~500 MB for small). Subsequent restarts skip this step - the volume is persistent. That's why there's a volumes: block in the compose file.

Check everything is healthy

docker compose ps

You should see both services:

NAME                     STATUS
diction-gateway          Up 30 seconds
diction-whisper-small    Up 30 seconds (health: starting)

health: starting on the whisper container is normal for the first couple of minutes. It's loading the model into RAM. Once that's done, the status will flip to Up (healthy) or just Up.

Watching logs

If something looks wrong, look at the logs:

docker compose logs -f

-f follows them in real time. Ctrl+C to detach.

You can also tail a single service:

docker compose logs -f gateway
docker compose logs -f whisper-small

What healthy logs look like (abbreviated):

Gateway:

{"level":"info","msg":"gateway starting","port":"8080"}
{"level":"info","msg":"backend registered","name":"small","url":"http://whisper-small:8000"}

Whisper:

INFO:     Application startup complete.
INFO:     Uvicorn running on http://0.0.0.0:8000

Common early errors:

pull access denied on the gateway image. A stale GitHub Container Registry token is cached in your Docker config (on macOS, usually in the login keychain from a past docker login). Run docker logout ghcr.io - yes, even if you don't think you're logged in - and try again.
exec format error on Apple Silicon. Rosetta isn't enabled. Go back to Docker Desktop → Settings → General and flip the Rosetta option on.
The voice model container stuck on health: starting for more than 3 minutes. Usually means it's still downloading weights on a slow connection. Check docker compose logs -f whisper-small to see the download progress.

Stopping and restarting

docker compose stop        # stop containers, keep their state
docker compose start       # start them again
docker compose down        # stop and remove containers (volumes survive)
docker compose down -v     # stop, remove containers AND volumes (re-downloads weights)
docker compose pull        # get newer images
docker compose up -d       # apply pulls / config changes

The model cache in the whisper-models volume is shared across rebuilds, so docker compose pull && docker compose up -d to upgrade is a ~30-second operation.

Step 5: Test It

Before you go anywhere near the iPhone, prove the server itself works. A broken stack is easier to debug from a terminal than from a keyboard extension.

Get an audio file

The quickest path: use your phone's built-in Voice Memos app. Record yourself saying "Hello from my home server." Hit stop. Share → Save to Files, or AirDrop to your Mac, or email it to yourself. You want the .m4a file on the same machine that's running the containers.

On Linux without a phone handy, record with arecord or sox:

# 5 seconds of 16-bit mono WAV at 16 kHz - whisper's native format
arecord -f S16_LE -r 16000 -c 1 -d 5 voice-memo.wav

On macOS, skip recording altogether and let the system generate a clip with say:

say -o voice-memo.aiff "Hello from my home server"

That gives you an .aiff the gateway accepts directly. Handy for scripted testing where you don't feel like holding a microphone.

No microphone and no speech synth? Grab any short speech clip you have lying around. MP3, WAV, M4A, AIFF, FLAC, Ogg - they all work. The voice model handles re-encoding internally.

Hit the gateway

curl -X POST http://localhost:8080/v1/audio/transcriptions \
  -F "file=@voice-memo.m4a" \
  -F "model=small"

You'll get back something like:

{"text":"Hello from my home server."}

That's the whole speech pipeline. Running on your hardware. Your audio never left the box.

Ask for different response formats

The same endpoint supports response_format=text if you'd rather have a plain string (useful if you're piping it into a shell):

curl -X POST http://localhost:8080/v1/audio/transcriptions \
  -F "file=@voice-memo.m4a" \
  -F "model=small" \
  -F "response_format=text"
# → Hello from my home server.

Check the response headers

The gateway adds timing info to the response headers - useful for benchmarking without reading logs:

curl -sS -D - -o /dev/null -X POST \
  http://localhost:8080/v1/audio/transcriptions \
  -F "file=@voice-memo.m4a" -F "model=small"

Look for:

X-Diction-Whisper-Ms - how many milliseconds the speech model took
X-Diction-LLM-Ms - appears only if you've enabled the cleanup step in Step 7

Talk to it from Python

Since the gateway speaks the OpenAI transcription API, the official openai Python SDK works against it directly. Useful if you want to script transcriptions from a laptop:

from openai import OpenAI

client = OpenAI(
    base_url="http://192.168.1.42:8080/v1",
    api_key="anything",   # the gateway doesn't check this by default
)

with open("voice-memo.m4a", "rb") as f:
    result = client.audio.transcriptions.create(
        file=f,
        model="small",
        response_format="text",
    )

print(result)

Same story with the Node SDK, LangChain, or any other tool that expects OpenAI's speech API. Diction becomes a drop-in local replacement for api.openai.com/v1/audio/transcriptions.

If the test fails

Connection refused. The gateway container isn't running. docker compose ps to confirm.
504 Gateway Timeout. The whisper container is still starting (model loading into RAM). Give it another 60 seconds.
400 Bad Request: "invalid audio file". Your file is corrupted or in a format whisper doesn't understand. Try a freshly recorded clip.
404 Not Found. You probably have a typo in the URL. The path is exactly /v1/audio/transcriptions - plural, with /v1/ prefix.
Empty response / hang. The voice model container crashed out of memory mid-transcription. Check docker compose logs whisper-small. small should be fine on any machine with 2GB of free RAM; if you upgraded to medium and the host doesn't have 3GB free, it'll OOM.

Step 6: Find Your Server's LAN IP

Your iPhone needs an address to reach this. Your server probably has two kinds: a public IP (facing the internet, you don't want to use that) and a private LAN IP (on your home WiFi, that's the one).

macOS:

ipconfig getifaddr en0

en0 is usually Wi-Fi on laptops and the built-in Ethernet on desktops. If it prints nothing (you're wired via a USB-C dongle, or on a Mac mini with Wi-Fi off), the right interface is somewhere else - try en1, en4, en5. Quickest catch-all:

ifconfig | grep 'inet ' | grep -v 127.0.0.1

Pick the 192.168.x.x or 10.x.x.x address. Ignore anything starting with 100. - that's Tailscale, not your LAN.

Linux:

hostname -I | awk '{print $1}'

Or, if you want a specific interface:

ip -4 addr show wlan0 | grep inet

Windows:

ipconfig | findstr IPv4

You'll get something like 192.168.1.42. Write it down. This is what you'll paste into the Diction app in Step 8.

Pin it so it doesn't drift

Your router hands out IPs via DHCP, which means the one you just wrote down might change next time the server reboots (or when the lease expires). Two ways to keep it stable:

DHCP reservation. Log into your router's admin page (usually 192.168.1.1, 192.168.0.1, or 10.0.0.1). Find the DHCP client list, locate your server by hostname or MAC address, and click the "reserve" / "static" option. From then on, your router will always hand out that same IP to that machine.
Static IP on the machine. On Linux, edit /etc/netplan/ or use your distro's network manager. On macOS, System Settings → Network → Wi-Fi → Details → TCP/IP → Configure IPv4 → Using DHCP with manual address. More work, more fragile. The router method is better.

If you'd rather not deal with IPs at all and your setup is more portable (laptop moving between networks, for example), skip ahead to the "Reach It From Anywhere" section. Tailscale gives every machine a stable private address that follows it around.

Step 7: Add AI Cleanup (Optional but Nice)

Skip this step and your dictation still works. You'll get raw transcription, which is usually 95% right. The remaining 5% is filler words ("um", "like"), missing commas, misheard homophones ("their" vs "there"), and sometimes a full sentence with no punctuation. AI cleanup fixes all of that before your agent ever sees it.

What it does

You say:

so um basically the meeting went well and uh they agreed to the timeline

The gateway hands that to the LLM, which returns:

The meeting went well. They agreed to the timeline.

That's the whole feature. Any OpenAI-compatible LLM works - OpenAI's own models, Groq, Anthropic (via a compatibility proxy), Together, Fireworks, a local Ollama install, anything that speaks POST /chat/completions.

The flow

iPhone → gateway → voice model → raw transcript
                              ↓
                    your LLM (chat/completions)
                              ↓
                    cleaned text → back to the iPhone

The iPhone sends ?enhance=true on the request when the app's AI Companion toggle is on. The gateway hits {LLM_BASE_URL}/chat/completions with your system prompt + the transcript. Whatever comes back gets sent to the iPhone instead of the raw transcript. If the LLM errors out or times out, the gateway falls back to raw - your dictation doesn't break because of a downstream hiccup.

Config reference

Four environment variables on the gateway:

Variable	Required	What it is
`LLM_BASE_URL`	yes	OpenAI-compatible endpoint, e.g. `https://api.openai.com/v1`
`LLM_MODEL`	yes	Model identifier, e.g. `gpt-4o-mini`
`LLM_API_KEY`	no	Bearer token (your provider's API key). Not needed for local Ollama.
`LLM_PROMPT`	no	System prompt. Literal string, or a file path starting with `/` if you want a longer one mounted as a volume.

Both LLM_BASE_URL and LLM_MODEL must be set for cleanup to turn on. Miss either one and the feature silently stays off.

Option A: OpenAI (or any OpenAI-compatible provider)

Easiest first step. Get a key at platform.openai.com/api-keys and add $5 of credit. For cleanup that's hundreds of hours of dictation.

Create ~/diction/.env:

echo "OPENAI_API_KEY=sk-your-key-here" > ~/diction/.env

Update the gateway service in docker-compose.yml:

  gateway:
    image: ghcr.io/omachala/diction-gateway:latest
    platform: linux/amd64
    container_name: diction-gateway
    restart: unless-stopped
    ports:
      - "8080:8080"
    depends_on:
      - whisper-small
    environment:
      DEFAULT_MODEL: small
      LLM_BASE_URL: "https://api.openai.com/v1"
      LLM_API_KEY: "${OPENAI_API_KEY}"
      LLM_MODEL: "gpt-4o-mini"
      LLM_PROMPT: "Clean up this voice transcription. Remove filler words (um, uh, like). Fix punctuation and capitalization. Return only the cleaned text, nothing else."

Docker Compose reads ${OPENAI_API_KEY} from the .env file in the same folder automatically. No extra flags needed.

Not tied to OpenAI. Every major LLM provider exposes the same OpenAI-compatible /chat/completions endpoint. Swap the three LLM_* URLs and keys and you're done. A few that work out of the box:

Anthropic - Claude models via the OpenAI SDK
Groq - fastest inference on the market, generous free tier
Together AI - broad open-model catalog
Fireworks - tuned Llama and Mixtral hosting
DeepInfra - pay-per-token open models
OpenRouter - one key, hundreds of models from every provider
Mistral - native OpenAI-compatible endpoint

Pick one, drop its LLM_BASE_URL and LLM_MODEL into the compose file, same shape.

Option B: Local with Ollama (zero cost, fully private)

If you've got enough RAM and want nothing leaving your house - not even the transcribed text - run the LLM locally.

Add a third service to your compose file:

  ollama:
    image: ollama/ollama:latest
    container_name: diction-ollama
    restart: unless-stopped
    ports:
      - "11434:11434"
    volumes:
      - ollama-models:/root/.ollama

And update the gateway service:

  gateway:
    image: ghcr.io/omachala/diction-gateway:latest
    platform: linux/amd64
    container_name: diction-gateway
    restart: unless-stopped
    ports:
      - "8080:8080"
    depends_on:
      - whisper-small
      - ollama
    environment:
      DEFAULT_MODEL: small
      LLM_BASE_URL: "http://ollama:11434/v1"
      LLM_MODEL: "gemma2:9b"
      LLM_PROMPT: "Clean up this voice transcription. Remove filler words. Fix punctuation and capitalization. Return only the cleaned text, nothing else."

Add the Ollama volume to the bottom of the file:

volumes:
  whisper-models:
  ollama-models:

Bring it up and pull a model:

docker compose up -d
docker exec diction-ollama ollama pull gemma2:9b

LLM_API_KEY isn't needed - Ollama doesn't check it.

Which Ollama model?

Sizes below are memory footprint - system RAM if you run Ollama on CPU, VRAM if you pass a GPU through to the container. Either way the number is the same.

Model	Params	Memory	Notes
`gemma2:9b`	9B	~6 GB	Best editing quality at this size. My pick.
`qwen2.5:7b`	7B	~5 GB	Strong at following cleanup instructions.
`llama3.1:8b`	8B	~5 GB	Most popular, well-tested.
`gemma3:4b`	4B	~3 GB	For tighter machines. Still OK for basic cleanup.

Under 7B tends to fail in a specific, annoying way: the model treats your transcript as a question and tries to answer it, instead of cleaning it up. Stick to 7B+ if you can spare the memory.

If you have an NVIDIA GPU, pass it through to the Ollama container (same reservation block as the voice model GPU example further down) and you'll get 5-10x faster cleanup.

Apply the changes

Once your compose file has the LLM_* variables set, restart the gateway so it picks them up:

docker compose up -d

Docker Compose detects the env change and recreates only the gateway container. The voice model container (and its loaded model) keeps running.

Test the cleanup

Same voice memo as before, with ?enhance=true appended:

curl -X POST "http://localhost:8080/v1/audio/transcriptions?enhance=true" \
  -F "file=@voice-memo.m4a" \
  -F "model=small"

Without ?enhance=true you get the raw transcription. With it, the gateway sends the transcript through the LLM before returning. Quickest sanity check: record yourself saying some filler words ("um, this is uh a test like") and watch them disappear.

To confirm the LLM is actually running (and wasn't silently disabled because of a missing env var), check the response headers for X-Diction-LLM-Ms:

curl -sS -D - -o /dev/null -X POST \
  "http://localhost:8080/v1/audio/transcriptions?enhance=true" \
  -F "file=@voice-memo.m4a" -F "model=small" | grep -i diction

You should see both X-Diction-Whisper-Ms and X-Diction-LLM-Ms in the output.

Dialing in the prompt

The default prompt above is fine for generic cleanup. Adjust it to your taste. Some real prompts I've tried:

Conservative cleaner (preserves your voice, just fixes obvious errors):

Clean up this voice transcription. Fix punctuation and obvious typos only.
Do not rephrase or change word choice. Return only the cleaned text.

Email-ready rewriter (turns rambling into something you could actually send):

Rewrite this voice note as a short professional email. Keep the meaning intact.
Return only the rewritten text, no greeting or sign-off.

Bullet-pointer (for dumping meeting notes):

Convert this voice note into a bulleted list of the key points.
One bullet per idea. Return only the list.

Translator (I dictate in English, send in German):

Translate this English voice note into natural German. Return only the translation.

Long prompts via a file

If your prompt is more than a one-liner, mount it as a file. Create ~/diction/cleanup-prompt.txt:

You are a transcript cleaner.

Rules:
- Remove filler words (um, uh, er, like, you know).
- Fix grammar and punctuation.
- Preserve the speaker's voice and meaning.
- Common speech-to-text errors: "there / their / they're", "affect / effect".
- Do not add a preamble.
- Return only the cleaned text.

Mount it into the container and point LLM_PROMPT at the file path:

  gateway:
    image: ghcr.io/omachala/diction-gateway:latest
    # ... rest of config
    volumes:
      - ./cleanup-prompt.txt:/config/cleanup-prompt.txt:ro
    environment:
      LLM_BASE_URL: "https://api.openai.com/v1"
      LLM_API_KEY: "${OPENAI_API_KEY}"
      LLM_MODEL: "gpt-4o-mini"
      LLM_PROMPT: "/config/cleanup-prompt.txt"

If LLM_PROMPT starts with /, the gateway reads it as a file path. Otherwise it uses the string directly.

Why gpt-4o-mini or a 7B local model instead of something bigger

Cleanup is a simple task. The LLM only needs to polish, not reason. A frontier-tier model is overkill and slower. gpt-4o-mini (cloud) or gemma2:9b (local) hit the sweet spot for this workload. Save the expensive models for your actual conversations with the agent downstream.

Step 8: Install Diction and Point It at Your Server

Server's ready. Time to put the keyboard in front of it.

Install the app

On your iPhone, open the App Store and install Diction. It's free to download, and the modes you need for self-hosting (the entire point of this article) are free forever.

First run

Open the app. It walks you through three things:

Add the keyboard. iOS requires you to manually add any third-party keyboard. The app sends you to Settings → General → Keyboard → Keyboards → Add New Keyboard → Diction. Tap "Diction", then go back.
Allow Full Access. Back in Keyboards, tap "Diction" in the list and flip "Allow Full Access" on. iOS will show a scary-sounding warning. It's required for any keyboard that makes network requests, which Diction has to do (it sends audio to your server). Diction has no QWERTY input, no text logging, and no analytics - there's nothing to capture even if it wanted to. Only the mic audio leaves the phone, and only to the endpoint you configure below. The source for the gateway is on GitHub, so you can audit exactly what the server does with the audio.
Grant microphone access. Back in the app, it asks for mic permission. Yes.

Point it at your server

Inside the Diction app:

Go to Settings (gear icon, top right).
Tap Mode. Choose Self-Hosted.
Tap Endpoint. Enter http://192.168.1.42:8080 (substituting your server's IP from Step 6).
Scroll down. If you configured AI cleanup in Step 7, toggle AI Companion on.
Tap Test connection. You should see a green check within a second or two. If not, see the troubleshooting below.

Take it for a spin

Open any app that accepts text - Telegram, Messages, Notes, Mail, the Safari address bar, whatever. Tap to bring up the keyboard. Long-press the globe icon (bottom-left of the default keyboard) to switch keyboards. Pick Diction.

You'll see one big mic button. Tap it, talk, release. The audio streams to your server. The transcription arrives back in about as much time as it takes for you to take your finger off the button.

On a local network, end-to-end latency for a short sentence is typically under a second. Good enough that you stop thinking about it.

If it doesn't connect

Server not running? docker compose ps on the server.
iPhone not on the same WiFi as the server.
IP address typo - re-check what Step 6 returned.
Firewall blocking port 8080. On Linux with ufw: sudo ufw allow from 192.168.0.0/16 to any port 8080. On macOS, System Settings → Network → Firewall. Docker Desktop adds itself to the allow list on install, so inbound on published ports normally works - but if you've previously clicked "Deny" on a firewall prompt for Docker, that choice sticks. Flip it back under "Options…", or temporarily turn the firewall off to confirm that's the cause.

Quickest sanity check: open Safari on the iPhone and try http://192.168.1.42:8080/health. If the browser can't reach it, the app can't either.

Now dictate into your agent

Open Telegram. Tap your agent's chat. Tap the globe to switch to the Diction keyboard. Tap the mic. Talk. Release. Your server transcribes, the LLM cleans it up, and the message lands in the composer ready to send. Hit send. Your agent replies. Loop.

That's the whole point of the exercise.

Reach It From Anywhere (Not Just Home WiFi)

Right now your dictation only works on your home network. The moment you walk out the door, the iPhone can't reach 192.168.1.42 anymore. Three clean ways to fix this.

Tailscale (my pick)

Tailscale builds a private mesh network between your devices over WireGuard. Install it on the server and on the iPhone, sign in to the same account on both, and your phone gets a stable 100.x.x.x address it can use to reach the server from anywhere - cellular, coffee shop WiFi, a plane with WiFi, wherever.

Server side (Linux):

curl -fsSL https://tailscale.com/install.sh | sh
sudo tailscale up

On macOS, download the app and run it.

iPhone side: install the Tailscale app from the App Store, sign in.

On the server, grab the tailnet IP:

tailscale ip -4
# → 100.64.1.42

Back in the Diction app, change the Endpoint from http://192.168.1.42:8080 to http://100.64.1.42:8080. Your dictation now works wherever you've got signal. Free for personal use (up to 100 devices).

Cloudflare Tunnel (public URL, no port forwarding)

If you'd rather have a pretty URL and don't want to install anything on the phone, Cloudflare Tunnel gives you an outbound tunnel from your server to Cloudflare's edge. No router config, no exposed ports.

Add this service to your compose file:

  cloudflared:
    image: cloudflare/cloudflared:latest
    container_name: diction-cloudflared
    restart: unless-stopped
    command: tunnel --no-autoupdate run
    environment:
      TUNNEL_TOKEN: "${CLOUDFLARE_TUNNEL_TOKEN}"

Create the tunnel in the Cloudflare Zero Trust dashboard, grab the token, paste it into your .env, set the public hostname to route to http://gateway:8080. Done. Dictate over https://dictation.yourdomain.com.

Free tier. Works great. Only caveat: your transcriptions pass through Cloudflare's network on the way. That's not plaintext (HTTPS all the way), but if "no third party in the path" is the whole reason you set this up, stick to Tailscale.

ngrok (testing / temporary)

For quick testing, ngrok gives you a public URL in one command:

ngrok http 8080

It prints a https://xxx.ngrok-free.app URL. Paste that into the Diction app. Good for a demo or a five-minute test. Free tier URLs change every restart, which is annoying for permanent use. Also adds latency because your audio makes a round trip through ngrok's edge.

Which one?

Personal use, only you reach it: Tailscale. Fast, private, no external hostnames.
Family / small team reaches the same server: Cloudflare Tunnel. Pretty URL, TLS, one password.
Just testing: ngrok.

Already Have a Voice Model Server?

If you've already got a voice model server running somewhere - a self-hosted faster-whisper-server, a colleague's LocalAI instance, your employer's internal speech API - keep it. You don't need the voice model container from Step 3.

What you still need is the Diction Gateway. The iPhone app talks to it for WebSocket streaming and the end-to-end encryption handshake - neither of which a plain OpenAI-compatible transcription server exposes. Point the gateway at your existing server with CUSTOM_BACKEND_URL:

services:
  gateway:
    image: ghcr.io/omachala/diction-gateway:latest
    platform: linux/amd64
    container_name: diction-gateway
    restart: unless-stopped
    ports:
      - "8080:8080"
    environment:
      CUSTOM_BACKEND_URL: http://your-existing-server:8000
      CUSTOM_BACKEND_MODEL: Systran/faster-whisper-small
      # Optional LLM cleanup (Step 7):
      LLM_BASE_URL: "https://api.openai.com/v1"
      LLM_API_KEY: "${OPENAI_API_KEY}"
      LLM_MODEL: "gpt-4o-mini"
      LLM_PROMPT: "Clean up this voice transcription..."

Two extra knobs the CUSTOM_BACKEND_* path supports if you need them:

CUSTOM_BACKEND_AUTH: "Bearer sk-whatever". Sent as the Authorization header to your backend. For instances you've put an auth proxy in front of, or anything hosted that requires a token.
CUSTOM_BACKEND_NEEDS_WAV: "true". Some backends (Canary, Parakeet) only accept WAV. The gateway transparently converts incoming audio with ffmpeg before forwarding.

Point the iPhone at the gateway (http://your-server:8080), leave your existing voice model server where it is, and get streaming plus LLM cleanup on top.

Swap the Speech Model

The starter compose file runs small. That's a choice, not a commitment. Swapping to a different voice model size is two lines in your compose file plus a docker compose up -d. The gateway has a short name for each model it knows how to route to:

Short name (`DEFAULT_MODEL`)	Service hostname	Full model ID
`small`	`whisper-small`	`Systran/faster-whisper-small`
`medium`	`whisper-medium`	`Systran/faster-whisper-medium`
`large-v3-turbo`	`whisper-large-turbo`	`deepdml/faster-whisper-large-v3-turbo-ct2`
`parakeet-v3`	`parakeet`	`nvidia/parakeet-tdt-0.6b-v3`

To swap from small to medium, rewrite your compose file so the whisper service is named whisper-medium, uses WHISPER__MODEL: Systran/faster-whisper-medium, and the gateway's DEFAULT_MODEL is medium.

If the service name doesn't match the short name the gateway expects, you'll see 404 model not found on every request. That's the #1 reason people get stuck when upgrading.

Running multiple models at once? Add more services (whisper-small + whisper-medium side by side) and the app can switch between them per-request by setting the model field in the request body. DEFAULT_MODEL only applies when the request doesn't specify one.

What This Actually Cost Me

The machine: whatever you already have idling at home
Electricity: the speech model at idle is effectively zero. Spikes briefly when you dictate.
OpenAI: gpt-4o-mini is the cheap model. An hour of dictation costs roughly a cent. Five dollars of credit lasts months.

Got an NVIDIA GPU Sitting Idle?

If the box you're setting this up on has an NVIDIA card in it, you can skip the small model and run something that's genuinely state of the art. CPU-only is fine for dictation. GPU unlocks the models that the paid services are running - often faster than those services, because there's no network round trip.

Two options. Pick one.

	Parakeet TDT 0.6B v3	large-v3-turbo
Best at	Speed + accuracy on European languages	Multilingual breadth (99 languages)
WER (English)	~6.3%	~7.4%
Latency	Sub-second	Under 2s on consumer GPU
VRAM (INT8)	~2 GB	~2.3 GB
Languages	25 European	99
Audio format	WAV only (gateway converts)	Anything (voice model handles it)

Option A: Parakeet (fastest, 25 European languages)

NVIDIA's Parakeet TDT 0.6B v3. On a recent consumer GPU (think RTX 3060 or better) it transcribes a 5-second clip in well under a second. Accuracy on clean English audio beats the large-v3 voice model on most benchmarks, at a fraction of the size and latency.

Supported languages: English, Bulgarian, Croatian, Czech, Danish, Dutch, Estonian, Finnish, French, German, Greek, Hungarian, Italian, Latvian, Lithuanian, Maltese, Polish, Portuguese, Romanian, Slovak, Slovenian, Spanish, Swedish, Russian, Ukrainian. If you dictate in any of these, Parakeet is the better engine.

If you dictate in Japanese, Mandarin, Arabic, Korean, or anything outside that list, use Option B.

Replace the whisper-small service in docker-compose.yml with this:

services:
  parakeet:
    image: ghcr.io/achetronic/parakeet:latest-int8
    container_name: diction-parakeet
    restart: unless-stopped
    ports:
      - "5092:5092"
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: 1
              capabilities: [gpu]

  gateway:
    image: ghcr.io/omachala/diction-gateway:latest
    platform: linux/amd64
    container_name: diction-gateway
    restart: unless-stopped
    ports:
      - "8080:8080"
    depends_on:
      - parakeet
    environment:
      DEFAULT_MODEL: parakeet-v3

The gateway already knows how to speak to a service named parakeet on port 5092. No extra wiring needed. Test it exactly the same way as before.

You'll need the NVIDIA Container Toolkit installed on the host so Docker can pass the GPU through. One-line install if you haven't done it yet.

Option B: large-v3-turbo voice model (multilingual, frontier-tier)

The biggest model in this family, GPU-accelerated. This is what the paid cloud transcription services charge real money for. Runs great on any GPU with 6GB+ of VRAM.

services:
  whisper-large:
    image: fedirz/faster-whisper-server:latest-cuda
    container_name: diction-whisper-large
    restart: unless-stopped
    volumes:
      - whisper-models:/root/.cache/huggingface
    environment:
      WHISPER__MODEL: Systran/faster-whisper-large-v3-turbo
      WHISPER__INFERENCE_DEVICE: cuda
      WHISPER__COMPUTE_TYPE: float16
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: 1
              capabilities: [gpu]

  gateway:
    image: ghcr.io/omachala/diction-gateway:latest
    platform: linux/amd64
    container_name: diction-gateway
    restart: unless-stopped
    ports:
      - "8080:8080"
    depends_on:
      - whisper-large
    environment:
      DEFAULT_MODEL: large-v3-turbo

volumes:
  whisper-models:

First boot pulls about 1.6GB of model weights. After that it's warm and fast.

What About NVIDIA Canary 1B?

If you've been reading up on speech models recently, you've probably seen Canary 1B at the top of the accuracy benchmarks. Yes, it's better than both options above on paper. The catch: NVIDIA ships it through NeMo, not as a turnkey OpenAI-compatible container. Getting it wrapped in the API the gateway expects is real work. You'll end up writing a small serving layer yourself. I run one of those internally for the Diction cloud, but I'm not going to pretend you can copy-paste a compose block for it. If you're willing to build that wrapper, point the gateway at it via CUSTOM_BACKEND_URL (see the next section) and you're set.

For everyone else: Parakeet or large-v3-turbo is already better than what most cloud services give you.

The OpenAI-Compatible API You Just Installed

The gateway speaks the OpenAI audio transcription API. That means anything that knows how to talk to api.openai.com/v1/audio/transcriptions also knows how to talk to your server. You spun up the iPhone keyboard client of this API, but you can also point laptops, scripts, or other services at the same URL.

Quick Python example using the official OpenAI SDK:

from openai import OpenAI

client = OpenAI(
    base_url="http://192.168.1.42:8080/v1",
    api_key="anything",   # not checked by default
)

with open("meeting.m4a", "rb") as f:
    text = client.audio.transcriptions.create(
        file=f,
        model="small",
        response_format="text",
    )

print(text)

Same thing works for the Node SDK, LangChain, Flowise, n8n, anything. Treat it as a local stand-in for OpenAI's hosted API.

What's supported

POST /v1/audio/transcriptions with file, model, language, prompt, response_format=json|text
GET /v1/models - lists the speech engines and models the gateway can route to. Response shape is Diction's own ({"providers": [{"id": "whisper", "models": [...]}, ...]}), not OpenAI's flat data array, so OpenAI SDK .models.list() calls won't parse it cleanly. Hit it directly with curl if you want to see what's available.
Multiple short-name aliases: small, medium, large-v3-turbo, parakeet-v3
HuggingFace-style IDs: Systran/faster-whisper-small, nvidia/parakeet-tdt-0.6b-v3, etc.

What's not supported

Text-to-speech (/v1/audio/speech). This is transcription only.
response_format=verbose_json | srt | vtt. No word-level timestamps.
Server-Sent Events streaming on the REST endpoint. Use the WebSocket /v1/audio/stream for streaming.
OpenAI's Realtime API (/v1/realtime).

Authentication

By default the gateway has AUTH_ENABLED=false. Pass any non-empty string as the API key - nothing's checked. If you want to lock it down (e.g. exposing via Cloudflare Tunnel), set AUTH_ENABLED=true and configure the token in your gateway env. The server/docker-compose.yml in the public repo has a more elaborate example if you want to see it.

Caveat: error response shape

Diction's gateway returns errors as {"error":"message"}, not OpenAI's nested {"error":{"message":"...","type":"..."}}. Most SDKs surface these as a raw HTTPError rather than a parsed APIError. Catch both if you're writing something defensive.

Privacy: What Actually Happens to Your Audio

The whole reason most people set this up is not paying a random SaaS to process their voice. Worth being precise about what this stack does and doesn't do:

What leaves your iPhone: raw audio, encoded as Opus (over WebSocket stream) or WAV (over REST), heading to the server endpoint you configured.

In transit: HTTP by default. Plain text audio over your LAN. That's fine on a trusted home network. If you expose the gateway over the internet (Cloudflare Tunnel, ngrok, your own reverse proxy), put TLS in front of it. Tailscale wraps everything in WireGuard so you don't need to think about TLS at all - that's part of why I prefer it.

What your server does with the audio: feeds it to the voice model container. The voice model transcribes. Returns text. Audio gets thrown away - neither the gateway nor faster-whisper-server persists audio anywhere. docker compose logs contains request metadata (latency, model used, text length) but not the audio or the transcript. You can verify yourself: docker exec diction-whisper-small ls -la /tmp is essentially empty between requests.

If cleanup is enabled: the transcript (plain text, no audio) gets sent to your configured LLM endpoint. That's the only point where data leaves your server. If you pick a local Ollama, nothing leaves the house at all. If you pick OpenAI/Groq/whatever, the transcript passes through their infrastructure. Their data policies apply to that leg - read them if it matters.

What the Diction app does with your audio: nothing. The keyboard's only job is to stream to your endpoint and insert the response. No analytics, no tracking, no background uploads. The app has no QWERTY input, so there's literally nothing to log even if it wanted to. Source for the server-side code is on GitHub (the iOS app itself isn't open source, but the data flow on the wire is straightforward: one POST per dictation, to the endpoint you configured).

Full Access permission: iOS requires this for any keyboard that touches the network. It's a coarse switch that also grants things like pasteboard access. Diction uses the network part and nothing else - again, no typed input, no pasteboard monitoring. If you'd rather not trust that claim, run the setup from this article and point Wireshark at the gateway's port. You'll see exactly one connection per dictation, to your endpoint.

One Small Thing About "AI Companion"

If you dig around the Diction app's settings you'll find an "AI Companion" toggle with its own prompt field. Worth knowing how that interacts with what you just built.

The toggle is what tells the app to ask for cleanup (?enhance=true in the request). It's the on/off switch. But the actual prompt the LLM sees is whatever you put in LLM_PROMPT in your compose file. The in-app prompt field is used by the hosted Diction Cloud setup. On your own server, your env var wins. Every time.

So: flip AI Companion on in the app if you want cleanup to run. Tune the prompt by editing docker-compose.yml and running docker compose up -d again. Nothing else to configure.

It's Open Source. Go Wild.

The gateway is on GitHub at omachala/diction under an open-source license. If there's a behavior you want that it doesn't have, fork it. If you hit a bug or add something other people would benefit from, I'd love a pull request. The codebase is small and deliberately boring Go. You don't need to be an expert to find your way around.

Some things I know people want and haven't built yet: per-app routing (different models for different apps), a richer context API, swappable post-processing pipelines. If any of those scratch your itch, the code's right there.

Heard of Speaches?

Speaches is the nearest neighbor - an OpenAI-compatible self-hosted speech server with transcription, TTS, and a realtime API. Good project for a general-purpose endpoint. It won't drive the Diction keyboard, though: the app opens a WebSocket at /v1/audio/stream and does an X25519 + AES-GCM handshake on every request, and Speaches streams transcription over SSE on the REST endpoint with no knowledge of that handshake. That's why I wrote Diction Gateway - the keyboard's protocol baked in, end-to-end encrypted transcripts by default, BYO LLM cleanup in a single env var, and a thin wrapper mode (CUSTOM_BACKEND_URL) so you can put it in front of any existing speech server. Even outside the keyboard use case, if you want a minimal OpenAI-compatible speech gateway with an LLM cleanup step wired in, reach for this one.

Where to Go Next

Some directions once the base setup is working:

Ditch the cloud LLM for a local model. You already saw the Ollama option in Step 7. Uncomment it in your compose file, ollama pull gemma2:9b, done. Nothing leaves your house. I've got a full walkthrough of the Ollama side here.
Move off home WiFi. Tailscale (Reach It From Anywhere section above) is the easy answer. Five minutes to set up, dictation works at the café.
Upgrade the speech model. Start with small, move to medium once you notice misheard words, jump to large-v3-turbo if you've got a GPU. Model accuracy climbs noticeably between each tier.
Dictate in another language. The voice model autodetects, so you don't have to do anything. If you're mostly in a European language and have a GPU, switch to Parakeet - it's meaningfully more accurate for those.
Tune the cleanup prompt. The default prompt fixes filler words and punctuation. Try the email-ready rewriter, the bullet-pointer, or your own variant. See the prompt library in Step 7.
Add a second gateway. Run one on your home server (high quality, slow connection over VPN) and one on a dev laptop (lower quality, instant local). Switch per-network.
Plug the gateway into other things. It's an OpenAI-compatible speech endpoint. Any transcription workflow - meeting notes, voice memos pipeline, automatic subtitling - can point at it instead of OpenAI.
Contribute. If you build something useful on top of this, PR it to omachala/diction. Better prompts, better docs, new backends, whatever.

The keyboard is in the App Store. You can self-host, use the Diction Cloud, or both. The app lets you switch per-app - self-host your Telegram dictation, use the cloud when you're offline from your tailnet, on-device only mode for the really sensitive stuff. Mix and match.

Closing the Thread

What I like about this setup: I can talk to OpenClaw and the rest of my agents without worrying about who else is listening on the way in. The keyboard's as fast as the built-in one. Short dictations land in under a second. The only thing I pay is whatever my cleanup LLM costs - pennies on OpenAI, zero on local Ollama. The rest stays on my hardware.

The project is still quite new, but the feedback from people using it daily has been genuinely amazing. I'm adding features almost every week and making the whole thing more rock solid with each release. If there's something missing for your workflow, say so - good chance it's on its way or can be.

If you found this useful, a GitHub star on omachala/diction would be a lovely token of appreciation - it's the easiest way to tell me this stuff is worth building more of. Try the app, tell someone else who'd find it useful, and if you hit something that's broken or confusing in this walkthrough, ping me. I'll fix it.

Happy dictating.