If you've been following OpenClaw — the open-source AI gateway that routes to any LLM provider — you've probably wondered: what can I actually build on top of it?
OpenVoiceUI is the first voice UI built on OpenClaw. It gives OpenClaw a face, a voice, and a visual workspace. Talk to any LLM through your browser, hear responses spoken back, and watch the AI build live web pages during the conversation.
This tutorial gets you from zero to a running voice assistant in about 5 minutes.
What is OpenClaw + OpenVoiceUI?
OpenClaw is the gateway layer. It handles:
- Routing to any LLM provider (Anthropic, OpenAI, Groq, Z.AI, local models)
- Session management and context windowing
- Tool use and agent orchestration
- Auth profile management (swap API keys, add providers on the fly)
OpenVoiceUI is the interface layer built on top of OpenClaw. It adds:
- Voice I/O — browser-based speech-to-text and text-to-speech
- Live web canvas — the AI generates full HTML pages during conversation (dashboards, reports, tools)
- Desktop OS interface — windows, folders, right-click menus, wallpaper
- AI music generation via Suno integration
- AI image generation with FLUX.1 and Stable Diffusion 3.5
- Voice cloning via Qwen3-TTS
- Agent profiles — multiple AI personas, hot-swappable
- Built-in music player with crossfade and AI ducking
Together: OpenClaw handles the intelligence, OpenVoiceUI handles everything the user sees and hears.
Prerequisites
- Docker and Docker Compose
- Node.js 18+
- At least one LLM API key (Groq has a free tier — easiest way to start)
That's it. No Python setup, no manual dependency management — Docker handles the stack.
Installation
One command:
npx openvoiceui setup
The setup wizard walks you through:
- Enter your API keys (Groq is required for TTS, then pick your LLM provider)
- It generates your
.env,openclaw.json, and auth profiles - Builds the Docker images
Then start everything:
npx openvoiceui start
Open http://localhost:5001 in Chrome or Edge. That's your voice assistant.
Behind the scenes, Docker Compose launches three services:
| Service | What it does |
|---|---|
| openclaw | OpenClaw gateway on port 18791 — manages LLM sessions, tool use, agent routing |
| supertonic | Local TTS engine (free, no API key needed) — ONNX-based speech synthesis |
| openvoiceui | Flask server on port 5001 — serves the UI, handles voice streaming, manages canvas pages |
How the Architecture Works
Browser (voice + canvas)
|
v
OpenVoiceUI (Flask, port 5001)
|
v WebSocket
OpenClaw Gateway (port 18791)
|
v API calls
LLM Provider (Anthropic / OpenAI / Groq / Z.AI / local)
The key architectural decision: complete separation between the UI and the intelligence.
OpenVoiceUI never talks to your LLM directly. Everything goes through OpenClaw. This means:
- Switch LLM providers by changing one config value in OpenClaw
- Add new providers without touching the UI code
- OpenClaw handles context pruning, compaction, and session management independently
- Tool use and agent orchestration happen at the gateway layer
Voice streaming uses WebSocket for low latency. The browser captures speech via Web Speech API (or Deepgram/Groq for server-side STT), sends it to Flask, which forwards to OpenClaw, which calls the LLM. The response streams back and gets spoken via TTS.
The Canvas System — Your AI Gets a Screen
This is the feature that makes OpenVoiceUI more than a chatbot.
During a voice conversation, ask the AI to build something visual:
"Build me a dashboard showing server metrics"
The AI generates a complete HTML page and renders it live in the browser inside a desktop-style window manager. The page persists on the server filesystem — it's not ephemeral.
This works because OpenClaw gives the AI access to tools. The canvas tool lets the AI create, update, and manage HTML pages. The AI writes the HTML, OpenVoiceUI saves it to disk and renders it in an iframe.
The desktop interface includes:
- Draggable windows for canvas pages
- Right-click context menus
- Folder creation and organization
- Wallpaper customization
- A file explorer for browsing all pages
Pages can also communicate back to the app via a postMessage bridge — so the AI can build interactive tools that trigger voice responses, navigate between pages, or control playback.
Swapping Your LLM Provider
Since OpenClaw handles provider routing, changing your LLM is a config change:
- Open
http://localhost:18791(OpenClaw control panel) - Add your provider API key
- Change the default model
Or edit openclaw.json directly:
{
"agents": {
"defaults": {
"model": "anthropic/claude-sonnet-4-5"
}
}
}
Restart, and your voice assistant is now using Claude instead of whatever it was using before. The UI code didn't change. Your conversation flow didn't change. Your canvas pages still work.
Tested providers:
- Anthropic (Claude) — via direct API or Z.AI proxy
- OpenAI (GPT-4o) — direct
- Groq (Llama, Mixtral) — fast inference, free tier
- Z.AI (GLM-4.7) — great value, Anthropic-compatible API
- Any OpenAI-compatible endpoint — local models via LM Studio, Ollama, etc.
TTS Options
OpenVoiceUI ships with multiple TTS providers:
| Provider | Type | Cost | Quality |
|---|---|---|---|
| Supertonic | Local, ONNX | Free | Good — ships in Docker, no API key |
| Groq Orpheus | Cloud | ~$0.05/min | Very good — fast, natural |
| Qwen3-TTS | Cloud | ~$0.003/min | Great — supports voice cloning |
| Hume EVI | Cloud | ~$0.032/min | Excellent — emotion-aware |
Switch between them from the Settings panel in the UI. No restart needed.
Voice cloning works with Qwen3: upload a voice sample, get a clone in ~37 seconds, then generate speech with that voice.
Deploying to a VPS
This is where OpenVoiceUI really shines. Running on a VPS means:
- Always on — your assistant is available 24/7
- Proper SSL — microphone access requires HTTPS (localhost is exempt, but remote access isn't)
- Persistent storage — canvas pages, music, transcripts all stay on the server
Recommended: a Hetzner CX22 ($5-15/mo, 2 cores, 4GB RAM). I've been running multiple user instances on a single Hetzner box.
git clone https://github.com/MCERQUA/OpenVoiceUI
cd OpenVoiceUI
npx openvoiceui setup
npx openvoiceui start
For production, add nginx as a reverse proxy with SSL (the deploy/setup-sudo.sh script handles this automatically).
What's Not Great Yet (Honest Assessment)
- STT — Chrome's SpeechRecognition API only allows one instance at a time, which creates challenges for wake-word detection + conversation. Working on server-side alternatives.
- Docker image size — ~5.4GB. Flask + Node + audio/ML dependencies add up.
- Documentation — behind the code. The README is solid but in-depth guides are sparse.
- Mobile — works but not optimized. Desktop browsers are the primary target.
- TTS echo — the AI can hear its own voice through the mic. Echo cancellation is an open problem.
These are all being actively worked on. Issues are tracked on GitHub.
Why OpenClaw Matters Here
You could build a voice UI on top of raw LLM APIs. But then you'd be reimplementing:
- Provider routing and failover
- Session management and context windowing
- Tool use orchestration
- Auth profile management
- Context pruning and compaction for long-running sessions
OpenClaw already solved all of this. OpenVoiceUI just adds the interface layer on top.
If you're already using OpenClaw for other projects (CLI agents, chat interfaces, automation), OpenVoiceUI gives you a voice-first frontend that connects to the same gateway. Same session management, same tool definitions, same provider config.
Get Started
npx openvoiceui setup
npx openvoiceui start
- GitHub: github.com/MCERQUA/OpenVoiceUI
- npm: npmjs.com/package/openvoiceui
- OpenClaw: openclaw.ai
MIT licensed. Feedback, issues, and PRs welcome.
If you're building on OpenClaw and want a voice interface, this is the starting point. If you're not using OpenClaw yet, this is a good reason to try it.
Top comments (0)