MCerqua

Posted on Mar 17

OpenVoiceUI -AI-Voice Agent App Generates Live Canvas Pages Using OpenClaw

#ai #opensource #tutorial #agents

If you've been following OpenClaw — the open-source AI gateway that routes to any LLM provider — you've probably wondered: what can I actually build on top of it?

OpenVoiceUI is the first voice UI built on OpenClaw. It gives OpenClaw a face, a voice, and a visual workspace. Talk to any LLM through your browser, hear responses spoken back, and watch the AI build live web pages during the conversation.

This tutorial gets you from zero to a running voice assistant in about 5 minutes.

What is OpenClaw + OpenVoiceUI?

OpenClaw is the gateway layer. It handles:

Routing to any LLM provider (Anthropic, OpenAI, Groq, Z.AI, local models)
Session management and context windowing
Tool use and agent orchestration
Auth profile management (swap API keys, add providers on the fly)

OpenVoiceUI is the interface layer built on top of OpenClaw. It adds:

Voice I/O — browser-based speech-to-text and text-to-speech
Live web canvas — the AI generates full HTML pages during conversation (dashboards, reports, tools)
Desktop OS interface — windows, folders, right-click menus, wallpaper
AI music generation via Suno integration
AI image generation with FLUX.1 and Stable Diffusion 3.5
Voice cloning via Qwen3-TTS
Agent profiles — multiple AI personas, hot-swappable
Built-in music player with crossfade and AI ducking

Together: OpenClaw handles the intelligence, OpenVoiceUI handles everything the user sees and hears.

Prerequisites

Docker and Docker Compose
Node.js 18+
At least one LLM API key (Groq has a free tier — easiest way to start)

That's it. No Python setup, no manual dependency management — Docker handles the stack.

Installation

One command:

npx openvoiceui setup

The setup wizard walks you through:

Enter your API keys (Groq is required for TTS, then pick your LLM provider)
It generates your .env, openclaw.json, and auth profiles
Builds the Docker images

Then start everything:

npx openvoiceui start

Open http://localhost:5001 in Chrome or Edge. That's your voice assistant.

Behind the scenes, Docker Compose launches three services:

Service	What it does
openclaw	OpenClaw gateway on port 18791 — manages LLM sessions, tool use, agent routing
supertonic	Local TTS engine (free, no API key needed) — ONNX-based speech synthesis
openvoiceui	Flask server on port 5001 — serves the UI, handles voice streaming, manages canvas pages

How the Architecture Works

Browser (voice + canvas)
    |
    v
OpenVoiceUI (Flask, port 5001)
    |
    v  WebSocket
OpenClaw Gateway (port 18791)
    |
    v  API calls
LLM Provider (Anthropic / OpenAI / Groq / Z.AI / local)

The key architectural decision: complete separation between the UI and the intelligence.

OpenVoiceUI never talks to your LLM directly. Everything goes through OpenClaw. This means:

Switch LLM providers by changing one config value in OpenClaw
Add new providers without touching the UI code
OpenClaw handles context pruning, compaction, and session management independently
Tool use and agent orchestration happen at the gateway layer

Voice streaming uses WebSocket for low latency. The browser captures speech via Web Speech API (or Deepgram/Groq for server-side STT), sends it to Flask, which forwards to OpenClaw, which calls the LLM. The response streams back and gets spoken via TTS.

The Canvas System — Your AI Gets a Screen

This is the feature that makes OpenVoiceUI more than a chatbot.

During a voice conversation, ask the AI to build something visual:

"Build me a dashboard showing server metrics"

The AI generates a complete HTML page and renders it live in the browser inside a desktop-style window manager. The page persists on the server filesystem — it's not ephemeral.

This works because OpenClaw gives the AI access to tools. The canvas tool lets the AI create, update, and manage HTML pages. The AI writes the HTML, OpenVoiceUI saves it to disk and renders it in an iframe.

The desktop interface includes:

Draggable windows for canvas pages
Right-click context menus
Folder creation and organization
Wallpaper customization
A file explorer for browsing all pages

Pages can also communicate back to the app via a postMessage bridge — so the AI can build interactive tools that trigger voice responses, navigate between pages, or control playback.

Swapping Your LLM Provider

Since OpenClaw handles provider routing, changing your LLM is a config change:

Open http://localhost:18791 (OpenClaw control panel)
Add your provider API key
Change the default model

Or edit openclaw.json directly:

{
  "agents": {
    "defaults": {
      "model": "anthropic/claude-sonnet-4-5"
    }
  }
}

Restart, and your voice assistant is now using Claude instead of whatever it was using before. The UI code didn't change. Your conversation flow didn't change. Your canvas pages still work.

Tested providers:

Anthropic (Claude) — via direct API or Z.AI proxy
OpenAI (GPT-4o) — direct
Groq (Llama, Mixtral) — fast inference, free tier
Z.AI (GLM-4.7) — great value, Anthropic-compatible API
Any OpenAI-compatible endpoint — local models via LM Studio, Ollama, etc.

TTS Options

OpenVoiceUI ships with multiple TTS providers:

Provider	Type	Cost	Quality
Supertonic	Local, ONNX	Free	Good — ships in Docker, no API key
Groq Orpheus	Cloud	~$0.05/min	Very good — fast, natural
Qwen3-TTS	Cloud	~$0.003/min	Great — supports voice cloning
Hume EVI	Cloud	~$0.032/min	Excellent — emotion-aware

Switch between them from the Settings panel in the UI. No restart needed.

Voice cloning works with Qwen3: upload a voice sample, get a clone in ~37 seconds, then generate speech with that voice.

Deploying to a VPS

This is where OpenVoiceUI really shines. Running on a VPS means:

Always on — your assistant is available 24/7
Proper SSL — microphone access requires HTTPS (localhost is exempt, but remote access isn't)
Persistent storage — canvas pages, music, transcripts all stay on the server

Recommended: a Hetzner CX22 ($5-15/mo, 2 cores, 4GB RAM). I've been running multiple user instances on a single Hetzner box.

git clone https://github.com/MCERQUA/OpenVoiceUI
cd OpenVoiceUI
npx openvoiceui setup
npx openvoiceui start

For production, add nginx as a reverse proxy with SSL (the deploy/setup-sudo.sh script handles this automatically).

What's Not Great Yet (Honest Assessment)

STT — Chrome's SpeechRecognition API only allows one instance at a time, which creates challenges for wake-word detection + conversation. Working on server-side alternatives.
Docker image size — ~5.4GB. Flask + Node + audio/ML dependencies add up.
Documentation — behind the code. The README is solid but in-depth guides are sparse.
Mobile — works but not optimized. Desktop browsers are the primary target.
TTS echo — the AI can hear its own voice through the mic. Echo cancellation is an open problem.

These are all being actively worked on. Issues are tracked on GitHub.

Why OpenClaw Matters Here

You could build a voice UI on top of raw LLM APIs. But then you'd be reimplementing:

Provider routing and failover
Session management and context windowing
Tool use orchestration
Auth profile management
Context pruning and compaction for long-running sessions

OpenClaw already solved all of this. OpenVoiceUI just adds the interface layer on top.

If you're already using OpenClaw for other projects (CLI agents, chat interfaces, automation), OpenVoiceUI gives you a voice-first frontend that connects to the same gateway. Same session management, same tool definitions, same provider config.

Get Started

npx openvoiceui setup
npx openvoiceui start

GitHub: github.com/MCERQUA/OpenVoiceUI
npm: npmjs.com/package/openvoiceui
OpenClaw: openclaw.ai

MIT licensed. Feedback, issues, and PRs welcome.

If you're building on OpenClaw and want a voice interface, this is the starting point. If you're not using OpenClaw yet, this is a good reason to try it.

DEV Community