<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: MCerqua</title>
    <description>The latest articles on DEV Community by MCerqua (@mcerqua).</description>
    <link>https://dev.to/mcerqua</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3830036%2F2f20de0e-47c5-4a21-bd47-40a6e8cb2f7a.jpeg</url>
      <title>DEV Community: MCerqua</title>
      <link>https://dev.to/mcerqua</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/mcerqua"/>
    <language>en</language>
    <item>
      <title>OpenVoiceUI -AI-Voice Agent App Generates Live Canvas Pages Using OpenClaw</title>
      <dc:creator>MCerqua</dc:creator>
      <pubDate>Tue, 17 Mar 2026 20:51:26 +0000</pubDate>
      <link>https://dev.to/mcerqua/openvoiceui-ai-voice-agent-app-generates-live-canvas-pages-using-openclaw-33i9</link>
      <guid>https://dev.to/mcerqua/openvoiceui-ai-voice-agent-app-generates-live-canvas-pages-using-openclaw-33i9</guid>
      <description>&lt;p&gt;If you've been following &lt;a href="https://openclaw.ai" rel="noopener noreferrer"&gt;OpenClaw&lt;/a&gt; — the open-source AI gateway that routes to any LLM provider — you've probably wondered: what can I actually build on top of it?&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;OpenVoiceUI is the first voice UI built on OpenClaw.&lt;/strong&gt; It gives OpenClaw a face, a voice, and a visual workspace. Talk to any LLM through your browser, hear responses spoken back, and watch the AI build live web pages during the conversation.&lt;/p&gt;

&lt;p&gt;This tutorial gets you from zero to a running voice assistant in about 5 minutes.&lt;/p&gt;




&lt;h2&gt;
  
  
  What is OpenClaw + OpenVoiceUI?
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;OpenClaw&lt;/strong&gt; is the gateway layer. It handles:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Routing to any LLM provider (Anthropic, OpenAI, Groq, Z.AI, local models)&lt;/li&gt;
&lt;li&gt;Session management and context windowing&lt;/li&gt;
&lt;li&gt;Tool use and agent orchestration&lt;/li&gt;
&lt;li&gt;Auth profile management (swap API keys, add providers on the fly)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;OpenVoiceUI&lt;/strong&gt; is the interface layer built on top of OpenClaw. It adds:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Voice I/O&lt;/strong&gt; — browser-based speech-to-text and text-to-speech&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Live web canvas&lt;/strong&gt; — the AI generates full HTML pages during conversation (dashboards, reports, tools)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Desktop OS interface&lt;/strong&gt; — windows, folders, right-click menus, wallpaper&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;AI music generation&lt;/strong&gt; via Suno integration&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;AI image generation&lt;/strong&gt; with FLUX.1 and Stable Diffusion 3.5&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Voice cloning&lt;/strong&gt; via Qwen3-TTS&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Agent profiles&lt;/strong&gt; — multiple AI personas, hot-swappable&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Built-in music player&lt;/strong&gt; with crossfade and AI ducking&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Together: OpenClaw handles the intelligence, OpenVoiceUI handles everything the user sees and hears.&lt;/p&gt;




&lt;h2&gt;
  
  
  Prerequisites
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;Docker and Docker Compose&lt;/li&gt;
&lt;li&gt;Node.js 18+&lt;/li&gt;
&lt;li&gt;At least one LLM API key (Groq has a free tier — easiest way to start)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;That's it. No Python setup, no manual dependency management — Docker handles the stack.&lt;/p&gt;




&lt;h2&gt;
  
  
  Installation
&lt;/h2&gt;

&lt;p&gt;One command:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;npx openvoiceui setup
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The setup wizard walks you through:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Enter your API keys (Groq is required for TTS, then pick your LLM provider)&lt;/li&gt;
&lt;li&gt;It generates your &lt;code&gt;.env&lt;/code&gt;, &lt;code&gt;openclaw.json&lt;/code&gt;, and auth profiles&lt;/li&gt;
&lt;li&gt;Builds the Docker images&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Then start everything:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;npx openvoiceui start
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Open &lt;code&gt;http://localhost:5001&lt;/code&gt; in Chrome or Edge. That's your voice assistant.&lt;/p&gt;

&lt;p&gt;Behind the scenes, Docker Compose launches three services:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Service&lt;/th&gt;
&lt;th&gt;What it does&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;openclaw&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;OpenClaw gateway on port 18791 — manages LLM sessions, tool use, agent routing&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;supertonic&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Local TTS engine (free, no API key needed) — ONNX-based speech synthesis&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;openvoiceui&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Flask server on port 5001 — serves the UI, handles voice streaming, manages canvas pages&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;




&lt;h2&gt;
  
  
  How the Architecture Works
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Browser (voice + canvas)
    |
    v
OpenVoiceUI (Flask, port 5001)
    |
    v  WebSocket
OpenClaw Gateway (port 18791)
    |
    v  API calls
LLM Provider (Anthropic / OpenAI / Groq / Z.AI / local)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The key architectural decision: &lt;strong&gt;complete separation between the UI and the intelligence.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;OpenVoiceUI never talks to your LLM directly. Everything goes through OpenClaw. This means:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Switch LLM providers by changing one config value in OpenClaw&lt;/li&gt;
&lt;li&gt;Add new providers without touching the UI code&lt;/li&gt;
&lt;li&gt;OpenClaw handles context pruning, compaction, and session management independently&lt;/li&gt;
&lt;li&gt;Tool use and agent orchestration happen at the gateway layer&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Voice streaming uses WebSocket for low latency. The browser captures speech via Web Speech API (or Deepgram/Groq for server-side STT), sends it to Flask, which forwards to OpenClaw, which calls the LLM. The response streams back and gets spoken via TTS.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Canvas System — Your AI Gets a Screen
&lt;/h2&gt;

&lt;p&gt;This is the feature that makes OpenVoiceUI more than a chatbot.&lt;/p&gt;

&lt;p&gt;During a voice conversation, ask the AI to build something visual:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;"Build me a dashboard showing server metrics"&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;The AI generates a complete HTML page and renders it live in the browser inside a desktop-style window manager. The page persists on the server filesystem — it's not ephemeral.&lt;/p&gt;

&lt;p&gt;This works because OpenClaw gives the AI access to tools. The canvas tool lets the AI create, update, and manage HTML pages. The AI writes the HTML, OpenVoiceUI saves it to disk and renders it in an iframe.&lt;/p&gt;

&lt;p&gt;The desktop interface includes:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Draggable windows for canvas pages&lt;/li&gt;
&lt;li&gt;Right-click context menus&lt;/li&gt;
&lt;li&gt;Folder creation and organization&lt;/li&gt;
&lt;li&gt;Wallpaper customization&lt;/li&gt;
&lt;li&gt;A file explorer for browsing all pages&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Pages can also communicate back to the app via a postMessage bridge — so the AI can build interactive tools that trigger voice responses, navigate between pages, or control playback.&lt;/p&gt;




&lt;h2&gt;
  
  
  Swapping Your LLM Provider
&lt;/h2&gt;

&lt;p&gt;Since OpenClaw handles provider routing, changing your LLM is a config change:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Open &lt;code&gt;http://localhost:18791&lt;/code&gt; (OpenClaw control panel)&lt;/li&gt;
&lt;li&gt;Add your provider API key&lt;/li&gt;
&lt;li&gt;Change the default model&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Or edit &lt;code&gt;openclaw.json&lt;/code&gt; directly:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"agents"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"defaults"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"model"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"anthropic/claude-sonnet-4-5"&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Restart, and your voice assistant is now using Claude instead of whatever it was using before. The UI code didn't change. Your conversation flow didn't change. Your canvas pages still work.&lt;/p&gt;

&lt;p&gt;Tested providers:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Anthropic&lt;/strong&gt; (Claude) — via direct API or Z.AI proxy&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;OpenAI&lt;/strong&gt; (GPT-4o) — direct&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Groq&lt;/strong&gt; (Llama, Mixtral) — fast inference, free tier&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Z.AI&lt;/strong&gt; (GLM-4.7) — great value, Anthropic-compatible API&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Any OpenAI-compatible endpoint&lt;/strong&gt; — local models via LM Studio, Ollama, etc.&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  TTS Options
&lt;/h2&gt;

&lt;p&gt;OpenVoiceUI ships with multiple TTS providers:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Provider&lt;/th&gt;
&lt;th&gt;Type&lt;/th&gt;
&lt;th&gt;Cost&lt;/th&gt;
&lt;th&gt;Quality&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Supertonic&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Local, ONNX&lt;/td&gt;
&lt;td&gt;Free&lt;/td&gt;
&lt;td&gt;Good — ships in Docker, no API key&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Groq Orpheus&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Cloud&lt;/td&gt;
&lt;td&gt;~$0.05/min&lt;/td&gt;
&lt;td&gt;Very good — fast, natural&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Qwen3-TTS&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Cloud&lt;/td&gt;
&lt;td&gt;~$0.003/min&lt;/td&gt;
&lt;td&gt;Great — supports voice cloning&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Hume EVI&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Cloud&lt;/td&gt;
&lt;td&gt;~$0.032/min&lt;/td&gt;
&lt;td&gt;Excellent — emotion-aware&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Switch between them from the Settings panel in the UI. No restart needed.&lt;/p&gt;

&lt;p&gt;Voice cloning works with Qwen3: upload a voice sample, get a clone in ~37 seconds, then generate speech with that voice.&lt;/p&gt;




&lt;h2&gt;
  
  
  Deploying to a VPS
&lt;/h2&gt;

&lt;p&gt;This is where OpenVoiceUI really shines. Running on a VPS means:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Always on — your assistant is available 24/7&lt;/li&gt;
&lt;li&gt;Proper SSL — microphone access requires HTTPS (localhost is exempt, but remote access isn't)&lt;/li&gt;
&lt;li&gt;Persistent storage — canvas pages, music, transcripts all stay on the server&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Recommended: a Hetzner CX22 ($5-15/mo, 2 cores, 4GB RAM). I've been running multiple user instances on a single Hetzner box.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;git clone https://github.com/MCERQUA/OpenVoiceUI
&lt;span class="nb"&gt;cd &lt;/span&gt;OpenVoiceUI
npx openvoiceui setup
npx openvoiceui start
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;For production, add nginx as a reverse proxy with SSL (the &lt;code&gt;deploy/setup-sudo.sh&lt;/code&gt; script handles this automatically).&lt;/p&gt;




&lt;h2&gt;
  
  
  What's Not Great Yet (Honest Assessment)
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;STT&lt;/strong&gt; — Chrome's SpeechRecognition API only allows one instance at a time, which creates challenges for wake-word detection + conversation. Working on server-side alternatives.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Docker image size&lt;/strong&gt; — ~5.4GB. Flask + Node + audio/ML dependencies add up.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Documentation&lt;/strong&gt; — behind the code. The README is solid but in-depth guides are sparse.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Mobile&lt;/strong&gt; — works but not optimized. Desktop browsers are the primary target.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;TTS echo&lt;/strong&gt; — the AI can hear its own voice through the mic. Echo cancellation is an open problem.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;These are all being actively worked on. Issues are tracked on GitHub.&lt;/p&gt;




&lt;h2&gt;
  
  
  Why OpenClaw Matters Here
&lt;/h2&gt;

&lt;p&gt;You could build a voice UI on top of raw LLM APIs. But then you'd be reimplementing:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Provider routing and failover&lt;/li&gt;
&lt;li&gt;Session management and context windowing&lt;/li&gt;
&lt;li&gt;Tool use orchestration&lt;/li&gt;
&lt;li&gt;Auth profile management&lt;/li&gt;
&lt;li&gt;Context pruning and compaction for long-running sessions&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;OpenClaw already solved all of this. OpenVoiceUI just adds the interface layer on top.&lt;/p&gt;

&lt;p&gt;If you're already using OpenClaw for other projects (CLI agents, chat interfaces, automation), OpenVoiceUI gives you a voice-first frontend that connects to the same gateway. Same session management, same tool definitions, same provider config.&lt;/p&gt;




&lt;h2&gt;
  
  
  Get Started
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;npx openvoiceui setup
npx openvoiceui start
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;GitHub:&lt;/strong&gt; &lt;a href="https://github.com/MCERQUA/OpenVoiceUI" rel="noopener noreferrer"&gt;github.com/MCERQUA/OpenVoiceUI&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;npm:&lt;/strong&gt; &lt;a href="https://www.npmjs.com/package/openvoiceui" rel="noopener noreferrer"&gt;npmjs.com/package/openvoiceui&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;OpenClaw:&lt;/strong&gt; &lt;a href="https://openclaw.ai" rel="noopener noreferrer"&gt;openclaw.ai&lt;/a&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;MIT licensed. Feedback, issues, and PRs welcome.&lt;/p&gt;

&lt;p&gt;If you're building on OpenClaw and want a voice interface, this is the starting point. If you're not using OpenClaw yet, this is a good reason to try it.&lt;/p&gt;

</description>
      <category>agents</category>
      <category>ai</category>
      <category>opensource</category>
      <category>tutorial</category>
    </item>
  </channel>
</rss>
