I added voice mode to my AI work tool using ElevenLabs as the speech layer

#ai #automation #productivity #showdev

SlopWeaver is a desktop app that connects your work tools (Gmail, Slack, Linear, Google Docs, and more) and uses AI to handle busywork. It already had a text-based AI chat with full tool access: workspace search, task creation, cross-platform context. Adding voice meant giving the same capabilities a speech interface.

The architecture:

ElevenLabs Conversational AI handles the speech side. It manages the WebSocket connection, speech-to-text, text-to-speech, turn detection, and interruption handling. SlopWeaver is registered as a custom LLM provider. When you speak, ElevenLabs transcribes it and sends the text to SlopWeaver's API in OpenAI Chat Completions format. SlopWeaver processes it through the same Claude chat pipeline as text (tool calls, context retrieval, generation), then streams the response back as SSE chunks. ElevenLabs begins speaking the chunks as they arrive, so TTS starts before the full response is generated.

In practice, the voice conversation has the same tool access as text. You can ask it to search your messages, create tasks, summarize threads, pull context from connected platforms. Voice is an input modality, not a separate product.

In this demo I used voice mode to work through some product issues. Asked the AI to explain a Sentry error on screen, told it to create a task for broken Gmail email rendering, then accepted a proposed fix from the tasks page. One task created by voice, one existing proposal accepted.

Three problems I ran into that might save you time if you're building something similar:

ElevenLabs uses Whisper for transcription, and it misheard domain-specific words constantly. "Slack" became "stack", "Jira" became "gyra", "SlopWeaver" became anything. I built a per-user vocabulary correction service that runs post-transcription. It's a simple find-and-replace pass but it made a big difference in downstream AI comprehension.

The AI's text responses contain markdown, code blocks, URLs, and embedded content. None of that works when spoken aloud. Added a sanitization layer between the chat pipeline output and the SSE stream that strips non-voice-safe content. The AI generates one response, and the rendering layer decides what's appropriate for voice vs text display.

The round trip is: speech capture, ElevenLabs transcription, network to SlopWeaver API, Claude generation (sometimes with tool calls that hit external APIs), network back to ElevenLabs, text-to-speech, audio playback. Each step adds latency, and tool calls add more. Two things helped most: using ElevenLabs' lowest-latency TTS model (eleven_flash_v2_5) and streaming response chunks so TTS can start speaking before generation finishes. The target is under 800ms for the first spoken token on non-tool-call turns.

Each voice conversation turn is a billable action. Preflight affordability check runs when the session starts, actual cost deduction happens after the webhook completes. Sessions are Redis-backed with a 30-minute TTL, which also prevents race conditions between overlapping turns.

Stack: NestJS, React 19, Tauri v2 (desktop), Claude (Anthropic SDK with prompt caching), ElevenLabs Conversational AI (WebSocket + custom LLM webhook), Supabase + pgvector, BullMQ.

Building in public. Previous demos showed the review queue and the cross-platform AI chat. This one adds voice as a third interaction mode.

DEV Community

I added voice mode to my AI work tool using ElevenLabs as the speech layer

Top comments (0)