Have you ever wanted to just talk to your AI assistant? Like, literally speak into your microphone and hear it respond naturally?
I work with AI all day, and sometimes typing feels... slow. Copy-pasting is tedious. Sometimes I'm cooking, walking, or just tired of staring at a screen. I wanted something that felt like a real conversation.
So I built Talkative Lobster — a desktop app that turns your voice into AI conversation and back again.
What It Does
You speak → Speech-to-Text → LLM → Text-to-Speech → You hear
You talk into your mic, the AI thinks, and then speaks back to you. No typing, no copy-pasting — just natural back-and-forth conversation.
Key features:
- Interrupt anytime — Just start talking to cut in mid-response, like a real conversation
- Works offline — Local whisper.cpp and Piper TTS when you need privacy or no network
- Japanese support — VOICEVOX and Kokoro voices, plus "aizuchi" (those little "mm-hmm" sounds)
- Multiple providers — ElevenLabs, OpenAI, or local options for both STT and TTS
- Privacy-first — API keys encrypted on your machine, everything routes through your own gateway
How It Works
Voice Detection
The app uses Silero VAD (Voice Activity Detection) — a neural network trained on real speech patterns. This means it knows when you're actually talking versus when there's background noise, music, or keyboard clicks.
Speech-to-Text Options
- ElevenLabs Scribe — Best accuracy, fastest
- OpenAI Whisper — Reliable, widely used
- whisper.cpp — Runs locally, no network needed
Text-to-Speech Options
- ElevenLabs — Most natural voices
- VOICEVOX — Great for Japanese
- Kokoro — Japanese + English
- Piper — Lightweight, runs offline
LLM (OpenClaw Gateway)
The AI part runs through OpenClaw — a local gateway that:
- Lets you switch models without code changes
- Handles rate limits gracefully
- Keeps all your API keys in one place
The Challenges I Faced
Challenge 1: The Echo Problem
At first, the AI's voice was triggering the microphone, creating infinite loops. 😅
Solution: A speaker monitor that "subtracts" the audio output from the input stream. Now only your voice triggers the AI.
Challenge 2: Latency
3-4 second delays made conversations feel robotic and awkward.
Solution:
- Stream responses — start speaking as soon as the first tokens arrive
- Optimized audio buffers — balance latency vs. quality
- Added "aizuchi" sounds — those little "mm-hmm" fillers during thinking time
Challenge 3: Japanese Support
Most TTS voices sound robotic in Japanese. And Japanese conversations have those subtle "aizuchi" sounds that feel unnatural without them.
Solution: Integrated VOICEVOX and Kokoro for natural prosody, and added optional aizuchi sounds during thinking.
Try It Out
Prerequisites
- Node.js 20+
- pnpm
- An OpenClaw gateway running locally
Quick Start
git clone https://github.com/coo-quack/talkative-lobster.git
cd talkative-lobster
pnpm install
pnpm dev
On first launch, the Settings modal walks you through connecting your gateway and choosing your preferred STT/TTS providers.
Downloads
Pre-built binaries available:
What's Next
- More TTS providers (Azure, Google Cloud)
- Voice presets for different use cases
- Conversation history with search
- Custom wake words
- Mobile companion app
Final Thoughts
Voice interfaces are finally good enough to be useful — not just gimmicky. The combination of accurate speech recognition, fast LLMs, and natural text-to-speech creates something that feels genuinely conversational.
If you've ever wanted to just talk to your AI while doing something else — cooking, walking, debugging at 2 AM — give Talkative Lobster a try.
Links:
Built with 🦞 by the coo-quack team
Top comments (0)