I started skeptical. A voice AI with cloned voices, real-time, no app install — running on free API tiers? Seemed overly ambitious. But a few hours later, I had a working app. Here's the full breakdown.
TL;DR
Clone Talking is a web app for real-time voice conversations with AI persona clones. Open source. Runs on free API tiers.
- GitHub: https://github.com/MatheusSimonaci/clone-talking
- Demo: https://www.youtube.com/watch?v=Zdw1FfRfmJc
The Challenge
I wanted to build something ambitious: a system where you could talk to AI clones of anyone and get responses in their actual voice.
Requirements:
- Real-time voice processing
- Sub-second latency
- No app installation needed
- Ethical voice cloning
- Works with free/cheap API tiers
Traditionally, this would take days of debugging WebSocket issues, API rate limiting, and voice synthesis integration. I wanted to see how fast it could actually be done.
The Tech Stack
Your Phone → Whisper (STT) → OpenRouter (LLM) → VoiSpark (TTS) → Your Ears
- Speech-to-Text: OpenAI Whisper
- LLM: OpenRouter (access to Claude, GPT-4, Llama, and more)
- Text-to-Speech + Voice Cloning: VoiSpark
- Transport: WebSocket (low latency, bidirectional)
- Infrastructure: Node.js + Express + ngrok
- Frontend: Next.js + TailwindCSS
How It Works
- You speak into your phone (no app install — just scan a QR code)
- Audio is streamed to the backend via WebSocket
- Whisper transcribes your speech to text
- OpenRouter sends the text to the chosen LLM with a persona prompt
- The LLM response is synthesized by VoiSpark in the cloned voice
- Audio is streamed back — you hear the answer in their voice
Total round-trip: sub-second latency.
How to Run It
git clone https://github.com/MatheusSimonaci/clone-talking
cd clone-talking
npm install
# Set your API keys in .env
npm start
# Open http://localhost:3000
# Scan the QR code from your phone
# Start talking
You need 4 free-tier API keys: OpenAI (Whisper), OpenRouter, VoiSpark, and ngrok.
The Ethical Decision
Voice cloning is powerful — and risky. I made a deliberate choice to use a TTS provider that explicitly allows synthetic voice generation within their terms of service. I didn't want to build something cool while ignoring the ethics.
What's Next
- Custom voice training (upload your own voice sample)
- Multi-language support
- Conversation memory across sessions
- Integration with external knowledge bases
Contributions welcome. MIT License.
Top comments (0)