DEV Community

Matheus Simonaci Vieira
Matheus Simonaci Vieira

Posted on

I Built a Real-Time Voice AI in 50 Minutes. Here's How (and Why)

I started skeptical. A voice AI with cloned voices, real-time, no app install — running on free API tiers? Seemed overly ambitious. But a few hours later, I had a working app. Here's the full breakdown.

TL;DR

Clone Talking is a web app for real-time voice conversations with AI persona clones. Open source. Runs on free API tiers.

The Challenge

I wanted to build something ambitious: a system where you could talk to AI clones of anyone and get responses in their actual voice.

Requirements:

  • Real-time voice processing
  • Sub-second latency
  • No app installation needed
  • Ethical voice cloning
  • Works with free/cheap API tiers

Traditionally, this would take days of debugging WebSocket issues, API rate limiting, and voice synthesis integration. I wanted to see how fast it could actually be done.

The Tech Stack

Your Phone → Whisper (STT) → OpenRouter (LLM) → VoiSpark (TTS) → Your Ears
Enter fullscreen mode Exit fullscreen mode
  • Speech-to-Text: OpenAI Whisper
  • LLM: OpenRouter (access to Claude, GPT-4, Llama, and more)
  • Text-to-Speech + Voice Cloning: VoiSpark
  • Transport: WebSocket (low latency, bidirectional)
  • Infrastructure: Node.js + Express + ngrok
  • Frontend: Next.js + TailwindCSS

How It Works

  1. You speak into your phone (no app install — just scan a QR code)
  2. Audio is streamed to the backend via WebSocket
  3. Whisper transcribes your speech to text
  4. OpenRouter sends the text to the chosen LLM with a persona prompt
  5. The LLM response is synthesized by VoiSpark in the cloned voice
  6. Audio is streamed back — you hear the answer in their voice

Total round-trip: sub-second latency.

How to Run It

git clone https://github.com/MatheusSimonaci/clone-talking
cd clone-talking
npm install
# Set your API keys in .env
npm start
# Open http://localhost:3000
# Scan the QR code from your phone
# Start talking
Enter fullscreen mode Exit fullscreen mode

You need 4 free-tier API keys: OpenAI (Whisper), OpenRouter, VoiSpark, and ngrok.

The Ethical Decision

Voice cloning is powerful — and risky. I made a deliberate choice to use a TTS provider that explicitly allows synthetic voice generation within their terms of service. I didn't want to build something cool while ignoring the ethics.

What's Next

  • Custom voice training (upload your own voice sample)
  • Multi-language support
  • Conversation memory across sessions
  • Integration with external knowledge bases

Contributions welcome. MIT License.

GitHub: https://github.com/MatheusSimonaci/clone-talking

Top comments (0)