I Built a Real-Time Voice AI in 50 Minutes. Here's How (and Why)

#ai #webdev #node #opensource

I started skeptical. A voice AI with cloned voices, real-time, no app install — running on free API tiers? Seemed overly ambitious. But a few hours later, I had a working app. Here's the full breakdown.

TL;DR

Clone Talking is a web app for real-time voice conversations with AI persona clones. Open source. Runs on free API tiers.

GitHub: https://github.com/MatheusSimonaci/clone-talking
Demo: https://www.youtube.com/watch?v=Zdw1FfRfmJc

The Challenge

I wanted to build something ambitious: a system where you could talk to AI clones of anyone and get responses in their actual voice.

Requirements:

Real-time voice processing
Sub-second latency
No app installation needed
Ethical voice cloning
Works with free/cheap API tiers

Traditionally, this would take days of debugging WebSocket issues, API rate limiting, and voice synthesis integration. I wanted to see how fast it could actually be done.

The Tech Stack

Your Phone → Whisper (STT) → OpenRouter (LLM) → VoiSpark (TTS) → Your Ears

Speech-to-Text: OpenAI Whisper
LLM: OpenRouter (access to Claude, GPT-4, Llama, and more)
Text-to-Speech + Voice Cloning: VoiSpark
Transport: WebSocket (low latency, bidirectional)
Infrastructure: Node.js + Express + ngrok
Frontend: Next.js + TailwindCSS

How It Works

You speak into your phone (no app install — just scan a QR code)
Audio is streamed to the backend via WebSocket
Whisper transcribes your speech to text
OpenRouter sends the text to the chosen LLM with a persona prompt
The LLM response is synthesized by VoiSpark in the cloned voice
Audio is streamed back — you hear the answer in their voice

Total round-trip: sub-second latency.

How to Run It

git clone https://github.com/MatheusSimonaci/clone-talking
cd clone-talking
npm install
# Set your API keys in .env
npm start
# Open http://localhost:3000
# Scan the QR code from your phone
# Start talking

You need 4 free-tier API keys: OpenAI (Whisper), OpenRouter, VoiSpark, and ngrok.

The Ethical Decision

Voice cloning is powerful — and risky. I made a deliberate choice to use a TTS provider that explicitly allows synthetic voice generation within their terms of service. I didn't want to build something cool while ignoring the ethics.