Chatbots are so 2020. Let me show you what I built instead.
It's been ages since I last posted here. Hope y'all had a great Christmas! π Feels good to be back. βοΈ
The Problem With Every AI Assistant Right Now
You know what's annoying? Typing.
Every AI tool out there wants you to type type type like it's 1995. And don't even get me started on the ones that "listen" but can't see what you're showing them.
So I asked myself: What if I built an AI that works like an actual conversation?
One that:
- π Sees what you show it (camera feed)
- π Hears you naturally (no push-to-talk nonsense)
- π£οΈ Responds with voice and perfectly synced lip movements
- π Expresses emotions through a 3D avatar
And runs 100% locally on your machine. No API keys bleeding your wallet dry.
Introducing TalkMateAI π
TalkMateAI is a real-time, multimodal AI companion. You talk to it, show it things through your camera, and it responds with natural speech while a 3D avatar lip-syncs perfectly to every word.
It's like having a conversation with a character from a video game, except it's actually intelligent.
The Tech Stack (For My Fellow Nerds π€)
Backend (Python)
FastAPI + WebSockets β Real-time bidirectional communication
PyTorch + Flash Attention 2 β GPU go brrrrr
OpenAI Whisper (tiny) β Speech recognition
SmolVLM2-256M-Video-Instruct β Vision-language understanding
Kokoro TTS β Natural voice synthesis with word-level timing
Frontend (TypeScript)
Next.js 15 β Because Turbopack is fast af
Tailwind CSS + shadcn/ui β Pretty buttons
TalkingHead.js β 3D avatar with lip-sync magic
Web Audio API + AudioWorklet β Low-latency audio processing
Native WebSocket β None of that socket.io bloat
How It Actually Works
Here's the flow:
You speak β
VAD detects speech β
Audio (+ camera frame if enabled) sent via WebSocket β
Whisper transcribes β
SmolVLM2 understands text + image together β
Generates response β
Kokoro synthesizes speech with timing data β
Audio + lip-sync data sent back β
3D avatar speaks with perfect sync
All of this happens in real-time.
The Secret Sauce: Native Word Timing π―
Most TTS solutions give you audio and that's it. You're left guessing when each word starts for lip-sync.
Kokoro TTS gives you word-level timing data out of the box:
const speakData = {
audio: audioBuffer,
words: ["Hello", "world"],
wtimes: [0.0, 0.5], // when each word starts
wdurations: [0.4, 0.6] // how long each word lasts
};
// TalkingHead uses this for pixel-perfect lip sync
headRef.current.speakAudio(speakData);
The result? Lips that move exactly when they should. No uncanny valley weirdness.
Voice Activity Detection That Actually Works
I didn't want push-to-talk. I wanted natural conversation flow.
So I built a custom VAD using the Web Audio API's AudioWorklet. It calculates energy levels in real-time and tracks speech frames vs silence frames - all from the frontend (so no unnecessary wastage of backend processing power).
You just... talk. When you pause naturally, it processes. When you keep talking, it waits.
It respects conversational flow.
β οΈ Heads up: This version doesn't support barge-in (interrupting the avatar mid-speech) or sophisticated turn-taking detection. It's purely pause-based - you talk, pause, it responds.
The Vision Component ποΈ
Here's where it gets spicy. The camera isn't just for show.
When enabled, every audio segment gets sent with a camera snapshot. SmolVLM2 processes both together - the audio transcription AND what it sees.
You can literally say "What am I holding?" and it'll tell you.
Running It Yourself
Prerequisites
- Node.js 20+
- Python 3.10
- NVIDIA GPU with ~4GB+ VRAM should work (I used RTX 3070 8GB, but the models are lightweight - Whisper tiny + SmolVLM2-256M + Kokoro TTS)
- PNPM & UV package managers
Setup
# Clone it
git clone https://github.com/kiranbaby14/TalkMateAI.git
cd TalkMateAI
# Install everything
pnpm run monorepo-setup
# Run both frontend and backend
pnpm dev
Frontend: http://localhost:3000
Backend: http://localhost:8000
What Can You Build With This?
This is open source. Fork it. Break it. Make it weird.
Some ideas:
- π Language tutors that watch your pronunciation
- π¨ Creative companions that see your art and give feedback
- π Screen assistants - combine with Screenpipe for an AI that knows what you've been doing
The Code Is Yours
GitHub: github.com/kiranbaby14/TalkMateAI
π οΈ Fair warning: This was a curiosity-driven project, not a polished product. There are rough edges, things I'd do differently now, and probably bugs I haven't found yet. But that's the fun of open source, right? Dig in, break stuff, make it better.
Star it β if you think chatbots should evolve.
Shoutouts π
Big thanks to met4citizen for the incredible TalkingHead library. The 3D avatar rendering and lip-sync magic? That's all their work. I just plugged it in and fed it audio + timing data. Absolute legend.
What Would You Build?
Seriously, drop a comment. I want to know what wild ideas you have for real-time multimodal AI.
AI that sees + hears + responds naturally? That's not the future anymore.
That's right now. And you can run it on your GPU.
Built with β€οΈ and probably too much caffeine by @kiranbaby14
Top comments (0)