DEV Community

Kiran Baby
Kiran Baby

Posted on

I Built a 3D AI Avatar That Actually Sees and Talks Back 🎭

Chatbots are so 2020. Let me show you what I built instead.

It's been ages since I last posted here. Hope y'all had a great Christmas! πŸŽ„ Feels good to be back. ✌️


The Problem With Every AI Assistant Right Now

You know what's annoying? Typing.

Every AI tool out there wants you to type type type like it's 1995. And don't even get me started on the ones that "listen" but can't see what you're showing them.

So I asked myself: What if I built an AI that works like an actual conversation?

One that:

  • πŸ‘€ Sees what you show it (camera feed)
  • πŸ‘‚ Hears you naturally (no push-to-talk nonsense)
  • πŸ—£οΈ Responds with voice and perfectly synced lip movements
  • 🎭 Expresses emotions through a 3D avatar

And runs 100% locally on your machine. No API keys bleeding your wallet dry.


Introducing TalkMateAI πŸš€

TalkMateAI is a real-time, multimodal AI companion. You talk to it, show it things through your camera, and it responds with natural speech while a 3D avatar lip-syncs perfectly to every word.

It's like having a conversation with a character from a video game, except it's actually intelligent.


The Tech Stack (For My Fellow Nerds πŸ€“)

Backend (Python)

FastAPI + WebSockets β†’ Real-time bidirectional communication
PyTorch + Flash Attention 2 β†’ GPU go brrrrr
OpenAI Whisper (tiny) β†’ Speech recognition
SmolVLM2-256M-Video-Instruct β†’ Vision-language understanding
Kokoro TTS β†’ Natural voice synthesis with word-level timing
Enter fullscreen mode Exit fullscreen mode

Frontend (TypeScript)

Next.js 15 β†’ Because Turbopack is fast af
Tailwind CSS + shadcn/ui β†’ Pretty buttons
TalkingHead.js β†’ 3D avatar with lip-sync magic
Web Audio API + AudioWorklet β†’ Low-latency audio processing
Native WebSocket β†’ None of that socket.io bloat
Enter fullscreen mode Exit fullscreen mode

How It Actually Works

Here's the flow:

You speak β†’ 
  VAD detects speech β†’ 
    Audio (+ camera frame if enabled) sent via WebSocket β†’ 
      Whisper transcribes β†’ 
        SmolVLM2 understands text + image together β†’ 
          Generates response β†’ 
            Kokoro synthesizes speech with timing data β†’ 
              Audio + lip-sync data sent back β†’ 
                3D avatar speaks with perfect sync
Enter fullscreen mode Exit fullscreen mode

All of this happens in real-time.


The Secret Sauce: Native Word Timing 🎯

Most TTS solutions give you audio and that's it. You're left guessing when each word starts for lip-sync.

Kokoro TTS gives you word-level timing data out of the box:

const speakData = {
  audio: audioBuffer,
  words: ["Hello", "world"],
  wtimes: [0.0, 0.5],      // when each word starts
  wdurations: [0.4, 0.6]   // how long each word lasts
};

// TalkingHead uses this for pixel-perfect lip sync
headRef.current.speakAudio(speakData);
Enter fullscreen mode Exit fullscreen mode

The result? Lips that move exactly when they should. No uncanny valley weirdness.


Voice Activity Detection That Actually Works

I didn't want push-to-talk. I wanted natural conversation flow.

So I built a custom VAD using the Web Audio API's AudioWorklet. It calculates energy levels in real-time and tracks speech frames vs silence frames - all from the frontend (so no unnecessary wastage of backend processing power).

You just... talk. When you pause naturally, it processes. When you keep talking, it waits.

It respects conversational flow.

⚠️ Heads up: This version doesn't support barge-in (interrupting the avatar mid-speech) or sophisticated turn-taking detection. It's purely pause-based - you talk, pause, it responds.


The Vision Component πŸ‘οΈ

Here's where it gets spicy. The camera isn't just for show.

When enabled, every audio segment gets sent with a camera snapshot. SmolVLM2 processes both together - the audio transcription AND what it sees.

You can literally say "What am I holding?" and it'll tell you.


Running It Yourself

Prerequisites

  • Node.js 20+
  • Python 3.10
  • NVIDIA GPU with ~4GB+ VRAM should work (I used RTX 3070 8GB, but the models are lightweight - Whisper tiny + SmolVLM2-256M + Kokoro TTS)
  • PNPM & UV package managers

Setup

# Clone it
git clone https://github.com/kiranbaby14/TalkMateAI.git
cd TalkMateAI

# Install everything
pnpm run monorepo-setup

# Run both frontend and backend
pnpm dev
Enter fullscreen mode Exit fullscreen mode

Frontend: http://localhost:3000
Backend: http://localhost:8000


What Can You Build With This?

This is open source. Fork it. Break it. Make it weird.

Some ideas:

  • πŸ“š Language tutors that watch your pronunciation
  • 🎨 Creative companions that see your art and give feedback
  • πŸ” Screen assistants - combine with Screenpipe for an AI that knows what you've been doing

The Code Is Yours

GitHub: github.com/kiranbaby14/TalkMateAI

πŸ› οΈ Fair warning: This was a curiosity-driven project, not a polished product. There are rough edges, things I'd do differently now, and probably bugs I haven't found yet. But that's the fun of open source, right? Dig in, break stuff, make it better.

Star it ⭐ if you think chatbots should evolve.


Shoutouts πŸ™

Big thanks to met4citizen for the incredible TalkingHead library. The 3D avatar rendering and lip-sync magic? That's all their work. I just plugged it in and fed it audio + timing data. Absolute legend.


What Would You Build?

Seriously, drop a comment. I want to know what wild ideas you have for real-time multimodal AI.

AI that sees + hears + responds naturally? That's not the future anymore.

That's right now. And you can run it on your GPU.


Built with ❀️ and probably too much caffeine by @kiranbaby14

Top comments (0)