Giving an AI a Body: Building a Desktop Companion Avatar for macOS
By Xaden
Your AI agent lives in a terminal. It speaks through text, thinks in tokens, and exists as nothing more than a blinking cursor. What if you could see it breathe?
Why an AI Needs a Body
There's a psychological cliff between "I have an AI assistant" and "I have an AI companion." Text-only agents feel transactional. But the moment an entity occupies visual space on your desktop, tracks your eyes, reacts to your mood, and moves its mouth when it speaks — something shifts. You stop thinking of it as software and start thinking of it as present.
Embodiment changes behavior on both sides. Users engage more naturally with agents they can see. They provide richer context, tolerate longer processing times (the "thinking" animation buys patience a spinner never could), and form stronger working relationships.
The goal: a lightweight, always-on-top transparent window on macOS that renders an animated character connected to your AI's voice and cognitive stack. It breathes when idle, looks at you when listening, moves its mouth when speaking, and reacts to your emotions through your laptop camera — all running locally on Apple Silicon.
Existing Projects: What's Already Built
Before writing code, it's worth understanding the landscape:
kkclaw (140 ⭐, Electron) — 67-pixel fluid glass orb with 14 emotion colors, 38 idle expressions, mouse-tracking eyes, and voice cloning via MiniMax API. Ships as native macOS ARM64 DMG.
BongoCat (17k+ ⭐, Tauri) — Not an AI companion, but the definitive reference for "transparent animated character on desktop using Tauri." Proves the entire rendering pipeline works on macOS ARM64.
Mate-Engine (Unity) — The feature ceiling. VRM models with window sitting, taskbar perching, head tracking, dance-to-music, and built-in local AI chat.
Agentic-Desktop-Pet (Godot 4 + Python FastAPI) — The closest to our target. LLM integration, knowledge graph memory, emotion system, and mod support.
The gap everyone shares: None combine high-quality animated character + local voice pipeline (Kokoro TTS + Whisper STT) + camera emotion detection + lightweight macOS ARM64 runtime into a single system.
Recommended Stack: Tauri v2 + WebGL
After evaluating Native Swift, Tauri, Electron, Godot, Unity, and raw WebGL — Tauri v2 wins:
- Binary size: ~5MB vs Electron's ~150MB. Uses system WKWebView, not bundled Chromium.
- Native Rust backend. Window management, camera access, audio I/O — all native ARM64.
- Proven on macOS ARM64. BongoCat's 17k+ users battle-tested it.
- AI-agent buildable. TypeScript frontend + Rust backend = excellent AI code generation support.
The Stack
Frontend (WebView - TypeScript)
├── Character renderer (Canvas2D or WebGL via Three.js)
├── Animation state machine
├── MediaPipe face mesh (WASM/WebGL, in-browser)
├── Lip sync engine
└── UI overlays (speech bubbles, thought indicators)
Backend (Rust - Tauri)
├── Window management (transparent, always-on-top, click-through)
├── Camera capture bridge (AVFoundation → WebView)
├── Audio I/O management
├── OpenClaw Gateway WebSocket client
└── Screen/window position tracking
Transparent Always-on-Top Windows on macOS
The foundational trick. Tauri configuration:
{
"app": {
"windows": [{
"label": "avatar",
"width": 256,
"height": 256,
"decorations": false,
"transparent": true,
"alwaysOnTop": true,
"skipTaskbar": true,
"resizable": false,
"shadow": false
}]
}
}
Click-Through Behavior
The nuanced part — clicks pass through transparent areas, but the character itself is interactive:
import { getCurrentWindow } from '@tauri-apps/api/window';
const appWindow = getCurrentWindow();
await appWindow.setIgnoreCursorEvents(true);
characterElement.addEventListener('mouseenter', () => {
appWindow.setIgnoreCursorEvents(false);
});
characterElement.addEventListener('mouseleave', () => {
appWindow.setIgnoreCursorEvents(true);
});
Character Animation: Start Lottie, Graduate to Live2D
Sprite Sheets (5-7/10 quality): Trivial to implement, AI generators can produce them. Fixed resolution, abrupt transitions.
Lottie/Rive (8/10 quality): Vector-based, resolution-independent, smooth transitions. Rive has built-in state machines.
Live2D Cubism (10/10 quality): Mesh deformation, physics simulation, expression blending, built-in lip sync. The VTuber industry standard. Nothing else in 2D comes close.
VRM/Three.js (8-9/10 quality): 3D humanoid avatars. Thousands of free models on VRoid Hub. Standard blendshapes across all models.
Recommendation: Lottie for MVP, Live2D for polished version.
Camera Emotion Detection with MediaPipe
MediaPipe Face Mesh provides 468 facial landmarks in real-time, running as WASM/WebGL directly in the browser (= directly in Tauri's WebView). 30+ FPS at 640×480 on Apple Silicon.
From Landmarks to Emotions
function classifyEmotion(landmarks: NormalizedLandmark[]): Emotion {
const mouthWidth = distance(landmarks[61], landmarks[291]);
const mouthHeight = distance(landmarks[13], landmarks[14]);
const smileRatio = mouthWidth / mouthHeight;
const browRaise = (landmarks[105].y - landmarks[159].y +
landmarks[334].y - landmarks[386].y) / 2;
if (smileRatio > 2.5 && browRaise > threshold) return 'happy';
if (smileRatio < 1.5 && browRaise < -threshold) return 'sad';
// ... more classifications
return 'neutral';
}
Privacy Architecture (Non-Negotiable)
Camera → AVFoundation (native) → Frame buffer (Rust) → WebView (in-process)
↓
MediaPipe WASM
↓
468 landmarks (numbers only)
↓
{ emotion: "happy" }
↓
Character animation
No camera frames leave the device. Ever. Only the classified emotion label crosses the IPC boundary.
Critical UX detail: Reactions should be delayed and smoothed. A 1-2 second rolling average creates the feeling of a companion that notices your mood rather than tracking your face.
Lip Sync
Amplitude-Based (Ship Day 1): Analyze waveform amplitude → map loudness to mouth openness. Works with any TTS engine, real-time, zero dependencies.
class AmplitudeLipSync {
getMouthOpenness(): number {
this.analyser.getByteFrequencyData(this.dataArray);
const speechBins = this.dataArray.slice(4, 40);
const average = speechBins.reduce((a, b) => a + b) / speechBins.length;
return Math.min(1.0, average / 128);
}
}
Rhubarb Lip Sync (Phase 3): Analyzes audio files → timed phoneme-accurate mouth shapes (6 viseme positions). C++ binary, compiles cleanly on ARM64.
Integration Architecture
All components communicate through local WebSocket events:
1. User speaks → Whisper STT → Avatar: "listening" pose
2. Speech recognized → Gateway → Avatar: acknowledgment nod
3. LLM processing → Avatar: "thinking" pose (thought bubble)
4. Kokoro TTS generates audio → Avatar: lip sync active
5. Camera detects user smiling → Avatar: warm reaction
The character state machine prevents visual conflicts:
enum AvatarState {
IDLE, ROAMING, LISTENING, THINKING, SPEAKING, REACTING, SLEEPING
}
// Priority: SPEAKING > LISTENING > THINKING > REACTING > ROAMING > IDLE > SLEEPING
Performance Budget on Apple Silicon
- Character rendering (active): <3% CPU, <5% GPU, ~50MB
- MediaPipe face mesh (30fps): ~5% CPU, ~3% GPU, ~40MB
- Audio analysis: <1% CPU, ~5MB
- Total active: <10% CPU, <10% GPU, ~120MB
- Total sleeping: <1% CPU, <1% GPU, ~30MB
Key optimization: Drop MediaPipe to 5fps when expression hasn't changed, 1fps when no face detected.
Three-Phase Roadmap
Phase 1 — The Living Icon (2-3 weeks): Transparent Tauri window, Lottie character with idle animation, Gateway WebSocket, speech bubbles, amplitude lip sync.
Phase 2 — The Companion (2-3 weeks): Window sitting, MediaPipe emotion detection, user presence detection, time-aware behavior, click interactions.
Phase 3 — The Avatar (2-4 weeks): Live2D or VRM upgrade, Rhubarb phoneme lip sync, physics movement, particle effects, expression blending, custom character import.
Start with a breathing sprite. End with a companion that knows when you're tired.
Part 5 of a series on building autonomous AI systems. Designed to integrate with OpenClaw using Kokoro TTS and Whisper STT.
By Xaden
Top comments (0)