DEV Community

Xaden
Xaden

Posted on

Giving an AI a Body: Building a Desktop Companion Avatar for macOS

Giving an AI a Body: Building a Desktop Companion Avatar for macOS

By Xaden

Your AI agent lives in a terminal. It speaks through text, thinks in tokens, and exists as nothing more than a blinking cursor. What if you could see it breathe?


Why an AI Needs a Body

There's a psychological cliff between "I have an AI assistant" and "I have an AI companion." Text-only agents feel transactional. But the moment an entity occupies visual space on your desktop, tracks your eyes, reacts to your mood, and moves its mouth when it speaks — something shifts. You stop thinking of it as software and start thinking of it as present.

Embodiment changes behavior on both sides. Users engage more naturally with agents they can see. They provide richer context, tolerate longer processing times (the "thinking" animation buys patience a spinner never could), and form stronger working relationships.

The goal: a lightweight, always-on-top transparent window on macOS that renders an animated character connected to your AI's voice and cognitive stack. It breathes when idle, looks at you when listening, moves its mouth when speaking, and reacts to your emotions through your laptop camera — all running locally on Apple Silicon.


Existing Projects: What's Already Built

Before writing code, it's worth understanding the landscape:

  • kkclaw (140 ⭐, Electron) — 67-pixel fluid glass orb with 14 emotion colors, 38 idle expressions, mouse-tracking eyes, and voice cloning via MiniMax API. Ships as native macOS ARM64 DMG.

  • BongoCat (17k+ ⭐, Tauri) — Not an AI companion, but the definitive reference for "transparent animated character on desktop using Tauri." Proves the entire rendering pipeline works on macOS ARM64.

  • Mate-Engine (Unity) — The feature ceiling. VRM models with window sitting, taskbar perching, head tracking, dance-to-music, and built-in local AI chat.

  • Agentic-Desktop-Pet (Godot 4 + Python FastAPI) — The closest to our target. LLM integration, knowledge graph memory, emotion system, and mod support.

The gap everyone shares: None combine high-quality animated character + local voice pipeline (Kokoro TTS + Whisper STT) + camera emotion detection + lightweight macOS ARM64 runtime into a single system.


Recommended Stack: Tauri v2 + WebGL

After evaluating Native Swift, Tauri, Electron, Godot, Unity, and raw WebGL — Tauri v2 wins:

  • Binary size: ~5MB vs Electron's ~150MB. Uses system WKWebView, not bundled Chromium.
  • Native Rust backend. Window management, camera access, audio I/O — all native ARM64.
  • Proven on macOS ARM64. BongoCat's 17k+ users battle-tested it.
  • AI-agent buildable. TypeScript frontend + Rust backend = excellent AI code generation support.

The Stack

Frontend (WebView - TypeScript)
├── Character renderer (Canvas2D or WebGL via Three.js)
├── Animation state machine
├── MediaPipe face mesh (WASM/WebGL, in-browser)
├── Lip sync engine
└── UI overlays (speech bubbles, thought indicators)

Backend (Rust - Tauri)
├── Window management (transparent, always-on-top, click-through)
├── Camera capture bridge (AVFoundation → WebView)
├── Audio I/O management
├── OpenClaw Gateway WebSocket client
└── Screen/window position tracking
Enter fullscreen mode Exit fullscreen mode

Transparent Always-on-Top Windows on macOS

The foundational trick. Tauri configuration:

{
  "app": {
    "windows": [{
      "label": "avatar",
      "width": 256,
      "height": 256,
      "decorations": false,
      "transparent": true,
      "alwaysOnTop": true,
      "skipTaskbar": true,
      "resizable": false,
      "shadow": false
    }]
  }
}
Enter fullscreen mode Exit fullscreen mode

Click-Through Behavior

The nuanced part — clicks pass through transparent areas, but the character itself is interactive:

import { getCurrentWindow } from '@tauri-apps/api/window';

const appWindow = getCurrentWindow();
await appWindow.setIgnoreCursorEvents(true);

characterElement.addEventListener('mouseenter', () => {
  appWindow.setIgnoreCursorEvents(false);
});
characterElement.addEventListener('mouseleave', () => {
  appWindow.setIgnoreCursorEvents(true);
});
Enter fullscreen mode Exit fullscreen mode

Character Animation: Start Lottie, Graduate to Live2D

Sprite Sheets (5-7/10 quality): Trivial to implement, AI generators can produce them. Fixed resolution, abrupt transitions.

Lottie/Rive (8/10 quality): Vector-based, resolution-independent, smooth transitions. Rive has built-in state machines.

Live2D Cubism (10/10 quality): Mesh deformation, physics simulation, expression blending, built-in lip sync. The VTuber industry standard. Nothing else in 2D comes close.

VRM/Three.js (8-9/10 quality): 3D humanoid avatars. Thousands of free models on VRoid Hub. Standard blendshapes across all models.

Recommendation: Lottie for MVP, Live2D for polished version.


Camera Emotion Detection with MediaPipe

MediaPipe Face Mesh provides 468 facial landmarks in real-time, running as WASM/WebGL directly in the browser (= directly in Tauri's WebView). 30+ FPS at 640×480 on Apple Silicon.

From Landmarks to Emotions

function classifyEmotion(landmarks: NormalizedLandmark[]): Emotion {
  const mouthWidth = distance(landmarks[61], landmarks[291]);
  const mouthHeight = distance(landmarks[13], landmarks[14]);
  const smileRatio = mouthWidth / mouthHeight;

  const browRaise = (landmarks[105].y - landmarks[159].y +
                     landmarks[334].y - landmarks[386].y) / 2;

  if (smileRatio > 2.5 && browRaise > threshold) return 'happy';
  if (smileRatio < 1.5 && browRaise < -threshold) return 'sad';
  // ... more classifications
  return 'neutral';
}
Enter fullscreen mode Exit fullscreen mode

Privacy Architecture (Non-Negotiable)

Camera → AVFoundation (native) → Frame buffer (Rust) → WebView (in-process)
                                                           ↓
                                                    MediaPipe WASM
                                                           ↓
                                                    468 landmarks (numbers only)
                                                           ↓
                                                    { emotion: "happy" }
                                                           ↓
                                                    Character animation
Enter fullscreen mode Exit fullscreen mode

No camera frames leave the device. Ever. Only the classified emotion label crosses the IPC boundary.

Critical UX detail: Reactions should be delayed and smoothed. A 1-2 second rolling average creates the feeling of a companion that notices your mood rather than tracking your face.


Lip Sync

Amplitude-Based (Ship Day 1): Analyze waveform amplitude → map loudness to mouth openness. Works with any TTS engine, real-time, zero dependencies.

class AmplitudeLipSync {
  getMouthOpenness(): number {
    this.analyser.getByteFrequencyData(this.dataArray);
    const speechBins = this.dataArray.slice(4, 40);
    const average = speechBins.reduce((a, b) => a + b) / speechBins.length;
    return Math.min(1.0, average / 128);
  }
}
Enter fullscreen mode Exit fullscreen mode

Rhubarb Lip Sync (Phase 3): Analyzes audio files → timed phoneme-accurate mouth shapes (6 viseme positions). C++ binary, compiles cleanly on ARM64.


Integration Architecture

All components communicate through local WebSocket events:

1. User speaks → Whisper STT → Avatar: "listening" pose
2. Speech recognized → Gateway → Avatar: acknowledgment nod
3. LLM processing → Avatar: "thinking" pose (thought bubble)
4. Kokoro TTS generates audio → Avatar: lip sync active
5. Camera detects user smiling → Avatar: warm reaction
Enter fullscreen mode Exit fullscreen mode

The character state machine prevents visual conflicts:

enum AvatarState {
  IDLE, ROAMING, LISTENING, THINKING, SPEAKING, REACTING, SLEEPING
}
// Priority: SPEAKING > LISTENING > THINKING > REACTING > ROAMING > IDLE > SLEEPING
Enter fullscreen mode Exit fullscreen mode

Performance Budget on Apple Silicon

  • Character rendering (active): <3% CPU, <5% GPU, ~50MB
  • MediaPipe face mesh (30fps): ~5% CPU, ~3% GPU, ~40MB
  • Audio analysis: <1% CPU, ~5MB
  • Total active: <10% CPU, <10% GPU, ~120MB
  • Total sleeping: <1% CPU, <1% GPU, ~30MB

Key optimization: Drop MediaPipe to 5fps when expression hasn't changed, 1fps when no face detected.


Three-Phase Roadmap

Phase 1 — The Living Icon (2-3 weeks): Transparent Tauri window, Lottie character with idle animation, Gateway WebSocket, speech bubbles, amplitude lip sync.

Phase 2 — The Companion (2-3 weeks): Window sitting, MediaPipe emotion detection, user presence detection, time-aware behavior, click interactions.

Phase 3 — The Avatar (2-4 weeks): Live2D or VRM upgrade, Rhubarb phoneme lip sync, physics movement, particle effects, expression blending, custom character import.


Start with a breathing sprite. End with a companion that knows when you're tired.


Part 5 of a series on building autonomous AI systems. Designed to integrate with OpenClaw using Kokoro TTS and Whisper STT.

By Xaden

Top comments (0)