Jakim

Posted on May 2

How I Built an AI Character That Lives on My Desktop and Learns New Expressions

#ai #opensource #desktop #electron

I wanted my AI assistant to have a face. Not a chat bubble, not a cartoon avatar — something that feels alive on my desktop.

So I built Cloe Desktop: a transparent, always-on-top window with a photorealistic character whose expressions are chosen autonomously by the AI agent based on conversation context.

Here's what it looks like in action:

The Core Idea

Most "AI companions" are chat windows. Some have static avatars. A few have cartoon animations. But none of them feel like a presence on your screen.

The key insight was: let the AI agent itself decide what expression to show. Not rules, not triggers — actual agent autonomy. When the user says something funny, the agent decides to laugh. When it's working on a task, it decides to show a "working" animation. When the user says goodnight, it decides to blow a kiss.

How It Works

Expression System

The character is rendered as transparent GIFs with clean edges (chroma key removed). Each expression is a short animation loop:

smile — warm smile
think — tilts head, looks away
kiss — blows a kiss
tease — wink + smirk
nod — gentle nod of agreement
laugh — genuine big laugh
shy — looks away, embarrassed
clap — applause
yawn — sleepy yawn (late nights only 😴)
working — typing on keyboard
speak — mouth animation synchronized with TTS audio
blink, wave, shake_head

14 built-in, but here's the interesting part...

She Learns New Expressions

This is what I'm most excited about. You're not limited to the built-in set.

You describe a new expression in plain text:

"a cute Asian girl facing the camera, pouting with puckered lips, pure green background"

And the AI pipeline does the rest:

Generate reference — Wan2.7 image-pro creates a character-consistent reference frame
Generate video — Wan2.7 image-to-video animates the expression
Process — chroma key removal → transparent GIF with clean edges
Register — the GIF drops into the animations folder and becomes a new action

No code changes. No restart. The new action is immediately available to the agent.

The generation pipeline is a skill — describe once, generate forever. Over time, your character develops a unique set of expressions that nobody else has.

Agent Integration

The HTTP API is intentionally dead simple:

# Make her smile
curl -s http://localhost:19851/action -d '{"action":"smile"}'

# Make her talk with voice
curl -s http://localhost:19851/action -d '{"action":"speak","audio":"done"}'

# Check she's alive
curl -s http://localhost:19851/status

One endpoint, one JSON field. Any AI agent framework can integrate in minutes.

The agent also mirrors its own lifecycle state:

Agent Event	What Cloe Does
New conversation starts	Waves hello
Agent starts processing	Shows "working" animation
Agent finishes	Returns to idle
Conversation ends	Blows a kiss

Idle Behavior

When nobody's interacting, she doesn't just freeze. She cycles through idle animations — blinking, smiling, thinking — every 8–15 seconds, never repeating the same one twice in a row. She feels alive even when you're not looking.

The Tech

Desktop app: Electron transparent frameless window
Rendering: Double-buffer GIF crossfade for smooth transitions
Animations: AI-generated transparent GIFs (Wan2.7 I2V + chroma key)
Voice: TTS with synchronized mouth animation
Bridge: Embedded HTTP + WebSocket server in the Electron app
Android companion: Kotlin floating widget, connects to desktop bridge over LAN or Tailscale

What I Learned Building This

Double-buffer crossfade is essential. Switching GIFs directly causes a visible flash. I use two <img> elements, fade one out while fading the other in, and swap when the transition completes. Sounds simple, but getting it smooth took several iterations.

Chroma key quality matters more than you'd think. Early versions had green fringe around the character edges. Tuning the chroma key parameters and adding edge feathering made the difference between "obviously a cutout" and "she's actually sitting on my desktop."

Agent-driven expressions > hardcoded rules. I started with a rule-based system (if user says "thank you" → smile). It felt robotic. Switching to letting the agent decide based on full conversation context made it feel genuinely alive. The agent understands nuance — a sarcastic "thanks" shouldn't trigger a smile.

Idle behavior is underrated. The random cycling through blink/smile/think during idle is maybe 10 lines of code, but it's what makes the difference between a "tool" and a "presence." People notice when she's just... there.

What's Next

Real-time voice calls — live speech-to-text → LLM → text-to-speech, actual conversations
Community animation packs — share and import character expressions
Windows & Linux — Electron supports it, needs packaging
Custom character import — bring your own reference art, generate animations for any persona

Try It

Cloe Desktop is open source: github.com/JakimLi/cloe-desktop

macOS (DMG download available in Releases) + Android companion app.

I'd love to hear what you think — especially:

What expressions would make it feel more alive?
Would you use this with your AI agent?
Any features you'd want to see?

Built together by JakimLi (human) & Cloe (AI) 💖

DEV Community