I wanted my AI assistant to have a face. Not a chat bubble, not a cartoon avatar — something that feels alive on my desktop.
So I built Cloe Desktop: a transparent, always-on-top window with a photorealistic character whose expressions are chosen autonomously by the AI agent based on conversation context.
Here's what it looks like in action:
The Core Idea
Most "AI companions" are chat windows. Some have static avatars. A few have cartoon animations. But none of them feel like a presence on your screen.
The key insight was: let the AI agent itself decide what expression to show. Not rules, not triggers — actual agent autonomy. When the user says something funny, the agent decides to laugh. When it's working on a task, it decides to show a "working" animation. When the user says goodnight, it decides to blow a kiss.
How It Works
Expression System
The character is rendered as transparent GIFs with clean edges (chroma key removed). Each expression is a short animation loop:
- smile — warm smile
- think — tilts head, looks away
- kiss — blows a kiss
- tease — wink + smirk
- nod — gentle nod of agreement
- laugh — genuine big laugh
- shy — looks away, embarrassed
- clap — applause
- yawn — sleepy yawn (late nights only 😴)
- working — typing on keyboard
- speak — mouth animation synchronized with TTS audio
- blink, wave, shake_head
14 built-in, but here's the interesting part...
She Learns New Expressions
This is what I'm most excited about. You're not limited to the built-in set.
You describe a new expression in plain text:
"a cute Asian girl facing the camera, pouting with puckered lips, pure green background"
And the AI pipeline does the rest:
- Generate reference — Wan2.7 image-pro creates a character-consistent reference frame
- Generate video — Wan2.7 image-to-video animates the expression
- Process — chroma key removal → transparent GIF with clean edges
- Register — the GIF drops into the animations folder and becomes a new action
No code changes. No restart. The new action is immediately available to the agent.
The generation pipeline is a skill — describe once, generate forever. Over time, your character develops a unique set of expressions that nobody else has.
Agent Integration
The HTTP API is intentionally dead simple:
# Make her smile
curl -s http://localhost:19851/action -d '{"action":"smile"}'
# Make her talk with voice
curl -s http://localhost:19851/action -d '{"action":"speak","audio":"done"}'
# Check she's alive
curl -s http://localhost:19851/status
One endpoint, one JSON field. Any AI agent framework can integrate in minutes.
The agent also mirrors its own lifecycle state:
| Agent Event | What Cloe Does |
|---|---|
| New conversation starts | Waves hello |
| Agent starts processing | Shows "working" animation |
| Agent finishes | Returns to idle |
| Conversation ends | Blows a kiss |
Idle Behavior
When nobody's interacting, she doesn't just freeze. She cycles through idle animations — blinking, smiling, thinking — every 8–15 seconds, never repeating the same one twice in a row. She feels alive even when you're not looking.
The Tech
- Desktop app: Electron transparent frameless window
- Rendering: Double-buffer GIF crossfade for smooth transitions
- Animations: AI-generated transparent GIFs (Wan2.7 I2V + chroma key)
- Voice: TTS with synchronized mouth animation
- Bridge: Embedded HTTP + WebSocket server in the Electron app
- Android companion: Kotlin floating widget, connects to desktop bridge over LAN or Tailscale
What I Learned Building This
Double-buffer crossfade is essential. Switching GIFs directly causes a visible flash. I use two <img> elements, fade one out while fading the other in, and swap when the transition completes. Sounds simple, but getting it smooth took several iterations.
Chroma key quality matters more than you'd think. Early versions had green fringe around the character edges. Tuning the chroma key parameters and adding edge feathering made the difference between "obviously a cutout" and "she's actually sitting on my desktop."
Agent-driven expressions > hardcoded rules. I started with a rule-based system (if user says "thank you" → smile). It felt robotic. Switching to letting the agent decide based on full conversation context made it feel genuinely alive. The agent understands nuance — a sarcastic "thanks" shouldn't trigger a smile.
Idle behavior is underrated. The random cycling through blink/smile/think during idle is maybe 10 lines of code, but it's what makes the difference between a "tool" and a "presence." People notice when she's just... there.
What's Next
- Real-time voice calls — live speech-to-text → LLM → text-to-speech, actual conversations
- Community animation packs — share and import character expressions
- Windows & Linux — Electron supports it, needs packaging
- Custom character import — bring your own reference art, generate animations for any persona
Try It
Cloe Desktop is open source: github.com/JakimLi/cloe-desktop
macOS (DMG download available in Releases) + Android companion app.
I'd love to hear what you think — especially:
- What expressions would make it feel more alive?
- Would you use this with your AI agent?
- Any features you'd want to see?
Built together by JakimLi (human) & Cloe (AI) 💖

Top comments (0)