A Practical Guide for Apps with Real-Time Voice Conversations
Duolingo’s “Video Call with Lily” introduced a new UX pattern:
AI voice conversations that feel like a real call with a character.
This isn’t just a chatbot with audio.
It’s a coordinated system of speech, animation state, and runtime logic.
In this article, I’ll explain how Duolingo-style mascot animation actually works, and how the same approach can be applied to any app using Rive.
The Core Problem: Voice Alone Is Not Enough
Many AI apps already have:
Speech-to-text
LLM responses
Text-to-speech output
Yet users still describe the experience as:
“It feels robotic.”
Why?
Because humans rely heavily on visual turn-taking signals:
Who’s listening?
Who’s speaking?
Is the other side processing?
Am I interrupting?
Duolingo solved this by making the character behave like a call participant, not a narrator.
Key Insight: This Is a State Problem, Not an Animation Problem
The biggest mistake teams make is treating this as:
“We just need a talking animation.”
In reality, you need a state-driven character system.
The mascot must visually represent:
Idle / waiting
Listening (user speaking)
Thinking (AI processing)
Talking (AI speaking)
Emotional feedback (encouragement, correction, confusion)
This is where Rive State Machines become critical.
Why Rive Works Well for Duolingo-Style Mascots
Rive is not just for timeline animations. It supports:
State Machines
Runtime-controlled inputs
Blended animation states
Lightweight real-time performance (mobile + web)
Used correctly, Rive acts as a behavior layer between AI audio and UI.
Standardizing the Rive Input Contract
To make this scalable across multiple characters, I recommend defining a strict input API that every .riv file follows.
Typical inputs:
- listening (bool) — user speaking
- talking (bool) — AI audio playing
- viseme (number) — mouth shape index
- visemeWeight (0–1) — blending strength
- mouthOpen (0–1) — audio amplitude fallback
- emotion (number) — emotional tone
Once engineers integrate this contract once, any number of characters can be swapped without code changes.
Mouth Sync: Real-Time, Not Cinematic
For voice calls, lip sync must prioritize latency over perfection.
Best practice is a hybrid approach:
Use viseme events when available (from TTS engines)
Blend with audio-amplitude–driven mouth opening
Keep viseme transitions soft and forgiving
This avoids delayed mouth movement when audio is streamed in small chunks.
The goal is believability, not film-quality dubbing.
Runtime Flow (High Level)
A typical interaction loop looks like this:
User speaks → character enters listening
Audio stops → character enters thinking
AI audio starts → talking = true
Mouth sync driven by visemes / amplitude
Audio ends → return to idle or listening
Animation reacts to events, not timelines.
Scaling to Multiple Characters
Once the system is in place:
New characters reuse the same rig logic
Personality is defined via timing, emotion tuning, and micro-motion
Engineering stays unchanged
This is how teams move fast without reworking animation every sprint.
About the Author
I’m Praneeth Kawya Thathsara, founder of UI Animation Agency.
I specialize in the complete pipeline for interactive characters:
Rive rigging and optimization
State Machine design
Real-time mouth sync systems
Runtime integration strategies (Flutter, Web, Unity)
I work remotely with solo founders and startups globally to ship interactive mascots that increase retention and session time.
Interested in Implementing This?
If you’re building:
AI speaking practice apps
Roleplay or conversation trainers
AI tutors or companions
…and want to move from a static mascot to a living, responsive character, I’m happy to discuss implementation strategies.
Get in Touch
📧 Email
riveanimator@gmail.com
+94 71 700 0999
Top comments (0)