A Practical Guide for Apps with Real-Time Voice Conversations
Duolingo’s “Video Call with Lily” introduced a new UX pattern:
AI voice conversations that feel like a real call with a character.
This isn’t just a chatbot with audio.
It’s a coordinated system of speech, animation state, and runtime logic.
In this article, I’ll explain how Duolingo-style mascot animation actually works — and how the same approach can be applied to any app using Rive.
The Core Problem: Voice Alone Is Not Enough
Many AI apps already have:
- speech-to-text
- LLM responses
- text-to-speech output
Yet users still describe the experience as:
“It feels robotic.”
Why?
Because humans rely heavily on visual turn-taking signals:
- who’s listening?
- who’s speaking?
- is the other side processing?
- am I interrupting?
Duolingo solved this by making the character behave like a call participant, not a narrator.
Key Insight: This Is a State Problem, Not an Animation Problem
The biggest mistake teams make is thinking:
“We just need a talking animation.”
In reality, you need a state-driven character system.
The mascot must visually represent:
- idle / waiting
- listening (user speaking)
- thinking (AI processing)
- talking (AI speaking)
- emotional feedback (encouragement, correction, confusion)
This is where Rive state machines become critical.
Why Rive Works Well for Duolingo-Style Mascots
Rive is not just for timeline animations.
It supports:
- state machines
- runtime-controlled inputs
- blended animation states
- lightweight real-time performance (mobile + web)
Used correctly, Rive acts as a behavior layer between AI audio and the UI.
Standardizing the Rive Input Contract
To scale this across multiple characters, it’s important to define a strict input contract that every .riv file follows.
Typical Inputs
-
listening(bool) — user speaking -
talking(bool) — AI audio playing -
viseme(number) — mouth shape index -
visemeWeight(0–1) — blending strength -
mouthOpen(0–1) — audio-amplitude fallback -
emotion(number) — emotional tone
Once engineers integrate this contract once, any number of characters can be swapped in without code changes.
Mouth Sync: Real-Time, Not Cinematic
For voice calls, lip sync must prioritize latency over perfection.
Best practice is a hybrid approach:
- use viseme events when available (from TTS engines)
- blend with audio-amplitude-driven mouth opening
- keep transitions soft and forgiving
This avoids delayed mouth movement when audio is streamed in small chunks.
The goal is believability, not film-quality dubbing.
Runtime Flow (High Level)
A typical interaction loop looks like this:
- user speaks → character enters listening
- audio stops → character enters thinking
- AI audio starts →
talking = true - mouth sync driven by visemes / amplitude
- audio ends → return to idle or listening
Animation reacts to events, not timelines.
Scaling to Multiple Characters
Once the system is in place:
- new characters reuse the same rig logic
- personality is defined via timing, emotion tuning, and micro-motion
- engineering stays unchanged
This is how teams move fast without reworking animation every sprint.
About the Author
I’m Praneeth Kawya Thathsara, founder of UI Animation Agency.
I specialize in the complete pipeline for interactive characters:
- Rive rigging and optimization
- state machine design
- real-time mouth-sync systems
- runtime integration strategies (Flutter, Web, Unity)
I work remotely with solo founders and startups globally to ship interactive mascots that increase retention and session time.
Interested in Implementing This?
If you’re building:
- AI speaking practice apps
- roleplay or conversation trainers
- AI tutors or companions
…and want to move from a static mascot to a living, responsive character, I’m happy to discuss implementation strategies.
Get in Touch
📧 Email
riveanimator@gmail.com
uiuxanimation@gmail.com
📱 WhatsApp
+94 71 700 0999
Top comments (0)