DEV Community

Cover image for Don't Build Just Another Chatbot: Architecting a "Duolingo-Style" AI Companion with Rive
Praneeth Kawya Thathsara
Praneeth Kawya Thathsara

Posted on

Don't Build Just Another Chatbot: Architecting a "Duolingo-Style" AI Companion with Rive

_By Praneeth Kawya Thathsara Rive Specialist & Founder @ UI Animation Agency
_
We are drowning in "AI Wrappers." If you are building an AI language tutor, a roleplay app, or a mental health companion, you have a problem: Text interfaces are boring.

The apps winning the race right now (like Duolingo’s "Lily" or character.ai) aren’t just outputting tokens; they are rendering performance.

As a Rive animator who specializes in AI interactions, I’ve seen the backend of many of these projects. The difference between a "toy" app and a "product" usually comes down to one thing: The Lip Sync Architecture.

In this post, I’m going to break down the technical setup required to build a reactive, lip-syncing AI character using Rive, moving beyond simple volume-bouncing to phoneme-accurate speech.

The Architecture: Puppet vs. Puppeteer
To build a character that feels alive, you need to separate concerns:

The Puppet (Rive): A state machine that handles morphing shapes based on numeric inputs.

The Puppeteer (Your Code): React/Flutter/Swift logic that parses audio and sends signals to the puppet.

Level 1: The "Muppet" Method (Amplitude)

The fast way.

If you are shipping an MVP tomorrow, this is where you start. You analyze the Root Mean Square (RMS) of the audio amplitude.

Rive Setup: A 1D Blend State. Input 0 is mouth closed. Input 100 is mouth wide open.

Code: riveInput.value = normalizedVolume;

The Problem: It looks like a Muppet. The character opens their mouth wide for "OO" sounds and "EE" sounds alike. It lacks nuance.

Level 2: The "Viseme" Method (Phonetic Mapping)

The Duolingo way.

To get that crisp "Lily" sarcasm or realistic speech, we stop using volume. We use Visemes. Visemes are the visual equivalent of phonemes (the sounds we make).

If you use a TTS provider like Azure Speech SDK or AWS Polly, they don't just return audio; they return Viseme Events—integers representing the shape of the mouth at a specific timestamp.

The Rive State Machine

Instead of a single "Mouth Open" blend, I build a state machine with roughly 12-15 discrete mouth shapes:

  • Sil (Silence/Idle)
  • PP (Lips pressed - P, B, M)
  • FF (Teeth on lip - F, V)
  • TH (Tongue out - TH)
  • DD (Tongue behind teeth - T, D, S)
  • kk (Open back - K, G)
  • aa (Wide - A)
  • O (Round - O) ...and so on.

I map these to a Number Input called viseme_id.

The Code Logic

In your frontend (e.g., React Native or Flutter), your listener looks like this (pseudo-code):

JavaScript

ttsService.on('visemeReceived', (visemeID) => {
  // 1. Get the Rive input
  const mouthInput = riveArtboard.findInput('viseme_id');

  // 2. Map the TTS provider's ID to your Rive ID
  // (Azure has 21 shapes, Rive might only need 12)
  const mappedID = mapAzureToRive(visemeID);

  // 3. Update the state
  mouthInput.value = mappedID;
});
Enter fullscreen mode Exit fullscreen mode

The Secret: Layered Micro-Behaviors

Lip sync is only 50% of the illusion. If the character stares unblinkingly while talking, it enters the "Uncanny Valley."

To fix this, I use Layered State Machines in Rive. This allows multiple timelines to play simultaneously without conflict.

*Layer 1: Mouth. (Controlled by Code).
*

Layer 2: Eyes. (Self-contained loop). I rig a "Randomize" listener inside Rive that triggers a blink or an eye-dart every 2-5 seconds. The code doesn't need to touch this. It happens automatically.

*Layer 3: Emotions. (Boolean Inputs). isBored, isHappy, isThinking.
*

Handling "The Pause" (Latency)
The biggest UX killer in AI voice chat is the 2-3 seconds of silence while the LLM generates the answer.

You cannot let the character freeze here. I build a specific "Thinking" loop for this state.

User stops talking.

App triggers: isThinking = true.

Rive Animation: Character looks up to the left (accessing memory), taps a finger, or—in the case of a sarcastic character—rolls their eyes impatiently.

Audio Stream Starts: isThinking = false, and Viseme data starts flowing.

Need Help Rigging This?

Building the bridge between high-quality animation and functional code is a niche skill. I am Praneeth Kawya Thathsara, founder of UI Animation Agency.

I work with startups to build these exact systems—delivering not just the .riv file, but the logic map your developers need to hook it up.

Connect on LinkedIn: https://www.linkedin.com/in/praneethkawyathathsara/

Email : riveanimator@gmail.com / +94 71 700 0999

Top comments (0)