AI voice interfaces are no longer experimental. SaaS platforms, mobile apps, and AI-first startups are embedding conversational assistants directly into their products. However, most voice features still rely on static UI or pre-rendered animations.
If you want a production-ready AI voice avatar, you need a real-time animation system driven by actual speech data.
This guide covers:
- Proper Rive State Machine setup
- Viseme input mapping strategy
- Trigger vs Number input decisions
- Accurate audio synchronization
- Flutter and Web integration examples
This is written for product designers, mobile developers, and startup teams building real AI systems — not demo prototypes.
Architecture Overview
A production AI voice avatar typically follows this pipeline:
- Text → AI TTS API
- TTS API → Audio + timestamp alignment data
- Alignment data → Viseme mapping
- Audio playback → Drives Rive State Machine
- Rive → Real-time lip sync animation
The animation must be deterministic, lightweight, and tightly synced to audio playback.
State Machine Setup in Rive
Step 1: Design Mouth Shapes
Do not design 30 phoneme-specific shapes. In production, group them into 8–12 viseme categories:
- Neutral
- Closed (M, B, P)
- Slight Open
- Wide (E)
- Round (O, U)
- Smile
- Rest
- Emphasis
Keep vector paths clean and optimized. Overly complex paths reduce mobile performance.
Step 2: Create a Lip Sync State Machine
Inside Rive:
- Create a new State Machine (e.g., LipSyncMachine)
- Add a Number Input named viseme
- Create states for each mouth animation
- Add conditional transitions based on viseme value
Example logic:
- If viseme == 0 → Neutral
- If viseme == 1 → Closed
- If viseme == 2 → Open
- If viseme == 3 → Wide
Keep transitions instant. Avoid long blend animations for speech.
Viseme Input Mapping Strategy
AI APIs typically return either:
- Viseme IDs
- Phoneme-level alignment
- Word-level timing
You should create a mapping layer between API output and your Rive input values.
Example mapping table:
- API Viseme 0 → Rive viseme 0 (Neutral)
- API Viseme 1 → Rive viseme 2 (Open)
- API Viseme 2 → Rive viseme 3 (Wide)
Never couple your Rive file directly to one provider’s ID system. Always use a translation layer. This ensures you can switch between OpenAI, Azure, or ElevenLabs without redesigning your animation.
Trigger vs Number Input in Rive
This is a common developer question.
Trigger Input
Use when:
- You need one-time events (blink, nod)
- State change is momentary
Not ideal for continuous viseme switching.
Number Input
Use when:
- You need scalable viseme control
- Multiple mouth shapes exist
- Values update frequently
- Speech runs continuously
For AI lip sync, Number Input is the correct choice in nearly all production scenarios.
Why?
- Cleaner logic
- Easier mapping
- Less state explosion
- Better performance
Syncing Audio Playback with Animation
Naive implementation:
- Use setTimeout or delayed timers
- Update viseme value based on timestamp
This works in demos, but not reliably in production.
Production-Ready Approach
- Start audio playback
- Track playback position
- On each frame:
- Compare current time to viseme timeline
- Update viseme input only when threshold is crossed
On Web, use requestAnimationFrame.
On Flutter, use a periodic stream or audio position listener.
Key rules:
- Skip duplicate consecutive visemes
- Do not update input every frame unnecessarily
- Ensure audio latency is accounted for
Flutter Integration Example
Below is a simplified production-style Flutter implementation:
import 'package:flutter/material.dart';
import 'package:rive/rive.dart';
import 'package:audioplayers/audioplayers.dart';
class VoiceAvatar extends StatefulWidget {
@override
_VoiceAvatarState createState() => _VoiceAvatarState();
}
class _VoiceAvatarState extends State<VoiceAvatar> {
late StateMachineController _controller;
SMINumber? _visemeInput;
final AudioPlayer _audioPlayer = AudioPlayer();
final List<Map<String, dynamic>> visemeTimeline = [
{"time": 0, "value": 0},
{"time": 100, "value": 2},
{"time": 250, "value": 3},
{"time": 400, "value": 1},
];
void _onRiveInit(Artboard artboard) {
_controller = StateMachineController.fromArtboard(
artboard,
'LipSyncMachine',
)!;
artboard.addController(_controller);
_visemeInput = _controller.findInput<double>('viseme') as SMINumber?;
}
void _syncVisemes() {
_audioPlayer.onPositionChanged.listen((position) {
final currentMs = position.inMilliseconds;
for (var event in visemeTimeline) {
if (currentMs >= event["time"]) {
_visemeInput?.value = event["value"];
}
}
});
}
Future<void> playAudio() async {
await _audioPlayer.play(UrlSource("https://example.com/audio.mp3"));
_syncVisemes();
}
@override
Widget build(BuildContext context) {
return Column(
children: [
RiveAnimation.asset(
'assets/ai_avatar.riv',
onInit: _onRiveInit,
),
ElevatedButton(
onPressed: playAudio,
child: Text("Play"),
)
],
);
}
}
In production, optimize the timeline iteration to avoid looping the full list every position update. Maintain a pointer index instead.
Web Integration Example (JavaScript)
const riveInstance = new rive.Rive({
src: "ai_avatar.riv",
canvas: document.getElementById("canvas"),
stateMachines: "LipSyncMachine",
autoplay: true,
onLoad: () => {
const inputs = riveInstance.stateMachineInputs("LipSyncMachine");
visemeInput = inputs.find(i => i.name === "viseme");
}
});
const audio = new Audio("speech.mp3");
const visemeTimeline = [
{ time: 0, value: 0 },
{ time: 120, value: 2 },
{ time: 260, value: 3 }
];
let index = 0;
function sync() {
const currentTime = audio.currentTime * 1000;
if (index < visemeTimeline.length &&
currentTime >= visemeTimeline[index].time) {
visemeInput.value = visemeTimeline[index].value;
index++;
}
requestAnimationFrame(sync);
}
audio.play();
requestAnimationFrame(sync);
This approach ensures animation stays locked to actual playback time.
Production Optimization Checklist
Before shipping:
- Reduce State Machine branching
- Minimize vector complexity
- Test on mid-range Android devices
- Validate latency under slow network conditions
- Ensure fallback animation when alignment data is unavailable
- Separate animation logic from AI provider logic
AI voice avatars are becoming a core product feature, not a cosmetic enhancement. When implemented correctly, real-time lip sync:
- Increases user trust
- Improves engagement
- Enhances perceived intelligence
- Differentiates your AI product from competitors
Rive’s State Machine architecture makes it possible to build scalable, cross-platform, production-grade AI avatars driven directly by speech data.
If you are building an AI-powered application and need a production-ready Rive animation system optimized for real-time voice interaction, working with a specialist can significantly accelerate development.
Praneeth Kawya Thathsara
Full-Time Rive Animator
Email: uiuxanimation@gmail.com
WhatsApp: +94 71 700 0999
Top comments (0)