DEV Community

Cover image for Rive State Machine for AI Voice Avatars — Complete Developer Guide
Praneeth Kawya Thathsara
Praneeth Kawya Thathsara

Posted on

Rive State Machine for AI Voice Avatars — Complete Developer Guide

AI voice interfaces are no longer experimental. SaaS platforms, mobile apps, and AI-first startups are embedding conversational assistants directly into their products. However, most voice features still rely on static UI or pre-rendered animations.

If you want a production-ready AI voice avatar, you need a real-time animation system driven by actual speech data.

This guide covers:

  • Proper Rive State Machine setup
  • Viseme input mapping strategy
  • Trigger vs Number input decisions
  • Accurate audio synchronization
  • Flutter and Web integration examples

This is written for product designers, mobile developers, and startup teams building real AI systems — not demo prototypes.


Architecture Overview

A production AI voice avatar typically follows this pipeline:

  1. Text → AI TTS API
  2. TTS API → Audio + timestamp alignment data
  3. Alignment data → Viseme mapping
  4. Audio playback → Drives Rive State Machine
  5. Rive → Real-time lip sync animation

The animation must be deterministic, lightweight, and tightly synced to audio playback.


State Machine Setup in Rive

Step 1: Design Mouth Shapes

Do not design 30 phoneme-specific shapes. In production, group them into 8–12 viseme categories:

  • Neutral
  • Closed (M, B, P)
  • Slight Open
  • Wide (E)
  • Round (O, U)
  • Smile
  • Rest
  • Emphasis

Keep vector paths clean and optimized. Overly complex paths reduce mobile performance.

Step 2: Create a Lip Sync State Machine

Inside Rive:

  • Create a new State Machine (e.g., LipSyncMachine)
  • Add a Number Input named viseme
  • Create states for each mouth animation
  • Add conditional transitions based on viseme value

Example logic:

  • If viseme == 0 → Neutral
  • If viseme == 1 → Closed
  • If viseme == 2 → Open
  • If viseme == 3 → Wide

Keep transitions instant. Avoid long blend animations for speech.


Viseme Input Mapping Strategy

AI APIs typically return either:

  • Viseme IDs
  • Phoneme-level alignment
  • Word-level timing

You should create a mapping layer between API output and your Rive input values.

Example mapping table:

  • API Viseme 0 → Rive viseme 0 (Neutral)
  • API Viseme 1 → Rive viseme 2 (Open)
  • API Viseme 2 → Rive viseme 3 (Wide)

Never couple your Rive file directly to one provider’s ID system. Always use a translation layer. This ensures you can switch between OpenAI, Azure, or ElevenLabs without redesigning your animation.


Trigger vs Number Input in Rive

This is a common developer question.

Trigger Input

Use when:

  • You need one-time events (blink, nod)
  • State change is momentary

Not ideal for continuous viseme switching.

Number Input

Use when:

  • You need scalable viseme control
  • Multiple mouth shapes exist
  • Values update frequently
  • Speech runs continuously

For AI lip sync, Number Input is the correct choice in nearly all production scenarios.

Why?

  • Cleaner logic
  • Easier mapping
  • Less state explosion
  • Better performance

Syncing Audio Playback with Animation

Naive implementation:

  • Use setTimeout or delayed timers
  • Update viseme value based on timestamp

This works in demos, but not reliably in production.

Production-Ready Approach

  1. Start audio playback
  2. Track playback position
  3. On each frame:
    • Compare current time to viseme timeline
    • Update viseme input only when threshold is crossed

On Web, use requestAnimationFrame.
On Flutter, use a periodic stream or audio position listener.

Key rules:

  • Skip duplicate consecutive visemes
  • Do not update input every frame unnecessarily
  • Ensure audio latency is accounted for

Flutter Integration Example

Below is a simplified production-style Flutter implementation:

import 'package:flutter/material.dart';
import 'package:rive/rive.dart';
import 'package:audioplayers/audioplayers.dart';

class VoiceAvatar extends StatefulWidget {
  @override
  _VoiceAvatarState createState() => _VoiceAvatarState();
}

class _VoiceAvatarState extends State<VoiceAvatar> {
  late StateMachineController _controller;
  SMINumber? _visemeInput;
  final AudioPlayer _audioPlayer = AudioPlayer();

  final List<Map<String, dynamic>> visemeTimeline = [
    {"time": 0, "value": 0},
    {"time": 100, "value": 2},
    {"time": 250, "value": 3},
    {"time": 400, "value": 1},
  ];

  void _onRiveInit(Artboard artboard) {
    _controller = StateMachineController.fromArtboard(
      artboard,
      'LipSyncMachine',
    )!;
    artboard.addController(_controller);
    _visemeInput = _controller.findInput<double>('viseme') as SMINumber?;
  }

  void _syncVisemes() {
    _audioPlayer.onPositionChanged.listen((position) {
      final currentMs = position.inMilliseconds;
      for (var event in visemeTimeline) {
        if (currentMs >= event["time"]) {
          _visemeInput?.value = event["value"];
        }
      }
    });
  }

  Future<void> playAudio() async {
    await _audioPlayer.play(UrlSource("https://example.com/audio.mp3"));
    _syncVisemes();
  }

  @override
  Widget build(BuildContext context) {
    return Column(
      children: [
        RiveAnimation.asset(
          'assets/ai_avatar.riv',
          onInit: _onRiveInit,
        ),
        ElevatedButton(
          onPressed: playAudio,
          child: Text("Play"),
        )
      ],
    );
  }
}
Enter fullscreen mode Exit fullscreen mode

In production, optimize the timeline iteration to avoid looping the full list every position update. Maintain a pointer index instead.


Web Integration Example (JavaScript)

const riveInstance = new rive.Rive({
  src: "ai_avatar.riv",
  canvas: document.getElementById("canvas"),
  stateMachines: "LipSyncMachine",
  autoplay: true,
  onLoad: () => {
    const inputs = riveInstance.stateMachineInputs("LipSyncMachine");
    visemeInput = inputs.find(i => i.name === "viseme");
  }
});

const audio = new Audio("speech.mp3");
const visemeTimeline = [
  { time: 0, value: 0 },
  { time: 120, value: 2 },
  { time: 260, value: 3 }
];

let index = 0;

function sync() {
  const currentTime = audio.currentTime * 1000;
  if (index < visemeTimeline.length &&
      currentTime >= visemeTimeline[index].time) {
    visemeInput.value = visemeTimeline[index].value;
    index++;
  }
  requestAnimationFrame(sync);
}

audio.play();
requestAnimationFrame(sync);
Enter fullscreen mode Exit fullscreen mode

This approach ensures animation stays locked to actual playback time.


Production Optimization Checklist

Before shipping:

  • Reduce State Machine branching
  • Minimize vector complexity
  • Test on mid-range Android devices
  • Validate latency under slow network conditions
  • Ensure fallback animation when alignment data is unavailable
  • Separate animation logic from AI provider logic

AI voice avatars are becoming a core product feature, not a cosmetic enhancement. When implemented correctly, real-time lip sync:

  • Increases user trust
  • Improves engagement
  • Enhances perceived intelligence
  • Differentiates your AI product from competitors

Rive’s State Machine architecture makes it possible to build scalable, cross-platform, production-grade AI avatars driven directly by speech data.

If you are building an AI-powered application and need a production-ready Rive animation system optimized for real-time voice interaction, working with a specialist can significantly accelerate development.

Praneeth Kawya Thathsara

Full-Time Rive Animator

Email: uiuxanimation@gmail.com

WhatsApp: +94 71 700 0999

Top comments (0)