DEV Community

Cover image for How to Build Real-Time AI Lip Sync Using Rive State Machine + Viseme Data
Praneeth Kawya Thathsara
Praneeth Kawya Thathsara

Posted on

How to Build Real-Time AI Lip Sync Using Rive State Machine + Viseme Data

AI products are rapidly moving beyond static chat interfaces. Voice-enabled assistants, AI tutors, and conversational agents are becoming standard in SaaS dashboards, mobile apps, and web platforms.

However, most AI avatars still feel disconnected from their voice output. The missing layer is synchronized, real-time lip movement.

This article explains how to build production-ready real-time AI lip sync using Rive State Machines and viseme data from modern Text-to-Speech (TTS) APIs such as OpenAI, Azure Cognitive Services, and ElevenLabs. The focus is on practical implementation for product designers, mobile developers, and startup teams building real AI-driven interfaces.


What Is a Viseme?

A viseme is the visual representation of a phoneme (a speech sound). When a TTS engine generates speech audio, many providers can also return timestamped alignment data. This data tells you:

  • Which mouth shape should be displayed
  • When it should appear
  • How long it should remain active

Instead of manually animating mouth movements, you map this structured speech data to predefined mouth shapes in your animation system.

In production systems, visemes are typically grouped into 8–12 mouth shapes rather than mapping every phoneme individually. This improves performance while maintaining believable speech animation.


How AI APIs Provide Viseme Data

Different providers expose alignment data differently:

Azure Cognitive Services (TTS)

  • Provides viseme events during synthesis
  • Includes viseme ID and audio offset
  • Designed for real-time animation use cases

OpenAI Voice Pipelines

  • Provides structured speech alignment depending on API configuration
  • Can output phoneme-level timing data
  • Requires mapping phonemes to viseme groups

ElevenLabs

  • Returns timestamp alignment metadata
  • Can be post-processed into viseme categories

In all cases, the implementation pattern is the same:

  1. Generate speech audio
  2. Capture timestamped alignment data
  3. Stream or play audio
  4. Drive animation state using timing data

Designing the Rive State Machine for Lip Sync

Rive is particularly suited for this use case because of its State Machine architecture and runtime performance across Flutter, Web, and mobile platforms.

Step 1: Create Mouth Shape Animations

Inside Rive:

  • Design a neutral mouth state
  • Design grouped mouth shapes (closed, open, wide, smile, etc.)
  • Keep vector complexity minimal
  • Avoid excessive blending for speech transitions

Production tip: Most AI avatars work well with 8–10 grouped visemes instead of full phoneme mapping.

Step 2: Add a Number Input

Inside your Rive State Machine:

  • Add a Number Input called viseme
  • Create transitions based on viseme values
  • Example mapping:
    • 0 → Neutral
    • 1 → Closed
    • 2 → Open
    • 3 → Wide

Number Inputs scale better than multiple triggers and simplify runtime logic.

Step 3: Keep Transitions Instant

For speech sync:

  • Avoid long transition blends
  • Use immediate transitions
  • Keep logic linear and predictable

This ensures accurate real-time updates.


Production Integration Example (Flutter)

Below is a simplified Flutter example showing how viseme values can drive a Rive State Machine during audio playback.

import 'package:flutter/material.dart';
import 'package:rive/rive.dart';
import 'dart:async';

class LipSyncAvatar extends StatefulWidget {
  @override
  _LipSyncAvatarState createState() => _LipSyncAvatarState();
}

class _LipSyncAvatarState extends State<LipSyncAvatar> {
  late StateMachineController _controller;
  SMINumber? _visemeInput;

  final List<Map<String, dynamic>> visemeTimeline = [
    {"time": 0, "value": 0},
    {"time": 120, "value": 3},
    {"time": 240, "value": 5},
    {"time": 380, "value": 2},
  ];

  void _onRiveInit(Artboard artboard) {
    _controller = StateMachineController.fromArtboard(
      artboard,
      'LipSyncMachine',
    )!;
    artboard.addController(_controller);
    _visemeInput = _controller.findInput<double>('viseme') as SMINumber?;
  }

  void playVisemes() {
    for (var event in visemeTimeline) {
      Timer(Duration(milliseconds: event["time"]), () {
        _visemeInput?.value = event["value"];
      });
    }
  }

  @override
  Widget build(BuildContext context) {
    return RiveAnimation.asset(
      'assets/ai_avatar.riv',
      onInit: _onRiveInit,
    );
  }
}
Enter fullscreen mode Exit fullscreen mode

In a real production system:

  • Sync viseme updates with audio playback time
  • Avoid relying only on Timer
  • Use audio player current position to drive updates
  • Skip redundant consecutive visemes

Audio and Animation Synchronization Strategy

For production reliability:

  • Start audio playback
  • Track current playback position
  • Compare against viseme timestamps
  • Update viseme input only when playback time crosses threshold
  • Use a frame-driven loop for smooth updates

Avoid naive setTimeout-style approaches in production systems. Instead, tie animation updates directly to audio time for accuracy.


Performance Considerations for Mobile and Web

When deploying AI avatars in real apps:

  • Reduce vector complexity in mouth shapes
  • Minimize State Machine branching
  • Avoid heavy nested artboards
  • Preload Rive files before speech begins
  • Batch process viseme updates when possible

Performance issues typically arise from over-designed assets rather than runtime logic.


Real-World Product Use Cases

This architecture is already being used in:

  • AI onboarding assistants in SaaS dashboards
  • Voice-based mobile AI tutors
  • Conversational healthcare assistants
  • AI-powered support agents
  • EdTech and language learning apps

The key difference between a toy demo and a production feature is synchronization precision, asset optimization, and scalable state logic.


Why Rive Is Suitable for AI Avatars

Compared to video-based avatars:

  • Rive supports real-time state-driven control
  • Works across Flutter, Web, iOS, Android
  • Lightweight and runtime efficient
  • Fully programmable from code
  • Allows scalable animation logic

For startups building AI-first products, Rive enables dynamic interaction instead of static playback.


Implementation Checklist for Production Teams

Before shipping:

  • Validate viseme grouping strategy
  • Test sync accuracy under network latency
  • Benchmark performance on mid-tier Android devices
  • Ensure fallback behavior when alignment data is missing
  • Keep animation logic independent from AI provider

This ensures your AI avatar remains platform-agnostic and scalable.


AI products are evolving from text interfaces to embodied, voice-driven systems. Real-time lip sync is not cosmetic; it improves user trust, engagement, and perceived intelligence.

By combining:

  • AI voice APIs
  • Timestamped viseme data
  • Rive State Machines
  • Platform-native runtime integration

You can build production-ready AI avatars that feel integrated rather than layered on top.


If you’re building an AI product and need production-level interactive Rive animation designed for real-time voice systems, collaboration with a specialist can significantly reduce iteration time.

Praneeth Kawya Thathsara

Full-Time Rive Animator

Email: uiuxanimation@gmail.com

WhatsApp: +94 71 700 0999

Top comments (0)