Building Voice AI NPCs in Unreal Engine: Speech Recognition to Lip Sync Pipeline

Georgy Dev — Sun, 15 Feb 2026 13:42:49 +0000

I recently put together a demo project that shows how to create fully interactive AI NPCs in Unreal Engine using speech recognition, AI chatbots, text-to-speech, and realistic lip synchronization with facial animations. The entire system is built with Blueprints and works across Windows, Linux, Mac, iOS, and Android.

If you’ve been exploring AI NPC solutions like ConvAI or Charisma.ai, you’ve probably noticed the tradeoffs: metered API costs that scale with your player count, latency from network roundtrips, and dependency on cloud infrastructure. This modular approach gives you more control, run components locally or pick your own cloud providers, avoid per-conversation billing, and keep your players interactions private if needed. You own the pipeline, so you can optimize for what actually matters to your game. Plus, with local inference and direct audio-based lip sync, you can achieve lower latency and more realistic facial animation, check the demo video below to see the difference yourself.

Here’s an example of the real-time lip sync quality achievable with this system:

What This System Does

The workflow creates a natural conversation loop with an AI character:

Player speaks into microphone → speech recognition converts it to text
Text goes to an AI chatbot (OpenAI, Claude, DeepSeek, etc.) → AI generates a response
Response is converted to speech via text-to-speech
Character’s lips sync perfectly with the spoken audio

The speech recognition part is optional — you can also just type text directly to the chatbot if that works better for your use case.

The Plugin Stack

This implementation uses several plugins that work together:

Runtime MetaHuman Lip Sync — Generates facial animation from audio (documentation)
Runtime Speech Recognizer — Converts speech to text (optional — you can also enter text manually) (documentation)
Runtime AI Chatbot Integrator — Connects to AI providers and TTS services (documentation)
Runtime Audio Importer — Processes audio at runtime (documentation)
Runtime Text To Speech — Optional local TTS synthesis (documentation)

All plugins are designed to work together with Blueprint nodes, no C++ required.

The plugin also supports custom characters beyond MetaHumans — Daz Genesis, Character Creator, Mixamo, ReadyPlayerMe, and any character with blend shapes.

Why CPU Inference?

The lip sync runs on CPU, not GPU. This might seem counterintuitive, but for small, frequent operations like lip sync (processing every 10ms by default), CPU is actually faster:

GPU has overhead from PCIe transfers and kernel launches
At batch size 1 with rapid inference, this overhead exceeds compute time
Game engines already saturate the GPU with rendering and physics
CPU avoids resource contention and unpredictable latency spikes

The transformer-based model is lightweight enough that most mid-tier CPUs handle it fine in real-time. For weaker hardware, you can adjust settings like processing chunk size or switch to a more optimized model variant.

Animation Blueprint Setup

Setting up the lip sync in your Animation Blueprint is straightforward:

In the Event Graph, create your lip sync generator on Begin Play
In the Anim Graph, add the blend node and connect your character’s pose
Connect the generator to the blend node

The setup guide walks through this step-by-step, with different tabs for Standard vs Realistic models.

Audio Processing

The system connects audio through delegates. For example, with microphone input (copyable nodes):

Create a Capturable Sound Wave
Bind to its audio data delegate
Pass audio chunks to your lip sync generator
Start capturing

The audio processing guide covers different audio sources: microphone, TTS, audio files, and streaming buffers.

You can also combine lip sync with custom animations for idle gestures or emotional expressions.

Multilingual Support

Since the lip sync analyzes audio phonemes directly, it works with any spoken language without language-specific configuration. Just feed it the audio and it generates the appropriate mouth movements — whether that’s English, Mandarin, Arabic, or anything else.

Testing the Demo

You can try the complete system yourself:

Download Windows demo (packaged, ready to run)
Download source files (UE 5.6+ project)

The demo includes several MetaHuman characters and shows all the features I’ve covered. It’s a good reference if you’re building something similar.

Performance Considerations

A few tips for optimization:

For mobile/VR:

Use the Standard Model for better frame rates
Increase processing chunk size (trades slight latency for CPU savings)
Adjust thread counts based on your target hardware

For desktop:

Realistic or Mood-Enabled models for maximum quality
Keep default 10ms chunk size for responsive lip sync
Use Original model type for best accuracy

General:

Enable streaming for both AI responses and TTS to minimize latency
Use VAD to avoid processing empty audio
For the Realistic model with TTS, external services (ElevenLabs, OpenAI) work better than local TTS due to ONNX runtime conflicts (though the Mood-Enabled model supports local TTS fine)

Use Cases

This system enables quite a few applications:

AI NPCs in games with natural conversations
Virtual assistants in VR/AR
Training simulations with interactive characters
Digital humans for customer service
Virtual production and real-time cinematics

The Blueprint-based setup makes it accessible even if you’re not comfortable with C++.

Wrapping Up

The combination of offline speech recognition, flexible AI integration, quality TTS, and realistic lip sync creates some genuinely immersive interactions. All the plugins are on Fab, and there’s extensive documentation if you want to dig into specific features.

For more examples and tutorials, check out the lip sync video tutorials or join the Discord community.

If you need custom development or have questions about enterprise solutions: solutions@georgy.dev

DEV Community: Georgy Dev