DEV Community

Georgy Dev
Georgy Dev

Posted on • Originally published at Medium

Building Voice AI NPCs in Unreal Engine: Speech Recognition to Lip Sync Pipeline

I recently put together a demo project that shows how to create fully interactive AI NPCs in Unreal Engine using speech recognition, AI chatbots, text-to-speech, and realistic lip synchronization with facial animations. The entire system is built with Blueprints and works across Windows, Linux, Mac, iOS, and Android.

If you’ve been exploring AI NPC solutions like ConvAI or Charisma.ai, you’ve probably noticed the tradeoffs: metered API costs that scale with your player count, latency from network roundtrips, and dependency on cloud infrastructure. This modular approach gives you more control, run components locally or pick your own cloud providers, avoid per-conversation billing, and keep your players interactions private if needed. You own the pipeline, so you can optimize for what actually matters to your game. Plus, with local inference and direct audio-based lip sync, you can achieve lower latency and more realistic facial animation, check the demo video below to see the difference yourself.

Here’s an example of the real-time lip sync quality achievable with this system:

What This System Does

The workflow creates a natural conversation loop with an AI character:

  1. Player speaks into microphone → speech recognition converts it to text
  2. Text goes to an AI chatbot (OpenAI, Claude, DeepSeek, etc.) → AI generates a response
  3. Response is converted to speech via text-to-speech
  4. Character’s lips sync perfectly with the spoken audio

The speech recognition part is optional — you can also just type text directly to the chatbot if that works better for your use case.

The Plugin Stack

This implementation uses several plugins that work together:

All plugins are designed to work together with Blueprint nodes, no C++ required.

The plugin also supports custom characters beyond MetaHumans — Daz Genesis, Character Creator, Mixamo, ReadyPlayerMe, and any character with blend shapes.

Why CPU Inference?

The lip sync runs on CPU, not GPU. This might seem counterintuitive, but for small, frequent operations like lip sync (processing every 10ms by default), CPU is actually faster:

  • GPU has overhead from PCIe transfers and kernel launches
  • At batch size 1 with rapid inference, this overhead exceeds compute time
  • Game engines already saturate the GPU with rendering and physics
  • CPU avoids resource contention and unpredictable latency spikes

The transformer-based model is lightweight enough that most mid-tier CPUs handle it fine in real-time. For weaker hardware, you can adjust settings like processing chunk size or switch to a more optimized model variant.

Animation Blueprint Setup

Setting up the lip sync in your Animation Blueprint is straightforward:

  • In the Event Graph, create your lip sync generator on Begin Play
  • In the Anim Graph, add the blend node and connect your character’s pose
  • Connect the generator to the blend node

Blend Realistic MetaHuman Lip Sync

The setup guide walks through this step-by-step, with different tabs for Standard vs Realistic models.

Audio Processing

The system connects audio through delegates. For example, with microphone input (copyable nodes):

  • Create a Capturable Sound Wave
  • Bind to its audio data delegate
  • Pass audio chunks to your lip sync generator
  • Start capturing

Realistic Lip Sync During Audio Capture

The audio processing guide covers different audio sources: microphone, TTS, audio files, and streaming buffers.

You can also combine lip sync with custom animations for idle gestures or emotional expressions.

Multilingual Support

Since the lip sync analyzes audio phonemes directly, it works with any spoken language without language-specific configuration. Just feed it the audio and it generates the appropriate mouth movements — whether that’s English, Mandarin, Arabic, or anything else.

Testing the Demo

You can try the complete system yourself:

The demo includes several MetaHuman characters and shows all the features I’ve covered. It’s a good reference if you’re building something similar.

Performance Considerations

A few tips for optimization:

For mobile/VR:

  • Use the Standard Model for better frame rates
  • Increase processing chunk size (trades slight latency for CPU savings)
  • Adjust thread counts based on your target hardware

For desktop:

  • Realistic or Mood-Enabled models for maximum quality
  • Keep default 10ms chunk size for responsive lip sync
  • Use Original model type for best accuracy

General:

  • Enable streaming for both AI responses and TTS to minimize latency
  • Use VAD to avoid processing empty audio
  • For the Realistic model with TTS, external services (ElevenLabs, OpenAI) work better than local TTS due to ONNX runtime conflicts (though the Mood-Enabled model supports local TTS fine)

Use Cases

This system enables quite a few applications:

  • AI NPCs in games with natural conversations
  • Virtual assistants in VR/AR
  • Training simulations with interactive characters
  • Digital humans for customer service
  • Virtual production and real-time cinematics

The Blueprint-based setup makes it accessible even if you’re not comfortable with C++.

Wrapping Up

The combination of offline speech recognition, flexible AI integration, quality TTS, and realistic lip sync creates some genuinely immersive interactions. All the plugins are on Fab, and there’s extensive documentation if you want to dig into specific features.

For more examples and tutorials, check out the lip sync video tutorials or join the Discord community.

If you need custom development or have questions about enterprise solutions: solutions@georgy.dev

Top comments (0)