DEV Community

Frank Fu
Frank Fu

Posted on • Originally published at frankfu.blog

NavTalk Update: Revolutionary 200ms Response Time for Real-Time Digital Human Experience!

1. Response Speed Performance

Let’s get straight to the point by looking at the actual response speed performance:

In the live demo, we achieved an end-to-end latency of under 200 ms for the initial audio processing — from the user finishing their speech to the AI processing, generating the video, and displaying it on the front end, all within approximately 200 ms. Currently, this response speed is highly advanced compared to other real-time digital human systems.

2. Overall Latency Before Optimization

We conducted detailed tests on MuseTalk’s real-time performance in the A100 GPU environment:

1. When testing with 0.5-second real-time audio input, the processing time exceeded 0.5 seconds, failing to meet real-time requirements. As shown in the video below:

2. Upon adjusting the FPS to 18, the processing speed for real-time audio input improved by about 0.2 seconds. However, further FPS reduction to below 15 is required to meet real-time expectations.

3. After increasing the batch size, the processing time actually increased, reaching the chip’s processing limit.

The root cause of the issue lies in the A100 GPU, which uses AMD chips by default. These chips are slower than Intel chips in computer vision tasks, such as image processing. Specifically, the system uses the AMD EPYC 7J13 64-Core Processor with 30 cores, which is suited for virtualization and high-concurrency tasks but underperforms in some image processing tasks compared to Intel processors. Unfortunately, most GPU cloud providers are equipped with AMD chips.

I initially encountered this problem, which limited performance optimization. Later, I had an idea: Could we leverage the GPU for image processing tasks, thereby breaking through the current performance bottleneck? This thought led to a series of optimization steps.

3. GPU-Accelerated Image Processing Optimization

3.1 Optimization Approach

To address the performance bottleneck of AMD chips in image processing, the core idea was to move the image processing operations, originally executed on the CPU, to the GPU, taking full advantage of the GPU’s parallel computing capabilities. In MuseTalk’s inference process, the following image processing steps were executed on the CPU:

1. Data Conversion After VAE Decoding: The decoded result from the GPU tensor is converted to a numpy array, incurring GPU → CPU data transfer overhead.

2. Image Resize: Image resizing is performed on the CPU using OpenCV’s cv2.resize().

3. Image Sharpening: Image sharpening is done on the CPU using OpenCV and NumPy with an Unsharp Mask operation.

4. Image Blending: Image composition and blending are handled on the CPU using PIL.

Although each operation individually takes a small amount of time, the cumulative latency becomes significant in a real-time processing scenario. More importantly, these operations can be accelerated using the parallel computing power of the GPU.

3.2 Technical Implementation

3.2.1 Creating a GPU Image Processing Tool Library

First, I created a dedicated GPU image processing tool library musetalk/utils/gpu_image_processing.py, implementing the following core functions:

▪ gpu_resize(): Uses PyTorch’s F.interpolate() for GPU-based image resizing.

▪ gpu_gaussian_blur(): Implements GPU-based Gaussian blur using PyTorch’s F.conv2d().

▪ gpu_unsharp_mask(): Performs image sharpening on the GPU using GPU-based Gaussian blur.

▪ gpu_image_blending(): GPU-based image blending using tensor operations.

These functions support multiple input formats ([H, W, C], [B, H, W, C], [B, C, H, W]) and automatically handle data format conversions, ensuring ease of use. Based on modifications in the processing.py file, all image processing tasks were migrated to the GPU.

3.2.2 Optimizing VAE Decoding Process

I modified the decode_latents() method in musetalk/models/vae.py, adding a return_tensor parameter:

def decode_latents(self, latents, return_tensor=False):
# ... decoding logic ...
if return_tensor:
# Return a GPU tensor to avoid GPU → CPU transfer
image = image.permute(0, 2, 3, 1) # [B, H, W, C]
image = image * 255.0
image = image[..., [2, 1, 0]] # Convert RGB to BGR
return image
else:
# Original behavior: return a NumPy array
image = (
image.detach()
.cpu()
.permute(0, 2, 3, 1)
.float()
.numpy()
)
# ...
return image

With return_tensor=True, the data stays on the GPU, avoiding unnecessary data transfer.

3.2.3 Refactoring the Real-Time Inference Process

In scripts/realtime_inference.py, I refactored the process_frames() method to add a GPU processing path:

Key changes:

▪ Image Resize Optimization

# Original: CPU-based processing
res_frame = cv2.resize(
res_frame.astype(np.uint8),
(x2 - x1, y2 - y1)
)

Optimized: GPU-based processing

res_frame_gpu = gpu_resize(
res_frame,
(y2 - y1, x2 - x1),
mode='bilinear'
)

▪ Image Sharpening Optimization

# Original: CPU-based processing (OpenCV + NumPy)
res_frame = apply_unsharp_mask(
res_frame,
amount=1.2,
sigma=1.0,
threshold=5.0
)

Optimized: GPU-based processing

res_frame_gpu = gpu_unsharp_mask(
res_frame_gpu,
amount=1.2,
sigma=1.0,
threshold=5.0
)

▪ Image Blending Optimization

# Original: CPU-based processing (PIL)
combine_frame = get_image_blending(
ori_frame,
res_frame,
bbox,
mask,
mask_crop_box
)

Optimized: GPU-based processing

body_tensor = numpy_to_tensor_gpu(ori_frame, device)
face_tensor = res_frame_gpu # Already on GPU
mask_tensor = numpy_to_tensor_gpu(mask, device)

combine_frame_tensor = gpu_image_blending(
body_tensor,
face_tensor,
bbox,
mask_tensor,
mask_crop_box,
device
)

combine_frame = tensor_to_numpy_cpu(combine_frame_tensor)

The entire process uses an automatic fallback mechanism: if GPU processing fails, it falls back to CPU processing to ensure system stability.

3.3 Performance Improvement Results

After optimization, we tested the system in an AMD EPYC 7J13 processor + A100 GPU environment:

3.3.1 Performance Improvement Data

Operation CPU Time GPU Time Speedup
Image Resize 5–10 ms 1–2 ms 5–10x
Image Sharpening 8–15 ms 2–4 ms 3–5x
Image Blending 10–20 ms 3–5 ms 3–5x
VAE Decoding (No Transfer) Saves transfer time

3.3.2 Overall Effect

Before Optimization:

▪ 0.5-second audio input required more than 0.5 seconds processing time.

▪ Did not meet real-time requirements.

▪ FPS needed to be reduced to below 15 to barely achieve real-time.

After Optimization:

▪ Image processing speed improved by 3–5 times.

▪ End-to-end latency controlled under 200 ms.

▪ Successfully achieved real-time response with significantly improved user experience.

3.4 Why GPU Acceleration is Effective?

1. Flexible Computing Precision: GPUs support float32/half precision, allowing flexible balancing between precision and speed.

2. Parallel Computing Advantage: Image processing tasks (such as resizing, convolution, and blending) are inherently parallel, and GPUs, with their thousands of cores, are well-suited for these tasks.

3. Memory Bandwidth: The memory bandwidth of GPU video memory is far higher than the bandwidth between the CPU and main memory, eliminating data transfer bottlenecks.

4. MuseTalk Docker Deployment Record

4.1 Build and Push Image

4.1.1 Rebuild Image

docker build -t xxx/musetalk:latest .

4.1.2 Push New Image to Docker Hub

docker push xxx/musetalk:latest

Note: You need to log in to Docker Hub before pushing:

docker login

4.2 Remove and Pull Image

4.2.1 Stop and Remove Old Container

sudo docker rm -f musetalk

4.2.2 Pull Latest Image

sudo docker pull xxx/musetalk:latest

4.3 Run Container

4.3.1 Start New Container

sudo docker run -d \
--name musetalk \
--gpus all \
--restart unless-stopped \
-p 2160:2160 \
gavana2/musetalk:latest

Explanation of Parameters:

▪ -d: Run in detached mode (background).

▪ –name musetalk: The container name.

▪ –gpus all: Use all available GPUs (requires installation of nvidia-container-toolkit).

▪ –restart unless-stopped: Auto-restart policy (unless manually stopped).

▪ -p 2160:2160: Port mapping (host port:container port).

Note: On the first run, it will automatically download models from HuggingFace to the /workspace/models directory inside the container.

4.4 View Logs and Debug

4.4.1 Real-Time Logs

sudo docker logs -f musetalk

4.4.2 Check Container Status

sudo docker ps
sudo docker ps -a
sudo docker stats musetalk

4.5 Container Operations

4.5.1 Enter Container

sudo docker exec -it musetalk /bin/bash

Explanation: The -it parameter specifies interactive mode, and /bin/bash is the command executed to enter the container.

4.5.2 Fix CRLF Issue in Filenames

# Enter the container
sudo docker exec -it musetalk /bin/bash

Navigate to the target directory

cd /workspace

One-time fix for all filenames containing CRLF

for f in *$'\r'; do mv "$f" "${f%$'\r'}"; done

4.5.3 Create Directories and Copy Files

If you encounter issues with filenames containing carriage return characters (\r), you can execute the following in the container:

mkdir -p /workspace/silent/sk_navtalk_xxx/girl

Copy the avatars directory

cp -r /workspace/results/sk_navtalk_xxx/v15/avatars \
/workspace/silent/sk_navtalk_xxx/

Copy all files from the full_imgs folder

cp -r /workspace/results/sk_navtalk_xxx/v15/avatars/girl/full_imgs/* \
/workspace/silent/sk_navtalk_xxx/girl/

4.6 Analyze GPU Usage

nvidia-smi -l 1

The post NavTalk Update: Revolutionary 200ms Response Time for Real-Time Digital Human Experience! appeared first on Frank Fu's Blog.

Top comments (0)