DEV Community: Frank Fu

Building NavBot-D1: From Parts, Jetson, and ROS 2 to Reinforcement-Learning Locomotion

Frank Fu — Thu, 09 Jul 2026 07:00:09 +0000

NavBot-D1 is an open quadruped robotics platform for developers. Its value is not just that it can walk, but that it connects mechanical design, actuators, onboard computing, ROS 2, CAN bus, remote control, and reinforcement-learning locomotion into a system that can be reproduced, modified, and extended.

NavBot-D1 is positioned as an open development platform rather than a closed consumer robot dog. It is designed for industrial-grade development, inspection, safety, field operation, and robotics research. The product page lists several key specifications:

Up to 97 N-m peak joint torque
25 kg payload capacity
All-metal aluminum alloy frame
Reinforcement learning-based locomotion
Jetson Orin Nano Super 8G onboard computing
2-hour battery life
GPS, LoRa, and real-time image transmission capability

This article follows the build as an engineering reproduction guide. We will start from the system goal, then walk through the hardware list and architecture, and finally connect mechanical assembly, Jetson bring-up, ROS 2, IMU/CAN validation, actuator IDs, calibration, and reinforcement-learning gait testing into one reproducible path.

Project Overview

NavBot-D1 can be understood as a complete 12-DOF quadruped robot development kit. Its basic control loop looks like this:

Remote input / LoRa
      ↓
Jetson Orin Nano Super 8G
      ↓
ROS 2 drivers, state estimation, RL locomotion policy
      ↓
USB-CAN / CAN bus
      ↓
12 joint actuators
      ↓
IMU feedback + physical locomotion

If you have built ROS 2 rovers, robotic arms, or desktop robots before, the main difference with NavBot-D1 is that this is not a slow wheeled chassis. It is a dynamic legged system. A wheeled robot can usually stop in place when control goes wrong. A quadruped with incorrect joint direction, ID mapping, zero position, or policy output can fall immediately. For NavBot-D1, the engineering sequence matters more than any single command.

I recommend splitting the reproduction process into four layers:

Stage	Goal	Success Criteria
Hardware bring-up	Mechanical structure, power, main board, Jetson, IMU, and CAN all work	Stable power-on, working emergency stop, SSH access
Actuator bring-up	12 actuator IDs, directions, and zero positions are correct	Each leg moves consistently with the body coordinate frame
Standing test	The robot can stand and switch modes reliably	Repeated power cycles lead to repeatable standing behavior
Locomotion test	The RL policy drives controlled quadruped motion	Forward, backward, turning, and speed commands match remote input

This is also how the rest of this article is structured. We will not treat NavBot-D1 as a pile of parts. We will treat it as a layered robot system that must be validated step by step.

Core Hardware List

The diagram below reorganizes the BOM by function. This is also how I recommend managing the parts during a real reproduction, because each layer has a different failure mode.

Module	Main Parts	Role	Reproduction Focus
Mechanical load-bearing layer	Body frame, upper/lower legs, support posts, limiters, battery bay	Carries the body, battery, actuators, and impact loads	Hole alignment, orientation, stiffness, limit range
Actuation and transmission layer	12 actuators, knee connectors, bearings, linkages	Provides 12 leg joints	ID, direction, zero position, linkage interference
Onboard compute layer	Jetson Orin Nano Super 8G, main board, cooling	Runs ROS 2, drivers, and RL policy	Flashing, networking, SSH, thermals
Sensing and bus layer	IMU, USB-CAN, CAN wiring	Provides attitude feedback and actuator communication	Device enumeration, permissions, data rate
Communication layer	Remote controller, LoRa, GPS, image transmission module	Supports indoor/outdoor control and longer-range operation	Frequency, distance, open-area testing, antenna mounting
Power and safety layer	Battery, BMS, power button, emergency stop, power distribution	Powers Jetson, peripherals, and actuators	Battery state, emergency stop, wire gauge, connector retention

One common mistake is to treat the BOM as a purchase list only. For a quadruped robot, every component should be tied to a validation action. For example, validating the IMU does not mean it is physically plugged in. It means you can read reasonable roll, pitch, yaw, or raw IMU data from the software stack. Validating CAN does not mean the USB-CAN adapter lights up. It means the adapter is detected, the bitrate is configured, and messages can be exchanged with the actuators.

System Architecture

The NavBot-D1 system can be divided into three main paths:

Control path Remote input or LoRa commands enter the Jetson. ROS 2 nodes parse the command, then the RL policy or locomotion controller produces joint targets.
Actuation path Jetson communicates through USB-CAN to the CAN bus, then sends joint targets to the 12 actuators. The actuator ID, direction, and control mode determine whether software commands map correctly to physical joints.
Feedback path The IMU provides body attitude feedback, and the actuators also report state. For a quadruped, attitude, joint position, motor status, and control mode must be managed by the same system state machine.

A few terms should be normalized before we continue:

Ambiguous Expression	Engineering Meaning
accurator	actuator / joint actuator
canvas	CAN bus
Nora	LoRa module
resource learning	reinforcement learning
Jet Nano / Jettison	Jetson Orin Nano Super 8G

All later steps follow these three paths: make the hardware mechanically reliable, make Jetson maintainable, bring up ROS 2 and CAN, then move to standing and gait testing.

Step 1: Prepare the Mechanical Parts

The mechanical build should begin with parts inspection, not immediate assembly. Once a quadruped is half assembled, discovering that a bearing, spacer, screw, or linkage is oriented incorrectly can cost a lot of time.

I recommend laying out and checking parts in this order:

Body frame, side plates, top plate, and bottom plate.
Upper legs, lower legs, and knee joint connectors for all four legs.
Bearings, linkages, limiters, and support posts.
All 12 actuators, with temporary labels for their future leg and joint positions.
Battery bay, BMS, main board, Jetson, IMU, USB-CAN, and wiring.
LoRa, remote controller, GPS, image transmission module, and antennas.

Here is a practical rule: do not wait until actuator ID writing to decide where each actuator belongs. Before mechanical assembly, label each actuator temporarily, such as FL-HIP, FL-THIGH, and FL-CALF. This makes later ID writing, CAN wiring, and zero calibration much less error-prone.

Step 2: Assemble the Actuators and Main Frame

The first parts to secure are the actuators and the main frame. Three details are especially important:

Keep debug ports accessible: actuators may need ID rewriting, parameter reads, or debugging later. If the frame blocks the debug port, maintenance becomes painful.
Use a small amount of thread locker on critical screws: legged robots vibrate heavily, especially around actuators, linkages, and support structures.
Manually move each layer after assembly: do not wait until all four legs are complete to discover that one joint is binding.

You can assemble one side first and then mirror the process on the other side. Each leg should be checked for:

Hip actuator mounting strength.
Alignment between the upper leg and actuator output.
Proper bearing insertion at the knee joint.
Smooth linkage motion.
Limiters that prevent unsafe over-rotation.

For a quadruped, mechanical limiters are not decorative parts. They prevent joints from rotating into dangerous positions during software faults, manual handling, or falls. Mechanical limits and software limits should both exist. Do not rely on the control policy alone.

Step 3: Wire the Power, Main Board, Jetson, IMU, and CAN

After the mechanical loop is closed, the next step is wiring. A useful rule is: power first, signals second; secure first, debug later.

A typical connection structure looks like this:

Battery
  ├── BMS / power management
  ├── Emergency stop / power switch
  ├── Main board power input
  ├── Jetson power
  ├── External module power
  └── Actuator power path

Jetson
  ├── USB → IMU
  ├── USB → USB-CAN adapter
  ├── USB → LoRa / receiver module
  ├── HDMI / keyboard / mouse for first boot
  └── Ethernet / Wi-Fi for SSH

Cable routing is one of the easiest areas to underestimate. NavBot-D1 runs, turns, and absorbs impact from the ground. Every cable should satisfy three requirements:

It must not be pulled when joints move.
It must not be crushed by actuator housings during a fall or when the robot is carried.
Its purpose must remain visible for later maintenance.

Power should be validated separately. For the first power-on, do not enable the actuators immediately. Confirm:

BMS display and battery state are normal.
The emergency stop can reliably remove dangerous output.
The Jetson fan starts.
Main board indicators look normal.
Peripheral power rails are stable.

Step 4: Flash Jetson and Create a Maintainable Base System

The first software task is to turn Jetson into a maintainable compute node. Do not rush into robot code.

Recommended flow:

Put Jetson into recovery mode.
Use NVIDIA SDK Manager to flash JetPack.
Select NVMe as the target disk so the system has enough space.
Connect monitor, keyboard, and mouse for the first boot.
Create the user account and verify the desktop and base network.
Get the IP address and switch to SSH.

Why switch to SSH early? Later steps involve ROS 2 setup, dependency installation, driver builds, and log inspection. SSH makes command reuse and log capture much easier than working directly from an attached monitor.

The acceptance criteria for this stage are simple:

Host computer → SSH → Jetson
Jetson → Internet / local network
Jetson → can reboot and reconnect

Only after these are stable should you move to ROS 2 and driver installation.

Step 5: Install ROS 2, Wi-Fi Driver, Device Tree, and Dependencies

Once the base Jetson system is ready, the next layer is the robotics runtime stack. This usually includes:

Configuring ROS 2 package sources.
Installing ROS 2 base packages and build tools.
Installing the Wi-Fi driver so the robot is not tied to Ethernet.
Configuring the device tree so hardware interfaces are exposed correctly.
Cloning the main NavBot-D1 project.
Installing dependencies and building the workspace.

It is better not to hard-code a one-time command sequence here, because JetPack, ROS 2, kernel versions, and project scripts may change. A more reliable approach is to preserve the flow and validation points:

Subsystem	Goal	Validation
Wi-Fi	Jetson joins the local network wirelessly	SSH still works after unplugging Ethernet
ROS 2	Base environment is usable	Environment can be sourced and base tests run
Device tree	Hardware interfaces are recognized	Device nodes remain stable after reboot
Dependencies	Project can be built	Workspace build completes without key errors
IMU	Attitude data can be read	Roll/pitch/yaw or raw IMU data changes reasonably
USB-CAN	CAN adapter is usable	Device enumeration, bitrate setup, and send/receive tests pass

Logs matter during quadruped bring-up. Save the output of every build, driver installation, and device test, especially logs related to the CAN adapter, IMU, and actuators. If gait fails later, the policy is not always the problem. The lower-level interface may be unstable.

Step 6: Configure the CAN Bus and Actuator IDs

CAN bus is the execution path of NavBot-D1. Jetson produces joint targets, but the 12 actuators on the CAN bus are what actually move the legs.

Before entering reinforcement-learning control, each actuator must be handled individually:

Connect one actuator at a time.
Write or confirm the actuator ID.
Set the control mode and required parameters.
Reload the parameters and read them back.
Record the leg and joint that this actuator belongs to.
Move to the next actuator.

Do not connect all 12 actuators and then guess which one is which. For a quadruped, an ID mistake is direct: software may think it is controlling the front-left hip while the rear-right knee is moving.

Create an actuator mapping table:

Leg	Joint	Temporary Label	Final ID	Direction	Zero Position
Front Left	Hip	FL-HIP	Follow official docs	To verify	To calibrate
Front Left	Thigh	FL-THIGH	Follow official docs	To verify	To calibrate
Front Left	Calf	FL-CALF	Follow official docs	To verify	To calibrate
Front Right	Hip	FR-HIP	Follow official docs	To verify	To calibrate
Rear Left	Hip	RL-HIP	Follow official docs	To verify	To calibrate
Rear Right	Hip	RR-HIP	Follow official docs	To verify	To calibrate

This table is not paperwork. It binds mechanical placement, wiring, software coordinate frames, and debugging records together. If one leg moves in the wrong direction later, this table helps isolate whether the issue is ID mapping, motor direction, zero position, or control output.

Step 7: Calibrate Zero Position and Validate Joint Direction

Calibration is a hard gate before real locomotion. A 12-DOF quadruped cannot rely on a rough “looks close enough” mechanical pose.

Recommended calibration order:

Place the body level An uneven table or floor will contaminate the zero position. Use a level or fixture if possible.
Make all four leg poses consistent Front/rear and left/right leg poses should be comparable. Do not calibrate with one leg bent and another extended.
Test each joint with small commands Move one joint at a time and confirm direction against the body coordinate frame.
Use low torque or a safe mode first Do not start the first test in a high-output state.
Confirm emergency stop behavior Before proving that the robot can stand, prove that it can stop safely.

Before standing, answer these questions:

Is every actuator ID unique?
Do software labels for front-left, front-right, rear-left, and rear-right match the physical robot?
Does each joint’s positive direction match the control definition?
Does the IMU attitude match the body orientation?
Can the emergency stop intervene immediately if actuator output becomes unsafe?

If these questions are not answered, entering RL mode is not responsible engineering.

Step 8: Enter Reinforcement Learning Locomotion Mode

After mechanical assembly, power, Jetson, ROS 2, IMU, CAN, and actuator mapping all pass validation, the robot can enter reinforcement-learning locomotion testing.

Recommended first run:

Place the robot on an open flat surface.
Power on and wait for Jetson and ROS 2 nodes to start.
Connect the remote controller or LoRa control side.
Enter stand mode using the safe sequence.
Confirm that all four legs carry load evenly, without obvious shaking.
Enter reinforcement-learning locomotion mode.
Start with slow forward and backward commands.
Test small left and right yaw commands.
Increase speed gradually.

The goal is not to run fast on the first attempt. The goal is to create a repeatable test procedure. After every run, record:

Battery percentage and voltage changes.
Actuator temperature and warnings.
Jetson CPU/GPU load and temperature.
ROS 2 node timeouts or dropped messages.
CAN bus communication errors.
Which motion most easily triggers instability.

Only with enough test records can you responsibly tune policies, change mechanics, adjust parameters, or optimize gait behavior.

Key Hardware Notes

Actuator Direction

Actuator direction is one of the most common bring-up problems. Even if the ID is correct, a reversed direction makes the policy output the wrong physical motion. Each joint should be tested independently with a small command.

Thread Locker

Use a small amount of thread locker around actuators, linkages, support posts, and frame joints. Do not overuse it, or later maintenance becomes difficult. Do not skip it either, because long-duration locomotion can loosen fasteners.

Cable Routing

Cables should not sit close to rotating joints, pass through crush zones, or rely on connectors to absorb pulling force. CAN, power, USB, and antenna cables all need fixed routing points.

Emergency Stop

Emergency stop is not decorative. Validate it before the first actuator enable. If a quadruped fails during standing, human reaction time is usually not enough.

CAN Bus

CAN bus issues often appear as intermittent motion loss or one joint dropping offline. Check bitrate, termination, cable length, loose connectors, USB-CAN driver behavior, and device permissions.

LoRa and Remote Control

LoRa is useful for longer-range operation in open areas. For indoor or short-range bring-up, a local remote controller or ROS 2 control interface is usually more convenient. Do not assume the long-range link is stable in a closed or cluttered environment.

Jetson Thermal and Power

Jetson runs ROS 2, drivers, and the RL policy at the same time, so power and cooling matter. Poor cooling can cause throttling. Unstable power can cause random reboots or peripheral dropouts.

Debugging Notes

Problem	Check First
No power-on	Battery, power button, emergency stop, BMS, main board input
Cannot log into Jetson	First boot via HDMI, IP address, Ethernet/Wi-Fi, SSH service
SSH disconnects often	Wi-Fi driver, power stability, Jetson temperature, network quality
No IMU data	USB connection, device node, permissions, driver, ROS 2 topic
CAN device unavailable	USB-CAN driver, enumeration, bitrate, permissions
One actuator does not respond	ID, power, CAN wiring, control mode, parameter reload
Shaking during standing	Zero position, direction, IMU attitude, leg interference, control gains
Wrong forward direction	Body coordinate frame, remote mapping, joint direction, policy input
Turning behaves incorrectly	Left/right leg mapping, yaw command direction, IMU frame
Outdoor test is unstable	Ground friction, grass resistance, battery level, communication distance, actuator temperature

Do not change multiple variables at once. Quadruped failure chains are long. If you change wiring, ID mapping, policy, and remote mapping simultaneously, it becomes hard to know which change actually fixed the issue.

Final Result

After completing the process above, NavBot-D1 should have the following capabilities:

Jetson can be maintained remotely through SSH.
The ROS 2 runtime stack starts reliably.
IMU and USB-CAN can be detected and tested.
All 12 actuator IDs, directions, and zero positions are traceable.
The robot can power on, stand, and switch into RL locomotion mode safely.
Remote input can trigger forward, backward, turning, and faster movement.

More importantly, the developer now has an extensible quadruped platform. The next steps can include local development tooling, simulation, reinforcement-learning training, perception modules, navigation modules, and remote operation systems.

Reproduction Suggestions

If you plan to reproduce or extend NavBot-D1, I recommend the following priorities:

Reproduce bring-up first, do not modify algorithms first Stabilize the original hardware and software stack before tuning policies.
Test one leg before the full robot If possible, use a single leg or single actuator to validate ID, direction, mode, and CAN communication.
Start indoors and short-range before outdoor long-range tests Outdoor testing adds ground conditions, communication range, battery level, and safety risks. It is not ideal for first bring-up.
Keep logs for every test When a robot falls or shakes, logs are more reliable than intuition.
Separate mechanical changes from software changes Changing leg geometry, wiring, battery, or policy parameters should be recorded separately.
Put safety before features Emergency stop, mechanical limits, low-speed testing, and power management are not optional extras. They are the foundation of dynamic robot development.

NavBot-D1 matters because it provides a sufficiently complete real robot platform. It has mechanical complexity, onboard compute, and RL locomotion. It supports hardware integration, software control, and simulation research. For robotics developers, this kind of platform is ideal for building end-to-end engineering capability across CAD, BOM, wiring, ROS 2, CAN, calibration, and policy deployment.

Appendix

Video Source: https://youtu.be/Raj1pK31hU4, or watch it right here.

The post Building NavBot-D1: From Parts, Jetson, and ROS 2 to Reinforcement-Learning Locomotion appeared first on Frank Fu's Blog.

Building a Desktop AI Companion with RDK X5, OpenClaw, NavTalk, and MQTT

Frank Fu — Tue, 26 May 2026 07:00:08 +0000

Project Overview

This project is a desktop AI companion prototype. It uses the RDK X5 as the edge computing host, runs the OpenClaw gateway and MQTT bridge locally, and displays the NavTalk digital human interface on a small screen. The user speaks through a USB audio card or microphone. The system converts the voice input into text, sends it to OpenClaw and the large language model for processing, then sends the returned text through TTS and NavTalk lip sync to generate a visual digital human response.

This is not a complete mobile robot chassis project. It is a hardware validation platform designed to first get the voice, display, AI agent, and MQTT communication pipeline working together. The main hardware in this demo is the RDK X5, a small display, and a USB audio capture device.

Core Hardware List

Module	Role in the Video	Connection / Purpose
RDK X5 development board	Edge computing host	Runs the Ubuntu desktop, OpenClaw, and MQTT bridge; connects to the display and audio device
Small HDMI display	Digital human display terminal	Connects to the RDK X5 through HDMI and displays the OpenClaw/NavTalk page and digital human avatar
Display driver / adapter board	Interface board for the small display	The blue PCB behind the display, connected to HDMI and power/data cables
USB audio card / microphone	Voice input	Plugs into a USB port on the RDK X5 and captures the user’s voice
USB-C / power cables	Power supply	Power the RDK X5 and display to keep the system stable
Network connection	Access to cloud model and MQTT	The RDK X5 needs network access to reach the OpenAI model, NavTalk page, or MQTT broker
Debugging computer	Configuration entry point	Accesses the RDK X5 desktop and services through a browser, terminal, or remote connection

The physical hardware connection is shown in the video screenshot below:

System Architecture

User voiceUSB audio card / microphoneRDK X5Speech-to-text STTMQTT topic: openclaw/inOpenClaw GatewayLLM / OpenAI modelMQTT topic: openclaw/outNavTalkTTS + lip syncDigital human on HDMI display

In this pipeline, the RDK X5 is the hardware hub. It receives USB audio input, drives the HDMI display, and also runs the OpenClaw gateway and MQTT client/bridge environment. MQTT decouples the NavTalk page from the OpenClaw gateway, allowing voice input, AI inference results, and digital human output to move through a lightweight messaging channel.

Step 1: Arrange and Connect the Hardware

At the beginning of the video, the author shows the hardware layout on the desk. The RDK X5 sits in the center of the work mat. The small display is connected to the development board through a thicker HDMI cable, and the blue adapter board behind the screen also has another power/data cable attached. The USB audio device is placed next to the development board and is then plugged into a USB port on the RDK X5.

The recommended build order is:

Place the RDK X5 in a well-ventilated position and keep it powered off at first.
Connect the HDMI input of the small display to the HDMI output of the RDK X5.
Connect power to the small display. If the display requires touch input or extra USB data, connect the corresponding USB cable as well.
Plug the USB audio card or microphone into a USB port on the RDK X5.
Power on the RDK X5 and confirm that the system can boot into the desktop normally.
Connect the network and make sure the RDK X5 can access the model service, MQTT broker, and NavTalk page.

At around the 1-minute mark, the video shows the screen being held closer to the camera. The small display is already showing an OpenClaw/NavTalk-related page:

Step 2: Confirm the RDK X5 Desktop Environment

After the hardware is powered on, the RDK X5 enters an Ubuntu/Xfce-style desktop environment. In the video, the author opens a terminal and browser on the RDK X5 desktop, and all later installation steps are completed directly on the RDK X5.

First confirm three things:

The HDMI display can show the desktop normally.
The USB audio card has been recognized by the system and can be used as a recording input.
The RDK X5 is connected to the network, and the terminal can run apt update, access installation scripts, and reach the MQTT broker.

If you want to access the OpenClaw console on the RDK X5 from another computer, you also need to check the firewall and listening address later. The author mentions that if another computer on the LAN needs to access the RDK X5, the corresponding ports may need to be opened.

Step 3: Install Basic Dependencies

The on-screen document in the video divides the installation into several phases. The first phase installs basic tools on Ubuntu, including curl, certificates, Git, Python, Python venv/pip, and mosquitto-clients for MQTT testing.

The dependencies shown on screen are similar to:

sudo apt update
sudo apt install -y curl ca-certificates git python3 python3-venv python3-pip mosquitto-clients

These tools are used for:

Tool	Purpose
`curl`	Downloads the OpenClaw installation script
`git`	Fetches the MQTT skill or sample code
`python3-venv` / `pip`	Creates a Python virtual environment for the MQTT bridge
`mosquitto-clients`	Runs end-to-end MQTT tests with `mosquitto_pub` and `mosquitto_sub`

Step 4: Install OpenClaw on the RDK X5

The second phase in the video installs OpenClaw. The on-screen document shows that the OpenClaw installation script pulls Node.js when needed, so the author does not handle the Node.js version separately.

The installation command shown in the video is:

curl -fsSL https://openclaw.ai/install.sh | bash

After installation, run the configuration wizard:

openclaw configure

The author selects the following settings:

Configuration Item	Selection in the Video
Gateway location	local / this machine
Model provider	OpenAI
Model	`gpt-5.5`
Lockdown	Disabled

After configuration, check the gateway status:

openclaw gateway status

The screen shows that the OpenClaw gateway is running locally, with service port 18789. The Dashboard address is similar to:

http://127.0.0.1:18789/

The Gateway WebSocket address is similar to:

ws://127.0.0.1:18789

The key point in this phase is that OpenClaw is now running as a local service on the RDK X5. The later MQTT bridge and NavTalk setup both communicate around this gateway.

Step 5: Install the MQTT Skill and Bridge Environment

OpenClaw provides the agent/gateway capability, but a messaging channel is needed between the NavTalk page and OpenClaw. In the video, the author places an mqtt-client skill into the OpenClaw workspace and creates a Python runtime environment for it.

The path shown in the on-screen document is approximately:

$HOME/.openclaw/workspace/skills/mqtt-client/

If the code is already on the desktop, it can be copied into the OpenClaw skills directory:

mkdir -p "$HOME/.openclaw/workspace/skills"
cp -a "$HOME/Desktop/mqtt-client" "$HOME/.openclaw/workspace/skills/mqtt-client"

It can also be cloned into the same directory through Git. The video then runs the installation script, installs the MQTT skill and Python venv, and writes the configuration into OpenClaw.

The important part of this step is not just installing packages. It extends OpenClaw from a local chat gateway into a service that can subscribe to MQTT input and publish results back to MQTT output.

Step 6: Configure the MQTT Broker

The video shows two MQTT broker options:

Option	Use Case	Notes
Default EMQX public broker	Quick testing	The document shows `broker.emqx.io`; WebSocket + TLS uses port `8084`
Local Mosquitto broker	Local closed-loop setup / more offline-friendly operation	The video shows installing `mosquitto` and opening TCP `1883` plus WebSocket `9001`

The default EMQX test commands in the on-screen document are similar to:

mosquitto_sub -h broker.emqx.io -p 1883 -t openclaw/out -v
mosquitto_pub -h broker.emqx.io -p 1883 -t openclaw/in -m "Say hello in one sentence."

If using local Mosquitto, the video shows installing and enabling the service:

sudo apt install -y mosquitto
sudo systemctl enable mosquitto

The script then adds TCP 1883 and WebSocket 9001 listeners. The screen shows the ss -ltnp check result, where 0.0.0.0:1883 and *:9001 are both listening.

Choose the broker based on the scenario:

Scenario	Recommendation
Quickly reproduce the video flow	Start with the EMQX public broker
Stable demo within the same LAN	Use local Mosquitto on the RDK X5
Reduce dependency on the public internet	Use local Mosquitto
NavTalk page runs on another device	Check the RDK X5 firewall and WebSocket port

Step 7: Start OpenClaw and the MQTT Bridge

After the MQTT broker is configured, the author restarts the OpenClaw gateway and starts the MQTT bridge.

The verification process has three layers:

Call the OpenClaw HTTP endpoint and confirm that it returns HTTP 200.
Start the MQTT bridge and check that the terminal shows logs such as configuration loaded, MQTT connected, and topic subscription succeeded.
Open another terminal, publish a message to openclaw/in with mosquitto_pub, and use mosquitto_sub to observe whether openclaw/out receives a reply.

The first test in the video has a small issue, but the author retries and the message pipeline works. This also shows that the most common MQTT debugging problem is usually not the model itself, but the broker address, port, topic, or publishing and subscribing against different brokers.

Step 8: Configure the NavTalk Digital Human Page

After OpenClaw and MQTT are ready, the author opens the NavTalk page and enters the avatar settings page. The visible configuration items in the video include:

NavTalk Configuration Item	Role in the Video
Provider	Select `OpenClaw`
MQTT address	Enter the MQTT endpoint
Send data topic	Topic used by NavTalk to send text to OpenClaw
Receive input topic	Topic used by NavTalk to receive replies from OpenClaw
Character Name	Select a digital human character, such as Lauren

Configuration page screenshot:

The logic is: NavTalk does not call the large model directly. Instead, it sends the speech-recognized text to MQTT. The OpenClaw bridge subscribes to that input topic, processes the request, and publishes the reply back to an output topic. NavTalk receives the text, then handles TTS and digital human lip sync.

Step 9: End-to-End Test

At the end of the video, the author performs a full demo. He says “Please tell me a story” to the USB audio device. The system completes the following steps:

The USB audio card captures the voice.
The speech module on the RDK X5 converts the voice into text.
The text is sent to OpenClaw through MQTT.
OpenClaw calls the configured OpenAI model to generate a reply.
The reply text is sent back to NavTalk through MQTT.
NavTalk generates speech and drives the digital human lip sync.
The small display shows the digital human telling a story.

The final screen shows the NavTalk digital human. The RDK X5, small display, and audio input device form a complete desktop AI companion:

Key Hardware Notes

1. Do not connect only HDMI to the small display

Many small HDMI displays require separate power in addition to the HDMI signal cable. Some displays also need an extra USB data cable if touch input is supported. In the video, the display has a blue adapter board and multiple cables behind it, which means it is not a module that can fully work with HDMI alone.

2. The USB audio device must be recognized by the system

Voice is the main interaction entry point for this project. If the USB audio card is not recognized by the RDK X5, the later STT, MQTT, and OpenClaw stages will not receive valid input. After connecting the hardware, first confirm the input device in the system audio settings, then run the end-to-end test.

3. The MQTT address must be reachable from NavTalk

If the NavTalk page and the RDK X5 are not running in the same environment, for example if NavTalk is opened in a browser on another computer, then 127.0.0.1 points to the browser computer itself, not the RDK X5. In that case, set the MQTT endpoint to the LAN IP address of the RDK X5 and check the firewall.

4. Topics must be configured in pairs

The send topic in NavTalk must match the topic subscribed to by the OpenClaw bridge. The receive topic in NavTalk must match the topic published by the OpenClaw bridge. The video document uses names similar to openclaw/in and openclaw/out, so it is best not to rename them at the beginning.

5. Verify with the command line before connecting NavTalk

In the video, the author first verifies MQTT input and output with mosquitto_pub and mosquitto_sub, then connects NavTalk. This step is important because it narrows the problem scope. If the command line cannot receive a reply, check OpenClaw/MQTT first. If the command line works but NavTalk does not, then check the NavTalk endpoint, topics, and browser permissions.

Final Result

After completion, this hardware platform provides the following capabilities:

Capability	Implementation
Voice input	Captures user voice through the USB audio card
AI inference	OpenClaw on the RDK X5 calls the large model
Lightweight communication	MQTT bridge forwards input and output
Digital human display	NavTalk displays the avatar on the HDMI screen
Voice reply	TTS generates speech, and NavTalk performs lip sync

The most valuable part of the project is that it turns the RDK X5 from a development board into an interactive desktop AI terminal. The hardware layer uses only a few external devices, but the combination of OpenClaw, MQTT, and NavTalk creates a complete loop of listening, thinking, speaking, and displaying.

Reproduction Suggestions

If reproducing the project from scratch, follow this order:

First boot the RDK X5 normally and connect it to the network.
Connect the HDMI display and confirm that the display is stable.
Connect the USB audio card and confirm that recording input works.
Install OpenClaw and make sure gateway status is normal.
Install the MQTT skill and bridge.
Start with the EMQX public broker for a quick test.
If a local setup is needed, switch to a local Mosquitto broker.
Configure the NavTalk Provider, MQTT endpoint, and topics.
Run an end-to-end test with one simple voice prompt.

The benefit of this build order is that every step has a verifiable result. It avoids mixing display, audio, network, MQTT, model configuration, and digital human page issues into one debugging session.

Appendix

Video Source: https://www.youtube.com/watch?v=u07fDrbOdYI, or watch it right here.

The post Building a Desktop AI Companion with RDK X5, OpenClaw, NavTalk, and MQTT appeared first on Frank Fu's Blog.

OpenAvatarChat: A Detailed Explanation of System Architecture and Handler Collaboration Mechanism

Frank Fu — Mon, 30 Mar 2026 08:55:10 +0000

1. Overall Architecture

1.1 System Hierarchical Structure

OpenAvatarChat adopts a layered architecture, divided into three levels from top to bottom:

Architecture Description:

1. ChatEngine (Top Layer)

The core of the system, managing the entire chat engine

Responsible for initialization, configuration loading, and Handler management

Supports concurrent multi-session operation, with each session running independently

2. ChatSession (Middle Layer)

Corresponds to a user session (one WebRTC connection)

Manages all Handler instances within the session

Manages data flow, threads, and queues

3. Handler (Bottom Layer)

Functional modules responsible for specific task processing

Includes: RTC client, VAD, ASR, LLM, TTS, Avatar, etc.

Each Handler creates an independent instance when the session starts

1.2 Core Component Description

ChatEngine (src/chat_engine/chat_engine.py)

Responsibilities:

System initialization and management

Creation and initialization of HandlerManager

Creation and destruction of sessions

Management of concurrent multi-session operation

Key Methods:

def initialize(engine_config, app=None, ui=None):

    # Initialize HandlerManager

    # Load all Handlers

    # Set up the client Handler

def create_client_session(session_info, client_handler):

    # Create a new ChatSession

    # Prepare the Handler environment

    # Return the session and Handler environment

def stop_session(session_id):

    # Stop and destroy the session

HandlerManager (src/chat_engine/core/handler_manager.py)

Responsibilities:

Dynamically load Handler modules from configuration files

Manage Handler lifecycle

Key Data Structure:

handler_registries = {

    "RtcClient": HandlerRegistry(

        base_info=HandlerBaseInfo(...),

        handler=RtcClient instance,

        handler_config=configuration object

    ),

    "SileroVad": HandlerRegistry(...),

    ...

}

ChatSession (src/chat_engine/core/chat_session.py)

Responsibilities:

Manage data flow for a single session

Create and manage Handler instances

Data routing and distribution

Thread management

Key Data Structure:

# Data routing table: Data type → Handler input queue


data_sinks = {

    ChatDataType.MIC_AUDIO: [

        DataSink(owner="SileroVad", sink_queue=vad_queue),

    ],

    ChatDataType.HUMAN_TEXT: [

        DataSink(owner="LLM_Bailian", sink_queue=llm_queue),

    ],

}

  
  
  Handler records: Handler name → Handler environment


handlers = {

    "SileroVad": HandlerRecord(env=HandlerEnv(...)),

    ...

}

2. Data Flow Process

2.1 Complete Data Flow Architecture Diagram

The complete data flow is as follows:

2.2 Detailed Data Flow Process

Step 1: Client Input

Step 2: Data Distribution (Subscription Distribution)

Key Mechanisms:

data_sinks is a mapping table from data types to Handler input queues.

The system automatically finds all subscribers based on the data type.

Data is simultaneously distributed to all Handlers that have subscribed to that data type.

Step 3: Handler Processing

Each Handler has an independent processing thread that reads data from its own input queue:

Step 4: Chained Data Flow

Data automatically forms a processing chain based on the input and output definitions of the Handlers:

Step 5: Client Output

2.3 Key Data Structures: Queues and Routing

Input Queues:

# Client input queues (created by RTC Client Handler)

input_queues = {

    EngineChannelType.AUDIO: asyncio.Queue(),

    EngineChannelType.VIDEO: asyncio.Queue(),

    EngineChannelType.TEXT: asyncio.Queue(),

}

Handler Input Queues:

# Each Handler has its own input queue

vad_input_queue = queue.Queue()      # Input queue for SileroVad

asr_input_queue = queue.Queue()      # Input queue for SenseVoice

llm_input_queue = queue.Queue()      # Input queue for LLM_Bailian

tts_input_queue = queue.Queue()      # Input queue for Edge_TTS

avatar_input_queue = queue.Queue()   # Input queue for AvatarMusetalk

Data Routing Table (data_sinks):

# Data type → List of Handlers that subscribe to this type

data_sinks = {

    ChatDataType.MIC_AUDIO: [

        DataSink(owner="SileroVad", sink_queue=vad_input_queue),

    ],

    ChatDataType.HUMAN_AUDIO: [

        DataSink(owner="SenseVoice", sink_queue=asr_input_queue),

    ],

    ChatDataType.HUMAN_TEXT: [

        DataSink(owner="LLM_Bailian", sink_queue=llm_input_queue),

    ],

    ChatDataType.AVATAR_TEXT: [

        DataSink(owner="Edge_TTS", sink_queue=tts_input_queue),

    ],

    ChatDataType.AVATAR_AUDIO: [

        DataSink(owner="AvatarMusetalk", sink_queue=avatar_input_queue),

    ],

}

Output Queue Mapping:

# (Handler Name, Data Type) → Output Queue

outputs = {

    ("AvatarMusetalk", ChatDataType.AVATAR_VIDEO): DataSink(

        sink_queue=output_queues[EngineChannelType.VIDEO]

    ),

}

3. The Essence of Handler

3.1 What is a Handler?

A Handler is an independent functional module, and each Handler is responsible for a specific task:

RTC Client Handler: Manages WebRTC connections, receives user input, and sends output

SileroVad Handler: Voice Activity Detection (VAD), detects whether the user is speaking

SenseVoice Handler: Speech Recognition (ASR), converts speech into text

LLM Handler: Large Language Model, generates response text

TTS Handler: Text-to-Speech (TTS), converts text into audio

Avatar Handler: Avatar driving, generates video from audio

3.2 The Nature of a Handler: Independent Threads

Key Understanding: Each Handler creates an independent thread when the session starts.

Thread Operation Mode:

# Core loop of the handler_pumper thread

def handler_pumper(session_context, handler_env, sinks, outputs):

    shared_states = session_context.shared_states

    input_queue = handler_env.input_queue  # Handler's input queue
while shared_states.active:  # Continue running while the session is active
    try:
        # 1. Read data from the input queue
        input_data = input_queue.get_nowait()
    except queue.Empty:
        time.sleep(0.03)  # Sleep for 30ms when the queue is empty
        continue

    # 2. Call the Handler to process the data
    handler_result = handler_env.handler.handle(
        handler_env.context,
        input_data,
        handler_env.output_info
    )

    # 3. Submit the processed result
    ChatDataSubmitter.submit(handler_result)
        │
        └─→ distribute_data()  # Distribute to the next Handler</code></pre><h3 class="wp-block-heading" style="font-size:clamp(18.959px, 1.185rem + ((1vw - 3.2px) * 1.255), 30px);">3.3 The Lifecycle of a Handler</h3><h4 class="wp-block-heading" style="font-size:clamp(16.293px, 1.018rem + ((1vw - 3.2px) * 0.989), 25px);">Stage 1: Load (load)</h4><p class="has-black-color has-text-color has-link-color has-small-medium-font-size wp-elements-b85c410936e4085440a31364d204b137">When the system starts, each Handler executes a load:</p><pre class="wp-block-code"><code>handler.load(engine_config, handler_config)</code></pre><p class="has-black-color has-text-color has-link-color has-small-medium-font-size wp-elements-d70759991c85e4d41799731d87dd3c20"><strong>Purpose</strong>:</p><p class="has-black-color has-text-color has-link-color has-small-medium-font-size wp-elements-c7adf9dabae7a9ec3268914328da44c2"><img src="https://s.w.org/images/core/emoji/15.0.3/72x72/25aa.png" alt="▪" class="wp-smiley" style="height: 1em; max-height: 1em;" /> Load model files</p><p class="has-black-color has-text-color has-link-color has-small-medium-font-size wp-elements-110eeec2b4f8d0745e672a7460c70773"><img src="https://s.w.org/images/core/emoji/15.0.3/72x72/25aa.png" alt="▪" class="wp-smiley" style="height: 1em; max-height: 1em;" /> Initialize global resources</p><p class="has-black-color has-text-color has-link-color has-small-medium-font-size wp-elements-e86f46ea31a3ab58b33f3f3d817c88f7"><img src="https://s.w.org/images/core/emoji/15.0.3/72x72/25aa.png" alt="▪" class="wp-smiley" style="height: 1em; max-height: 1em;" /> Prepare the Handler runtime environment</p><p class="has-black-color has-text-color has-link-color has-small-medium-font-size wp-elements-47e3acf204bd1f841288f37930cfd1fe"><strong>Examples</strong>:</p><p class="has-black-color has-text-color has-link-color has-small-medium-font-size wp-elements-4cf4aa4c73ceff3e42c49c493d0756fe"><img src="https://s.w.org/images/core/emoji/15.0.3/72x72/25aa.png" alt="▪" class="wp-smiley" style="height: 1em; max-height: 1em;" /> SileroVad: Load the VAD model</p><p class="has-black-color has-text-color has-link-color has-small-medium-font-size wp-elements-7d5a5d31e923c06a261ce9c71875b756"><img src="https://s.w.org/images/core/emoji/15.0.3/72x72/25aa.png" alt="▪" class="wp-smiley" style="height: 1em; max-height: 1em;" /> SenseVoice: Load the ASR model</p><p class="has-black-color has-text-color has-link-color has-small-medium-font-size wp-elements-bf814852ddf0e1c48e75d01d8b4be34a"><img src="https://s.w.org/images/core/emoji/15.0.3/72x72/25aa.png" alt="▪" class="wp-smiley" style="height: 1em; max-height: 1em;" /> LLM: Initialize API client</p><p class="has-black-color has-text-color has-link-color has-small-medium-font-size wp-elements-ca77cb04f64945a005b4db0cb0e2ae8c"><img src="https://s.w.org/images/core/emoji/15.0.3/72x72/25aa.png" alt="▪" class="wp-smiley" style="height: 1em; max-height: 1em;" /> Avatar: Load the avatar model</p><h4 class="wp-block-heading" style="font-size:clamp(16.293px, 1.018rem + ((1vw - 3.2px) * 0.989), 25px);">Stage 2: Create Context (create_context)</h4><p class="has-black-color has-text-color has-link-color has-small-medium-font-size wp-elements-3929972cd6d907f4813220ddd0c955e8">When each session is created, an independent context is created for each Handler:</p><pre class="wp-block-code"><code>handler_context = handler.create_context(session_context, handler_config)

Purpose:

Create session-related states

For example: LLM creates conversation history, ASR creates an audio buffer

Stage 3: Handle (handle)

During the session, the Handler continuously processes data:

handler_result = handler.handle(context, inputs, output_definitions)

Features:

Each Handler runs in its own thread

It reads data from its own input queue

After processing, it outputs the result

Stage 4: Destroy Context (destroy_context)

When the session ends, the Handler context is cleaned up:

handler.destroy_context(handler_context)

Purpose:

Release session-related resources

Clean up state data

3.4 Interface Definition of Handlers

All Handlers inherit from HandlerBase and must implement the following interfaces:

class HandlerBase(ABC):

    @abstractmethod

    def load(self, engine_config, handler_config):

        """Load the Handler (e.g., load models)"""

        pass
@abstractmethod
def create_context(self, session_context, handler_config):
    """Create Handler context"""
    pass

@abstractmethod
def handle(self, context, inputs, output_definitions):
    """Process input data"""
    pass

@abstractmethod
def get_handler_detail(self, session_context, context):
    """Declare input and output data types"""
    return HandlerDetail(
        inputs={...},   # Input type definitions
        outputs={...}   # Output type definitions
    )

@abstractmethod
def destroy_context(self, context):
    """Destroy Handler context"""
    pass

3.5 Key Method of Handler: get_handler_detail

This is the key method for interaction between the Handler and the system. The Handler declares its inputs and outputs through this method:

def get_handler_detail(self, session_context, context) -> HandlerDetail:

    return HandlerDetail(

        inputs={

            ChatDataType.MIC_AUDIO: HandlerDataInfo(

                type=ChatDataType.MIC_AUDIO,

                # Other configurations...

            )

        },

        outputs={

            ChatDataType.HUMAN_AUDIO: HandlerDataInfo(

                type=ChatDataType.HUMAN_AUDIO,

                definition=output_definition,

            )

        }

    )

How the System Uses It:

1. During the prepare_handler() stage, the system calls get_handler_detail().

2. Based on the returned inputs, the system creates a data routing table:

for input_type, input_info in io_detail.inputs.items():

    sink_list = data_sinks.setdefault(input_type, [])

    data_sink = DataSink(

        owner=handler_name,

        sink_queue=handler_input_queue

    )

    sink_list.append(data_sink)

3. When data of that type arrives, the system automatically distributes it to the Handler’s input queue.

4. Handler Collaborative Mechanism

4.1 Data Subscription Mechanism

Core Idea: Handlers “subscribe” to data by declaring input types, and the system automatically establishes data routing.

Establishing Subscription Relationships

Subscription Example

For example, in the glut3.yaml configuration:

# SileroVad subscribes to MIC_AUDIO

data_sinks[ChatDataType.MIC_AUDIO] = [

    DataSink(owner="SileroVad", sink_queue=vad_queue),

]

  
  
  SenseVoice subscribes to HUMAN_AUDIO (SileroVad's output)


data_sinks[ChatDataType.HUMAN_AUDIO] = [

    DataSink(owner="SenseVoice", sink_queue=asr_queue),

]


  
  
  LLM_Bailian subscribes to HUMAN_TEXT (SenseVoice's output)


data_sinks[ChatDataType.HUMAN_TEXT] = [

    DataSink(owner="LLM_Bailian", sink_queue=llm_queue),

]


  
  
  Edge_TTS subscribes to AVATAR_TEXT (LLM's output)


data_sinks[ChatDataType.AVATAR_TEXT] = [

    DataSink(owner="Edge_TTS", sink_queue=tts_queue),

]


  
  
  AvatarMusetalk subscribes to AVATAR_AUDIO (TTS's output)


data_sinks[ChatDataType.AVATAR_AUDIO] = [

    DataSink(owner="AvatarMusetalk", sink_queue=avatar_queue),

]

4.2 Data Distribution Mechanism (Subscription Distribution)

When data arrives, the system automatically distributes it through distribute_data():

def distribute_data(data: ChatData, sinks, outputs):

    # 1. Check if it's the final output (directly sent to the client)

    source_key = (data.source, data.type)

    if source_key in outputs:

        outputs[source_key].sink_queue.put_nowait(data)
# 2. Find all Handlers subscribed to this data type
sink_list = sinks.get(data.type, &#91;])

# 3. Distribute to all subscribers
for sink in sink_list:
    if sink.owner == data.source:
        continue  # Skip the data source itself

    sink.sink_queue.put_nowait(data)  # Put into Handler's input queue

Key Points:

Data is automatically routed based on type.

A piece of data can be distributed to multiple subscribers simultaneously.

Handlers are completely decoupled and unaware of each other’s existence.

4.3 Handler Parallel Processing Mechanism

Parallel Execution

All Handler threads run simultaneously without blocking each other:

Data Flow Sequence Guarantee

Although Handlers run in parallel, the data flow is sequential:

MIC_AUDIO → HUMAN_AUDIO → HUMAN_TEXT → AVATAR_TEXT → AVATAR_AUDIO → AVATAR_VIDEO

Why the Sequence is Guaranteed

1. Data Type Driven:

SileroVad outputs HUMAN_AUDIO

SenseVoice subscribes to HUMAN_AUDIO (not MIC_AUDIO)

SenseVoice only receives data when HUMAN_AUDIO is produced

2. Queue Buffering:

Each Handler has its own input queue.

The queue automatically buffers data to ensure the sequence.

3. VAD’s Speech End Marker:

VAD outputs the human_speech_end marker during processing.

ASR waits for this marker before performing inference.

This ensures that a complete speech segment is processed.

4.4 Decoupling of Handlers

Complete Decoupling

Handlers do not communicate directly with each other, only interact through data types:

❌ Incorrect Way (Tightly Coupled):

    SileroVad → Direct Call → SenseVoice.handle()

✅ Correct Way (Decoupled):

    SileroVad → Outputs HUMAN_AUDIO → System Distributes → SenseVoice Input Queue

Benefits of Decoupling

1. Easy to Extend: Adding a new Handler only requires declaring input/output without modifying existing Handlers.

2. Flexible Combination: Handlers can be flexibly combined through configuration files.

3. Easy to Test: Each Handler can be tested independently.

4. Easy to Maintain: Handlers have clear responsibilities and do not interfere with each other.

4.5 Session End Mechanism

Shared Flag Control

All threads share a flag: shared_states.active

# While the session is running

shared_states.active = True

  
  
  All threads loop and check


while shared_states.active:

    # Process data

    ...


  
  
  When the session ends


shared_states.active = False


  
  
  All threads automatically exit the loop

End Process

5. Detailed Explanation of Handlers

5.1 RTC Client Handler

Function: Manages WebRTC connections and handles bidirectional communication with the client.

Input: Client audio/video/text (received via WebRTC)

Output: Avatar video/audio (sent via WebRTC)

Key Code Locations:

src/handlers/client/rtc_client/client_handler_rtc.py

src/service/rtc_service/rtc_stream.py

Workflow:

5.2 SileroVad Handler (Voice Activity Detection)

Function: Detects whether the user is speaking and filters out silence.

Input: ChatDataType.MIC_AUDIO (raw audio)

Output: ChatDataType.HUMAN_AUDIO (human speech audio, with speech activity markers)

Key Code Location: src/handlers/vad/silerovad/vad_handler_silero.py

Key Methods:

def get_handler_detail(self, ...):

    return HandlerDetail(

        inputs={

            ChatDataType.MIC_AUDIO: HandlerDataInfo(...)

        },

        outputs={

            ChatDataType.HUMAN_AUDIO: HandlerDataInfo(...)

        }

    )

def handle(self, context, inputs, output_definitions):

    # 1. Extract audio from input

    audio_data = inputs.data.get_main_data()

# 2. VAD model inference
is_speech = self.model(audio_data)

# 3. If speech is detected, output HUMAN_AUDIO
if is_speech:
    yield ChatData(type=HUMAN_AUDIO, data=audio_data)

Features:

Real-time processing with streaming output

Output contains human_speech_start and human_speech_end markers

ASR relies on these markers to determine when to perform recognition

5.3 SenseVoice Handler (Speech Recognition)

Function: Converts speech to text.

Input: ChatDataType.HUMAN_AUDIO (human speech audio)

Output: ChatDataType.HUMAN_TEXT (recognized text)

Key Code Location: src/handlers/asr/sensevoice/asr_handler_sensevoice.py

Key Methods:

def get_handler_detail(self, ...):

    return HandlerDetail(

        inputs={

            ChatDataType.HUMAN_AUDIO: HandlerDataInfo(...)

        },

        outputs={

            ChatDataType.HUMAN_TEXT: HandlerDataInfo(...)

        }

    )

def handle(self, context, inputs, output_definitions):

    # 1. Accumulate audio data

    context.audio_buffer.append(inputs.data.get_main_data())

# 2. Check if there is a human_speech_end marker
if inputs.data.has_meta('human_speech_end'):
    # 3. Perform ASR inference
    text = self.model(context.audio_buffer)

    # 4. Output recognized text
    yield ChatData(type=HUMAN_TEXT, data=text)

    # 5. Clear the buffer
    context.audio_buffer.clear()

Features:

Accumulates audio and waits for the speech end marker

Performs ASR on complete speech segments

Text format output: <|zh|><|NEUTRAL|><|Speech|><|woitn|>你好

5.4 LLM Handler (Large Language Model)

Function: Understands user input and generates response text.

Input: ChatDataType.HUMAN_TEXT (user text)

Output: ChatDataType.AVATAR_TEXT (AI response text)

Key Code Location: src/handlers/llm/openai_compatible/llm_handler_openai_compatible.py

Key Methods:

def get_handler_detail(self, ...):

    return HandlerDetail(

        inputs={

            ChatDataType.HUMAN_TEXT: HandlerDataInfo(...)

        },

        outputs={

            ChatDataType.AVATAR_TEXT: HandlerDataInfo(...)

        }

    )

def handle(self, context, inputs, output_definitions):

    # 1. Update conversation history

    context.history.add_user_message(inputs.data.get_main_data())

# 2. Call the LLM API (streaming)
response = self.client.chat.completions.create(
    model=self.model_name,
    messages=context.history.get_messages(),
    stream=True
)

# 3. Stream the output text
for chunk in response:
    text = chunk.choices&#91;0].delta.content
    if text:
        yield ChatData(type=AVATAR_TEXT, data=text)

Features:

Maintains conversation history

Supports streaming output

Configurable for different LLM models (Bailian, OpenAI compatible, etc.)

5.5 Edge_TTS Handler (Text-to-Speech)

Function: Converts text to speech.

Input: ChatDataType.AVATAR_TEXT (AI response text)

Output: ChatDataType.AVATAR_AUDIO (generated audio)

Key Code Location: src/handlers/tts/edgetts/tts_handler_edgetts.py

Key Methods:

def get_handler_detail(self, ...):

    return HandlerDetail(

        inputs={

            ChatDataType.AVATAR_TEXT: HandlerDataInfo(...)

        },

        outputs={

            ChatDataType.AVATAR_AUDIO: HandlerDataInfo(...)

        }

    )

def handle(self, context, inputs, output_definitions):

    # 1. Accumulate text

    context.text_buffer += inputs.data.get_main_data()

# 2. Check if there is a text end marker
if inputs.data.has_meta('text_end'):
    # 3. Call the TTS API to generate audio
    audio = edge_tts.generate(
        text=context.text_buffer,
        voice=self.voice
    )

    # 4. Output audio stream
    for audio_chunk in audio:
        yield ChatData(type=AVATAR_AUDIO, data=audio_chunk)

    # 5. Clear the buffer
    context.text_buffer = ""

Features:

Accumulates text and waits for a complete sentence

Supports multiple voices (selectable via configuration)

Outputs 24kHz audio

5.6 AvatarMusetalk Handler (Avatar Driving)

Function: Generates avatar video (lip-sync) from audio.

Input: ChatDataType.AVATAR_AUDIO (TTS-generated audio)

Output: ChatDataType.AVATAR_VIDEO (avatar video frames)

Key Code Location: src/handlers/avatar/musetalk/avatar_handler_musetalk.py

Key Methods:

def get_handler_detail(self, ...):

    return HandlerDetail(

        inputs={

            ChatDataType.AVATAR_AUDIO: HandlerDataInfo(...)

        },

        outputs={

            ChatDataType.AVATAR_VIDEO: HandlerDataInfo(...)

        }

    )

def handle(self, context, inputs, output_definitions):

    # 1. Accumulate audio data

    context.audio_buffer.append(inputs.data.get_main_data())

# 2. Check if there is an audio end marker
if inputs.data.has_meta('audio_end'):
    # 3. MuseTalk model processing
    video_frames = self.model(
        audio=context.audio_buffer,
        avatar_image=context.avatar_image
    )

    # 4. Output video frame stream
    for frame in video_frames:
        yield ChatData(type=AVATAR_VIDEO, data=frame)

    # 5. Clear the buffer
    context.audio_buffer.clear()

Features:

Precise lip-syncing

Supports 16fps video output

Uses the MuseTalk model for inference

5.7 Summary of the Handler Processing Chain

6. Quick Reference

6.1 Key Code Locations

Function	File Path	Key Method
Main Entry	`src/glut.py`	`main()`
Engine Initialization	`src/chat_engine/chat_engine.py`	`ChatEngine.initialize()`
Handler Loading	`src/chat_engine/core/handler_manager.py`	`HandlerManager.initialize()`
Session Creation	`src/chat_engine/chat_engine.py`	`ChatEngine.create_client_session()`
Data Distribution	`src/chat_engine/core/chat_session.py`	`ChatSession.distribute_data()`
Input Processing	`src/chat_engine/core/chat_session.py`	`ChatSession.inputs_pumper()`
Handler Processing	`src/chat_engine/core/chat_session.py`	`ChatSession.handler_pumper()`

6.2 Key Data Structures

Data Types (ChatDataType):

MIC_AUDIO        # Microphone audio

HUMAN_AUDIO      # Human speech audio

HUMAN_TEXT       # User text

AVATAR_TEXT      # AI response text

AVATAR_AUDIO     # TTS audio

AVATAR_VIDEO     # Avatar video

Data Routing Table (data_sinks):

data_sinks: Dict[ChatDataType, List[DataSink]]

  
  
  Data type → List of Handlers subscribed to this type

Handler Registry:

handler_registries: Dict[str, HandlerRegistry]

  
  
  Handler name → Handler registration info

6.3 Core Execution Flow

6.4 Key Features of Modularity

1. Configuration-driven: Define Handler combinations through YAML configuration files.

2. Dynamic Loading: Import and instantiate Handlers dynamically at runtime based on the configuration.

3. Data-driven Routing: Automatically distribute data based on data types, with Handlers unaware of each other.

4. Asynchronous Processing: Each Handler runs in its own thread, communicating via queues.

5. Loose Coupling: Handlers do not depend on each other directly, only on data types.

6. Easy to Extend: To add a new Handler, simply implement the HandlerBase interface.

7. Summary

OpenAvatarChat adopts a layered, modular architecture design:

Top Layer (ChatEngine): Manages the entire system and supports concurrent multi-session operation.

Middle Layer (ChatSession): Manages a single session and coordinates the collaborative work of Handlers.

Bottom Layer (Handler): Independent functional modules that communicate via data types.

Core Mechanisms:

Data Subscription: Handlers subscribe to data by declaring input types.

Automatic Routing: The system automatically distributes data based on data types.

Parallel Processing: Handlers run concurrently in independent threads.

Queue Communication: Communication between Handlers is asynchronous and decoupled via queues.

This design achieves a highly cohesive, loosely coupled architecture that makes the system easy to extend, maintain, and test.

The post OpenAvatarChat: A Detailed Explanation of System Architecture and Handler Collaboration Mechanism appeared first on Frank Fu's Blog.

Deployment tests of IMTalker and LatentSync

Frank Fu — Mon, 30 Mar 2026 08:54:34 +0000

LatentSync Deployment Test

During the LatentSync test on Lambda, I rented A6000 and A100 GPUs. Test results show:

On the A6000, generating a video for 20 seconds of audio resulted in a video over 100 seconds long.

On the A100, generation time was similar to the A6000.

Generated material:
I uploaded a video — the same one used with MuseTalk — and combined it with audio, looping for playback.

Generation results:
Except for insufficient clarity around the teeth detail, other mouth details were preserved very well.

Real‑time performance:

Conclusion:
From testing LatentSync under these different hardware setups, we conclude:

Performance gap: Although both A6000 and A100 are high‑performance GPUs, video generation speed still fails to reach real‑time or near‑real‑time — generating 20 seconds of audio requires over 100 seconds.

Not suitable for real‑time applications: Based on current hardware results, LatentSync is better suited for offline or batch rendering rather than applications requiring quick or real‑time video generation.

Hardware requirements: For higher‑quality output or higher‑resolution video generation, stronger GPUs with more VRAM are needed to reduce generation time.

IMTalker Deployment Test

Currently, IMTalker has been tested remotely, but there are some bugs. After clicking “Generate,” a manual page refresh is required to trigger backend processing. This issue is still being fixed, but partial results are now viewable.

Generated material:
Only a single image needs to be uploaded here.

Generation results:
The output video is cropped to a 512×512 region, can blink automatically, and shows very fast real‑time performance.

Real‑time performance:

Conclusion:
Based on IMTalker testing, we conclude:

Image cropping: The input image is cropped to 512×512 area.

Real‑time performance: Real-time performance meets expectations — the video can be generated quickly with synchronized mouth movements.

The post Deployment tests of IMTalker and LatentSync appeared first on Frank Fu's Blog.

NVIDIA Jetson Orin Nano Super Developer Kits – Build MIT Mini Cheetah Robot

Frank Fu — Mon, 30 Mar 2026 08:53:58 +0000

This article aims to systematically analyze the technical architecture and implementation details of the MIT Cheetah robot. The content is compiled from publicly available materials and combined with personal practical understanding, intended to provide reference for relevant technical developers.

MIT Cheetah System Architecture Diagram:

Data Communication Protocol Architecture Diagram:

1. Introduction to mbedOS

Developers who first encounter the MIT Cheetah project may notice that the codebase on GitHub is relatively small, and the compilation method differs from conventional projects. This is primarily because the project uses mbedOS as its underlying development framework.

The hardware modules of MIT Cheetah have relatively small code volumes. For example, the SPIne module primarily focuses on data interaction processing, while underlying hardware drivers and other basic functions are provided by mbedOS.

mbedOS is a complete software solution developed by ARM for IoT applications, and it is an embedded open-source ecosystem targeting ARM Cortex-M series processors. For more information, please visit the mbedOS official website.

The following example demonstrates how to initialize the SPI interface in the SPIne module:

void init_spi(void){
    SPISlave *spi = new SPISlave(PA_7, PA_6, PA_5, PA_4);
    spi->format(16, 0);         //　16bit
    spi->frequency(12000000);　 // 12M
    spi->reply(0x0);
    cs.fall(&spi_isr);
    printf("done\n\r");
}

The following is a typical application example of CAN bus communication:

#include "mbed.h"
 
DigitalOut myled(D8);
CAN can1(PD_0, PD_1,500000);
int main() {
     CANMessage msg;
    while(1) {
   if(can1.read(msg)) {
            printf("Message received:id=%d,type=%d,%d\n", msg.id,msg.type,msg.data[0]);
            myled = !myled;
    }
    }
}

2. MIT Cheetah Open Source Resources

The following are open source resource links related to the MIT Cheetah project:

Hardware Related:

Motor Controller Hardware: https://github.com/bgkatz/3phase_integrated

SPIne Hardware: https://github.com/bgkatz/SPIne

Software Related:

Motor Controller Software: https://os.mbed.com/users/benkatz/code/Hobbyking_Cheetah_Compact_DRV8323/

SPIne Software: https://os.mbed.com/users/benkatz/code/SPIne/

Linux Control Code (Cheetah Mini): https://github.com/mit-biomimetics/Cheetah-Software

3. MIT Mini Cheetah Robot System

3.1 Simulation Environment Configuration and Usage

After compilation is complete, you need to configure the simulation environment parameters. Navigate to the config directory under the MIT main folder, open the mini-cheetah-defaults.yaml file, set control_mode and cheater_mode to 1, and set use_rc to 0. After configuration, save and exit, as shown in the following figure:

Next, start the robot simulation environment. It is recommended to connect a game controller before starting (optional, for subsequent control). Navigate to the build directory under the MIT main folder (Note: directly entering the sim subdirectory may prevent the simulation from starting, so execute in the build directory), right-click in a blank area and select “Open in Terminal”, then execute the following command:

./sim/sim

After execution, the robot simulation control interface will be displayed, as shown in the following figure:

In the control interface, click “Mini Cheetah” and “Simulator” in sequence, then click the “Start” button to launch the robot simulation interface, as shown in the following figure:

Next, start the robot controller. Navigate to the build/user/MIT_Controller directory under the MIT main folder, right-click in a blank area and select “Open in Terminal”, then execute the following command:

./mit_ctrl m s

Here, mit_ctrl is the compiled executable file, parameter m represents the mini cheetah model, and parameter s represents simulate (simulation mode). After execution, the robot in the simulation should be able to stand up. At this point, switch to the simulation control interface and change the control_mode value to 4. You can observe that the robot in the simulation switches to trot (trotting gait), as shown in the following figure:

At this point, you can control the robot’s movement speed through the game controller’s joystick. Readers can explore different control modes on their own. The following describes the implementation method for backflip operation:

1. Change the control_mode value in the simulation control interface to 3, and the robot will enter a standing state

2. Change the control_mode value to 9, and the robot will perform a backflip action

3. After the backflip is complete, change the control_mode value to 3 again, then change it to 9 to repeat the backflip

Note: If the robot falls during operation, you can click the “Go Home” button in the simulation control interface to restore the robot to its initial position. If it cannot be restored, you need to restart the simulation and controller.

3.2 Real Robot and Simulation Combined Usage

When running the real robot, you need to start both the simulation interface and the controller program:

# Terminal 1: Start simulation interface
./sim/sim

# Terminal 2: Start controller (real robot mode)
./mit_ctrl m r f

Here, parameter r represents robot (real robot mode), and parameter f represents other configuration options.

4. Computer Board Selection

The original MIT Mini Cheetah system runs on UP Board, which uses a 4-core Intel Atom x5-Z8350 processor, equipped with 4GB RAM, peak power consumption of approximately 5W, based on x86 architecture.

UP Board has relatively few applications in the Chinese market. More common choices include Raspberry Pi and NVIDIA Jetson Nano. Among them, Raspberry Pi is more oriented towards general embedded applications, while Jetson Nano is more suitable for image processing and AI model deployment.

The solution described in this article uses Jetson Nano as the computing platform, running Ubuntu 22 system, equipped with a 6-core ARM Cortex-A78AE v8.2 64-bit processor (ARM architecture).

It should be noted that for the SPIne board, the GPIO interfaces of UP Board and Jetson Nano are compatible, which provides convenience for platform migration.

4.1 Jetson Nano Software Environment Configuration

The development environment used in this article is Ubuntu 20.04.

4.1.1 Download Cheetah-Software Source Code

git clone https://github.com/fuwei007/NavBot-EG02

4.1.2 Install Third-Party Dependency Libraries

sudo apt-get update
sudo apt -y install cmake gcc build-essential
sudo apt-get -y install openjdk-11-jdk
sudo apt -y install liblcm-dev
sudo apt-get -y install libeigen3-dev
sudo apt-get -y install mesa-common-dev
sudo apt -y install libgl1-mesa-dev
sudo apt -y install libglu1-mesa-dev
sudo apt-get -y install freeglut3-dev
sudo apt-get -y install libblas-dev liblapack-dev
sudo apt-get -y  install libopenblas-dev

sudo apt install -y coinor-libipopt-dev gfortran libglib2.0-dev
sudo apt install -y openjdk-8-jdk

4.1.3 Install Qt

Method 1: Source Code Compilation Installation

Download Qt 5.14.2 version: Qt 5.14.2 Download Link

After downloading, navigate to the directory where the file is located, right-click the Qt installation file, select “Properties” → “Permissions”, and check “Allow executing file as program”. Then open a terminal in that directory and execute the following command to start the Qt installation program (Note: the filename in the command should match the actual downloaded filename):

./qt-opensource-linux-x64-5.14.2.run

The installation process is similar to installing programs on Windows, just follow the wizard to complete the installation.

Method 2: Install Using apt Package Manager

sudo apt install -y libqt5 libqt5gamepad5

4.1.4 Install LCM

LCM (Lightweight Communications and Marshalling) is a library for message passing and marshalling.

Download LCM 1.4.0 installation package: LCM v1.4.0 Download Link

After downloading, extract the archive, navigate to the extracted folder, right-click in a blank area and select “Open in Terminal”, then execute the following commands in sequence (it is recommended to execute them one by one):

mkdir build 
cd build 
cmake .. 
make 
sudo make install 
sudo ldconfig

4.1.5 Install Eigen 3.3.6

Eigen is a C++ template library for linear algebra, matrix and vector operations. Based on practical experience, Eigen 3.3.6 version has good compatibility with the MIT Cheetah project, and other versions may have compatibility issues. It is recommended to use version 3.3.6.

Download Eigen 3.3.6: Eigen 3.3.6 Download Link

mkdir build 
cd build 
cmake .. 
make 
sudo make install 
sudo ldconfig

4.1.6 Modify Source Code Configuration

Navigate to the Cheetah-Software main folder (hereinafter referred to as MIT main folder). The folder structure is shown in the following figure:

The following modifications need to be made:

Step1: Modify Branch Name in CMakeLists.txt

Open the common/CMakeLists.txt file under the MIT main folder, and change master at the position marked in the figure below to main. At the same time, since pulling the googletest library from GitHub is slow in China, it is recommended to switch to the Gitee mirror source.

After modification, save and exit.

step2: Modify Eigen3 and LCM Header File Paths

Modify the header file include paths according to the actual installation path. If Eigen3 and LCM are installed in the /usr/include directory (rather than the default /usr/local/include in the source code), you need to modify the include paths in all related files.

Search and replace the following content:

Original Path:

include_directories("/usr/local/include/lcm/")
include_directories("/usr/local/include/eigen3")

Replace with:

include_directories("/usr/include/lcm/")
include_directories("/usr/include/eigen3")

List of Files to Modify:

Cheetah-Software-master/common/CMakeLists.txt
Cheetah-Software-master/rc_test/CMakeLists.txt
Cheetah-Software-master/robot/CMakeLists.txt
Cheetah-Software-master/sim/CMakeLists.txt
Cheetah-Software-master/user/MIT_Controller/CMakeLists.txt

step3: Modify Qt Path Configuration

Modify the file Cheetah-Software/scripts/find_qt_path.sh, comment out the default Qt path:

#printf "${HOME}/Qt/${QT_VER}/gcc_64/"

Then add your own Qt installation path, as shown in the following figure:

Note: The path after printf should include the bin directory.

step4: Fix Serial Port Header File Missing Issue

Modify the file Cheetah-Software/robot/src/rt/rt_serial.cpp:

Comment out #include <stropts.h>

Add #include <sys/ioctl.h> before #include <asm/termios.h>

Then fix the redefinition: vim /usr/include/asm-generic/termios.h Add #ifndef _SYS_IOCTL_H and #endif at the following position:

#ifndef _SYS_IOCTL_H
struct winsize {
        unsigned short ws_row;
        unsigned short ws_col;
        unsigned short ws_xpixel;
        unsigned short ws_ypixel;
};

#define NCC 8
struct termio {
        unsigned short c_iflag;         /* input mode flags */
        unsigned short c_oflag;         /* output mode flags */
        unsigned short c_cflag;         /* control mode flags */
        unsigned short c_lflag;         /* local mode flags */
        unsigned char c_line;           /* line discipline */
        unsigned char c_cc[NCC];        /* control characters */
};
#endif

4.1.7 Compile Program

It is recommended to build inside the mc-build folder at the project root directory :

cd mc-build
rm CMakeCache.txt  # Clean up old configuration (if present)

# Configure the project
# -DMINI_CHEETAH_BUILD=TRUE: build the Mini Cheetah version  
# -DJCQP_USE_AVX2=OFF: disable x86 AVX2 optimizations, so it suits ARM architectures (e.g., Jetson Nano / NX)  
cmake -DMINI_CHEETAH_BUILD=TRUE -DJCQP_USE_AVX2=OFF ..

# Build (adjust the -j parameter according to the number of CPU cores)
make -j4

Notes:

File deletion related errors that appear when executing ./make_types.sh can be ignored

The cmake command may get stuck at a certain step (usually related to Google services), which is a network issue and requires patience

make -j$(nproc) will automatically use all available CPU cores for parallel compilation. If your system doesn’t support this, you can use the make command instead, but compilation speed will be slower

5. SPIne Data Communication Conversion Board

SPIne is a key communication conversion module in the MIT Cheetah system, responsible for data conversion and transmission between the computer board and motor controllers.

Open Source Code Download Address: SPIne Firmware

5.1 Communication Rate Configuration

1. CAN Bus Communication: The communication rate of each CAN bus is configured to 1Mbps. SPIne uses two STM32 microcontrollers because a single CAN bus does not have sufficient bandwidth to support all motor communication requirements. Each STM32 provides two CAN buses, and each CAN bus is responsible for three motors’ communication to achieve a 1000Hz communication frequency. If a single CAN bus is responsible for two legs (six motors), it cannot achieve the required communication frequency.

2. SPI Communication: The SPI communication clock frequency between SPIne and the computer board is 12MHz, with a communication frequency of 1000Hz.

5.2 Communication Data Format

CAN Format: Each frame is 8 bytes.

SPIne → Joint Motor Controller (Command, 8 bytes):

Position command: 16bit

Velocity command: 12bit

Kp (position gain): 12bit

Kd (velocity gain): 12bit

Feedforward torque: 12bit

Joint Motor Controller → SPIne (Feedback, 5 bytes):

Position information: 16bit

Velocity information: 12bit

Current (torque): 12bit

PC → SPIne (Command, 132 bytes): Contains 33 data items: position commands, velocity commands, Kp, Kd, and feedforward torque for 6 joints, plus two flags and one checksum.

5.3 Code Architecture

The code structure of the SPIne firmware is shown in the following figure:

Main Module Descriptions:

leg_message: Responsible for data downlink and uplink encapsulation between UPboard and SPIne, as well as data encapsulation between SPIne and motor controllers

math_ops: Provides mathematical operation functions

main: Program main entry function

5.4 Communication Protocol Details

UPboard Computer Board <—-> SPIne Firmware

SPI Communication Protocol:

UPboard Computer Board---->SPIne Firmware

// 60 bytes
// 30 16-bit words
struct spi_data_t
{
    float q_abad[2];
    float q_hip[2];
    float q_knee[2];
    float qd_abad[2];
    float qd_hip[2];
    float qd_knee[2];
    int32_t flags[2];
    int32_t checksum;
};

UPboard Computer Board<----SPIne Firmware

// 132 bytes
// 66 16-bit words
struct spi_command_t
{
    float q_des_abad[2];
    float q_des_hip[2];
    float q_des_knee[2];
    float qd_des_abad[2];
    float qd_des_hip[2];
    float qd_des_knee[2];
    float kp_abad[2];
    float kp_hip[2];
    float kp_knee[2];
    float kd_abad[2];
    float kd_hip[2];
    float kd_knee[2];
    float tau_abad_ff[2];
    float tau_hip_ff[2];
    float tau_knee_ff[2];
    int32_t flags[2];
    int32_t checksum;
};

SPIne Firmware <—-> Motor Controller

CAN Communication Protocol:

SPIne Firmware---->Motor Controller

/// CAN Command Packet Structure ///
/// 16 bit position command, between -4*pi and 4*pi
/// 12 bit velocity command, between -30 and + 30 rad/s
/// 12 bit kp, between 0 and 500 N-m/rad
/// 12 bit kd, between 0 and 100 N-m*s/rad
/// 12 bit feed forward torque, between -18 and 18 N-m
/// CAN Packet is 8 8-bit words
/// Formatted as follows.  For each quantity, bit 0 is LSB
/// 0: [position[15-8]]
/// 1: [position[7-0]] 
/// 2: [velocity[11-4]]
/// 3: [velocity[3-0], kp[11-8]]
/// 4: [kp[7-0]]
/// 5: [kd[11-4]]
/// 6: [kd[3-0], torque[11-8]]
/// 7: [torque[7-0]]


SPIne Firmware<----Motor Controller

/// CAN Reply Packet Structure 
/// 16 bit position, between -4*pi and 4*pi
/// 12 bit velocity, between -30 and + 30 rad/s
/// 12 bit current, between -40 and 40;
/// CAN Packet is 5 8-bit words
/// Formatted as follows.  For each quantity, bit 0 is LSB
/// 0: [position[15-8]]
/// 1: [position[7-0]] 
/// 2: [velocity[11-4]]
/// 3: [velocity[3-0], current[11-8]]
/// 4: [current[7-0]]

5.5 SPIne Firmware SPI Communication Implementation Analysis

5.5.1 Data Buffer Definition

// Receive and transmit byte count
#define RX_LEN 66
#define TX_LEN 66
// SPI data buffer
uint16_t rx_buff[RX_LEN];
uint16_t tx_buff[TX_LEN];

5.5.2 SPIne and UPboard Data Encapsulation Structure

spi_data_t spi_data; // data from spine to up
spi_command_t spi_command; // data from up to spine

// 60 bytes
// 30 16-bit words
struct spi_data_t
{    //position
    float q_abad[2];
    float q_hip[2];
    float q_knee[2];
     //velocity
    float qd_abad[2];
    float qd_hip[2];
    float qd_knee[2];
    //flags and checksum
    int32_t flags[2];
    int32_t checksum;
};

// 132 bytes
// 66 16-bit words
struct spi_command_t
{   
    //position
    float q_des_abad[2];
    float q_des_hip[2];
    float q_des_knee[2];
   //velocity
    float qd_des_abad[2];
    float qd_des_hip[2];
    float qd_des_knee[2];
    //gain KP
    float kp_abad[2];
    float kp_hip[2];
    float kp_knee[2];
    //gain KD
    float kd_abad[2];
    float kd_hip[2];
    float kd_knee[2];
     //feedforward torque
    float tau_abad_ff[2];
    float tau_hip_ff[2];
    float tau_knee_ff[2];
    //flags and checksum
    int32_t flags[2];
    int32_t checksum;
};

5.5.3 SPI Interrupt Service Routine

The SPI interrupt service routine is responsible for handling SPI communication data transmission and reception:

void spi_isr(void)
{
    GPIOC->ODR |= (1 << 8);
    GPIOC->ODR &= ~(1 << 8);
    int bytecount = 0;
    SPI1->DR = tx_buff[0];
    while(cs == 0) {
        if(SPI1->SR&0x1) {
            rx_buff[bytecount] = SPI1->DR;//data reception
            bytecount++;
            if(bytecount<TX_LEN) {
                SPI1->DR = tx_buff[bytecount];//data transmission
            }
        }
    }  
    // after reading, save into spi_command
    //After reading data, save it in the spi_command structure
    // should probably check checksum first!
    //data checksum
    uint32_t calc_checksum = xor_checksum((uint32_t*)rx_buff,32);
    for(int i = 0; i < CMD_LEN; i++)
    {
        ((uint16_t*)(&spi_command))[i] = rx_buff[i];//spi_command data assignment
    }
    
    // run control, which fills in tx_buff for the next iteration
    //
    if(calc_checksum != spi_command.checksum){
        spi_data.flags[1] = 0xdead;}
        
    //test_control();
    //spi_data.q_abad[0] = 12.0f;
    control();//Assign the state information received from the robot motor controller to tx_buff and send it to UPboard via SPI
    PackAll();//Write the control data received from UPboard into CAN buffer 
    WriteAll();//Send CAN buffer data to leg motor controllers via CAN
}

5.5.4 Control Function Implementation

The control() function is responsible for assigning the state information received from the motor controller to tx_buff and sending it to UPboard via SPI:

void control()
{
    //Enter motor mode
    if(((spi_command.flags[0]&0x1)==1)  && (enabled==0)){
        enabled = 1;
        EnterMotorMode(&a1_can);
        can1.write(a1_can);
        EnterMotorMode(&a2_can);
        can2.write(a2_can);
        EnterMotorMode(&k1_can);
        can1.write(k1_can);
        EnterMotorMode(&k2_can);
        can2.write(k2_can);
        EnterMotorMode(&h1_can);
        can1.write(h1_can);
        EnterMotorMode(&h2_can);
        can2.write(h2_can);
        printf("e\n\r");
        return;
    }
      //Exit motor mode
    else if((((spi_command.flags[0]&0x1))==0)  && (enabled==1)){
         enabled = 0;
        ExitMotorMode(&a1_can);
        can1.write(a1_can);
        ExitMotorMode(&a2_can);
        can2.write(a2_can);
        ExitMotorMode(&h1_can);
        can1.write(h1_can);
        ExitMotorMode(&h2_can);
        can2.write(h2_can);
        ExitMotorMode(&k1_can);
        can1.write(k1_can);
        ExitMotorMode(&k2_can);
        can2.write(k2_can);
        printf("x\n\r");
        return;
        }
    //Assign the state information received from the motor controller to spi_data (send to UPboard)
    spi_data.q_abad[0] = l1_state.a.p;
    spi_data.q_hip[0] = l1_state.h.p;
    spi_data.q_knee[0] = l1_state.k.p;
    spi_data.qd_abad[0] = l1_state.a.v;
    spi_data.qd_hip[0] = l1_state.h.v;
    spi_data.qd_knee[0] = l1_state.k.v;
    
    spi_data.q_abad[1] = l2_state.a.p;
    spi_data.q_hip[1] = l2_state.h.p;
    spi_data.q_knee[1] = l2_state.k.p;
    spi_data.qd_abad[1] = l2_state.a.v;
    spi_data.qd_hip[1] = l2_state.h.v;
    spi_data.qd_knee[1] = l2_state.k.v;
       
    if(estop==0){//Emergency stop
        //printf("estopped!!!!\n\r");
        memset(&l1_control, 0, sizeof(l1_control));
        memset(&l2_control, 0, sizeof(l2_control));
        spi_data.flags[0] = 0xdead;
        spi_data.flags[1] = 0xdead;
        led = 1;
        }
    
    else{//Running state, assign the spi_command data received from UPboard to l1_control (send to motor controller)
        led = 0;
        
        memset(&l1_control, 0, sizeof(l1_control));
        memset(&l2_control, 0, sizeof(l2_control));
        
        l1_control.a.p_des = spi_command.q_des_abad[0];
        l1_control.a.v_des  = spi_command.qd_des_abad[0];
        l1_control.a.kp = spi_command.kp_abad[0];
        l1_control.a.kd = spi_command.kd_abad[0];
        l1_control.a.t_ff = spi_command.tau_abad_ff[0];
        
        l1_control.h.p_des = spi_command.q_des_hip[0];
        l1_control.h.v_des  = spi_command.qd_des_hip[0];
        l1_control.h.kp = spi_command.kp_hip[0];
        l1_control.h.kd = spi_command.kd_hip[0];
        l1_control.h.t_ff = spi_command.tau_hip_ff[0];
        
        l1_control.k.p_des = spi_command.q_des_knee[0];
        l1_control.k.v_des  = spi_command.qd_des_knee[0];
        l1_control.k.kp = spi_command.kp_knee[0];
        l1_control.k.kd = spi_command.kd_knee[0];
        l1_control.k.t_ff = spi_command.tau_knee_ff[0];
        
        l2_control.a.p_des = spi_command.q_des_abad[1];
        l2_control.a.v_des  = spi_command.qd_des_abad[1];
        l2_control.a.kp = spi_command.kp_abad[1];
        l2_control.a.kd = spi_command.kd_abad[1];
        l2_control.a.t_ff = spi_command.tau_abad_ff[1];
        
        l2_control.h.p_des = spi_command.q_des_hip[1];
        l2_control.h.v_des  = spi_command.qd_des_hip[1];
        l2_control.h.kp = spi_command.kp_hip[1];
        l2_control.h.kd = spi_command.kd_hip[1];
        l2_control.h.t_ff = spi_command.tau_hip_ff[1];
        
        l2_control.k.p_des = spi_command.q_des_knee[1];
        l2_control.k.v_des  = spi_command.qd_des_knee[1];
        l2_control.k.kp = spi_command.kp_knee[1];
        l2_control.k.kd = spi_command.kd_knee[1];
        l2_control.k.t_ff = spi_command.tau_knee_ff[1];
        
        //Soft stop program to prevent stopping too abruptly
        spi_data.flags[0] = 0;
        spi_data.flags[1] = 0;
        spi_data.flags[0] |= softstop_joint(l1_state.a, &l1_control.a, A_LIM_P, A_LIM_N);
        spi_data.flags[0] |= (softstop_joint(l1_state.h, &l1_control.h, H_LIM_P, H_LIM_N))<<1;
        //spi_data.flags[0] |= (softstop_joint(l1_state.k, &l1_control.k, K_LIM_P, K_LIM_N))<<2;
        spi_data.flags[1] |= softstop_joint(l2_state.a, &l2_control.a, A_LIM_P, A_LIM_N);
        spi_data.flags[1] |= (softstop_joint(l2_state.h, &l2_control.h, H_LIM_P, H_LIM_N))<<1;
        //spi_data.flags[1] |= (softstop_joint(l2_state.k, &l2_control.k, K_LIM_P, K_LIM_N))<<2;
    }
    spi_data.checksum = xor_checksum((uint32_t*)&spi_data,14);
    for(int i = 0; i < DATA_LEN; i++){
        tx_buff[i] = ((uint16_t*)(&spi_data))[i];}
    
}

5.5.5 Soft Stop Program Implementation

The soft stop program is used to prevent overly abrupt movements when joint motion exceeds limits:

int softstop_joint(joint_state state, joint_control * control, float limit_p, float limit_n){
    if((state.p)>=limit_p){
        //control->p_des = limit_p;
        control->v_des = 0.0f;
        control->kp = 0;
        control->kd = KD_SOFTSTOP;
        control->t_ff += KP_SOFTSTOP*(limit_p - state.p);
        return 1;
    }
    else if((state.p)<=limit_n){
        //control->p_des = limit_n;
        control->v_des = 0.0f;
        control->kp = 0;
        control->kd = KD_SOFTSTOP;
        control->t_ff += KP_SOFTSTOP*(limit_n - state.p);
        return 1;
    }
    return 0;
    
    }

5.5.6 Data Packing Function

The PackAll() function is responsible for packing the control data received from UPboard into the CAN buffer:

//l1_control encapsulates the data received from UPboard, i.e., encapsulates spi_command
struct leg_control{
    joint_control a, h, k;
    }
struct joint_control{
    float p_des, v_des, kp, kd, t_ff;//position, velocity, KP, KD, torque t_ff
    };
void PackAll(){
    pack_cmd(&a1_can, l1_control.a); //Left leg 1 ankle motor
    pack_cmd(&a2_can, l2_control.a); //Left leg 2 ankle motor
    pack_cmd(&h1_can, l1_control.h); //Left leg 1 hip motor
    pack_cmd(&h2_can, l2_control.h); //Left leg 2 hip motor
    pack_cmd(&k1_can, l1_control.k); //Left leg 1 Knee motor
    pack_cmd(&k2_can, l2_control.k); //Right leg 2 Knee motor
    
    }

5.5.7 CAN Data Packing Function

The pack_cmd() function parses the control information sent from UPboard and packs it into the CAN buffer, ready to be sent to the motor controller:

/// CAN Command Packet Structure ///
/// 16 bit position command, between -4*pi and 4*pi
/// 12 bit velocity command, between -30 and + 30 rad/s
/// 12 bit kp, between 0 and 500 N-m/rad
/// 12 bit kd, between 0 and 100 N-m*s/rad
/// 12 bit feed forward torque, between -18 and 18 N-m
/// CAN Packet is 8 8-bit words
/// Formatted as follows.  For each quantity, bit 0 is LSB
/// 0: [position[15-8]]
/// 1: [position[7-0]] 
/// 2: [velocity[11-4]]
/// 3: [velocity[3-0], kp[11-8]]
/// 4: [kp[7-0]]
/// 5: [kd[11-4]]
/// 6: [kd[3-0], torque[11-8]]
/// 7: [torque[7-0]]

void pack_cmd(CANMessage * msg, joint_control joint){
     
     /// limit data to be within bounds ///
     float p_des = fminf(fmaxf(P_MIN, joint.p_des), P_MAX);                    
     float v_des = fminf(fmaxf(V_MIN, joint.v_des), V_MAX);
     float kp = fminf(fmaxf(KP_MIN, joint.kp), KP_MAX);
     float kd = fminf(fmaxf(KD_MIN, joint.kd), KD_MAX);
     float t_ff = fminf(fmaxf(T_MIN, joint.t_ff), T_MAX);
     /// convert floats to unsigned ints ///
     uint16_t p_int = float_to_uint(p_des, P_MIN, P_MAX, 16);            
     uint16_t v_int = float_to_uint(v_des, V_MIN, V_MAX, 12);
     uint16_t kp_int = float_to_uint(kp, KP_MIN, KP_MAX, 12);
     uint16_t kd_int = float_to_uint(kd, KD_MIN, KD_MAX, 12);
     uint16_t t_int = float_to_uint(t_ff, T_MIN, T_MAX, 12);
     /// pack ints into the can buffer ///
     msg->data[0] = p_int>>8;                                       
     msg->data[1] = p_int&0xFF;
     msg->data[2] = v_int>>4;
     msg->data[3] = ((v_int&0xF)<<4)|(kp_int>>8);
     msg->data[4] = kp_int&0xFF;
     msg->data[5] = kd_int>>4;
     msg->data[6] = ((kd_int&0xF)<<4)|(t_int>>8);
     msg->data[7] = t_int&0xff;
     }

5.5.8 CAN Data Transmission Function

The WriteAll() function sends control data for all motors via the CAN bus:

void WriteAll(){
    //toggle = 1;
    can1.write(a1_can);
    wait(.00002);
    can2.write(a2_can);
    wait(.00002);
    can1.write(h1_can);
    wait(.00002);
    can2.write(h2_can);
    wait(.00002);
    can1.write(k1_can);
    wait(.00002);
    can2.write(k2_can);
    wait(.00002);
    //toggle = 0;
    }

5.6 SPIne Firmware and PC Serial Communication

SPIne communicates with the PC through serial port for debugging and manual control:

void serial_isr(){
     /// handle keyboard commands from the serial terminal ///
     while(pc.readable()){
        char c = pc.getc();
        //led = !led;
        switch(c){
            case(27):
                //loop.detach();
                printf("\n\r exiting motor mode \n\r");
                ExitMotorMode(&a1_can);
                ExitMotorMode(&a2_can);
                ExitMotorMode(&h1_can);
                ExitMotorMode(&h2_can);
                ExitMotorMode(&k1_can);
                ExitMotorMode(&k2_can);
                enabled = 0;
                break;
            case('m'):
                printf("\n\r entering motor mode \n\r");
                EnterMotorMode(&a1_can);
                EnterMotorMode(&a2_can);
                EnterMotorMode(&h1_can);
                EnterMotorMode(&h2_can);
                EnterMotorMode(&k1_can);
                EnterMotorMode(&k2_can);
                wait(.5);
                enabled = 1;
                //loop.attach(&sendCMD, .001);
                break;
            case('s'):
                printf("\n\r standing \n\r");
                counter2 = 0;
                is_standing = 1;
                //stand();
                break;
            case('z'):
                printf("\n\r zeroing \n\r");
                Zero(&a1_can);
                Zero(&a2_can);
                Zero(&h1_can);
                Zero(&h2_can);
                Zero(&k1_can);
                Zero(&k2_can);
                break;
            }
        }
        WriteAll();
        
    }

5.6.1 Enter Motor Mode Function

The EnterMotorMode() function is used to put the motor into operating mode:

void EnterMotorMode(CANMessage * msg){
    msg->data[0] = 0xFF;
    msg->data[1] = 0xFF;
    msg->data[2] = 0xFF;
    msg->data[3] = 0xFF;
    msg->data[4] = 0xFF;
    msg->data[5] = 0xFF;
    msg->data[6] = 0xFF;
    msg->data[7] = 0xFC;
    //WriteAll();
    }

5.7 SPIne Firmware Main Function Analysis

5.7.1 Main Program Flow

The main function is responsible for initializing the system and starting the main loop, continuously processing CAN messages and SPI communication:

int main() {
    wait(1);
    //led = 1;
    pc.baud(921600);//Set baud rate
    pc.attach(&serial_isr);//Communicate with PC
    estop.mode(PullUp);//Emergency stop setup
    //spi.format(16, 0);
    //spi.frequency(1000000);
    //spi.reply(0x0);
    //cs.fall(&spi_isr);

    //can1.frequency(1000000);                     // set bit rate to 1Mbps
    //can1.attach(&rxISR1);                 // attach 'CAN receive-complete' interrupt handler
    can1.filter(CAN_ID<<21, 0xFFE00004, CANStandard, 0); //CAN1 filter set up can filter
    //can2.frequency(1000000);                     // set bit rate to 1Mbps
    //can2.attach(&rxISR2);                 // attach 'CAN receive-complete' interrupt handler
    can2.filter(CAN_ID<<21, 0xFFE00004, CANStandard, 0); //CAN1 filter set up can filter
    //Allocate space
    memset(&tx_buff, 0, TX_LEN * sizeof(uint16_t));
    memset(&spi_data, 0, sizeof(spi_data_t));
    memset(&spi_command,0,sizeof(spi_command_t));
    
    //Set priority
    NVIC_SetPriority(TIM5_IRQn, 1);
    //NVIC_SetPriority(CAN1_RX0_IRQn, 3);
    //NVIC_SetPriority(CAN2_RX0_IRQn, 3);
    
    printf("\n\r SPIne\n\r");
    //printf("%d\n\r", RX_ID << 18);
    //Transmit data parameters
    a1_can.len = 8;                         //transmit 8 bytes
    a2_can.len = 8;                         //transmit 8 bytes
    h1_can.len = 8;
    h2_can.len = 8;
    k1_can.len = 8;
    k2_can.len = 8;
   //Receive data parameters
    rxMsg1.len = 6;                          //receive 6 bytes
    rxMsg2.len = 6;                          //receive 6 bytes
   //CAN ID setup
    a1_can.id = 0x1;                        
    a2_can.id = 0x1;                 
    h1_can.id = 0x2;
    h2_can.id = 0x2;
    k1_can.id = 0x3;
    k2_can.id = 0x3;     
    //Data buffer assignment
    pack_cmd(&a1_can, l1_control.a); 
    pack_cmd(&a2_can, l2_control.a); 
    pack_cmd(&h1_can, l1_control.h); 
    pack_cmd(&h2_can, l2_control.h); 
    pack_cmd(&k1_can, l1_control.k); 
    pack_cmd(&k2_can, l2_control.k); 
   //Transmit
    WriteAll();


    // SPI doesn't work if enabled while the CS pin is pulled low
    // Wait for CS to not be low, then enable SPI
    if(!spi_enabled){   //Wait for SPI enable
        while((spi_enabled==0) && (cs.read() ==0)){wait_us(10);}
        init_spi();
        spi_enabled = 1;
        }
            
    while(1) {//while main loop
        counter++;
        can2.read(rxMsg2);//Read data sent by motor controller
        unpack_reply(rxMsg2, &l2_state);//Data parsing, assign to l2_state
        can1.read(rxMsg1);                    // read message into Rx message storage
        unpack_reply(rxMsg1, &l1_state);
        wait_us(10);

        }     
    }

5.7.2 CAN Data Parsing Function

The unpack_reply() function is used to parse CAN messages received from the motor controller and assign the data to the corresponding state structure:

/// CAN Reply Packet Structure ///
/// 16 bit position, between -4*pi and 4*pi
/// 12 bit velocity, between -30 and + 30 rad/s
/// 12 bit current, between -40 and 40;
/// CAN Packet is 5 8-bit words
/// Formatted as follows.  For each quantity, bit 0 is LSB
/// 0: [position[15-8]]
/// 1: [position[7-0]] 
/// 2: [velocity[11-4]]
/// 3: [velocity[3-0], current[11-8]]
/// 4: [current[7-0]]

void unpack_reply(CANMessage msg, leg_state * leg){
    /// unpack ints from can buffer ///
    uint16_t id = msg.data[0];
    uint16_t p_int = (msg.data[1]<<8)|msg.data[2];
    uint16_t v_int = (msg.data[3]<<4)|(msg.data[4]>>4);
    uint16_t i_int = ((msg.data[4]&0xF)<<8)|msg.data[5];
    /// convert uints to floats ///
    float p = uint_to_float(p_int, P_MIN, P_MAX, 16);
    float v = uint_to_float(v_int, V_MIN, V_MAX, 12);
    float t = uint_to_float(i_int, -T_MAX, T_MAX, 12);
    
    if(id==1){
        leg->a.p = p;
        leg->a.v = v;
        leg->a.t = t;
        }
    else if(id==2){
        leg->h.p = p;
        leg->h.v = v;
        leg->h.t = t;
        }
    else if(id==3){
        leg->k.p = p;
        leg->k.v = v;
        leg->k.t = t;
        }
    }

6. Summary

This article systematically introduces the technical architecture and implementation details of the MIT Cheetah robot, covering the complete process from system architecture, simulation environment configuration, hardware platform selection to software environment setup. It focuses on analyzing the working principles and code implementation of the SPIne data communication conversion board, including SPI communication, CAN bus communication, and related data encapsulation and parsing mechanisms.

Through the detailed explanations in this article, developers can:

Understand the overall architecture and communication mechanisms of the MIT Cheetah system

Complete the full configuration process from simulation environment to real robot deployment

Master the core functions and implementation principles of the SPIne firmware

Perform secondary development and customization based on existing solutions

We hope this article can provide valuable reference for relevant technical developers and contribute to the further development and application of quadruped robot technology.

—

References:

Some content in this article references Chen Bu Chen’s Zhihu article, which provides an in-depth and excellent analysis of the technical details of the MIT Cheetah system. Special thanks are extended here.

Building Real-time Voice Conversations with ElevenLabs WebSocket API: A Complete Development Guide

Frank Fu — Mon, 30 Mar 2026 08:53:22 +0000

Recently, I’ve been researching real-time voice conversation implementations and discovered that ElevenLabs Agents Platform provides a very powerful WebSocket API. After some exploration, I completed a real-time voice conversation demo that can run directly in the browser. Today, I’ll share the implementation details and usage experience of this project.

1. Why Choose ElevenLabs?

Before we begin, you might be wondering why I chose ElevenLabs over other solutions. I compared ElevenLabs with OpenAI Realtime API and found that ElevenLabs has unique advantages in voice selection, model flexibility, and other aspects. However, I’ll elaborate on this comparison in detail later in the article.

2. Project Overview

demo link: https://demo.navtalk.ai/11labs/en/index.html

This demo is implemented based on the ElevenLabs Agents Platform WebSocket API and supports:

Complete WebSocket connection management

Real-time voice input and output

Text message support

Rich custom configuration options

Complete message handling mechanism

The entire project can run directly in the browser without a backend server, making it perfect for rapid prototyping and learning.

3. Core Features

3.1 Complete WebSocket Connection

The project implements complete WebSocket connection management, including:

Automatic signature URL retrieval

Secure WSS connection establishment

Comprehensive connection status and error handling

3.2 Real-time Voice Conversation

Voice processing is the core functionality, including:

Microphone audio capture

16kHz PCM audio encoding

Real-time audio stream transmission

Agent audio playback

3.3 Complete Message Handling

Supports all message types provided by ElevenLabs:

`conversation_initiation_metadata` – Session initialization

`user_transcript` – User speech-to-text

`agent_response` – Agent text response

`agent_response_correction` – Agent response correction

`audio` – Agent audio response

`interruption` – Interruption detection

`ping/pong` – Heartbeat detection

`client_tool_call` – Tool call support

`contextual_update` – Context update

`vad_score` – Voice activity detection score

3.4 Text Message Support

In addition to voice input, it also supports sending text messages to the Agent, with a very practical feature: text messages can interrupt the Agent’s ongoing voice response, making conversations more natural.

3.5 Custom Configuration

Provides rich configuration options:

Custom Agent Prompt

Custom first message

Language override

TTS voice ID override

Dynamic variable support

Custom LLM parameters (temperature / max_tokens)

4. Detailed Usage Instructions

4.1 Prepare Configuration

4.1.1 Open File

Simply open the link https://demo.navtalk.ai/11labs/en/index.html in your browser to get started.

4.1.2 Required Configuration Items

API Key (xi-api-key):

ElevenLabs API Key

Format: `sk-…` or `xi-api-key`

How to obtain: Log in to the ElevenLabs Console(https://elevenlabs.io/app/settings/api-keys), create or view API Key

Agent ID:

ElevenLabs Agent ID

Format: `agent_…`

How to obtain: Create or view an Agent on the ElevenLabs Agents page(https://elevenlabs.io/app/agents), then copy the Agent ID

4.1.3 Optional Configuration Items (in interface order)

Custom Prompt:

Override the Agent’s default prompt

Leave empty to use the default prompt from Agent configuration

Can be used to temporarily modify the Agent’s behavior and conversation style

First Message:

The first sentence the Agent says after connection

Leave empty to use the default first message from Agent configuration

Example: “Hello, I’m your AI assistant. How can I help you?”

Language:

Override the Agent’s default language setting

Supported language codes: `en` (English), `zh` (Chinese), `es` (Spanish), `fr` (French), `de` (German), `ja` (Japanese), etc.

Leave empty to use the default language from Agent configuration

TTS Voice:

Override the Agent’s default voice setting

Select different voice IDs from the dropdown menu

Leave empty to use the default voice from Agent configuration

Note: You need to fill in the API Key first to load the voice list

Dynamic Variables:

Used to dynamically replace variable placeholders in the Prompt during conversation

Format: JSON object, for example `{“user_name”: “John”, “greeting”: “Hello”}`

Use case: When the Agent’s Prompt contains variables (such as `{{user_name}}`, `{{greeting}}`), you can pass actual values through dynamic variables

Example:

{

“user_name”: “John”,

“company”: “ABC Company”,

“product”: “Smart Assistant”

}

If the Agent’s Prompt contains `Hello, {{user_name}}, welcome to use {{product}}`, the dynamic variables will automatically replace it with `Hello, John, welcome to use Smart Assistant`

Leave empty to not use dynamic variables

LLM Temperature:

Controls the randomness and creativity of LLM text generation

Value range: 0.0 – 2.0

Lower values produce more deterministic and consistent output (more conservative); higher values produce more random and creative output (more flexible)

Recommended value: 0.7 – 1.0 (balanced creativity and consistency)

Leave empty to use the default value from Agent configuration

LLM Max Tokens:

Limits the maximum number of tokens for a single LLM response

Value range: Positive integers

Used to control response length and avoid overly long replies

Leave empty to use the default value from Agent configuration

4.2 Start Conversation

1. Click the “Connect and Start Conversation” button

2. The browser will request microphone permission, please allow it

3. Recording will start automatically after successful connection

4. Start speaking, and the Agent will respond in real-time

4.3 Function Operations

Stop Recording: Stop sending audio but keep the connection

Disconnect: Completely disconnect the WebSocket connection

Text Message: Enter a message in the text input box and send it

5. API Documentation Reference

The demo implementation is based on ElevenLabs Agents Platform WebSocket API(https://elevenlabs.io/docs/agents-platform/api-reference/agents-platform/websocket)

5.1 WebSocket Endpoint

wss://api.elevenlabs.io/v1/convai/conversation?agent_id={agent_id}

5.2 Complete Call Flow

5.2.1 Connection Establishment Phase

Step 1: Establish WebSocket Connection

Client → Server: Establish WebSocket connection

wss://api.elevenlabs.io/v1/convai/conversation?agent_id={agent_id}

Step 2: Send Initialization Data

Immediately after successful connection, send `conversation_initiation_client_data` message

Contains Agent configuration overrides (optional), dynamic variables (optional), custom LLM parameters (optional)

Wait for server to return `conversation_initiation_metadata` event

Step 3: Receive Session Metadata

Server returns `conversation_initiation_metadata` event

Content to handle:

– Save `conversation_id` (for subsequent session management)

– Record audio format information (`agent_output_audio_format`, `user_input_audio_format`)

– Start audio capture (call `getUserMedia` to get microphone permission)

5.2.2 Conversation Phase

Audio Input Flow:

User speaks → Microphone capture → Audio processing (downsample to 16kHz) → Convert to 16-bit PCM → Base64 encode → Send user_audio_chunk

Server Response Flow:

Server receives audio → Speech recognition (ASR) → Send user_transcript → LLM processing → Generate response → Send agent_response → TTS synthesis → Send audio chunks

Key Event Handling Sequence:

1. When user speaks:

Continuously send `user_audio_chunk` (send once every 4096 samples)

Server processes audio stream, may return `vad_score` (voice activity detection score)

2. Server recognizes user speech:

Receive `user_transcript` event

Can display what the user said in the UI (for debugging)

3. Server generates response:

Receive `agent_response` event

Can display the Agent’s text response in the UI

May receive `agent_response_correction` (if the Agent corrects the response)

4. Server sends audio:

Receive `audio` event (may occur multiple times, streamed)

Processing method:

– Decode Base64 audio data

– Add to audio playback queue

– Play audio chunks in order

5. Interruption handling:

If the user sends a new message while the Agent is speaking, may receive `interruption` event

Need to immediately stop current audio playback and clear the audio queue

5.2.3 Heartbeat Maintenance Phase

Heartbeat Mechanism:

Server periodically sends `ping` event

Need to immediately respond with `pong` message, containing the same `event_id`

Used to keep connection alive and detect connection status

5.2.4 Tool Call Flow (if enabled)

Tool Call Steps:

1. Server sends `client_tool_call` event

2. Processing flow:

Parse tool call information (`tool_name`, `parameters`, `tool_call_id`)

Execute the corresponding tool/function

Send `client_tool_result` to return results

3. Server continues processing, may send new `agent_response` and `audio`

5.2.5 Context Update Flow (if enabled)

Context Update:

Client can actively send `contextual_update` to update conversation context

Server may also send `contextual_update` event

Handle context updates according to business requirements

5.2.6 Text Message Flow

Send Text Message:

Client sends `user_message` event

Feature: Can interrupt the Agent’s ongoing audio response (ElevenLabs unique feature)

Processing method:

– If the Agent is playing audio, immediately stop playback (receive `interruption` event)

– Wait for server to process text message and return new response

5.2.7 Connection Close Phase

Normal Close:

Stop sending audio (call `stopRecording`)

Close WebSocket connection

Release audio resources (close AudioContext, stop MediaStream)

Exception Handling:

Listen to WebSocket `error` and `close` events

Implement reconnection logic (optional)

Clean up all resources

5.3 Detailed Event Handling

5.3.1 Events Client Needs to Handle

Event Type	When Received	Required Handling	Optional Operations
`conversation_initiation_metadata`	After connection established	Save conversation_id, start recording	Display session information
`user_transcript`	After user speaks	–	Display what user said
`agent_response`	After Agent generates response	–	Display Agent text response
`agent_response_correction`	When Agent corrects response	–	Display correction information
`audio`	After Agent audio synthesis	Decode and play audio	Display playback status
`interruption`	When interruption detected	Stop playback, clear queue	Display interruption prompt
`ping`	Server heartbeat detection	Immediately send `pong`	–
`client_tool_call`	When Agent needs to call tool	Execute tool and return result	Display tool call information
`vad_score`	During voice activity detection	–	Visualize voice activity

5.3.2 When Client Sends Messages

Message Type	Send Timing	Frequency
`conversation_initiation_client_data`	Immediately after connection established	Once
`user_audio_chunk`	Continuously during recording	High frequency (approximately every 250ms)
`user_message`	When user inputs text	On demand
`user_activity`	When need to notify user activity	On demand
`pong`	Immediately respond when receive `ping`	On demand
`client_tool_result`	After tool execution completed	On demand
`contextual_update`	When need to update context	On demand

6. Audio Format Requirements

ElevenLabs has clear requirements for audio format:

Sample Rate: 16kHz

Channels: Mono

Encoding: 16-bit PCM

Format: Base64 encoded binary data

7. Technical Implementation

7.1 Audio Processing Flow

1. Capture: Use `getUserMedia` API to get microphone audio stream

2. Process: Use `AudioContext` and `ScriptProcessorNode` to process audio

3. Downsample: If sample rate is not 16kHz, automatically downsample

4. Encode: Convert Float32 audio data to 16-bit PCM

5. Encode: Base64 encode and send via WebSocket

7.2 Audio Playback Flow

1. Receive: Receive Base64 encoded audio from WebSocket

2. Decode: Base64 decode to binary data

3. Play: Try to play as MP3 first, if fails, play as PCM

8. ElevenLabs vs OpenAI Realtime API Detailed Comparison

During development, I also researched OpenAI Realtime API and found that both platforms have their own characteristics. Below is my detailed comparison:

8.1 Quick Comparison Overview

Comparison Item	ElevenLabs Agents Platform	OpenAI Realtime API
Multimodal Support	Not supported, i.e., does not support camera recognition (image input)	Supported (GPT-4o)
Voice Selection	100+ preset voices, supports voice cloning	10 preset voices
LLM Models	Multi-model support (ElevenLabs, OpenAI, Google, Anthropic)	GPT-4o, GPT-4o-mini
Knowledge Base	Supported	Supported (via Assistants API)
Function Call	Supported	Supported
Text Interrupt AI Response	Supported (sending text message can interrupt AI’s ongoing response)	Not supported
Latency	Depends on model (163ms-3.87s)	Low (300-800ms)
Pricing	Per-minute billing (based on model, $0.0033-$0.1956/minute)	Per-token billing (GPT-4o-mini more economical)

For detailed comparison information, please see the detailed explanations of each feature point below.

8.2 Detailed Comparison of Key Points

8.3.1 Multimodal Support (Camera Recognition)

Platform	Support Status	Detailed Information	Reference Links
ElevenLabs Agents Platform	Currently not supported	Focuses on voice conversation, does not support visual input (camera/image recognition)	ElevenLabs Agents Platform WebSocket API Documentation
OpenAI Realtime API	Supported (via GPT-4o)	Supports visual input, can process images and video frames, supports real-time camera recognition. GPT-4o model natively supports multimodal input	OpenAI Realtime API Documentation OpenAI GPT-4o Vision Capabilities

Explanation: OpenAI Realtime API is based on GPT-4o model, supports multimodal input, and can process image and video content. ElevenLabs currently focuses on voice conversation scenarios and does not support visual input.

Reference Sources:

ElevenLabs: Official WebSocket API Documentation – Does not mention visual input support

OpenAI: Realtime API Official Documentation – Supports GPT-4o multimodal capabilities

8.3.2 Voice Selection Comparison

Platform	Voice Count	Voice Characteristics	Customization Capability	Reference Links
ElevenLabs Agents Platform	100+ preset voices	High quality, multilingual, supports emotional expression, voice cloning	Supports custom voice ID, emotion control, tone adjustment, voice cloning	ElevenLabs Voice Library ElevenLabs Voice Cloning
OpenAI Realtime API	Limited selection (10 voices)	Mainly relies on TTS API, provides 10 preset voices (alloy, echo, fable, onyx, nova, shimmer…)	Limited voice control capability, does not support voice cloning	OpenAI TTS Documentation OpenAI TTS Voice List

Detailed Comparison:

ElevenLabs: Provides over 100 preset voices, covering multiple languages, ages, genders, and styles. Supports voice cloning, can create custom voices from a small number of samples. Supports emotion and tone control, can adjust voice expression. High voice quality, suitable for professional applications.

OpenAI: TTS API provides 10 preset voices (alloy, echo, fable, onyx, nova, shimmer…), relatively limited selection. Does not support voice cloning, weak voice control capability.

Reference Sources:

OpenAI: TTS API Documentation – Lists 10 available voices

ElevenLabs: Official Voice Library – Shows large number of preset voices

ElevenLabs: Voice Cloning Documentation – Supports custom voice cloning

8.2.3 Supported LLM Models

Platform	Supported Models	Model Characteristics	Reference Links
ElevenLabs Agents Platform	Multi-model support	Supports ElevenLabs proprietary models and multiple third-party models (OpenAI, Google, Anthropic, etc.), users can choose according to needs, supports custom LLM parameters	ElevenLabs Agents Documentation ElevenLabs LLM Configuration
OpenAI Realtime API	GPT-4o, GPT-4o-mini	Supports GPT-4o (multimodal, stronger capabilities) and GPT-4o-mini (lightweight, faster, lower cost), can switch models	OpenAI Realtime API Models OpenAI Model Comparison

List of Models Supported by ElevenLabs Agents Platform:

ElevenLabs Proprietary Models:

GLM-4.5-Air: Suitable for agentic use cases, latency ~631ms, cost ~$0.0600/minute

Qwen3-30B-A3B: Ultra-low latency, latency ~163ms, cost ~$0.0168/minute

GPT-OSS-120B: Experimental model (OpenAI open-source model), latency ~314ms, cost ~$0.0126/minute

Other Provider Models (available on ElevenLabs platform):

OpenAI Models:

GPT-5 series: GPT-5 (latency ~1.14s, cost ~$0.0826/minute), GPT-5.1, GPT-5 Mini (latency ~855ms, cost ~$0.0165/minute), GPT-5 Nano (latency ~788ms, cost ~$0.0033/minute)

GPT-4.1 series: GPT-4.1 (latency ~803ms, cost ~$0.1298/minute), GPT-4.1 Mini, GPT-4.1 Nano (latency ~478ms, cost ~$0.0065/minute)

GPT-4o (latency ~771ms, cost ~$0.1623/minute), GPT-4o Mini (latency ~738ms, cost ~$0.0097/minute)

GPT-4 Turbo (latency ~1.28s, cost ~$0.6461/minute), GPT-3.5 Turbo (latency ~494ms, cost ~$0.0323/minute)

Google Models:

Gemini 3 Pro Preview (latency ~3.87s, cost ~$0.1310/minute)

Gemini 2.5 Flash (latency ~752ms, cost ~$0.0097/minute), Gemini 2.5 Flash Lite (latency ~505ms, cost ~$0.0065/minute)

Gemini 2.0 Flash (latency ~564ms, cost ~$0.0065/minute), Gemini 2.0 Flash Lite (latency ~547ms, cost ~$0.0049/minute)

Anthropic Models:

Claude Sonnet 4.5 (latency ~1.5s, cost ~$0.1956/minute), Claude Sonnet 4 (latency ~1.31s, cost ~$0.1956/minute)

Claude Haiku 4.5 (latency ~703ms, cost ~$0.0652/minute)

Claude 3.7 Sonnet (latency ~1.12s, cost ~$0.1956/minute), Claude 3.5 Sonnet (latency ~1.14s, cost ~$0.1956/minute)

Claude 3 Haiku (latency ~608ms, cost ~$0.0163/minute)

Custom Models:

Supports adding custom LLMs

The above image shows the list of selectable LLM models in ElevenLabs Agents Platform, including latency and pricing information

Detailed Explanation:

– ElevenLabs: Provides rich model selection, including proprietary models and models from multiple third-party providers. Users can choose the most suitable model based on latency, cost, and functional requirements. Supports customizing LLM parameters (such as temperature, max_tokens) through `custom_llm_extra_body`.

– OpenAI: Clearly supports GPT-4o (supports multimodal, stronger reasoning capabilities) and GPT-4o-mini (faster, lower cost), users can choose according to needs. Both models support real-time conversation.

Reference Sources:

ElevenLabs: [Agents Platform Documentation](https://elevenlabs.io/docs/agents-platform) – Model selection interface

ElevenLabs: [WebSocket API – Custom LLM Parameters](https://elevenlabs.io/docs/agents-platform/api-reference/agents-platform/websocket#custom-llm-extra-body)

OpenAI: [Realtime API Documentation](https://platform.openai.com/docs/guides/realtime) – Supports GPT-4o and GPT-4o-mini

OpenAI: [Model Comparison Documentation](https://platform.openai.com/docs/models) – Detailed model information

8.2.4 Knowledge Base Support

Platform	Knowledge Base Support	Implementation Method	Reference Links
ElevenLabs Agents Platform	Supported	Supports knowledge base integration through Agent configuration, can upload documents and set up knowledge base, Agent can reference knowledge base content in conversations	ElevenLabs Agents Documentation ElevenLabs Agent Configuration
OpenAI Realtime API	Supported (via Assistants API or Function Calling)	Can integrate knowledge base through Assistants API (file upload, vector storage), or access external data sources and APIs through function calling	OpenAI Assistants API OpenAI Function Calling

Detailed Explanation:

– ElevenLabs: Supports knowledge base functionality in Agent configuration, can upload documents for Agent reference. Knowledge base content will be automatically referenced in conversations.

– OpenAI: Can create assistants with knowledge base through Assistants API (supports file upload and vector storage), or access external data sources and APIs through function calling, achieving more flexible knowledge retrieval.

Reference Sources:

ElevenLabs: [Agents Platform Documentation](https://elevenlabs.io/docs/agents-platform) – Mentions knowledge base support

ElevenLabs: [Agent Configuration Documentation](https://elevenlabs.io/docs/agents-platform/agent-configuration) – Knowledge base configuration instructions

OpenAI: [Assistants API Documentation](https://platform.openai.com/docs/assistants) – Knowledge base and file upload functionality

OpenAI: [Function Calling Documentation](https://platform.openai.com/docs/guides/function-calling) – External data access

8.2.5 Function Call Support

Platform	Support Status	Implementation Method	Reference Links
ElevenLabs Agents Platform	Supported	Implements tool calling through `client_tool_call` and `client_tool_result` message types, supports defining tools in Agent	ElevenLabs WebSocket API – Tool Calling ElevenLabs Agent Tool Configuration
OpenAI Realtime API	Supported	Implements function calling through `tool_calls` and `tool_results` events, supports defining tools in sessions	OpenAI Realtime API – Function Calling OpenAI Function Calling Guide

Detailed Comparison:

– ElevenLabs: Uses `client_tool_call` event to request client to execute tools, returns results through `client_tool_result`. Tools are defined in Agent configuration.

– OpenAI: Uses standard function calling mechanism, triggered through `tool_calls` event, returns results through `tool_results`. Supports dynamically defining tools in sessions.

Reference Sources:

ElevenLabs: [WebSocket API – client_tool_call](https://elevenlabs.io/docs/agents-platform/api-reference/agents-platform/websocket#client-tool-call) – Tool calling implementation

ElevenLabs: [Agent Configuration](https://elevenlabs.io/docs/agents-platform/agent-configuration) – Tool definition

OpenAI: [Realtime API Function Calling](https://platform.openai.com/docs/guides/realtime/function-calling) – Real-time API tool calling

OpenAI: [Function Calling Guide](https://platform.openai.com/docs/guides/function-calling) – Detailed implementation instructions

8.2.6 Text Interrupt AI Response

Platform	Support Status	Detailed Information	Reference Links
ElevenLabs Agents Platform	Supported	Sending text message (`user_message`) can interrupt AI’s ongoing voice response, achieving more natural conversation interaction	ElevenLabs WebSocket API – User Message
OpenAI Realtime API	Not supported	Sending text message cannot interrupt AI’s ongoing response, need to wait for current response to complete	OpenAI Realtime API Documentation

Detailed Comparison:

– ElevenLabs: Supports interrupting AI’s ongoing response by sending text messages. When user sends text message while AI is speaking, AI will immediately stop current response and process new text input, making conversations more natural and smooth, similar to interruption behavior in real human conversations.

– OpenAI: Does not support text message interruption feature. If AI is responding, text messages sent by user need to wait for current response to complete before being processed, which may affect conversation fluency and real-time performance.

Use Cases:

– ElevenLabs: Suitable for scenarios requiring fast interaction and interruption, such as real-time customer service, quick Q&A, etc.

– OpenAI: Suitable for scenarios requiring complete responses, but interaction may not be flexible enough

8.2.7 Latency Comparison

Platform	Latency Performance	Optimization Features	Reference Links
ElevenLabs Agents Platform	Depends on model selection	Latency ranges from 163ms to 3.87s, depending on the selected LLM model. Low-latency models like Qwen3-30B-A3B (~163ms) are suitable for real-time interaction, high-performance models like GPT-5 (~1.14s) or Claude Sonnet (~1.5s) have higher latency but stronger capabilities. Supports streaming response	ElevenLabs Agents Platform Documentation ElevenLabs WebSocket API
OpenAI Realtime API	Low latency	Real-time streaming response, latency typically 300-800ms (depends on model and network), GPT-4o-mini is usually faster	OpenAI Realtime API Documentation OpenAI Performance Optimization

Detailed Explanation:

– ElevenLabs: Latency depends on the selected LLM model. If selecting low-latency models (such as Qwen3-30B-A3B ~163ms, GPT-3.5 Turbo ~494ms), latency can be very low, suitable for real-time interaction. If selecting high-performance models (such as GPT-5 ~1.14s, Claude Sonnet ~1.5s), latency will be higher but reasoning capabilities stronger. Supports streaming audio response, reducing first-byte latency.

– OpenAI: Latency is relatively stable, GPT-4o-mini usually responds faster than GPT-4o. Supports streaming response optimization.

Actual latency will be affected by the following factors:

– Network conditions and geographic location

– Model selection (ElevenLabs platform has multiple models to choose from, OpenAI mainly GPT-4o vs GPT-4o-mini)

– Request complexity

– Server load

The above data are typical values, actual performance may vary depending on usage scenarios.

Reference Sources:

ElevenLabs: [Agents Platform Documentation](https://elevenlabs.io/docs/agents-platform) – Emphasizes low-latency optimization

OpenAI: [Realtime API Documentation](https://platform.openai.com/docs/guides/realtime) – Real-time performance description

OpenAI: [Latency Optimization Guide](https://platform.openai.com/docs/guides/realtime/optimizing-latency) – Performance optimization recommendations

8.2.8 Pricing Comparison

Platform	Billing Method	Price Details	Reference Links
ElevenLabs Agents Platform	Per-conversation minute billing (based on selected model)	Price depends on selected LLM model, usually includes comprehensive fees for voice synthesis, speech recognition, and LLM calls. For specific model prices, please refer to the “Supported LLM Models” section above	ElevenLabs Pricing Page ElevenLabs Billing Instructions
OpenAI Realtime API	Per-token and audio duration billing	GPT-4o: Input $2.50/1M tokens, Output $10/1M tokens GPT-4o-mini: Input $0.15/1M tokens, Output $0.60/1M tokens Audio input/output: $0.015/minute (Prices may change over time)	OpenAI Pricing Page OpenAI Realtime API Pricing

Detailed Comparison:

– ElevenLabs: Uses per-conversation minute billing model, price depends on selected LLM model. Usually includes comprehensive fees for voice synthesis, speech recognition, and LLM calls, billing method is simple and clear. For specific model prices, please refer to the “Supported LLM Models” section above.

– OpenAI: Uses per-token billing model, prices vary significantly between different models:

– GPT-4o-mini: More economical, suitable for high-frequency usage scenarios

– GPT-4o: Stronger functionality but higher price, suitable for scenarios requiring multimodal or stronger reasoning capabilities

– Audio processing billed separately per minute

Cost Estimation Examples (for reference only):

– Short conversation scenario (5 minutes, approximately 1000 tokens): OpenAI GPT-4o-mini approximately $0.0015 + $0.075 = $0.0765

– Long conversation scenario (30 minutes, approximately 5000 tokens): OpenAI GPT-4o-mini approximately $0.0075 + $0.45 = $0.4575

Recommendations: Choose the appropriate platform based on actual usage scenarios and budget:

– If mainly using voice conversation with high usage volume, ElevenLabs’ per-minute billing may be simpler, can choose different models according to needs to balance cost and performance

– If need multimodal capabilities or stronger LLM capabilities, OpenAI may be more suitable

– For high-frequency usage, GPT-4o-mini is usually more economical

Reference Sources:

ElevenLabs: [Official Pricing Page](https://elevenlabs.io/pricing) – Latest pricing information

ElevenLabs: [Agents Platform Documentation](https://elevenlabs.io/docs/agents-platform) – Billing instructions

OpenAI: [Official Pricing Page](https://platform.openai.com/pricing) – Latest pricing information (2024-2025)

OpenAI: [Realtime API Documentation](https://platform.openai.com/docs/guides/realtime) – Billing details

9. Conclusion

ElevenLabs Agents Platform WebSocket API provides powerful support for real-time voice conversations. Through this demo, I implemented complete real-time voice conversation functionality, including audio capture, processing, transmission, and playback.

Compared to OpenAI Realtime API, ElevenLabs has obvious advantages in voice selection, model flexibility, and other aspects, especially suitable for scenarios requiring specific voices or voice cloning. However, if multimodal capabilities are needed, OpenAI may be a better choice.

If you also want to try implementing real-time voice conversations, this demo should provide a good starting point. The project code is open source, and you can use it directly or extend it based on this foundation.

The post Building Real-time Voice Conversations with ElevenLabs WebSocket API: A Complete Development Guide appeared first on Frank Fu's Blog.

NavTalk Update: Revolutionary 200ms Response Time for Real-Time Digital Human Experience!

Frank Fu — Mon, 30 Mar 2026 08:52:45 +0000

1. Response Speed Performance

Let’s get straight to the point by looking at the actual response speed performance:

In the live demo, we achieved an end-to-end latency of under 200 ms for the initial audio processing — from the user finishing their speech to the AI processing, generating the video, and displaying it on the front end, all within approximately 200 ms. Currently, this response speed is highly advanced compared to other real-time digital human systems.

2. Overall Latency Before Optimization

We conducted detailed tests on MuseTalk’s real-time performance in the A100 GPU environment:

1. When testing with 0.5-second real-time audio input, the processing time exceeded 0.5 seconds, failing to meet real-time requirements. As shown in the video below:

2. Upon adjusting the FPS to 18, the processing speed for real-time audio input improved by about 0.2 seconds. However, further FPS reduction to below 15 is required to meet real-time expectations.

3. After increasing the batch size, the processing time actually increased, reaching the chip’s processing limit.

The root cause of the issue lies in the A100 GPU, which uses AMD chips by default. These chips are slower than Intel chips in computer vision tasks, such as image processing. Specifically, the system uses the AMD EPYC 7J13 64-Core Processor with 30 cores, which is suited for virtualization and high-concurrency tasks but underperforms in some image processing tasks compared to Intel processors. Unfortunately, most GPU cloud providers are equipped with AMD chips.

I initially encountered this problem, which limited performance optimization. Later, I had an idea: Could we leverage the GPU for image processing tasks, thereby breaking through the current performance bottleneck? This thought led to a series of optimization steps.

3. GPU-Accelerated Image Processing Optimization

3.1 Optimization Approach

To address the performance bottleneck of AMD chips in image processing, the core idea was to move the image processing operations, originally executed on the CPU, to the GPU, taking full advantage of the GPU’s parallel computing capabilities. In MuseTalk’s inference process, the following image processing steps were executed on the CPU:

1. Data Conversion After VAE Decoding: The decoded result from the GPU tensor is converted to a numpy array, incurring GPU → CPU data transfer overhead.

2. Image Resize: Image resizing is performed on the CPU using OpenCV’s cv2.resize().

3. Image Sharpening: Image sharpening is done on the CPU using OpenCV and NumPy with an Unsharp Mask operation.

4. Image Blending: Image composition and blending are handled on the CPU using PIL.

Although each operation individually takes a small amount of time, the cumulative latency becomes significant in a real-time processing scenario. More importantly, these operations can be accelerated using the parallel computing power of the GPU.

3.2 Technical Implementation

3.2.1 Creating a GPU Image Processing Tool Library

First, I created a dedicated GPU image processing tool library musetalk/utils/gpu_image_processing.py, implementing the following core functions:

gpu_resize(): Uses PyTorch’s F.interpolate() for GPU-based image resizing.

gpu_gaussian_blur(): Implements GPU-based Gaussian blur using PyTorch’s F.conv2d().

gpu_unsharp_mask(): Performs image sharpening on the GPU using GPU-based Gaussian blur.

gpu_image_blending(): GPU-based image blending using tensor operations.

These functions support multiple input formats ([H, W, C], [B, H, W, C], [B, C, H, W]) and automatically handle data format conversions, ensuring ease of use. Based on modifications in the processing.py file, all image processing tasks were migrated to the GPU.

3.2.2 Optimizing VAE Decoding Process

I modified the decode_latents() method in musetalk/models/vae.py, adding a return_tensor parameter:

def decode_latents(self, latents, return_tensor=False):

    # ... decoding logic ...

    if return_tensor:

        # Return a GPU tensor to avoid GPU → CPU transfer

        image = image.permute(0, 2, 3, 1)  # [B, H, W, C]

        image = image * 255.0

        image = image[..., [2, 1, 0]]  # Convert RGB to BGR

        return image

    else:

        # Original behavior: return a NumPy array

        image = (

            image.detach()

                 .cpu()

                 .permute(0, 2, 3, 1)

                 .float()

                 .numpy()

        )

        # ...

        return image

With return_tensor=True, the data stays on the GPU, avoiding unnecessary data transfer.

3.2.3 Refactoring the Real-Time Inference Process

In scripts/realtime_inference.py, I refactored the process_frames() method to add a GPU processing path:

Key changes:

Image Resize Optimization

# Original: CPU-based processing

res_frame = cv2.resize(

    res_frame.astype(np.uint8),

    (x2 - x1, y2 - y1)

)

  
  
  Optimized: GPU-based processing


res_frame_gpu = gpu_resize(

    res_frame,

    (y2 - y1, x2 - x1),

    mode='bilinear'

)

Image Sharpening Optimization

# Original: CPU-based processing (OpenCV + NumPy)

res_frame = apply_unsharp_mask(

    res_frame,

    amount=1.2,

    sigma=1.0,

    threshold=5.0

)

  
  
  Optimized: GPU-based processing


res_frame_gpu = gpu_unsharp_mask(

    res_frame_gpu,

    amount=1.2,

    sigma=1.0,

    threshold=5.0

)

Image Blending Optimization

# Original: CPU-based processing (PIL)

combine_frame = get_image_blending(

    ori_frame,

    res_frame,

    bbox,

    mask,

    mask_crop_box

)

  
  
  Optimized: GPU-based processing


body_tensor = numpy_to_tensor_gpu(ori_frame, device)

face_tensor = res_frame_gpu  # Already on GPU

mask_tensor = numpy_to_tensor_gpu(mask, device)

combine_frame_tensor = gpu_image_blending(

    body_tensor,

    face_tensor,

    bbox,

    mask_tensor,

    mask_crop_box,

    device

)

combine_frame = tensor_to_numpy_cpu(combine_frame_tensor)

The entire process uses an automatic fallback mechanism: if GPU processing fails, it falls back to CPU processing to ensure system stability.

3.3 Performance Improvement Results

After optimization, we tested the system in an AMD EPYC 7J13 processor + A100 GPU environment:

3.3.1 Performance Improvement Data

Operation	CPU Time	GPU Time	Speedup
Image Resize	5–10 ms	1–2 ms	5–10x
Image Sharpening	8–15 ms	2–4 ms	3–5x
Image Blending	10–20 ms	3–5 ms	3–5x
VAE Decoding (No Transfer)	—	—	Saves transfer time

3.3.2 Overall Effect

Before Optimization:

0.5-second audio input required more than 0.5 seconds processing time.

Did not meet real-time requirements.

FPS needed to be reduced to below 15 to barely achieve real-time.

After Optimization:

Image processing speed improved by 3–5 times.

End-to-end latency controlled under 200 ms.

Successfully achieved real-time response with significantly improved user experience.

3.4 Why GPU Acceleration is Effective?

1. Flexible Computing Precision: GPUs support float32/half precision, allowing flexible balancing between precision and speed.

2. Parallel Computing Advantage: Image processing tasks (such as resizing, convolution, and blending) are inherently parallel, and GPUs, with their thousands of cores, are well-suited for these tasks.

3. Memory Bandwidth: The memory bandwidth of GPU video memory is far higher than the bandwidth between the CPU and main memory, eliminating data transfer bottlenecks.

4. MuseTalk Docker Deployment Record

4.1 Build and Push Image

4.1.1 Rebuild Image

docker build -t xxx/musetalk:latest .

4.1.2 Push New Image to Docker Hub

docker push xxx/musetalk:latest

Note: You need to log in to Docker Hub before pushing:

docker login

4.2 Remove and Pull Image

4.2.1 Stop and Remove Old Container

sudo docker rm -f musetalk

4.2.2 Pull Latest Image

sudo docker pull xxx/musetalk:latest

4.3 Run Container

4.3.1 Start New Container

sudo docker run -d \

  --name musetalk \

  --gpus all \

  --restart unless-stopped \

  -p 2160:2160 \

  gavana2/musetalk:latest

Explanation of Parameters:

-d: Run in detached mode (background).

–name musetalk: The container name.

–gpus all: Use all available GPUs (requires installation of nvidia-container-toolkit).

–restart unless-stopped: Auto-restart policy (unless manually stopped).

-p 2160:2160: Port mapping (host port:container port).

Note: On the first run, it will automatically download models from HuggingFace to the /workspace/models directory inside the container.

4.4 View Logs and Debug

4.4.1 Real-Time Logs

sudo docker logs -f musetalk

4.4.2 Check Container Status

sudo docker ps

sudo docker ps -a

sudo docker stats musetalk

4.5 Container Operations

4.5.1 Enter Container

sudo docker exec -it musetalk /bin/bash

Explanation: The -it parameter specifies interactive mode, and /bin/bash is the command executed to enter the container.

4.5.2 Fix CRLF Issue in Filenames

# Enter the container

sudo docker exec -it musetalk /bin/bash

  
  
  Navigate to the target directory


cd /workspace


  
  
  One-time fix for all filenames containing CRLF


for f in *$'\r'; do mv "$f" "${f%$'\r'}"; done

4.5.3 Create Directories and Copy Files

If you encounter issues with filenames containing carriage return characters (\r), you can execute the following in the container:

mkdir -p /workspace/silent/sk_navtalk_xxx/girl

  
  
  Copy the avatars directory


cp -r /workspace/results/sk_navtalk_xxx/v15/avatars \

      /workspace/silent/sk_navtalk_xxx/


  
  
  Copy all files from the full_imgs folder


cp -r /workspace/results/sk_navtalk_xxx/v15/avatars/girl/full_imgs/* \

      /workspace/silent/sk_navtalk_xxx/girl/

4.6 Analyze GPU Usage

nvidia-smi -l 1

The post NavTalk Update: Revolutionary 200ms Response Time for Real-Time Digital Human Experience! appeared first on Frank Fu's Blog.

NavTalk Product Update: Five Core Features Comprehensive Upgrade

Frank Fu — Mon, 30 Mar 2026 08:52:09 +0000

Major Update: This update covers five functional modules: real-time communication, Avatar management, data reporting, API integration, and account security, while also announcing the next development plan. Notably, we have optimized the digital human response latency to approximately 200ms, achieving industry-leading levels and providing users with a smooth experience close to real human conversation.

1. Module One: Real-Time Communication Feature Optimization

In this update, we have comprehensively optimized the real-time communication features, focusing on response speed improvement, simplified integration process, and enhanced connection stability.

1.1 Digital Human Response Speed Optimization

Through deep optimization of the model and full-link performance tuning, we have elevated the real-time digital human response speed to industry-leading levels. This breakthrough performance improvement has brought NavTalk to new heights in real-time interaction experience.

Response Latency Breakthrough

In real-time conversation scenarios, response latency is a key indicator affecting user experience. Through continuous technical optimization, we have successfully controlled the end-to-end response latency to approximately 200ms. This means that the complete process from when users finish speaking to hearing the AI digital human’s reply has almost reached the fluency of natural human conversation.

This performance level is leading among all real-time digital human systems. Traditional real-time conversation systems typically require 500ms to 1000ms or even longer response times, while NavTalk’s 200ms response latency is already close to the fluency of real human conversation, significantly improving user interaction experience. In practical applications, users can hardly feel obvious delays, making the conversation process more natural and smooth.

Full-Link Technical Optimization

To achieve this performance breakthrough, we have conducted deep optimization across multiple technical aspects, achieving end-to-end full-link performance improvement:
Model Inference Optimization: We have conducted multi-level optimization of the model inference process. Through optimizing model architecture, reducing unnecessary computational steps, GPU-accelerated image processing, and other methods, we have significantly reduced the computational latency of model inference, greatly improving inference speed while ensuring response quality.
Network Transmission Optimization: Network transmission is an important aspect of real-time conversation systems. We have optimized the data transmission process to ensure data can be transmitted quickly and stably.
System Architecture Optimization: We have also optimized the entire system architecture, improving communication mechanisms between services and optimizing resource scheduling strategies, achieving overall system performance improvement.

The combined effect of these technical optimizations has enabled NavTalk to achieve extremely low response latency while maintaining high-quality conversation experience, bringing users a smooth interaction experience close to real human conversation. This performance improvement not only enhances user experience but also provides technical guarantees for more real-time interaction scenarios.

1.2 WebRTC Connection Consolidation

1.2.1 Previous Architecture Issues

Before optimization, developers needed to connect to two independent WebSocket services to complete real-time communication functionality:

Real-Time Communication WebSocket: wss://transfer.navtalk.ai/api/realtime-api, used for processing real-time conversation messages and business logic

Video Stream Interface: wss://transfer.navtalk.ai/api/webrtc, used for establishing WebRTC connections to obtain video streams

Although this dual-connection architecture was functionally complete, it brought many inconveniences. Developers needed to maintain the state of two connections simultaneously, handle connection establishment, reconnection, error handling, and other logic for both connections, increasing code complexity and maintenance costs. Additionally, state synchronization between the two connections was also a challenge, prone to connection state inconsistency issues.

1.2.2 Unified Connection Architecture

Now, we have merged these two services into a unified connection address: wss://transfer.navtalk.ai/wss/v2/realtime-chat. Through this single connection, developers can complete all real-time communication-related operations, including message transmission and video stream acquisition.

This architecture optimization brings significant advantages in multiple aspects:

Simplified Connection Management: Developers only need to maintain one WebSocket connection, greatly simplifying connection management complexity. No longer need to handle state synchronization issues between two connections, reducing code volume and potential bug risks. Connection establishment, reconnection, error handling, and other logic are all unified on one connection, making the code clearer and easier to maintain.

Improved Development Efficiency: From the unified connection, developers can directly obtain sessionId and use it to establish WebRTC connections to obtain video streams, without additional requests and complex coordination logic. The entire process becomes more intuitive and efficient, allowing developers to complete integration work faster.

Reduced Maintenance Costs: The simplified architecture not only reduces development costs but also lowers subsequent maintenance costs. The code is more concise, problem troubleshooting is easier, and upgrades and optimizations are more convenient. This is of great significance for long-term maintenance and iterative development.

This architecture optimization not only simplifies developers’ work but also improves system stability and performance, laying a solid foundation for NavTalk’s further development.

1.3 Intelligent Parameter Configuration

To simplify the developer experience, we have designed intelligent optimization for connection parameters.

Required Parameters:
license: Authorization code, used for identity verification and authorization management.
name: Avatar name, specifying the digital human character to use. This is the core parameter of the connection, determining which Avatar to use for conversation. The system will load corresponding configurations and resources based on the Avatar name.

Optional Parameters:

model: Specifies the language model to use. This is an optional parameter. If not specified, the system will use the default value gpt-realtime-mini. Developers can choose different models based on actual needs, such as selecting more powerful models for scenarios requiring higher performance, or lightweight models for scenarios requiring lower costs.

Default Value Mechanism

We have introduced a default value mechanism to make the connection process more convenient and flexible. When you only specify the name parameter (Avatar name) without other optional parameters, the system will automatically use the default model and voice configured for that Avatar.

Usage Examples

The following code examples demonstrate two connection methods:

// Method 1: Full parameter connection

// Suitable for scenarios requiring explicit specification of all parameters, such as temporarily using different models

const ws = new WebSocket('wss://transfer.navtalk.ai/wss/v2/realtime-chat?license=YOUR_LICENSE&name=avatar_name&model=gpt-realtime-mini');

// Method 2: Only specify required parameters, use default configuration (Recommended)

// Suitable for most scenarios, the system will automatically use the Avatar's default configuration

const ws = new WebSocket('wss://transfer.navtalk.ai/wss/v2/realtime-chat?license=YOUR_LICENSE&name=avatar_name');

Through this intelligent parameter configuration mechanism, we ensure both functional completeness and flexibility while greatly simplifying the developer experience, making NavTalk integration simpler and more efficient.

1.4 Message Format Optimization

To provide a clearer and more unified message interaction experience, we have unified the encapsulation of all message return formats. This optimization prepares for the integration of ElevenLabs while integrating OpenAI Realtime API.

For more detailed information, please refer to the API Documentation to learn about message format specifications and usage examples.

2. Module Two: Avatar Management Features

The introduction of Avatar sharing and import features makes collaboration between users more convenient. Now, you can easily share your carefully configured Avatar with others, or quickly import Avatars shared by others.

2.1 Sharing Feature

The sharing feature supports one-click generation of sharing links or sharing codes, allowing you to quickly share your carefully configured Avatar. The shared Avatar contains complete configuration information (model, voice, appearance, and all other settings), ensuring that recipients can get an experience completely consistent with the original Avatar.

2.2 Import Feature

The import feature supports quickly importing Avatars shared by others through sharing links or sharing codes. Imported Avatars can be used directly without reconfiguration, and the system will automatically apply all configuration information. The system will automatically synchronize Avatar configuration information to ensure that the imported Avatar configuration remains consistent with the original Avatar.

These features not only promote communication and cooperation between users but also enhance the scalability and shareability of Avatars.

3. Module Three: Data Reporting Features

To help users better manage and analyze business data, we have added powerful report export features. These features allow you to easily export and analyze business data, meeting the data analysis needs of different scenarios.

3.1 Conversation Record Report

The conversation record report feature allows you to export the complete conversation history between users and Avatars, providing strong support for data analysis and business decision-making.

Features:

Export complete conversation history between users and Avatars, including all conversation content

Support filtering by time range, flexibly selecting the data time period to export

Include complete data such as conversation content and timestamps, ensuring data integrity

3.2 Recharge Record Report

The recharge record report focuses on exporting account recharge details, providing support for financial management and data analysis.

Features:

Export account recharge details, including complete information such as recharge amount and time

Support filtering by user, time range, and other conditions, flexibly querying required data

4. Module Four: API Integration Features

To meet the needs of enterprise-level applications and third-party system integration, we have added conversation record query API and Webhook message notification features. These two features provide different data acquisition methods to meet integration needs in different scenarios.

4.1 Conversation Record Query API

The conversation record query API allows you to actively query conversation records through the API, supporting flexible query conditions and data formats.

Usage: Call the API through HTTP requests, pass in query parameters, and the system returns conversation records that meet the conditions.

4.2 Webhook Message Notification

The Webhook message notification feature automatically sends callback events of conversation records to your configured Webhook address after each call is completed, achieving passive data reception.

Usage: After configuring the Webhook address and trigger conditions, the system will automatically send callback requests to your server after each call is completed, containing complete conversation record data.

5. Module Five: Account Security Features

Account security has always been our focus. In this update, we have optimized the login logic to improve account security and user experience.

We have optimized login-related security mechanisms, including:

Optimized Verification Code Mechanism: Improved the generation and verification process of verification codes to enhance security

Secure Email Verification: Receive verification codes through registered email to ensure account security

Through these optimizations, we have further improved account security while maintaining a good user experience. We are committed to providing you with the most secure account protection, ensuring the security of your data and privacy.

6. Next Development Plan

6.1 ElevenLabs Integration

We will integrate ElevenLabs to bring you more powerful voice and model capabilities.

Voice Support

Integrate ElevenLabs’ rich voice library

Support uploading and training your own exclusive voices

Provide more flexible voice configuration and management features

Model Support

Support multiple large language models such as OpenAI, Gemini, Claude

Support connecting to your own model services

Flexibly switch between different models to meet different scenario needs

For detailed model support list, please refer to ElevenLabs WebSocket Real-Time Conversation Demo

Intelligent Knowledge Base Management

Implemented through RAG (Retrieval-Augmented Generation) technology:

Support retrieving your enterprise or personal knowledge base

Upload, manage, and update knowledge base content

Automatically retrieve relevant knowledge to improve answer accuracy

Provide personalized answers based on your knowledge base

Configuration and Pricing

More flexible and controllable model and voice combination configuration

Transparent pricing strategy

Choose services on demand, select optimal configuration based on usage scenarios

Achieve cost optimization

6.2 Multi-Avatar Generation Model Integration

We are researching the possibility of integrating multiple Avatar generation models to provide richer digital human images and expressiveness.

Feature Planning:

Support integrating different digital human generation models

Support switching between different models

Optimize multi-model operation efficiency

Provide higher quality digital human generation effects

Expected Results:

Richer Avatar choices

Higher quality image generation

6.3 Localized Deployment Support

We are developing a localized deployment solution that allows you to run the entire NavTalk project on your own GPU server.

Core Features:

Complete deployment with fully localized data

Meet data security requirements

Support enterprise private deployment needs

Optimize based on your hardware configuration

Applicable Scenarios:

Enterprise private deployment

Scenarios with high data security requirements

Large-scale deployment cost optimization

Customization needs

Service Support:

Complete deployment documentation and tools

Automated deployment scripts

Technical support and services

Continuous updates and maintenance

7. Update Summary

This NavTalk product update has comprehensively optimized according to functional modules, covering five core modules: real-time communication, Avatar management, data reporting, API integration, and account security. Among them, the real-time communication feature has achieved a major breakthrough in response speed optimization, optimizing digital human response latency to approximately 200ms, reaching industry-leading levels. These updates will further improve NavTalk’s user experience and functional completeness, providing individual users and enterprise customers with a more powerful and easier-to-use AI virtual human interaction platform.

At the same time, we are actively promoting development plans such as ElevenLabs integration, performance optimization, multi-model support, and localized deployment to bring more powerful capabilities to NavTalk. These plans will enable NavTalk to reach new heights in voice selection, model support, knowledge base management, performance, and deployment flexibility.

The post NavTalk Product Update: Five Core Features Comprehensive Upgrade appeared first on Frank Fu's Blog.

Complete Guide to Deploying MIT Mini Cheetah on D-Robotics RDK S100

Frank Fu — Mon, 30 Mar 2026 08:51:33 +0000

This document aims to systematically analyze the technical architecture and implementation details of the MIT Mini Cheetah robot control system, and provide detailed instructions on how to complete deployment on the D-Robotics RDK S100 development board. The content is based on publicly available materials combined with actual deployment experience, and is intended to provide complete deployment references and technical guidance for relevant technical developers.

1. Introduction to mbedOS

Developers who first encounter the MIT Cheetah project may notice that the code repository on GitHub is relatively small, and the compilation method differs from conventional projects. This is mainly because the project uses mbedOS as the underlying development framework.

The hardware module code for MIT Cheetah is relatively small. For example, the SPIne module mainly focuses on data interaction processing, while underlying hardware drivers and other basic functions are provided by mbedOS.

mbedOS is a complete software solution developed by ARM for IoT applications, and is an embedded open-source ecosystem for ARM Cortex-M series processors. For more information, please visit the mbedOS official website.

1.1 SPI Interface Initialization Example

The following example shows how to initialize the SPI interface in the SPIne module:

void init_spi(void){
    SPISlave *spi = new SPISlave(PA_7, PA_6, PA_5, PA_4);
    spi->format(16, 0);         // 16bit
    spi->frequency(12000000);   // 12M
    spi->reply(0x0);
    cs.fall(&spi_isr);
    printf("donenr");
}

1.2 CAN Bus Communication Example

The following is a typical application example of CAN bus communication:

#include "mbed.h"

DigitalOut myled(D8);
CAN can1(PD_0, PD_1, 500000);

int main() {
    CANMessage msg;
    while(1) {
        if(can1.read(msg)) {
            printf("Message received:id=%d,type=%d,%dn", msg.id, msg.type, msg.data[0]);
            myled = !myled;
        }
    }
}

2. MIT Cheetah Open Source Resources

The following are open source resource links related to the MIT Cheetah project:

2.1 Hardware Related

Motor Controller Hardware: https://github.com/bgkatz/3phase_integrated

SPIne Hardware: https://github.com/bgkatz/SPIne

2.2 Software Related

Motor Controller Software: https://os.mbed.com/users/benkatz/code/Hobbyking_Cheetah_Compact_DRV8323/

SPIne Software: https://os.mbed.com/users/benkatz/code/SPIne/

Linux Control Code (Cheetah Mini): https://github.com/mit-biomimetics/Cheetah-Software

3. MIT Mini Cheetah Robot System

3.1 Simulation Environment Configuration and Usage

After compilation is complete, you need to configure simulation environment parameters. Navigate to the config directory under the MIT main folder, open the mini-cheetah-defaults.yaml file, set control_mode and cheater_mode to 1, and set use_rc to 0. Save and exit after configuration.

Next, start the robot simulation environment. It is recommended to connect a gamepad before starting (optional, for subsequent control). Navigate to the build directory under the MIT main folder (Note: Directly entering the sim subdirectory may prevent the simulation from starting, so you need to execute from the build directory), right-click on a blank area and select “Open in Terminal”, then execute the following command:

./sim/sim

After execution, the robot simulation control interface will be displayed.

In the control interface, click “Mini Cheetah” and “Simulator” in sequence, then click the “Start” button to launch the robot simulation interface.

Next, start the robot controller. Navigate to the build/user/MIT_Controller directory under the MIT main folder, right-click on a blank area and select “Open in Terminal”, then execute the following command:

./mit_ctrl m s

Here, mit_ctrl is the compiled executable file, parameter m represents the mini cheetah model, and parameter s represents simulate (simulation mode). After execution, the robot in the simulation should be able to stand up. At this point, switch to the simulation control interface and change the control_mode value to 4. You can observe the robot in the simulation switching to trot (trotting gait).

At this point, you can control the robot’s movement speed using the gamepad joystick. Readers can explore different control modes on their own. The following is the implementation method for backflip operation:

1. Change the control_mode value in the simulation control interface to 3, and the robot will enter a standing state

2. Change the control_mode value to 9, and the robot will perform a backflip action

3. After the backflip is complete, change the control_mode value to 3 again, then to 9 to repeat the backflip

Note: If the robot falls during operation, you can click the “Go Home” button in the simulation control interface to restore the robot to its initial position. If it cannot be restored, you need to restart the simulation and controller.

3.2 Combined Use of Real Robot and Simulation

When running the real robot, you need to start both the simulation interface and the controller program:

# Terminal 1: Start simulation interface
./sim/sim

# Terminal 2: Start controller (real robot mode)
./mit_ctrl m r f

Here, parameter r represents robot (real robot mode), and parameter f represents other configuration options.

4. RDK S100 Development Board Selection and System Deployment

4.1 Development Board Selection Introduction

D-Robotics provides multiple series of development boards, optimized for different application scenarios:

RDK X3 (Entry-level Edge AI/Vision): Features 5 TOPS computing power, suitable for running common CV models and small robot prototype development.

RDK X5 (Mid-range Robot/Multi-sensor): 10 TOPS computing power + richer high-speed interfaces (4×USB3, dual MIPI CSI, CAN FD, Wi-Fi 6, PoE), suitable for more complete robot integration and sensor expansion.

RDK S100 / S100P (High-end “Computing-Control Integration”/Humanoid & Multi-joint Control Scenarios): 80/128 TOPS computing power + stronger CPU (A78AE) + On-board MCU (Cortex-R52+)，emphasizing “perception inference + real-time motion control” collaboration, very suitable for quadruped robots and other applications requiring high real-time performance.

This document mainly uses the RDK S100 development board to deploy the MIT Mini Cheetah program. This development board has powerful computing capabilities and rich interfaces, capable of meeting the real-time control requirements of quadruped robots.

RDK S100 Development Board Interface Description:

No.	Function	No.	Function
J1	Main board power supply interface	J22	MCU domain 16-Pin interface
J2	Main board function connector	J23	MCU expansion board 100-Pin interface
J3	RTC battery interface	J24	40-Pin interface
J8	Fan control interface	J25	Camera expansion board 100-Pin interface
J15	Main domain and MCU domain JTAG interface	K1	Reset button
J16	Type-C interface, for flashing, Main domain and MCU domain debugging	K2	Sleep button
J17	M.2 Key E interface	SW1	Power switch
J18	M.2 Key M interface	SW2	Flashing mode switch
J19&J20	4x USB3.0 Type-A interface	SW3&SW6	Pin function switching DIP switches
J21	HDMI interface	U43&U45	2x Gigabit RJ45 network ports

5. RDK S100 System Flashing

The RDK S100 kit currently provides Ubuntu 22.04 system image, supporting Desktop graphical interaction, providing convenience for development and debugging.

Important Note: The RDK S100 comes pre-installed with a test version system image. To ensure you are using the latest version of the system and obtain optimal performance, it is strongly recommended to complete the flashing of the latest version system image according to this document.

D-Robotics official website provides detailed system flashing documentation. This document only provides an overview of key steps. For more detailed instructions, please refer to: D-Robotics Official Documentation

5.1 USB Driver Installation

Download Address: USB Driver Download

For Windows operating systems, you need to install the corresponding drivers before using ADB and Fastboot functions.

ADB and Fastboot Description:

With the improvement of embedded development board performance and functionality, modern development boards (such as RDK S100) mainly use ADB and Fastboot for system flashing and debugging, providing more functionality compared to traditional serial port methods:

ADB (Android Debug Bridge): After the system has started, used as an “instruction channel” between the computer and the development board, supporting file transfer, command execution, and other functions.

Fastboot: When the system has not started, used for flashing, unlocking, system recovery underlying tools, is the key tool for system flashing.

5.2 Complete System Flashing

Important Configuration: Currently, you need to set the SW3 DIP switch to ↑ position to use the onboard eMMC to boot the system. The current version temporarily does not support booting from M.2 NVMe SSD.

5.2.1 Download Flashing Tools and System Image

Image Flashing Tool D-Navigation:

Download Address: D-Navigation

Windows Version: Use D-navigation-win32-x64_v2.4.zip version

System Image Download:

Download Address: System Image

After extracting the system image, you will get a product folder. Ensure that this folder contains the img_packages folder and xmodem_tools file, with the structure as shown below:

5.2.2 U-Boot Flashing Steps

This document uses U-Boot mode for system flashing. The specific steps are as follows:

1. Development Board Power Preparation: Ensure the development board is powered off

2. Enter U-Boot Mode: Set the SW2 DIP switch to ▽ position to enter U-Boot mode

3. Turn on Power: Set the SW1 DIP switch to ▽ position to turn on power

4. Open D-Navigation Tool and complete the following configuration:

Select product model: S100

Download mode: uboot

Storage medium: emmc

Type: secure

Click “Browse” to select the product folder where the firmware is located

Select the serial port connected to RDK S100, set baud rate to 921600

Click “Start Upgrade”

Note: During the upgrade process, if you see a ‘Need manual reset’ prompt, please power cycle the development board.

6. RDK S100 System Startup and Network Configuration

6.1 System Startup

Hardware Connection:

Connect the development board to the display via HDMI cable

Connect to the network via RJ45 network port (if the development board is not equipped with a Wi-Fi card)

Keep the development board powered off, complete the connection before powering on

First Boot: The system will perform default environment configuration during the first boot, and the entire process takes about 45 seconds. After configuration, the Ubuntu system desktop will be output on the display.

Troubleshooting: If the development board has no display output for a long time after powering on (more than 2 minutes), it indicates abnormal development board startup. At this time, you need to debug through the serial port cable, check the development board startup log to diagnose the problem.

After the Ubuntu Desktop version system starts, it will output the system desktop on the display through the Display interface, as shown in the figure below:

6.2 Network Configuration

Follow the GIF animation shown above, click OK, enter username: root, password: root to log in to the device.
After logging in, you can use the ifconfig -a command to query the development board IP address. Among them, eth0/eth1 represent wired network interfaces, and wlan0 represents wireless network interface:

root@ubuntu:~# ifconfig -a
eth0: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 1500
        inet 192.168.1.93  netmask 255.255.255.0  broadcast 192.168.1.255
        inet6 240e:39d:4d4:e2f0:283c:b3ff:fe97:bb72  prefixlen 64  scopeid 0x0<global>
        inet6 fe80::283c:b3ff:fe97:bb72  prefixlen 64  scopeid 0x20<link>
        ether 2a:3c:b3:97:bb:72  txqueuelen 1000  (Ethernet)
        RX packets 38261  bytes 55422230 (55.4 MB)
        RX errors 0  dropped 98  overruns 0  frame 0
        TX packets 21241  bytes 1485148 (1.4 MB)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0
        device interrupt 95

eth1: flags=4099<UP,BROADCAST,MULTICAST>  mtu 1500
        ether 92:b0:69:58:4e:df  txqueuelen 1000  (Ethernet)
        RX packets 0  bytes 0 (0.0 B)
        RX errors 0  dropped 0  overruns 0  frame 0
        TX packets 0  bytes 0 (0.0 B)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0
        device interrupt 96

lo: flags=73<UP,LOOPBACK,RUNNING>  mtu 65536
        inet 127.0.0.1  netmask 255.0.0.0
        inet6 ::1  prefixlen 128  scopeid 0x10<host>
        loop  txqueuelen 1000  (Local Loopback)
        RX packets 145  bytes 13618 (13.6 KB)
        RX errors 0  dropped 0  overruns 0  frame 0
        TX packets 145  bytes 13618 (13.6 KB)
        RX errors 0  dropped 0 overruns 0  frame 0
        TX errors 0  dropped 0 overruns 0  frame 0

You can obtain the router DHCP-assigned IP address through the eth0 interface for subsequent SSH remote connection.

SSH Login: For security reasons, it is recommended to use a regular user for SSH login instead of the root account.

Regular User: Username sunrise, password sunrise

6.3 System Version Confirmation

Important: System version alignment is crucial, as different versions may encounter different compatibility issues. Please confirm the system version:

sunrise@ubuntu:~$ cat /etc/version
4.0.4-Beta

Confirm that the current version is 4.0.4-Beta, consistent with the flashed image version.

7. RDK S100 Software Environment Configuration

7.1 Computer Board Selection Description

The original system of MIT Mini Cheetah runs on the UP Board, which uses a 4-core Intel Atom x5-Z8350 processor, equipped with 4GB RAM, peak power consumption of about 5W, based on x86 architecture.

UP Board has relatively few applications in the Chinese market. More common choices include Raspberry Pi and NVIDIA Jetson series. Among them, Raspberry Pi is more oriented towards general embedded applications, while the Jetson series is more suitable for image processing and AI model deployment.

The RDK S100 used in this document as a computing platform runs Ubuntu 22.04 system, equipped with a 6-core ARM Cortex-A78AE v8.2 64-bit processor (ARM architecture), with 80 TOPS AI computing power and onboard MCU (Cortex-R52+), very suitable for quadruped robots and other applications requiring “perception inference + real-time motion control” collaboration.

7.2 Download MIT Mini Cheetah Source Code

First, download the MIT Mini Cheetah source code. This document uses an adapted version:

git clone https://github.com/fuwei007/NavBot-EG02.git

After downloading, enter the source code directory, which we refer to as the MIT main folder.

7.3 Install Third-Party Dependency Libraries

Install the basic dependency libraries required for compilation and running:

sudo apt-get update
sudo apt -y install cmake gcc build-essential
sudo apt-get -y install openjdk-11-jdk
sudo apt -y install liblcm-dev
sudo apt-get -y install libeigen3-dev
sudo apt-get -y install mesa-common-dev
sudo apt -y install libgl1-mesa-dev
sudo apt -y install libglu1-mesa-dev
sudo apt-get -y install freeglut3-dev
sudo apt-get -y install libblas-dev liblapack-dev
sudo apt-get -y install libopenblas-dev

sudo apt install -y coinor-libipopt-dev gfortran libglib2.0-dev
sudo apt install -y openjdk-8-jdk

7.4 Install Qt

Qt is the graphics library required for the MIT Mini Cheetah simulation interface. There are two installation methods:

Method 1: Source Code Compilation Installation (Suitable for cases requiring a complete Qt development environment)

Download Qt 5.14.2 Version: Qt 5.14.2 Download Link

After downloading, execute the following steps in the directory where the file is located:

1. Select the downloaded Qt installation file, right-click and select “Properties”

2. In the “Permissions” tab, check “Allow executing file as program”

3. Right-click in this folder to open a terminal, and execute the following command (Note: qt-opensource-linux-x64-5.14.2.run should be replaced with your actual downloaded filename):

./qt-opensource-linux-x64-5.14.2.run

4. Complete the installation according to the graphical interface prompts (similar to Windows installation program)

Method 2: Install Using apt (Recommended, Simpler)

You can also use apt to directly install Qt-related libraries:

sudo apt install -y libqt5 libqt5gamepad5

RDK S100 Special Note: In fact, the RDK S100 system already has Qt-related environment pre-installed, so you can skip the source code compilation steps and only need to install the gamepad support library:

sudo apt install -y libqt5gamepad5-dev

7.5 Install LCM

LCM (Lightweight Communications and Marshalling) is a library used for inter-process communication in the MIT Mini Cheetah system.

Download LCM 1.4.0 Version: LCM v1.4.0 Download Link

After downloading, extract the compressed package, enter the extracted folder, right-click on a blank area and select “Open in Terminal”.

Important: Due to system version compatibility requirements, you need to switch the Java environment to JDK 8:

sudo update-alternatives --config javac
# Select option 2: /usr/lib/jvm/java-8-openjdk-arm64/jre/bin/java

sudo update-alternatives --config java
# Select option 2: /usr/lib/jvm/java-8-openjdk-arm64/bin/javac

After completing the Java environment switch, execute the following commands to compile and install LCM (it is recommended to execute them one by one):

mkdir build 
cd build 
cmake .. 
make
sudo make install 
sudo ldconfig

7.6 Install Eigen 3.3.6

Eigen is a C++ template library for linear algebra, matrix and vector operations. Important: After actual testing, other versions of Eigen may have compatibility issues, so you must use Eigen 3.3.6 version.

Download Eigen 3.3.6: Eigen 3.3.6 Download Link

After downloading, extract the compressed package, enter the extracted folder, right-click on a blank area and select “Open in Terminal”, then execute the following commands (it is recommended to execute them one by one):

mkdir build 
cd build 
cmake .. 
sudo make install 
sudo ldconfig

7.7 Modify MIT Mini Cheetah Program Source Code

Since the MIT Mini Cheetah original code is mainly designed for UP Board (x86 architecture), some adaptive modifications are needed on RDK S100 (ARM architecture). The downloaded source code directory structure is shown in the figure below:

The following will detail the modifications that need to be made:

7.7.1 Modify Git Branch and Repository Address in CMakeLists.txt

Open the common/CMakeLists.txt file under the MIT main folder, and you need to modify the following content:

1. Change the Git branch from master to main (GitHub has defaulted to using the main branch)

2. Switch the googletest library’s Git repository address to Gitee mirror (faster access in China)

The modification location is shown in the figure below:

Save and exit after modification.

7.7.2 Modify Eigen3 and LCM Header File Paths

Since Eigen3 and LCM header files are installed in the /usr/include directory in the RDK S100 system, while the default path in the source code is /usr/local/include, path modification is needed.

Search and Replace: Search for the following two lines in all related files:

include_directories("/usr/local/include/lcm/")
include_directories("/usr/local/include/eigen3")

Replace with:

include_directories("/usr/include/lcm/")
include_directories("/usr/include/eigen3")

List of Files That Need to Be Modified:

Cheetah-Software-master/common/CMakeLists.txt
Cheetah-Software-master/rc_test/CMakeLists.txt
Cheetah-Software-master/robot/CMakeLists.txt
Cheetah-Software-master/sim/CMakeLists.txt
Cheetah-Software-master/user/MIT_Controller/CMakeLists.txt

RDK S100 Special Note: If you are using already adapted source code (such as the version provided in this document), you may not need to make this modification, or you may need to perform the opposite operation (change /usr/include to /usr/local/include). Please adjust according to the actual header file installation location.

7.7.3 Modify Qt Path

Modify the file Cheetah-Software-master/scripts/find_qt_path.sh, comment out the original Qt path setting:

#printf "${HOME}/Qt/${QT_VER}/gcc_64/"

Note: The path after printf should include the bin directory.

RDK S100 Adaptation: Since RDK S100 uses system-installed Qt, you should use the following method to automatically obtain the Qt path:

printf "$(qmake -query QT_INSTALL_PREFIX)/"

This can automatically obtain the installation path of the system Qt without manual specification.

The modification location is shown in the figure below:

7.7.4 Fix Serial Port Header File Missing Issue

In ARM architecture Linux systems, the inclusion method of certain header files differs from x86 architecture and needs to be adapted.

Modify Source Code File: Edit the Cheetah-Software-master/robot/src/rt/rt_serial.cpp file:

1. Comment out #include <stropts.h>

2. Add #include <sys/ioctl.h> before #include <asm/termios.h>

Fix System Header File Redefinition Issue: Edit the system header file /usr/include/asm-generic/termios.h:

sudo nano /usr/include/asm-generic/termios.h

Add #ifndef _SYS_IOCTL_H at the beginning of the file, and add #endif after the related structure definition to avoid redefinition errors:

#ifndef _SYS_IOCTL_H
struct winsize {
        unsigned short ws_row;
        unsigned short ws_col;
        unsigned short ws_xpixel;
        unsigned short ws_ypixel;
};

#define NCC 8
struct termio {
        unsigned short c_iflag;         /* input mode flags */
        unsigned short c_oflag;         /* output mode flags */
        unsigned short c_cflag;         /* control mode flags */
        unsigned short c_lflag;         /* local mode flags */
        unsigned char c_line;           /* line discipline */
        unsigned char c_cc[NCC];        /* control characters */
};
#endif

7.7.5 Adapt spdlog Logging Library

spdlog is a fast C++ logging library. On RDK S100, you need to use the system-installed spdlog package instead of compiling from source.

Modify third-party/CMakeLists.txt: Replace all file content with the following:

add_subdirectory(Goldfarb_Optimizer)
add_subdirectory(ParamHandler)
add_subdirectory(inih)
add_subdirectory(osqp)
add_subdirectory(JCQP)
add_subdirectory(qpOASES)
add_subdirectory(lord_imu)
add_subdirectory(wheeltec_imu)
add_subdirectory(SOEM)

if(CMAKE_SYSTEM_NAME MATCHES Linux)
  add_subdirectory(vectornav)
endif()

# Build all 3rd-party libs with PIC (useful for shared libs)
set(CMAKE_POSITION_INDEPENDENT_CODE ON)

# ------------------------------------------------------------
# spdlog: use system package (libspdlog-dev) instead of source
# ------------------------------------------------------------
find_package(spdlog CONFIG REQUIRED)

# Provide a target named "spdlog" for compatibility with existing link lines.
add_library(spdlog INTERFACE)

if(TARGET spdlog::spdlog)
  target_link_libraries(spdlog INTERFACE spdlog::spdlog)
elseif(TARGET spdlog::spdlog_header_only)
  target_link_libraries(spdlog INTERFACE spdlog::spdlog_header_only)
else()
  message(FATAL_ERROR "spdlog CMake target not found (spdlog::spdlog / spdlog::spdlog_header_only). Install libspdlog-dev or set spdlog_DIR.")
endif()

Modify the top-level CMakeLists.txt, directly replace all content.

cmake_minimum_required(VERSION 3.5)

# Add project() to avoid CMake warning and make PROJECT_SOURCE_DIR valid
project(MiniCheetah LANGUAGES C CXX)

set(CMAKE_DISABLE_IN_SOURCE_BUILD ON)
set(CMAKE_DISABLE_SOURCE_CHANGES  ON)

if ("${CMAKE_SOURCE_DIR}" STREQUAL "${CMAKE_BINARY_DIR}")
  message(SEND_ERROR "In-source builds are not allowed.")
endif ()

set(CMAKE_COLOR_MAKEFILE ON)
#execute_process(COMMAND ../scripts/make_types.sh)

set(CMAKE_MODULE_PATH ${PROJECT_SOURCE_DIR}/cmake)

#set(CMAKE_VERBOSE_MAKEFILE ON)

option(MINI_CHEETAH_BUILD "use compiler flags for mini cheetah computer" OFF)
set(BUILD_TYPE_RELEASE TRUE)

option(NO_SIM "Do not build simulator" OFF)

# -------------------------------
# spdlog: use system libspdlog-dev
# Must be before any add_subdirectory() that links spdlog::spdlog
# -------------------------------
find_package(spdlog CONFIG REQUIRED)

# Some distros provide only header-only target; alias it to spdlog::spdlog
if(NOT TARGET spdlog::spdlog AND TARGET spdlog::spdlog_header_only)
  add_library(spdlog::spdlog ALIAS spdlog::spdlog_header_only)
endif()

if(MINI_CHEETAH_BUILD)
  SET (THIS_COM "../" )
  CONFIGURE_FILE(${CMAKE_CURRENT_SOURCE_DIR}/config.h.cmake
    ${CMAKE_BINARY_DIR}/Configuration.h)
  set(CMAKE_CXX_FLAGS "-O3 -no-pie -ggdb -Wall 
  -Wextra -Wcast-align -Wdisabled-optimization -Wformat=2 
  -Winit-self -Wmissing-include-dirs -Woverloaded-virtual 
  -Wshadow -Wsign-promo -Werror")
  set(CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} -Wno-error=overloaded-virtual -Wno-error=unused-parameter")
  set(CMAKE_C_FLAGS "-O3  -ggdb -std=gnu99 -I.")
  message("**** Mini-Cheetah build enabled ****")
else(MINI_CHEETAH_BUILD)
  SET (THIS_COM "${PROJECT_SOURCE_DIR}/" )
  CONFIGURE_FILE(${CMAKE_CURRENT_SOURCE_DIR}/config.h.cmake
    ${CMAKE_BINARY_DIR}/Configuration.h)

  if(CMAKE_SYSTEM_NAME MATCHES Linux)
    set(CMAKE_CXX_FLAGS "-O3 -no-pie -march=native -ggdb -Wall 
    -Wextra -Wcast-align -Wdisabled-optimization -Wformat=2 
    -Winit-self -Wmissing-include-dirs -Woverloaded-virtual 
    -Wshadow -Wsign-promo -Werror")
    set(CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} -Wno-error=overloaded-virtual -Wno-error=unused-parameter")
  elseif(APPLE)
    set(CMAKE_CXX_FLAGS "-O3 -march=native -ggdb -Wall 
    -Wextra -Wcast-align -Wdisabled-optimization -Wformat=2 
    -Winit-self -Wmissing-include-dirs -Woverloaded-virtual 
    -Wshadow -Wsign-promo")
    include_directories("/usr/local/include/")   # lcm includes
  endif()

  set(CMAKE_C_FLAGS "-O3  -ggdb  -march=native -std=gnu99 -I.")
  message("**** Mini-Cheetah build disabled ****")
endif(MINI_CHEETAH_BUILD)

set(CMAKE_CXX_STANDARD 14)

#find_package(lcm)

add_subdirectory(robot)
add_subdirectory(third-party)
add_subdirectory(common)

if(NO_SIM)

else(NO_SIM)
  add_subdirectory(sim)
endif()

add_subdirectory(user)
add_subdirectory(rc_test)

7.8 Compile MIT Mini Cheetah Program

After completing all source code modifications, start compiling the program. Taking compiling the Mini Cheetah version as an example:

cd Cheetah-Software
cd scripts
chmod +x make_types.sh
./make_types.sh  # You may see error messages like `rm: cannot remove...`, this is normal and can be ignored

cd .. && mkdir mc-build && cd mc-build
rm CMakeCache.txt  # Clean old configuration (if necessary)

# Configure project
# -DMINI_CHEETAH_BUILD=TRUE: Build Mini Cheetah version
# -DJCQP_USE_AVX2=OFF: Turn off x86 AVX2 optimization, adapt to ARM architecture (RDK S100)
cmake -DMINI_CHEETAH_BUILD=TRUE -DJCQP_USE_AVX2=OFF ..

# Compile (adjust -j parameter according to CPU core count, $(nproc) will automatically detect core count)
make -j$(nproc)

Compilation Notes:

1. ./make_types.sh Execution: This script may prompt some errors (such as “cannot remove, no such file or directory”), these errors can be ignored and do not affect compilation.

2. CMake Configuration:

DMINI_CHEETAH_BUILD=TRUE: Specify building Mini Cheetah version

DJCQP_USE_AVX2=OFF: Turn off x86 architecture AVX2 optimization, adapt to ARM architecture (RDK S100)
3. Network Issues: When executing the cmake command, it may get stuck at the step of downloading Google-related dependencies. This is a network issue, please wait patiently.

4. Compilation Parallelism: make -j$(nproc) will automatically use all CPU cores for parallel compilation. If you encounter problems, you can use make for single-threaded compilation, but it will be slower.

8. RDK S100 Program Execution

After compilation is complete, the generated controller executable file is located in the mc-build/user/MIT_Controller/ directory. Running the program requires sudo privileges to access hardware ports.

Important: You need to execute the following commands in the mc-build directory.

8.1 Simulation Mode Execution

First, test whether the program runs normally in simulation mode:

cd mc-build
sudo ./user/MIT_Controller/mit_ctrl m s

Parameter description:

m: Represents Mini Cheetah model

s: Represents simulate (simulation mode)

8.2 Real Robot Mode Execution

After confirming that simulation mode runs normally, you can switch to real robot mode:

cd mc-build
sudo ./user/MIT_Controller/mit_ctrl m r f

Parameter description:

m: Represents Mini Cheetah model

r: Represents robot (real robot mode)

f: Represents other configuration options

Notes:

In real robot mode, please ensure hardware connections are correct, including SPIne board, motor controllers, etc.

It is recommended to fully test in simulation mode first, confirm the control algorithm is normal before switching to real robot mode

When running the real robot, please ensure there is sufficient safe space to avoid the robot losing control and causing injury

9. Advanced: Solving RDK S100 Rear Leg (SPI 0.1) Drive Issue

When migrating code from Jetson Nano to RDK S100, you may encounter a typical phenomenon: the front legs can move, but the rear legs have no response at all.

9.1 Problem Diagnosis

Enter the following command in the terminal to check devices:

ls /dev/spi*

Normally, Mini Cheetah requires two SPI devices:

* `/dev/spidev0.0` (controls front legs)

* `/dev/spidev0.1` (controls rear legs)

Symptom: You may only see `/dev/spidev0.0`, or there may be an unused `/dev/spidev1.0`, but `/dev/spidev0.1` is missing.

Root Cause Analysis: The RDK S100 system device tree (Device Tree) may not have enabled the second chip select signal (CS1) of SPI0 by default, causing the kernel not to load the corresponding driver.

9.2 Solution (Ultimate Hardware Modification Version)

The most reliable method is to directly modify the kernel device tree file (DTB). For convenience, a Python script has been written that automatically decompiles the system kernel, “hard-writes” the missing driver node into it, and recompiles it back.

9.2.1 Step 1: Install Required Tools

We need to install the `device-tree-compiler` (dtc) tool to compile the device tree. Ensure the development board is connected to the network, then execute:

sudo apt-get update

sudo apt-get install -y device-tree-compiler

9.2.2 Step 2: Create Fix Script

Create a script file in the terminal:

nano force_spi_patch.py

9.2.3 Step 3: Copy Script Code

Copy the following code completely (comments are in English to prevent Chinese encoding issues):

import os

import glob

import subprocess

import sys

# Path to RDK S100 Device Tree files

DTB_DIR = "/boot/hobot"

# The CS1 node to insert (using spidev@1)

# reg = <0x1> corresponds to Chip Select 1

NEW_NODE = """

        spidev@1 {

            compatible = "rohm,dh2228fv";

            reg = <0x1>;

            spi-max-frequency = <0x2faf080>;

        };

"""

def patch_dts_content(content):

    # Check if spidev@1 already exists

    if "spidev@1" in content:

        return None, "Already patched"

    # Find the position of spidev@0

    # We insert spidev@1 immediately before spidev@0 for safety

    target_str = "spidev@0 {"

    if target_str not in content:

        return None, "spidev@0 not found"

    # Replace target string with NEW_NODE + target string

    new_content = content.replace(target_str, NEW_NODE + "\n\t" + target_str)

    return new_content, "Patched"

def main():

    print("=== Starting: Kernel Device Tree Patch ===")

    # 1. Check for dtc tool

    if subprocess.call(["which", "dtc"], stdout=subprocess.DEVNULL) != 0:

        print("Error: dtc tool not found. Please run: sudo apt-get install device-tree-compiler")

        sys.exit(1)

    # 2. Find all dtb files

    dtb_files = glob.glob(os.path.join(DTB_DIR, "rdk-s100*.dtb"))

    if not dtb_files:

        print(f"Error: No .dtb files found in {DTB_DIR}")

        sys.exit(1)

    count = 0

    for dtb_path in dtb_files:

        # Skip files we might have created manually before

        if "-cs1.dtb" in dtb_path:

            continue

        print(f"Processing: {os.path.basename(dtb_path)}")

        # Backup original file

        if not os.path.exists(dtb_path + ".original"):

            os.system(f"sudo cp {dtb_path} {dtb_path}.original")

        # Decompile DTB -> DTS

        dts_path = dtb_path + ".temp.dts"

        cmd_decompile = f"dtc -I dtb -O dts -o {dts_path} {dtb_path}"

        # Run decompile (suppress warnings)

        os.system(f"{cmd_decompile} > /dev/null 2>&1")

        if not os.path.exists(dts_path):

            print("  -> Decompilation failed, skipping")

            continue

        # Read and modify DTS content

        with open(dts_path, 'r') as f:

            content = f.read()

        new_content, status = patch_dts_content(content)

        if new_content:

            with open(dts_path, 'w') as f:

                f.write(new_content)

            # Recompile DTS -> DTB

            cmd_compile = f"dtc -I dts -O dtb -o {dtb_path} {dts_path}"

            if os.system(f"{cmd_compile} > /dev/null 2>&1") == 0:

                print(f"  -> Patch applied successfully!")

                count += 1

            else:

                print(f"  -> Compilation error, file not modified")

        else:

            print(f"  -> {status} (No changes needed)")

        # Clean up temporary file

        if os.path.exists(dts_path):

            os.remove(dts_path)

    print("-" * 30)

    if count > 0:

        print(f"Patch Complete! Modified {count} kernel files.")

        print("Please reboot immediately: sudo reboot")

    else:

        print("No files were modified. Please check if spidev@1 already exists.")

if __name__ == "__main__":

    main()

Press `Ctrl + O` to save, `Enter` to confirm, `Ctrl + X` to exit.

9.2.4 Step 4: Execute Fix

Run the script:

sudo python3 force_spi_patch.py

When you see the message `Patch applied successfully!`, it means the patch has been applied to the kernel file.

9.2.5 Step 5: Reboot and Verify

sudo reboot

After rebooting, check again:

ls /dev/spi*

At this point, you should be able to see both `/dev/spidev0.0` and `/dev/spidev0.1` existing simultaneously, and the robot’s rear legs can be controlled normally.

10. Summary

This document systematically introduces the complete process of deploying the MIT Mini Cheetah robot control system on the D-Robotics RDK S100 development board, covering all aspects from system flashing, network configuration, software environment setup, source code adaptation to program compilation and execution.

Through the detailed instructions in this document, developers can:

Understand the basic characteristics and selection basis of the RDK S100 development board

Complete the complete environment setup from system flashing to network configuration

Master the adaptation methods of MIT Mini Cheetah on ARM architecture platforms

Successfully compile and run the MIT Mini Cheetah control system

Perform robot control testing in simulation and real robot modes

It is hoped that this document can provide valuable references for relevant technical developers and promote the application and development of quadruped robot technology on more platforms.

Related Resources:

D-Robotics Official Documentation: https://developer.d-robotics.cc

MIT Mini Cheetah Open Source Code: https://github.com/mit-biomimetics/Cheetah-Software

The post Complete Guide to Deploying MIT Mini Cheetah on D-Robotics RDK S100 appeared first on Frank Fu's Blog.

NavTalk Digital Human Loop Video Generation Technical Implementation

Frank Fu — Mon, 30 Mar 2026 08:50:56 +0000

I. Background and Objectives

In the NavTalk real-time conversation system, digital humans need to display natural and smooth animation effects. To provide a better user experience, we need to generate a 4-second seamlessly looping video that allows the digital human to continuously play while waiting for user input or system responses, creating a seamless looping visual effect.

Core Challenges：

Seamless Loop: The last frame of the video must perfectly connect with the first frame to form a seamless loop

Natural Movement: The digital human’s movements need to be natural and professional, suitable for conversation scenarios

Precise Control: Precise control over video duration and loop points is required to ensure a perfect 4-second loop

II. Technical Solution Overview

We adopt a complete technical solution of AI Video Generation + Intelligent Blink Detection + Video Post-Processing:

Image Upload → Kling AI Generates 5s Video → Auto-detect Blink Time Point → Extract 2s Clip → Reverse and Concatenate → Generate 4s Loop Video

Technology Stack：

Video Generation: Kling AI (formerly ClingAI) Image-to-Video API

Blink Detection: MediaPipe + OpenCV (Python script)

Video Processing: FFmpeg (clipping, reversing, concatenating)

Backend Framework: Spring Boot + Apache HttpClient

III. Complete Implementation Flow

Step 1: Image to Video Generation (Kling AI API)

First, we call Kling AI’s image-to-video API to generate an initial 5-second video.

1.1 API Call Implementation

@PostMapping("/generateVideo")

public Result generateVideoFromImage(

        @RequestPart("image") MultipartFile image,

        @RequestPart(value = "prompt", required = false) String prompt) {
// If no prompt is provided, use the default NavTalk loop animation prompt
if (prompt == null || prompt.trim().isEmpty()) {
    prompt = clingAiService.getDefaultNavTalkLoopPrompt();
}

// Call Service layer to generate 5-second video
return clingAiService.generateVideo(image, prompt, 5);

}

1.2 Prompt Design

To generate a loopable video, we carefully designed the prompt to ensure the digital human faces the screen, remains still, and naturally blinks after 1 second:

public String getDefaultNavTalkLoopPrompt() {

    return "A digital human avatar faces the screen directly, completely still and motionless " +

           "throughout the entire video. The character maintains a calm, professional expression " +

           "with eyes open and fixed on the camera. After 1 second, the avatar performs a single " +

           "natural blink - eyelids close gently and then reopen smoothly. After the blink completes, " +

           "the character remains perfectly still again. The camera remains static with neutral lighting, " +

           "maintaining focus on the avatar's calm facial expression and professional demeanor. " +

           "The entire sequence creates a seamless loop where the end frame matches the start frame exactly, " +

           "with the blink occurring after 1 second in each cycle.";

}

Prompt Design Points:

Emphasize the digital human facing the screen (faces the screen directly)

Emphasize complete stillness (completely still and motionless), with no movement except blinking

Clear blink timing: blink starts after 1 second (After 1 second, the avatar performs a single natural blink)

Natural blink action: eyelids close gently and then reopen smoothly

Emphasize seamless connection: the end frame matches the start frame exactly

Maintain static camera and neutral lighting to ensure visual consistency

1.3 JWT Authentication

Kling AI API uses JWT Token for authentication. We implemented complete JWT generation logic:

public static String generateJwtToken(String accessKey, String secretKey) {

    // If Access Key is already in JWT format (3 parts), use it directly

    String[] tokenParts = accessKey.split(".");

    if (tokenParts.length == 3) {

        return accessKey;

    }
// Otherwise, generate a new JWT Token
long now = System.currentTimeMillis() / 1000;
String headerJson = "{"alg":"HS256","typ":"JWT"}";
String payloadJson = "{"iss":"" + accessKey + "","iat":" + now + 
                     ","nbf":" + now + ","exp":" + (now + 3600) + "}";

String header = base64UrlEncode(headerJson.getBytes(StandardCharsets.UTF_8));
String payload = base64UrlEncode(payloadJson.getBytes(StandardCharsets.UTF_8));
String signingInput = header + "." + payload;
String signature = hmacSha256Base64Url(signingInput, secretKey);

return signingInput + "." + signature;

}

1.4 Configuration

First, we need to set up Kling AI API information in the configuration file:

# application.properties or application-dev.properties

clingai.api.url=https://api-singapore.klingai.com

clingai.api.access.key=your-access-key

clingai.api.secret.key=your-secret-key

Inject configuration in the Service class using the @Value annotation:

@Service

public class ClingAiServiceImpl implements ClingAiService {
@Value("${clingai.api.url:}")
private String clingaiApiUrl;

@Value("${clingai.api.access.key:}")
private String clingaiAccessKey;

@Value("${clingai.api.secret.key:}")
private String clingaiSecretKey;

private final ObjectMapper objectMapper = new ObjectMapper();
private CloseableHttpClient httpClient;

// HttpClient initialization (with SSL support)
@PostConstruct
public void init() {
    try {
        SSLContext sslContext = SSLContext.getDefault();
        SSLConnectionSocketFactory sslSocketFactory = new SSLConnectionSocketFactory(
                sslContext,
                new String&#91;]{"TLSv1.2", "TLSv1.3"},
                null,
                NoopHostnameVerifier.INSTANCE
        );

        PoolingHttpClientConnectionManager cm = new PoolingHttpClientConnectionManager();
        cm.setMaxTotal(100);
        cm.setDefaultMaxPerRoute(20);

        this.httpClient = HttpClients.custom()
                .setConnectionManager(cm)
                .setSSLSocketFactory(sslSocketFactory)
                .setSSLHostnameVerifier(NoopHostnameVerifier.INSTANCE)
                .build();
    } catch (Exception e) {
        throw new RuntimeException("Failed to initialize HttpClient", e);
    }
}

}

1.5 API Request Construction and Response Processing

Complete generateVideo method implementation:

@Override

public Result generateVideo(MultipartFile image, String prompt, int duration) {

    try {

        // 1. Check configuration

        if (clingaiApiUrl == null || clingaiApiUrl.isEmpty()) {

            return ResultGenerator.genFailResult("Kling AI API configuration not set");

        }
    // 2. Build API endpoint
    String url = clingaiApiUrl + "/v1/videos/image2video";
    HttpPost httpPost = new HttpPost(url);

    // 3. Set request headers
    httpPost.setHeader("Content-Type", "application/json");

    // 4. Generate JWT Token and set Authorization header
    String authToken = ClingAiUtils.generateJwtToken(clingaiAccessKey, clingaiSecretKey);
    if (authToken == null || authToken.isEmpty()) {
        return ResultGenerator.genFailResult("Kling AI authentication information not configured or generation failed");
    }
    httpPost.setHeader("Authorization", "Bearer " + authToken);

    // 5. Build request body: Base64-encoded image + prompt + duration
    String imageBase64 = Base64.getEncoder().encodeToString(image.getBytes());
    Map&lt;String, Object&gt; requestBody = new HashMap&lt;&gt;();
    requestBody.put("model_name", "kling-v1-5");
    requestBody.put("image", imageBase64);
    requestBody.put("duration", String.valueOf(duration));
    requestBody.put("mode", "pro");
    if (prompt != null &amp;&amp; !prompt.isEmpty()) {
        requestBody.put("prompt", prompt);
    }

    // 6. Send request
    String jsonBody = objectMapper.writeValueAsString(requestBody);
    httpPost.setEntity(new StringEntity(jsonBody, StandardCharsets.UTF_8));

    try (CloseableHttpResponse response = httpClient.execute(httpPost)) {
        String responseBody = EntityUtils.toString(response.getEntity(), StandardCharsets.UTF_8);
        int statusCode = response.getStatusLine().getStatusCode();

        // 7. Process response
        if (statusCode &gt;= 200 &amp;&amp; statusCode &lt; 300) {
            try {
                JsonNode jsonNode = objectMapper.readTree(responseBody);
                // Response format: {code, message, request_id, data: {task_id, task_status, ...}}
                int code = jsonNode.has("code") ? jsonNode.get("code").asInt() : -1;
                if (code == 0 &amp;&amp; jsonNode.has("data")) {
                    JsonNode dataNode = jsonNode.get("data");
                    Map&lt;String, Object&gt; resultMap = new HashMap&lt;&gt;();
                    resultMap.put("taskId", dataNode.has("task_id") ? dataNode.get("task_id").asText() : null);
                    resultMap.put("taskStatus", dataNode.has("task_status") ? dataNode.get("task_status").asText() : null);
                    resultMap.put("duration", duration);
                    resultMap.put("requestId", jsonNode.has("request_id") ? jsonNode.get("request_id").asText() : null);
                    return ResultGenerator.genSuccessResult(resultMap);
                } else {
                    String message = jsonNode.has("message") ? jsonNode.get("message").asText() : "Unknown error";
                    return ResultGenerator.genFailResult("API returned error: " + message);
                }
            } catch (Exception e) {
                log.error("Failed to parse response", e);
                Map&lt;String, Object&gt; resultMap = new HashMap&lt;&gt;();
                resultMap.put("response", responseBody);
                return ResultGenerator.genSuccessResult(resultMap);
            }
        } else {
            return ResultGenerator.genFailResult("API returned error: " + statusCode + " - " + responseBody);
        }
    }
} catch (Exception e) {
    log.error("Exception occurred while generating {} second video", duration, e);
    return ResultGenerator.genFailResult("Exception occurred while generating video: " + e.getMessage());
}

}

Step 2: Polling Video Generation Status

Kling AI’s video generation is asynchronous. We need to poll the task status until the video generation is complete.

2.1 Status Query API Implementation

@Override

public Result getVideoStatus(String taskId) {

    try {

        if (clingaiApiUrl == null || clingaiApiUrl.isEmpty()) {

            return ResultGenerator.genFailResult("Kling AI API configuration not set");

        }
    // API endpoint: GET /v1/videos/image2video/{task_id}
    String url = clingaiApiUrl + "/v1/videos/image2video/" + taskId;
    HttpGet httpGet = new HttpGet(url);

    httpGet.setHeader("Content-Type", "application/json");

    // Get authentication token
    String authToken = ClingAiUtils.generateJwtToken(clingaiAccessKey, clingaiSecretKey);
    if (authToken == null || authToken.isEmpty()) {
        return ResultGenerator.genFailResult("Kling AI authentication information not configured or generation failed");
    }
    httpGet.setHeader("Authorization", "Bearer " + authToken);

    try (CloseableHttpResponse response = httpClient.execute(httpGet)) {
        String responseBody = EntityUtils.toString(response.getEntity(), StandardCharsets.UTF_8);
        int statusCode = response.getStatusLine().getStatusCode();

        if (statusCode == 200) {
            try {
                JsonNode jsonNode = objectMapper.readTree(responseBody);
                int code = jsonNode.has("code") ? jsonNode.get("code").asInt() : -1;
                if (code == 0 &amp;&amp; jsonNode.has("data")) {
                    JsonNode dataNode = jsonNode.get("data");
                    Map&lt;String, Object&gt; resultMap = new HashMap&lt;&gt;();
                    resultMap.put("taskId", dataNode.has("task_id") ? dataNode.get("task_id").asText() : null);
                    resultMap.put("taskStatus", dataNode.has("task_status") ? dataNode.get("task_status").asText() : null);
                    resultMap.put("taskStatusMsg", dataNode.has("task_status_msg") ? dataNode.get("task_status_msg").asText() : null);

                    // Parse video result (if task is completed)
                    if (dataNode.has("task_result") &amp;&amp; dataNode.get("task_result").has("videos")) {
                        JsonNode videosNode = dataNode.get("task_result").get("videos");
                        if (videosNode.isArray() &amp;&amp; videosNode.size() &gt; 0) {
                            JsonNode videoNode = videosNode.get(0);
                            resultMap.put("videoUrl", videoNode.has("url") ? videoNode.get("url").asText() : null);
                            resultMap.put("videoId", videoNode.has("id") ? videoNode.get("id").asText() : null);
                            resultMap.put("videoDuration", videoNode.has("duration") ? videoNode.get("duration").asText() : null);
                        }
                    }

                    return ResultGenerator.genSuccessResult(resultMap);
                } else {
                    String message = jsonNode.has("message") ? jsonNode.get("message").asText() : "Unknown error";
                    return ResultGenerator.genFailResult("Query failed: " + message);
                }
            } catch (Exception e) {
                log.error("Failed to parse response", e);
                Map&lt;String, Object&gt; resultMap = new HashMap&lt;&gt;();
                resultMap.put("response", responseBody);
                return ResultGenerator.genSuccessResult(resultMap);
            }
        } else {
            return ResultGenerator.genFailResult("Status query failed: " + statusCode + " - " + responseBody);
        }
    }
} catch (Exception e) {
    log.error("Exception occurred while querying video status", e);
    return ResultGenerator.genFailResult("Exception occurred while querying status: " + e.getMessage());
}

}

2.2 Polling Logic

// Step 2: Poll video generation status (wait up to maxPollingTime seconds)

log.info("Step 2: Start polling video generation status (wait up to {} seconds)", maxPollingTime);

String videoUrl = null;

long startTime = System.currentTimeMillis();

int pollCount = 0;

int maxPolls = maxPollingTime / 3; // Query every 3 seconds

while (pollCount < maxPolls) {

    Thread.sleep(3000); // Wait 3 seconds

    pollCount++;

Result statusResult = getVideoStatus(taskId);
if (statusResult.getCode() != 200) {
    log.warn("Failed to query video status: {}", statusResult.getMessage());
    continue;
}

Map&lt;String, Object&gt; statusData = (Map&lt;String, Object&gt;) statusResult.getData();
String taskStatus = (String) statusData.get("taskStatus");
videoUrl = (String) statusData.get("videoUrl");

log.info("Poll #{}: status: {}, videoUrl: {}", pollCount, taskStatus, 
         videoUrl != null ? "generated" : "not generated");

if (videoUrl != null &amp;&amp; !videoUrl.isEmpty()) {
    log.info("Video generation completed, URL: {}", videoUrl);
    break;
}

if ("failed".equals(taskStatus) || "error".equals(taskStatus)) {
    return ResultGenerator.genFailResult("Video generation failed, status: " + taskStatus);
}

// Check timeout
if (System.currentTimeMillis() - startTime &gt; maxPollingTime * 1000L) {
    return ResultGenerator.genFailResult("Video generation timeout, please query status manually later");
}


}

if (videoUrl == null || videoUrl.isEmpty()) {

    return ResultGenerator.genFailResult("Video generation timeout or failed, please query status manually later, taskId: " + taskId);

}

Step 3: Download Generated Video File

After obtaining the video URL, we need to download the video file locally for subsequent processing.

@Override

public MultipartFile downloadVideoFromUrl(String videoUrl) {

    try {

        log.info("Start downloading video: {}", videoUrl);
    HttpGet httpGet = new HttpGet(videoUrl);
    httpGet.setHeader("User-Agent", "Mozilla/5.0");

    try (CloseableHttpResponse response = httpClient.execute(httpGet)) {
        int statusCode = response.getStatusLine().getStatusCode();
        if (statusCode != 200) {
            log.error("Failed to download video, HTTP status code: {}", statusCode);
            return null;
        }

        byte&#91;] videoBytes = EntityUtils.toByteArray(response.getEntity());
        log.info("Video download completed, size: {} bytes", videoBytes.length);

        // Wrap as MultipartFile and return
        return new MultipartFile() {
            @Override
            public String getName() {
                return "video";
            }

            @Override
            public String getOriginalFilename() {
                return "generated_video.mp4";
            }

            @Override
            public String getContentType() {
                return "video/mp4";
            }

            @Override
            public boolean isEmpty() {
                return videoBytes.length == 0;
            }

            @Override
            public long getSize() {
                return videoBytes.length;
            }

            @Override
            public byte&#91;] getBytes() throws IOException {
                return videoBytes;
            }

            @Override
            public InputStream getInputStream() throws IOException {
                return new ByteArrayInputStream(videoBytes);
            }

            @Override
            public void transferTo(java.io.File dest) throws IOException, IllegalStateException {
                java.nio.file.Files.write(dest.toPath(), videoBytes);
            }
        };
    }
} catch (Exception e) {
    log.error("Failed to download video file", e);
    return null;
}

}

Step 4: Automatic Blink Time Point Detection

This is a critical step in the entire process. We need to find the blink time point in the video as the keyframe for looping. Blinking is a natural action node, and choosing the blink moment as the loop point ensures a more natural loop.

4.1 Why Choose Blinking as the Loop Point?

Natural Transition: Blinking is a brief action, and the facial state before and after blinking is similar, making it suitable as a loop point

Visual Concealment: The visual change during the blink moment can mask the loop transition

Temporal Precision: The blink action has a clear start and end, facilitating precise positioning

4.2 Blink Detection Implementation

We use a Python script to call MediaPipe or OpenCV for blink detection. Complete detectBlink method implementation:

@Override

public Result detectBlink(MultipartFile video) {

    try {

        if (video == null || video.isEmpty()) {

            return ResultGenerator.genFailResult("Video file cannot be empty");

        }
    // Create temporary working directory
    Path workDir = Files.createTempDirectory("clingai-detect-");
    Path inputPath = workDir.resolve("input.mp4");
    Path scriptPath = null;

    try {
        // 1. Save video file to temporary directory
        Files.copy(video.getInputStream(), inputPath, StandardCopyOption.REPLACE_EXISTING);

        // 2. Get Python script path (from resources or file system)
        try {
            java.net.URL scriptUrl = getClass().getClassLoader().getResource("scripts/detect_blink.py");
            if (scriptUrl != null) {
                scriptPath = Paths.get(scriptUrl.toURI());
            } else {
                // If resource file doesn't exist, try reading from file system
                String scriptResourcePath = "src/main/resources/scripts/detect_blink.py";
                Path projectRoot = Paths.get(System.getProperty("user.dir"));
                scriptPath = projectRoot.resolve(scriptResourcePath);
                if (!Files.exists(scriptPath)) {
                    return ResultGenerator.genFailResult("Blink detection script not found, please manually mark the blink time point");
                }
            }
        } catch (Exception e) {
            log.warn("Unable to load script from resources, trying to read from file system", e);
            String scriptResourcePath = "src/main/resources/scripts/detect_blink.py";
            Path projectRoot = Paths.get(System.getProperty("user.dir"));
            scriptPath = projectRoot.resolve(scriptResourcePath);
            if (!Files.exists(scriptPath)) {
                return ResultGenerator.genFailResult("Blink detection script not found, please manually mark the blink time point");
            }
        }

        // 3. Call Python script
        String pythonCmd = "python3";
        if (System.getProperty("os.name").toLowerCase().contains("windows")) {
            pythonCmd = "python";
        }

        ProcessBuilder pb = new ProcessBuilder(
                pythonCmd,
                scriptPath.toString(),
                inputPath.toString()
        );
        // Don't redirect stderr, read stdout and stderr separately
        pb.redirectErrorStream(false);
        Process p = pb.start();

        // 4. Read stdout (JSON output)
        StringBuilder output = new StringBuilder();
        try (BufferedReader reader = new BufferedReader(
                new InputStreamReader(p.getInputStream(), StandardCharsets.UTF_8))) {
            String line;
            while ((line = reader.readLine()) != null) {
                output.append(line).append("n");
            }
        }

        // 5. Read stderr (error messages, for logging only)
        StringBuilder errorOutput = new StringBuilder();
        Thread stderrReader = new Thread(() -&gt; {
            try (BufferedReader errorReader = new BufferedReader(
                    new InputStreamReader(p.getErrorStream(), StandardCharsets.UTF_8))) {
                String line;
                while ((line = errorReader.readLine()) != null) {
                    synchronized (errorOutput) {
                        errorOutput.append(line).append("n");
                    }
                }
            } catch (IOException e) {
                log.warn("Failed to read Python stderr", e);
            }
        });
        stderrReader.start();

        // Wait for stderr reading thread to complete (wait up to 5 seconds)
        try {
            stderrReader.join(5000);
        } catch (InterruptedException e) {
            log.warn("Stderr reading thread was interrupted", e);
        }

        if (errorOutput.length() &gt; 0) {
            log.info("Python script stderr output: {}", errorOutput.toString());
        }

        // 6. Wait for process to complete and check exit code
        int exitCode = p.waitFor();
        if (exitCode != 0) {
            log.error("Python script execution failed, exit code: {}, stdout: {}, stderr: {}",
                    exitCode, output.toString(), errorOutput.toString());
            return ResultGenerator.genFailResult("Blink detection failed, please manually mark the blink time point");
        }

        // 7. Extract JSON from output (may contain other text, need to find JSON part)
        String fullOutput = output.toString().trim();
        String jsonOutput = ClingAiUtils.extractJsonFromOutput(fullOutput);

        if (jsonOutput == null || jsonOutput.isEmpty()) {
            log.error("Unable to extract JSON from Python output, full output: {}", fullOutput);
            log.error("stderr output: {}", errorOutput.toString());
            return ResultGenerator.genFailResult("Blink detection failed: unable to parse result, please manually mark the blink time point");
        }

        // 8. Parse JSON result
        log.info("JSON returned by Python script: {}", jsonOutput);
        JsonNode resultNode = objectMapper.readTree(jsonOutput);

        if (resultNode.has("success") &amp;&amp; resultNode.get("success").asBoolean()) {
            double blinkTime = resultNode.get("blinkTime").asDouble();
            Map&lt;String, Object&gt; resultMap = new HashMap&lt;&gt;();
            resultMap.put("blinkTime", blinkTime);
            return ResultGenerator.genSuccessResult(resultMap);
        } else {
            String errorMsg = resultNode.has("error")
                    ? resultNode.get("error").asText()
                    : "No blink detected";
            return ResultGenerator.genFailResult(errorMsg + ", please manually mark the blink time point");
        }

    } finally {
        // Clean up temporary files
        try {
            if (Files.exists(inputPath)) {
                Files.delete(inputPath);
            }
            if (Files.exists(workDir)) {
                Files.delete(workDir);
            }
        } catch (Exception e) {
            log.warn("Failed to clean up temporary files", e);
        }
    }

} catch (Exception e) {
    log.error("Exception occurred while detecting blink", e);
    return ResultGenerator.genFailResult("Exception occurred while detecting blink: " + e.getMessage() + 
                                         ", please manually mark the blink time point");
}

}

4.3 Calling Blink Detection

// Step 4: Automatically detect blink time point in video

log.info("Step 4: Automatically detect blink time point in video");

Result detectResult = videoProcessService.detectBlink(videoFile);

Double blinkTime;

if (detectResult.getCode() != 200) {

    log.warn("Automatic blink detection failed: {}, using default value 2.5 seconds", detectResult.getMessage());

    // If detection fails, use default value

    blinkTime = 2.5;

    log.info("Using default blink time: {} seconds", blinkTime);

} else {

    Map<String, Object> detectData = (Map<String, Object>) detectResult.getData();

    blinkTime = ((Number) detectData.get("blinkTime")).doubleValue();

    log.info("Detected blink time: {} seconds", blinkTime);

}

4.4 Python Blink Detection Script

Our blink detection script supports two detection methods: prioritize MediaPipe (high precision), fallback to OpenCV (compatibility). Here is the complete implementation:

#!/usr/bin/env python3

  
  
  -- coding: utf-8 --


"""

Video Blink Detection Script

Uses mature libraries for accurate blink detection:


Prioritize MediaPipe Face Mesh (Google open-source, high accuracy)
Fallback to OpenCV Haar Cascades (simple but lower accuracy)


Dependencies installation:

pip install opencv-python numpy mediapipe==0.10.9

"""

import sys

import cv2

import json

import os

import numpy as np


  
  
  Set standard output encoding to UTF-8 (avoid Windows console garbled text)


if sys.platform == 'win32':

    try:

        import io

        if hasattr(sys.stdout, 'buffer'):

            sys.stdout = io.TextIOWrapper(sys.stdout.buffer, encoding='utf-8', 

                                         errors='replace', line_buffering=True)

        if hasattr(sys.stderr, 'buffer'):

            sys.stderr = io.TextIOWrapper(sys.stderr.buffer, encoding='utf-8', 

                                         errors='replace', line_buffering=True)

    except Exception:

        pass

def detect_blink_with_mediapipe(video_path):

    """

    Use MediaPipe for more accurate blink detection

    Requires installation: pip install mediapipe==0.10.9

    """

    try:

        import mediapipe as mp

    except ImportError as e:

        print(f"MediaPipe not installed: {e}", file=sys.stderr)

        return None

# Check MediaPipe version and API availability
mp_version = getattr(mp, '__version__', 'unknown')
print(f"MediaPipe version: {mp_version}", file=sys.stderr)

# Check if solutions module exists (old API)
if not hasattr(mp, 'solutions'):
    print(f"MediaPipe {mp_version} uses new tasks API, does not support old solutions API", 
          file=sys.stderr)
    print("Please downgrade to a version that supports solutions: pip install mediapipe==0.10.9", 
          file=sys.stderr)
    return None

mp_face_mesh = mp.solutions.face_mesh
face_mesh = mp_face_mesh.FaceMesh(
    static_image_mode=False,
    max_num_faces=1,
    refine_landmarks=True,
    min_detection_confidence=0.5,
    min_tracking_confidence=0.5
)

cap = cv2.VideoCapture(video_path)
if not cap.isOpened():
    return None

fps = cap.get(cv2.CAP_PROP_FPS)
frame_count = 0

# Eye keypoint indices (MediaPipe 468-point model)
LEFT_EYE_INDICES = &#91;33, 7, 163, 144, 145, 153, 154, 155, 133, 173, 
                    157, 158, 159, 160, 161, 246]
RIGHT_EYE_INDICES = &#91;362, 382, 381, 380, 374, 373, 390, 249, 263, 466, 
                     388, 387, 386, 385, 384, 398]

def calculate_eye_aspect_ratio(landmarks, eye_indices):
    """Calculate Eye Aspect Ratio (EAR)"""
    eye_points = &#91;landmarks&#91;i] for i in eye_indices]
    if len(eye_points) &lt; 6:
        return 1.0

    # Calculate vertical distances
    vertical_1 = abs(eye_points&#91;1].y - eye_points&#91;5].y)
    vertical_2 = abs(eye_points&#91;2].y - eye_points&#91;4].y)
    # Calculate horizontal distance
    horizontal = abs(eye_points&#91;0].x - eye_points&#91;3].x)

    if horizontal == 0:
        return 1.0

    ear = (vertical_1 + vertical_2) / (2.0 * horizontal)
    return ear

blink_times = &#91;]
ear_threshold = 0.25  # EAR threshold, values below this are considered blinks
consecutive_frames = 0

while True:
    ret, frame = cap.read()
    if not ret:
        break

    rgb_frame = cv2.cvtColor(frame, cv2.COLOR_BGR2RGB)
    results = face_mesh.process(rgb_frame)

    if results.multi_face_landmarks:
        landmarks = results.multi_face_landmarks&#91;0].landmark

        # Calculate EAR for left and right eyes
        left_ear = calculate_eye_aspect_ratio(landmarks, LEFT_EYE_INDICES)
        right_ear = calculate_eye_aspect_ratio(landmarks, RIGHT_EYE_INDICES)
        avg_ear = (left_ear + right_ear) / 2.0

        # Detect blink
        if avg_ear &lt; ear_threshold:
            consecutive_frames += 1
            if consecutive_frames == 1:  # Blink starts
                time_sec = frame_count / fps
                blink_times.append(time_sec)
        else:
            consecutive_frames = 0

    frame_count += 1
    # Limit processing frames (improve performance)
    if frame_count &gt; 300:
        break

cap.release()
face_mesh.close()

if blink_times:
    return blink_times&#91;0]
return None


def detect_blink_simple(video_path):

    """

    Improved OpenCV blink detection method: based on eye region changes and eye count

    Use this improved version if MediaPipe is unavailable

    """

    cap = cv2.VideoCapture(video_path)

    if not cap.isOpened():

        return None

fps = cap.get(cv2.CAP_PROP_FPS)
if fps &lt;= 0:
    fps = 30.0

# Use OpenCV face detector
face_cascade = cv2.CascadeClassifier(
    cv2.data.haarcascades + 'haarcascade_frontalface_default.xml')
eye_cascade = cv2.CascadeClassifier(
    cv2.data.haarcascades + 'haarcascade_eye.xml')

blink_times = &#91;]
frame_count = 0
prev_eye_count = None
prev_eye_area = None
blink_threshold = 0.7  # Eye region change threshold
min_eye_area = 50

# Eye area history for smoothing
eye_area_history = &#91;]
history_size = 3

while True:
    ret, frame = cap.read()
    if not ret:
        break

    gray = cv2.cvtColor(frame, cv2.COLOR_BGR2GRAY)
    faces = face_cascade.detectMultiScale(
        gray, scaleFactor=1.1, minNeighbors=3, minSize=(50, 50))

    current_eye_count = 0
    current_eye_area = 0

    if len(faces) &gt; 0:
        # Select the largest face
        largest_face = max(faces, key=lambda f: f&#91;2] * f&#91;3])
        x, y, w, h = largest_face

        # Only detect upper half of face (eye region)
        roi_gray = gray&#91;y:y+int(h*0.6), x:x+w]
        eyes = eye_cascade.detectMultiScale(
            roi_gray, scaleFactor=1.1, minNeighbors=2, minSize=(15, 15))
        current_eye_count = len(eyes)

        for (ex, ey, ew, eh) in eyes:
            eye_area = ew * eh
            if eye_area &gt;= min_eye_area:
                current_eye_area += eye_area

    # Smoothing: use historical average
    eye_area_history.append(current_eye_area)
    if len(eye_area_history) &gt; history_size:
        eye_area_history.pop(0)
    avg_eye_area = sum(eye_area_history) / len(eye_area_history) if eye_area_history else 0

    # Blink detection logic
    if prev_eye_count is not None and prev_eye_area is not None:
        # Method 1: Eye count change (from 2 to 0 or 1)
        if prev_eye_count &gt;= 2 and current_eye_count &lt; 2:
            time_sec = (frame_count - 1) / fps
            blink_times.append(time_sec)
        # Method 2: Eye area suddenly decreases
        elif prev_eye_area &gt; min_eye_area and avg_eye_area &gt; 0:
            area_ratio = avg_eye_area / prev_eye_area if prev_eye_area &gt; 0 else 1.0
            area_drop = (prev_eye_area - avg_eye_area) / prev_eye_area if prev_eye_area &gt; 0 else 0
            if area_ratio &lt; blink_threshold or area_drop &gt; 0.15:
                time_sec = (frame_count - 1) / fps
                if not blink_times or abs(blink_times&#91;-1] - time_sec) &gt; 0.3:
                    blink_times.append(time_sec)

    prev_eye_count = current_eye_count
    prev_eye_area = avg_eye_area if avg_eye_area &gt; 0 else (prev_eye_area if prev_eye_area else 0)
    frame_count += 1

    # Limit processing time (process first 15 seconds or first 450 frames)
    max_frames = min(450, int(fps * 15))
    if frame_count &gt;= max_frames:
        break

cap.release()

if blink_times:
    return blink_times&#91;0]
return None


def main():

    if len(sys.argv) < 2:

        result = {

            "error": "Video path must be provided as argument",

            "success": False

        }

        print(json.dumps(result, ensure_ascii=False))

        sys.exit(1)

video_path = sys.argv&#91;1]

if not os.path.exists(video_path):
    result = {
        "error": f"Video file does not exist: {video_path}",
        "success": False
    }
    print(json.dumps(result, ensure_ascii=False))
    sys.exit(1)

# Prioritize MediaPipe (most accurate)
blink_time = None
detection_method = None

try:
    blink_time = detect_blink_with_mediapipe(video_path)
    if blink_time is not None:
        detection_method = "mediapipe"
except Exception as e:
    print(f"MediaPipe detection exception: {e}", file=sys.stderr)

# If MediaPipe fails, use OpenCV simple method (as fallback)
if blink_time is None:
    try:
        blink_time = detect_blink_simple(video_path)
        if blink_time is not None:
            detection_method = "opencv"
    except Exception as e:
        print(f"OpenCV detection exception: {e}", file=sys.stderr)

if blink_time is not None:
    result = {
        "blinkTime": round(blink_time, 2),
        "success": True,
        "method": detection_method or "unknown"
    }
else:
    result = {
        "error": "No blink detected. Possible reasons: 1) No face in video 2) Poor face angle 3) Low video quality 4) MediaPipe not properly installed. Please manually mark the blink time point.",
        "success": False
    }

# Output JSON result to stdout (error messages already output to stderr)
json_output = json.dumps(result, ensure_ascii=False)
print(json_output, flush=True)


if name == "main":

    main()

Script Features:

Dual Algorithm Support: Prioritize MediaPipe (high-precision EAR algorithm), fallback to OpenCV (compatibility)

EAR Algorithm: MediaPipe uses Eye Aspect Ratio (EAR) for precise blink detection

Multiple Detection Methods: OpenCV uses eye count changes, area changes, and other methods

Smoothing: Use historical frame averages to reduce noise interference

Performance Optimization: Limit processing frames to improve processing speed

Error Handling: Comprehensive exception handling and log output

Step 5: Generate Loop Video (FFmpeg Processing)

This is the final and most critical step. We need to:

Extract 1 second before and after the blink time point (2 seconds total)

Reverse the 2-second clip

Concatenate the original clip and the reversed clip to form a 4-second loop video

5.1 Complete Loop Video Generation Implementation

@Override

public Result loopVideo(MultipartFile video, Double blinkTime, 

                       Double beforeSeconds, Double afterSeconds, String userId) {

    try {

        if (video == null || video.isEmpty()) {

            return ResultGenerator.genFailResult("Video file cannot be empty");

        }
    double before = beforeSeconds == null ? 1.0 : beforeSeconds;
    double after = afterSeconds == null ? 1.0 : afterSeconds;

    // Create temporary working directory
    Path workDir = Files.createTempDirectory("clingai-loop-");
    Path inputPath = workDir.resolve("input.mp4");
    Path clipPath = workDir.resolve("clip.mp4");
    Path revPath = workDir.resolve("reversed.mp4");
    Path outPath = workDir.resolve("loop.mp4");

    try {
        // 1. Save input video
        Files.copy(video.getInputStream(), inputPath, StandardCopyOption.REPLACE_EXISTING);

        // 2. Calculate clipping parameters
        double t = blinkTime == null ? 2.5 : blinkTime;
        double start = Math.max(0.0, t - before);
        double duration = before + after;

        // 3. Extract video clip (2 seconds)
        int clipExit = runFfmpeg(new String&#91;]{
                "ffmpeg", "-y", "-ss", String.valueOf(start),
                "-t", String.valueOf(duration), "-i", inputPath.toString(),
                "-an", "-c:v", "libx264", "-pix_fmt", "yuv420p",
                clipPath.toString()
        });
        if (clipExit != 0) {
            return ResultGenerator.genFailResult("ffmpeg clipping failed");
        }

        // 4. Reverse video clip
        int revExit = runFfmpeg(new String&#91;]{
                "ffmpeg", "-y", "-i", clipPath.toString(),
                "-vf", "reverse", "-an", "-c:v", "libx264",
                "-pix_fmt", "yuv420p", revPath.toString()
        });
        if (revExit != 0) {
            return ResultGenerator.genFailResult("ffmpeg reverse failed");
        }

        // 5. Concatenate original clip and reversed clip (4-second loop video)
        int concatExit = runFfmpeg(new String&#91;]{
                "ffmpeg", "-y", "-i", clipPath.toString(),
                "-i", revPath.toString(),
                "-filter_complex", "&#91;0:v]&#91;1:v]concat=n=2:v=1:a=0&#91;v]",
                "-map", "&#91;v]", "-an", "-c:v", "libx264",
                "-pix_fmt", "yuv420p", outPath.toString()
        });
        if (concatExit != 0) {
            return ResultGenerator.genFailResult("ffmpeg concatenation failed");
        }

        // 6. Read generated video
        byte&#91;] outBytes = Files.readAllBytes(outPath);

        // 7. Create MultipartFile object
        MultipartFile outFile = new MultipartFile() {
            @Override
            public String getName() {
                return "file";
            }

            @Override
            public String getOriginalFilename() {
                return "loop.mp4";
            }

            @Override
            public String getContentType() {
                return "video/mp4";
            }

            @Override
            public boolean isEmpty() {
                return outBytes.length == 0;
            }

            @Override
            public long getSize() {
                return outBytes.length;
            }

            @Override
            public byte&#91;] getBytes() throws IOException {
                return outBytes;
            }

            @Override
            public InputStream getInputStream() throws IOException {
                return new ByteArrayInputStream(outBytes);
            }

            @Override
            public void transferTo(java.io.File dest) throws IOException, IllegalStateException {
                Files.write(dest.toPath(), outBytes);
            }
        };

        // 8. Save file
        try {
            AppFile appFile = saveFile(outFile, userId);
            return ResultGenerator.genSuccessResult(appFile);
        } catch (Exception saveException) {
            log.error("Failed to save file", saveException);
            // If file save fails, try returning temporary file path
            Map&lt;String, Object&gt; resultMap = new HashMap&lt;&gt;();
            resultMap.put("fileUrl", "/temp/" + outPath.getFileName().toString());
            resultMap.put("fileName", "loop.mp4");
            resultMap.put("fileType", "video/mp4");
            resultMap.put("message", "File generated but failed to save to database: " + saveException.getMessage());
            return ResultGenerator.genSuccessResult(resultMap);
        }

    } finally {
        // Clean up temporary files
        try {
            if (Files.exists(inputPath)) {
                Files.delete(inputPath);
            }
            if (Files.exists(clipPath)) {
                Files.delete(clipPath);
            }
            if (Files.exists(revPath)) {
                Files.delete(revPath);
            }
            if (Files.exists(outPath)) {
                Files.delete(outPath);
            }
            if (Files.exists(workDir)) {
                Files.delete(workDir);
            }
        } catch (Exception e) {
            log.warn("Failed to clean up temporary files", e);
        }
    }
} catch (Exception e) {
    log.error("Failed to process loop video", e);
    String errorMsg = e.getMessage();
    if (errorMsg == null || errorMsg.isEmpty()) {
        errorMsg = e.getClass().getSimpleName();
    }
    return ResultGenerator.genFailResult("Failed to process loop video: " + errorMsg);
}

}

/**




Execute FFmpeg command
*/
private int runFfmpeg(String[] command) throws IOException, InterruptedException {
ProcessBuilder pb = new ProcessBuilder(command);
pb.redirectErrorStream(true);
Process p = pb.start();
try (InputStream is = p.getInputStream()) {
    byte[] buf = new byte[1024];
    while (is.read(buf) != -1) {
        // Read output to avoid buffer blocking
    }
}
return p.waitFor();
}

5.2 FFmpeg Command Details

Extract Video Clip:

ffmpeg -y -ss 1.5 -t 2.0 -i input.mp4 -an -c:v libx264 -pix_fmt yuv420p clip.mp4

-ss 1.5: Start from 1.5 seconds

▪ -t 2.0: Extract 2 seconds

-an: Remove audio

-c:v libx264: Use H.264 encoding

-pix_fmt yuv420p: Pixel format (compatibility)

Reverse Video:

ffmpeg -y -i clip.mp4 -vf reverse -an -c:v libx264 -pix_fmt yuv420p reversed.mp4

-vf reverse: Video filter, reverse playback

Concatenate Video:

ffmpeg -y -i clip.mp4 -i reversed.mp4 -filter_complex "[0:v][1:v]concat=n=2:v=1:a=0[v]" -map "[v]" -an -c:v libx264 -pix_fmt yuv420p loop.mp4

concat=n=2:v=1:a=0: Concatenate 2 videos, video stream only, no audio stream

ProcessBuilder pb = new ProcessBuilder(

"python",

scriptPath.toString(),

inputPath.toString()

);

pb.redirectErrorStream(false); // Read stdout and stderr separately

Process p = pb.start();

// Read stdout (JSON output)

StringBuilder output = new StringBuilder();

try (BufferedReader reader = new BufferedReader(

        new InputStreamReader(p.getInputStream(), StandardCharsets.UTF_8))) {

    String line;

    while ((line = reader.readLine()) != null) {

        output.append(line).append("n");

    }

}

// Extract JSON from output

String jsonOutput = ClingAiUtils.extractJsonFromOutput(output.toString());

JsonNode resultNode = objectMapper.readTree(jsonOutput);

double blinkTime = resultNode.get("blinkTime").asDouble();

IV. Core Interface Implementation

4.1 Complete Process Interface

@PostMapping("/generateLoopVideo")

@ApiOperation(value = "Complete process: Upload image to generate loop video (auto-detect blink)")

public Result generateLoopVideo(

        @RequestPart("image") MultipartFile image,

        @RequestParam(value = "prompt", required = false) String prompt,

        @RequestParam(value = "beforeSeconds", required = false, defaultValue = "1.0") Double beforeSeconds,

        @RequestParam(value = "afterSeconds", required = false, defaultValue = "1.0") Double afterSeconds,

        @RequestParam(value = "maxPollingTime", required = false, defaultValue = "300") Integer maxPollingTime) {
if (image == null || image.isEmpty()) {
    return ResultGenerator.genFailResult("Image file cannot be empty");
}

// Get current user ID
String userId = getCurrentTokenUserId();

// Call Service layer to complete the full process
return clingAiService.generateLoopVideo(
    image, prompt, beforeSeconds, afterSeconds, maxPollingTime, userId
);

}

4.2 Interface Parameters

Parameter	Type	Required	Default	Description
image	MultipartFile	Yes	–	Digital human image
prompt	String	No	Default prompt	Video generation prompt
beforeSeconds	Double	No	1.0	Duration to extract before blink time point (seconds)
afterSeconds	Double	No	1.0	Duration to extract after blink time point (seconds)
maxPollingTime	Integer	No	300	Maximum waiting time for video generation (seconds)

4.3 Response Result

{

  "code": 200,

  "message": "success",

  "data": {

    "id": "File ID",

    "fileName": "loop.mp4",

    "fileUrl": "/uploadFiles/2026/02/02/xxx.mp4",

    "detectedBlinkTime": 2.5,

    "originalTaskId": "Kling AI Task ID",

    "originalVideoUrl": "Original Video URL"

  }

}

V. Technical Highlights

5.1 Intelligent Blink Detection

Multiple Algorithm Support: Prioritize MediaPipe (high precision), fallback to OpenCV (compatibility)

EAR Algorithm: Use Eye Aspect Ratio (EAR) for precise blink detection

Fault Tolerance: Use default value (video midpoint 2.5 seconds) when detection fails

5.2 Seamless Loop Design

Still + Blink: Prompt design ensures digital human faces screen, completely still, only natural blink after 1 second

Precise Extraction: Extract 1 second before and after blink time point as center

Reverse and Concatenate: Original clip + Reversed clip = Perfect loop (blink action naturally connects at loop point)

5.3 Architecture Design

Layered Architecture: Controller → Service → Utils, clear responsibilities

Asynchronous Processing: Video generation is asynchronous, polling query status

Error Handling: Comprehensive exception handling and logging

VI. Summary

Through the complete technical solution of AI Video Generation + Intelligent Blink Detection + FFmpeg Video Processing, we have successfully achieved:

Perfect 4-Second Loop: Original 2-second clip + Reversed 2-second clip = 4-second seamless loop

Natural Movement: Intelligent extraction based on blink time point ensures natural loop

Automated Process: Fully automated from image upload to loop video generation

High-Quality Output: Use Kling AI Pro mode to generate high-quality videos

This solution not only addresses NavTalk’s digital human loop video requirements but also provides a solid foundation for future extensions (such as different loop durations, custom loop points, etc.).

Feature Release Plan

We will officially release this feature in the near future, allowing users to directly upload a custom character image, and the system will automatically generate a vivid 4-second loop video. The generated videos can be directly applied to digital human displays in NavTalk, providing users with a more personalized and vivid conversation experience. This feature will significantly lower the barrier to digital human video production, enabling every user to easily create their own exclusive digital human avatar.

The post NavTalk Digital Human Loop Video Generation Technical Implementation appeared first on Frank Fu's Blog.

Understanding Reinforcement Learning through OpenDuck

Frank Fu — Mon, 30 Mar 2026 08:50:12 +0000

Objective: Replicate the OpenDuck Mini project and control it using the RDK X5 development board.

OpenDuck Mini is an open-source robotics project aimed at creating a miniature, low-cost replica of Disney’s BDX Droid. The project was initiated and is maintained by developer Antoine Pirrone (apirrone).

Table of Contents：

Project Research

International Projects

Domestic Projects

OpenDuck Development Workflow

OpenDuck Repository Overview

Raspberry Pi Zero 2W Deployment Process

RDK X5 Deployment Process

Frequently Asked Questions (FAQ)

Reinforcement Learning

I. Project Research

1.1 International Projects

Focus on algorithm implementation and community ecosystem.

1.1.1 OpenDuck Mini

Project	Description
Link	Open_Duck_Mini
Hardware Architecture	`Raspberry Pi Zero 2W` + `Feetech ST3215 Servo` + `IMU`
Core Features	Ultra-low cost (<$400), fully 3D-printed structure
Tech Stack	Sim2Real (MuJoCo), successfully implemented reinforcement learning control on low-cost servos
Evaluation	Best for beginners, suitable as a low-cost educational tool or desktop display project

1.1.2 K-Scale Labs (Stompy)

Project	Description
Link	github.com/kscalelabs
Hardware Architecture	Committed to full-stack open source, including self-developed driver boards and host computers
Core Features	Large community scale, dedicated to establishing a universal humanoid robot standard (K-Lang)
Evaluation	Adopts an “ecosystem” development strategy, aiming to become the Android platform of the robotics field

1.1.3 Berkeley Humanoid Lite

Project	Description
Link	berkeley-humanoid-lite
Hardware Architecture	High-performance brushless motors + 3D-printed gearboxes
Core Features	Academic “low-cost” research platform benchmark (<$5000), designed specifically for reinforcement learning research
Evaluation	Hardcore research-oriented, suitable for studying high-dynamic motion control (such as jumping, backflips, etc.)

1.1.4 Poppy Project & Robotis OP3

Project	Description
Link	Poppy \| Robotis
Hardware Architecture	`Dynamixel` high-end servos + `x86/SBC`
Evaluation	Previous generation technology route, relies on expensive Dynamixel servos, not suitable for end-to-end reinforcement learning applications

1.2 Domestic Projects

Domestic projects are generally more aggressive in brushless motor (BLDC/FOC) applications with stronger hardware performance.

1.2.1 Kit-Miao (Damiao Technology)

Project	Description
Link	Gitee Repo
Hardware Architecture	Damiao joint motors (integrated FOC driver) + `STM32/ESP32`
Core Features	Mature technical solution, provides complete source code for both MPC and reinforcement learning algorithms
Evaluation	Highly suitable for secondary development, motor performance is in the first tier of domestic products

1.2.2 Unitree Qmini (Yushu)

Project	Description
Link	Unitree GitHub
Hardware Architecture	Unitree 8010 hub motors
Core Features	Only includes leg structure, official Isaac Gym training environment provided
Evaluation	Large company technology downscaling, high motor reliability and excellent algorithm performance ceiling

1.2.3 AlexBot (Alexhuge1)

Project	Description
Link	Github
Hardware Architecture	Self-made/modified brushless motors + `ODrive` or similar FOC drivers
Core Features	Personal geek project, adapted to `Humanoid-Gym`
Evaluation	Hardcore DIY representative, suitable for in-depth research on motor control and mechanical design

1.2.4 HighTorque & FFTAI

Project	Description
Link	HighTorque \| FFTAI
Evaluation	Leaning towards commercial products. HighTorque is suitable as a teaching tool; FFTAI is suitable for university laboratory procurement

II. OpenDuck Development Workflow

flowchart LR

    A[🛠 Modeling & Simulation] --> B[🏃 Motion Generation]

    B --> C[🧠 Reinforcement Learning]

    C --> D[🖨 Hardware Construction]

    D --> E[🚀 Runtime Deployment]

2.1 Phase 1: Model and Simulation Preparation

Reference: prepare_robot.md

Step	Tool/Operation	Output
1. Modeling & Export	Solid Works / Onshape + `onshape2robot`	URDF file
2. MuJoCo Configuration	Execute `MUJOCO compile`	MuJoCo XML
3. Model Correction	Modify XML (add actuator, free joint)	Complete XML
4. Simulation Verification	`simulate` to confirm scene

2.2 Phase 2: Motion Generation

Repository: reference_motion_generator

Input: Motion generator (polynomial fitting)

Output: Reference motion pkl

2.3 Phase 3: Reinforcement Learning

Repository: playground

Input: Reference motion pkl file + verified XML scene file

Core Task: Sim2Real training (train and verify robot control strategy in virtual environment)

2.4 Phase 4: Hardware Construction

Repository: Main repository

Step	Reference Document
3D Print Parts	`print_guide.md`
Assemble Robot	`assembly_guide.md`
Connect Circuit	`open_duck_mini_v2_wiring_diagram.png`

2.5 Phase 5: Runtime Deployment

Repository: Runtime

1. System environment installation

2. Servo + IMU initialization

3. Controller Bluetooth connection

4. Foot sensor debugging

5. Sim2Real deployment

III. OpenDuck Repository Overview

Repository	Purpose	Output
Open Duck Mini	Documentation + 3D print models	Parts
Open Duck Mini Runtime	Real robot inference + Sim2Real	–
Open Duck Playground	GPU parallel training strategy	.onnx
Open Duck reference motion generator	Gait generator	.pkl

IV. Raspberry Pi Zero 2W Deployment Process

Although steps like flashing images, setting WiFi passwords, and enabling I2C have detailed tutorials online, this article provides a complete deployment process for reference due to encountering WiFi connection issues during actual deployment and some differences from official documentation.

4.1 Flash Image

Follow the standard image flashing process, note to select the headless version (lite version), and configure WiFi account and password in advance.

Recommended to use the same image version as the tutorial: 2025-12-04-raspios-trixie-arm64-lite.img.xz

4.2 SD Card Expansion

After image flashing is complete, the actual available space is usually only a small portion of the SD card’s total capacity, requiring filesystem expansion.

# 32GB SD card may only show 7GB after flashing

sudo raspi-config -> Advanced options -> Expand Filesystem

  
  
  Verify


df -h

4.3 APT Source Configuration

# Backup

sudo cp /etc/apt/sources.list.d/debian.sources /etc/apt/sources.list.d/debian.sources.bak

sudo cp /etc/apt/sources.list.d/raspi.sources /etc/apt/sources.list.d/raspi.sources.bak

Modify Debian main source (/etc/apt/sources.list.d/debian.sources):

Types: deb

URIs: https://mirrors.tuna.tsinghua.edu.cn/debian/

Suites: trixie trixie-updates trixie-backports

Components: main contrib non-free non-free-firmware

Signed-By: /usr/share/keyrings/debian-archive-keyring.gpg

Types: deb

URIs: https://mirrors.tuna.tsinghua.edu.cn/debian-security/

Suites: trixie-security

Components: main contrib non-free non-free-firmware

Signed-By: /usr/share/keyrings/debian-archive-keyring.gpg

Modify Raspberry Pi source (/etc/apt/sources.list.d/raspi.sources):

Types: deb

URIs: https://mirrors.tuna.tsinghua.edu.cn/raspberrypi/

Suites: trixie

Components: main

Signed-By: /usr/share/keyrings/raspberrypi-archive-keyring.gpg

# Update

sudo apt update

sudo apt upgrade -y

4.4 Reduce FTDI USB Serial Latency

# Create rule file

sudo tee /etc/udev/rules.d/99-usb-serial.rules >/dev/null <<'EOF'

SUBSYSTEM=="usb-serial", DRIVER=="ftdi_sio", ATTR{latency_timer}="1"

EOF

  
  
  Apply


sudo udevadm control --reload-rules

sudo udevadm trigger

This rule only applies to FTDI drivers and does not affect CH340/CP210x.

4.5 Enable I2C

sudo raspi-config -> Interface Options -> I2C

4.6 Install System Packages

sudo apt install -y git unzip i2c-tools joystick python3-pip python3-venv

4.7 Configure pip Source

pip config set global.index-url https://mirrors.aliyun.com/pypi/simple

pip config set global.trusted-host mirrors.aliyun.com

  
  
  Verify


pip config list

4.8 Install Miniconda

# Create directory

mkdir download && cd download

  
  
  Download Miniconda (aarch64)


  
  
  https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-aarch64.sh


chmod +x Miniconda3-latest-Linux-aarch64.sh

./Miniconda3-latest-Linux-aarch64.sh


  
  
  Follow prompts: Enter -> yes -> Enter -> yes


source ~/.bashrc

Configure Conda Mirror:

# Clean old configuration

conda config --remove-key channels 2>/dev/null || true

conda config --remove-key default_channels 2>/dev/null || true

  
  
  Set Tsinghua source


conda config --append default_channels https://mirrors.tuna.tsinghua.edu.cn/anaconda/pkgs/main

conda config --append default_channels https://mirrors.tuna.tsinghua.edu.cn/anaconda/pkgs/r

conda config --set custom_channels.conda-forge https://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud


  
  
  Set channels


conda config --add channels conda-forge

conda config --add channels defaults

conda config --set show_channel_urls yes

Create Environment:

conda create -n duck310 python=3.10 -y --repodata-fn current_repodata.json -v

conda activate duck310

4.9 Configure pip Acceleration and Install uv

Must be executed in the (duck310) environment

pip config set global.index-url https://mirrors.aliyun.com/pypi/simple

pip config set global.trusted-host mirrors.aliyun.com

pip install -U uv

4.10 Install OpenDuckMini Dependencies

uv pip install -U pip setuptools wheel

uv pip install rustypot==0.1.0 onnxruntime==1.18.1 numpy 

    adafruit-circuitpython-bno055==5.4.13 scipy==1.15.1 

    pygame==2.6.0 openai==1.70.0 RPi.GPIO

4.11 Configure Proxy (Optional)

git config --global http.proxy http://your_proxy_address:your_proxy_port

git config --global https.proxy https://your_proxy_address:your_proxy_port

Example configuration:

git config --global http.proxy http://192.168.1.196:6551

git config --global https.proxy https://192.168.1.196:6551

4.12 Install pypot and Open_Duck_Mini_Runtime

mkdir ~/project && cd ~/project

Install Open_Duck_Mini_Runtime:

# Download: https://github.com/apirrone/Open_Duck_Mini_Runtime/tree/v2

unzip Open_Duck_Mini_Runtime-2.zip

cd Open_Duck_Mini_Runtime-2

uv pip install -e .

Install pypot:

# Download: https://github.com/apirrone/pypot/tree/support-feetech-sts3215

unzip pypot-support-feetech-sts3215.zip

cd pypot-support-feetech-sts3215

uv pip install .

4.13 Calibrate IMU

sudo usermod -aG i2c $USER

i2cdetect -y 1

cd ~/project/Open_Duck_Mini_Runtime-2/scripts/

python calibrate_imu.py

Rotate and move the robot in different directions until the terminal outputs [3,3,3,3] and displays Calibrated = True

Calibration results will be saved in the imu_calib_data.pkl file

cp imu_calib_data.pkl ~/project/Open_Duck_Mini_Runtime-2/mini_bdx_runtime/mini_bdx_runtime/

4.14 Adjust Servo Offsets

cd ~/project/Open_Duck_Mini_Runtime-2/scripts

python find_soft_offsets.py

Operation Steps:

1. Use a cardboard box or stand to elevate the robot from the bottom, ensuring both feet are suspended

2. Refer to the servo position diagram for calibration:

3. Put the robot in an upright position with all motors in torque-locked state

4. Unlock motors one by one, manually adjust to the correct position, then re-lock

Final State Check:

Chassis (abdomen) direction remains horizontal or slightly upward

Left and right legs, left and right feet are symmetrical, should completely overlap when viewed from the side

When placed on a table, both feet’s micro switches should trigger simultaneously

Head direction remains horizontal or slightly upward

4.15 Modify Configuration File

cd ~/project/Open_Duck_Mini_Runtime-2/

cp example_config.json ~/duck_config.json

Fill in the servo offsets in the ~/duck_config.json configuration file and add the following settings:

{

  "imu_upside_down": true

}

Important: If the imu_upside_down parameter is not set, the robot will exhibit abnormal oscillations during walking and cannot maintain balance correctly.

4.16 Initial Bent Leg Posture

cd ~/project/Open_Duck_Mini_Runtime-2/scripts

python turn_on.py

Under normal assembly conditions, servo position should be 0 when fully upright

After startup, the robot should be in a bent leg posture with servo torque locked

If you encounter problems, please refer to Frequently Asked Questions (FAQ)

4.17 Test Walking

cd ~/project/Open_Duck_Mini_Runtime-2/scripts

python v2_rl_walk_mujoco.py 

    --duck_config_path ~/duck_config.json 

    --onnx_model_path ~/BEST_WALK_ONNX_2.onnx

The BEST_WALK_ONNX_2.onnx model file needs to be downloaded from the official repository and placed in the home directory.

The robot will first enter the initial posture, then begin movement. Actual operation requires controller control. If you don’t have a Bluetooth controller, you can modify the code to default to forward movement.

V. RDK X5 Deployment Process

The RDK kit provides Ubuntu 22.04 system images (desktop/server versions).
The following only lists steps different from Raspberry Pi, please refer to the above for identical steps.

5.1 System Flashing

Download Image: RDK X5 Image Download

Recommended version: rdk-x5-ubuntu22-preinstalled-desktop-3.4.1-arm64.img.xz

NAND Firmware Flashing (optional, for version consistency):

Download: NAND Firmware Download

Recommended version: product_20251111.zip

5.2 Install System Packages

Same as Raspberry Pi Step 6

5.3 Configure pip Source

Same as Raspberry Pi Step 7

5.4 Create venv ( Different from Raspberry Pi)

The official hobot.GPIO, hobot_dnn and other packages from Digua Robotics are precompiled for the RDK system Python environment.
Compatibility issues may occur in Conda environments, recommended to use system Python + venv virtual environment.

python3 -m venv --system-site-packages ~/duck_env

source ~/duck_env/bin/activate

  
  
  Verify GPIO module


python3 -c "import Hobot.GPIO; print('OK')"

5.5 Configure pip Acceleration and Install uv

Must be executed in the (duck_env) environment

pip config set global.index-url https://mirrors.aliyun.com/pypi/simple

pip config set global.trusted-host mirrors.aliyun.com

python3 -m pip install -U uv

5.6 Install Dependencies ( Different from Raspberry Pi)

python3 -m uv pip install -U pip setuptools wheel

  
  
  Note: RDK X5 uses smbus2 instead of RPi.GPIO


python3 -m uv pip install rustypot==0.1.0 onnxruntime==1.18.1 numpy 

    adafruit-circuitpython-bno055==5.4.13 scipy==1.15.1 

    pygame==2.6.0 openai==1.70.0 smbus2

5.7 Configure Proxy (Optional)

Same as Raspberry Pi Step 11

5.8 Install pypot and Runtime

mkdir ~/project && cd ~/project

Install Open_Duck_Mini_Runtime (RDK X5 version):

unzip Open_Duck_Mini_Runtime-2_RDK_X5.zip

cd Open_Duck_Mini_Runtime-2_RDK_X5

uv pip install -e .

Install pypot:

# Download: https://github.com/apirrone/pypot/tree/support-feetech-sts3215

unzip pypot-support-feetech-sts3215.zip

cd pypot-support-feetech-sts3215

uv pip install .

5.9 Calibrate IMU

Same as Raspberry Pi Step 13 (change path to Open_Duck_Mini_Runtime-2_RDK_X5)

5.10 Adjust Servo Offsets

Same as Raspberry Pi Step 14 (change path to Open_Duck_Mini_Runtime-2_RDK_X5)

5.11 Modify Configuration File

Same as Raspberry Pi Step 15 (change path to Open_Duck_Mini_Runtime-2_RDK_X5)

5.12 Initial Bent Leg Posture

Same as Raspberry Pi Step 16 (change path to Open_Duck_Mini_Runtime-2_RDK_X5)

If you encounter problems, please refer to Frequently Asked Questions (FAQ)

5.13 Test Walking

cd ~/project/Open_Duck_Mini_Runtime-2_RDK_X5/scripts

python v2_rl_walk_mujoco.py 

    --duck_config_path ~/duck_config.json 

    --onnx_model_path ~/BEST_WALK_ONNX_2.onnx

This article has added support for Logitech F710 controller.

VI. Frequently Asked Questions (FAQ)

6.1 Q1: When running `find_soft_offsets.py`, gravity shows horizontal posture

Problem Cause: Servo 22 or 12 is not installed in horizontal orientation, causing servo position to be approximately -1.57 radians

Solution:

1. Loosen the 4 fixing screws on the servo main disk to allow the entire leg to be freely adjustable

2. Create the following script to return the servo to center position:

cd ~/project/Open_Duck_Mini_Runtime-2/scripts  # or corresponding RDK X5 path

nano set_servo_mid.py

from mini_bdx_runtime.rustypot_position_hwi import HWI

from mini_bdx_runtime.duck_config import DuckConfig

import argparse

import time

import traceback

def zero_motor(hwi, joint_id, tol=0.02, timeout=5.0):

    """Move motor to 0 rad and wait until reached."""

    print(f"Zeroing motor ID {joint_id} to 0 rad")

try:
    current_pos = hwi.io.read_present_position(&#91;joint_id])&#91;0]
    print(f"Current position: {current_pos:.3f} rad")

    hwi.io.write_goal_position(&#91;joint_id], &#91;0.0])

    start_time = time.time()
    while True:
        pos = hwi.io.read_present_position(&#91;joint_id])&#91;0]
        err = abs(pos)

        print(f"  pos={pos:.3f} rad, err={err:.3f}")

        if err &lt; tol:
            print("✓ Zero position reached")
            return True

        if time.time() - start_time &gt; timeout:
            print("✗ Timeout while zeroing motor")
            return False

        time.sleep(0.05)

except Exception as e:
    print(f"✗ Error zeroing motor ID {joint_id}: {e}")
    print(traceback.format_exc())
    return False


def main():

    parser = argparse.ArgumentParser()

    parser.add_argument("--id", type=int, required=True, help="Motor ID to zero")

    args = parser.parse_args()

print("Initializing hardware interface...")
try:
    duck_config = DuckConfig()
    hwi = HWI(duck_config=duck_config)
    print("Successfully connected to hardware")
except Exception as e:
    print(f"Error initializing HWI: {e}")
    print(traceback.format_exc())
    return

zero_motor(hwi, args.id)

try:
    hwi.io.disable_torque(&#91;args.id])
    print(f"Torque disabled for motor ID {args.id}")
except Exception:
    pass


if name == "main":

    main()

3. Run the script, specify the servo ID to calibrate and return to center position:

python set_servo_mid.py --id 12

Expected Output:

Initializing hardware interface...

Successfully connected to hardware

Zeroing motor ID 12 to 0 rad

Current position: -3.086 rad

  pos=-3.086 rad, err=3.086

  ...

✗ Timeout while zeroing motor

Torque disabled for motor ID 12

4. The servo disk will automatically rotate. After rotation is complete, fix the four screws in the upright posture.

Document Update Log

As of the writing of this article, multiple tutorials for OpenDuck Mini have Python environment configuration issues

This tutorial, when used with the specified image version, has been verified in practice and can avoid common environment issues

VII. Reinforcement Learning

This section introduces how to use the OpenDuck project for reinforcement learning training, including reference motion generation, data processing, and model training.

7.1 Generate Reference Motions

Repository: Open_Duck_reference_motion_generator
Purpose: Generate reference motion data for imitation learning

7.1.1 Clone Repository and Install Dependencies

cd ~/project/open_duck_mini_ws

git clone https://github.com/apirrone/Open_Duck_reference_motion_generator.git

cd Open_Duck_reference_motion_generator

  
  
  Install dependencies using uv


uv sync

7.1.2 Batch Generate Motions

Use the auto_waddle.py script to batch generate motion files with different gait parameters

uv run scripts/auto_waddle.py 

    --duck open_duck_mini_v2 

    --sweep 

    -j8

Parameter	Description
`--duck`	Robot model (`open_duck_mini_v2`)
`--sweep`	Traverse all parameter combinations
`-j8`	Use 8 threads for parallel generation

Generation Result: Approximately 240 .json motion files will be generated in the recordings/ directory

File naming format: {number}{x_velocity}{y_velocity}{turn_velocity}.json

Example: 99_0.074-0.111_-0.074.json

X-direction velocity: 0.074 m/s (forward)

Y-direction velocity: -0.111 m/s (right)

Turn angular velocity: -0.074 rad/s (clockwise)

7.1.3 Verify Generated Motions (Optional)

# Use Meshcat for visualization

uv run open_duck_reference_motion_generator/gait_playground.py --duck open_duck_mini_v2

Then open http://127.0.0.1:7000/static/ in your browser to view the 3D model animation

7.2 Process Motion Data

Purpose: Perform polynomial fitting on motion data to compress data and smooth noise

7.2.1 Polynomial Fitting

cd ~/project/open_duck_mini_ws/Open_Duck_reference_motion_generator

uv run scripts/fit_poly.py --ref_motion recordings/

Output: The polynomial_coefficients.pkl file will be generated in the current directory, containing polynomial coefficients for all motions

Purpose of Polynomial Fitting:

Significantly compress data volume (each joint only needs 5-10 coefficients to represent the complete motion trajectory)

Effectively smooth noise and jitter in raw data

Facilitate fast sampling and interpolation during reinforcement learning training

7.2.2 View Fitting Results (Optional)

uv run scripts/plot_poly_fit.py --coefficients polynomial_coefficients.pkl

The script will display fitting curve graphs for each motion one by one to verify fitting effectiveness

7.2.3 Copy to Training Directory

cp polynomial_coefficients.pkl 

   ~/project/open_duck_mini_ws/Open_Duck_Playground/playground/open_duck_mini_v2/data/

7.3 Reinforcement Learning Training

Repository: Open_Duck_Playground
Purpose: Train walking strategy using PPO algorithm

7.3.1 Clone Repository and Install Dependencies

cd ~/project/open_duck_mini_ws

git clone https://github.com/apirrone/Open_Duck_Playground.git

cd Open_Duck_Playground

uv sync

7.3.2 Start Training

python3 playground/open_duck_mini_v2/runner.py 

    --task flat_terrain_backlash 

    --num_timesteps 300000000

Parameter	Description
`--task`	Training task type (`flat_terrain_backlash` means flat terrain + backlash compensation)
`--num_timesteps`	Total training steps (300 million steps, usually takes several hours to complete)

Training Output:

checkpoints/ directory – Saves model checkpoints during training

ONNX.onnx file – Final exported ONNX format inference model

7.3.3 Monitor Training Progress

Run the following command in a new terminal:

cd ~/project/open_duck_mini_ws/Open_Duck_Playground

tensorboard --logdir=checkpoints/

Open http://localhost:6006 in your browser to view training curves and metrics

7.3.4 Training Parameters

Parameter	Default Value	Description
`num_envs`	8192	Number of parallel simulation environments
`batch_size`	256	Training batch size
`learning_rate`	0.0003	Learning rate
`discounting`	0.97	Discount factor (for calculating present value of future rewards)
`episode_length`	1000	Maximum steps per episode

7.3.5 Deploy to Real Robot

After training is complete, copy the generated ONNX.onnx model file to the robot device:

scp ONNX.onnx user@raspberry-pi:~/BEST_WALK_ONNX_2.onnx

Then follow the steps in the Test Walking section to complete deployment

The post Understanding Reinforcement Learning through OpenDuck appeared first on Frank Fu's Blog.

NavTalk Official Support for NVIDIA RTX 5090 on Linux

Frank Fu — Mon, 30 Mar 2026 08:50:11 +0000

NavTalk’s digital human lip-sync and real-time audio/video capabilities are fully supported for deployment and operation on Linux servers equipped with NVIDIA RTX 5090. End-to-end adaptation and validation—from drivers and frameworks to the inference engine—have been completed for the latest generation (Blackwell architecture and corresponding NVIDIA drivers and libraries), ensuring a stable, high-performance real-time digital human experience on current hardware.

This document describes NavTalk’s official support for RTX 5090 on Linux in terms of technology stack, adaptation work, and product value, and provides recommended concurrent real-time chat Session counts for RTX 5090 / 4090 / 3090 based on measured results, for evaluation and sizing reference.

1. Why RTX 5090 and Linux Matter

Compute upgrade: RTX 5090 is based on the Blackwell architecture, with significantly higher memory and compute, suited for real-time high-resolution lip-sync and multi-session concurrency.

Linux first: Most production and cloud environments run Linux; NavTalk offers a full set of services on Linux (including real-time lip-sync, video lip-sync, and other APIs), making integration and scaling straightforward.

Long-term compatibility: Adaptation has been completed for the latest NVIDIA drivers and AI runtime (e.g. CUDA 12.8, PyTorch 2.7), keeping NavTalk aligned with the official software stack for the foreseeable future and reducing upgrade cost.

Thus, “deployable, operable, and scalable” on RTX 5090 Linux is a clear commitment from NavTalk for production and high-end compute scenarios. We recommend using NVIDIA drivers that support RTX 5090 (e.g. 5xx series) and a common Linux distribution (e.g. Ubuntu 22.04 LTS or newer).

2. Technology Stack and Adaptation

NavTalk’s runtime on RTX 5090 Linux is selected and validated separately from environments used for older GPUs (e.g. CUDA 11.8), and is maintained independently to avoid wrong or mixed installations and to simplify environment isolation and issue reproduction.

2.1 Core Runtime (5090-specific)

The table below lists officially verified software versions for NavTalk on RTX 5090, for operations and integration reference. Python is the runtime; CUDA is the NVIDIA compute platform; PyTorch is the main framework for AI models; mmcv / mmdet / mmpose are the vision libraries used for face and pose, etc.

Component	5090 Linux recommended version	Notes
Python	3.10.11	Runtime version
CUDA	12.8	NVIDIA compute platform for RTX 5090
PyTorch	2.7.0+cu128	AI model framework (vision, audio, etc.)
TensorFlow	≥2.16.0	Required when enabling related features
NumPy	1.26.0	Numerical library, compatible with image processing
mmcv	2.1.0	Computer vision base (face, image processing, etc.)
mmdet	3.2.0	Detection library paired with mmcv
mmpose	1.2.0	Pose library paired with mmcv

NavTalk maintains a dedicated dependency list for the 5090 environment, including the above components and versions, with notes on TensorFlow, CUDA 12.8, NumPy, etc., separate from older GPU environments, reflecting 5090-specific adaptation and maintainability.

2.2 5090 Architecture Compatibility

RTX 5090 uses the new Blackwell architecture (compute capability 9.0). Some vision libraries may not ship prebuilt packages for 5090. Compatibility has been verified and adapted for the 5090 architecture so that face, pose, and related capabilities run correctly on 5090, enabling full usability.

2.3 Inference and Model Management

NavTalk’s lip-sync core is based on MuseTalk 1.5 (a widely used high-quality lip-sync model) and runs on 5090 with the PyTorch 2.7 + CUDA 12.8 stack above.

NavTalk provides unified GPU and model management: models are loaded on demand, and multi-task contention for the GPU is avoided, improving stability in multi-service or multi-GPU setups and long-term operation on 5090.

All versions and adaptation work above have been verified, representing reproducible, deliverable engineering support, not just “theoretical” compatibility.

3. Product Value and Use Cases

Latency and quality: On 5090, NavTalk can leverage the new generation’s compute for real-time lip-sync at 30+ fps and higher resolution with multi-session concurrency, suitable for digital humans, virtual hosts, and live interaction where latency and quality matter.

Service forms: On 5090 Linux, NavTalk offers real-time lip API, video lip API, digital human avatar API, and other interfaces for live, recorded, and interactive use; the real-time lip API is optimized for low latency and streaming.

Production-ready: Concurrency, quality enhancements (e.g. face enhancement, mouth sharpening), GPU options, and output directories are configurable, easing integration with your existing business systems, storage, and monitoring.

Thus, NavTalk on 5090 Linux is not only “runs” but full production support for the latest compute, supporting evaluation and rollout.

4. RTX 5090 / 4090 / 3090 Concurrency and Responsiveness

Conclusions in this section are based on single-node, single-GPU measured memory usage (service port 8800, real-time chat WebSocket call scenario). The following gives RTX 5090, 4090, and 3090 concurrent Session recommendations from a memory perspective; if the GPU is shared with other processes (e.g. LLM services), recalculate using available memory.

4.1 Memory and Single-Session Peak (Measured)

Item	Value	Notes
RTX 5090 total memory	32,607 MiB (~31.8 GiB)	Single-GPU physical memory; after small desktop usage, still ~32 GiB for planning.
Single-session real-time chat peak	10,410 MiB (~10.2 GiB)	NavTalk process group usage when one real-time chat Session is inferring.

Composition of the single-session peak (measured): main process during inference ~8,746 MiB, plus two worker processes at 832 MiB each, total 8,746 + 832×2 = 10,410 MiB. In the current deployment, each real-time chat Session corresponds to a separately started service process set (not multi-threaded sharing), so each additional Session adds ~10.2 GiB memory; this peak is used for sizing.

Share of total capacity: 10,410 MiB ÷ 32,607 MiB ≈ 31.9%.

4.2 Concurrent Session Count (Memory-Based)

When NavTalk has exclusive use of RTX 5090:

Usable memory for NavTalk is 32,607 MiB (still close to 32 GiB after desktop, etc.).

Floor by single-session peak 10,410 MiB: 32,607 ÷ 10,410 ≈ 3.13 → 3 concurrent real-time chat Sessions.

Check: 3 × 10,410 = 31,230 MiB < 32,607 MiB; ~1,377 MiB headroom for fragmentation and short-term spikes.

When other processes use GPU memory (e.g. LLM inference, other services):

Available memory = 32,607 MiB − other process usage;

Concurrent Sessions = ⌊ available memory ÷ 10,410 ⌋ (floor).

Actual concurrency limits also depend on system RAM, CPU, and network; we recommend load testing in the target environment (including whether the GPU is shared).

4.3 Three-GPU Concurrent Session Recommendations (Measured and Inferred)

Single-session real-time chat peak is taken from 5090 measurements: 10,410 MiB (~10.2 GiB). 5090 was tested on Linux; 4090 and 3090 on Windows. Using memory floor and measured results, recommended planning is:

GPU	Total memory	Environment	Recommended concurrent real-time chat Sessions
RTX 5090	32,607 MiB (~31.8 GiB)	Linux	3
RTX 4090	24,564 MiB (~24.0 GiB)	Windows	2
RTX 3090	24,576 MiB (~24.0 GiB)	Windows	1

The above are memory-based recommendations; actual limits also depend on system memory, CPU, and network. We recommend load testing in the target environment.

5. Summary

NavTalk is officially supported and runs fully on NVIDIA RTX 5090 + Linux. For 5090, NavTalk specifies runtime versions (e.g. Python 3.10, CUDA 12.8, PyTorch 2.7), a 5090-specific dependency list, and recommended versions for face/pose and related libraries, with end-to-end adaptation and validation from drivers through the inference engine.

Compatibility has been addressed for the 5090 architecture; where prebuilt packages are unavailable, building from source and similar approaches are supported to run correctly on 5090.

Concurrency and responsiveness: Based on measurements and memory sizing, RTX 5090 (Linux, exclusive) supports 3 concurrent real-time chat Sessions, RTX 4090 (Windows) is recommended at 2, and RTX 3090 (Windows, with system/desktop usage) at 1; if the GPU is shared with other processes, recalculate from available memory. Higher compute improves low-latency and real-time lip experience.

This document describes product-level support for RTX 5090 on Linux, for external communication and technical evaluation.

The post NavTalk Official Support for NVIDIA RTX 5090 on Linux appeared first on Frank Fu's Blog.

DEV Community: Frank Fu

Building NavBot-D1: From Parts, Jetson, and ROS 2 to Reinforcement-Learning Locomotion

Project Overview

Core Hardware List

System Architecture

Step 1: Prepare the Mechanical Parts

Step 2: Assemble the Actuators and Main Frame

Step 3: Wire the Power, Main Board, Jetson, IMU, and CAN

Step 4: Flash Jetson and Create a Maintainable Base System

Step 5: Install ROS 2, Wi-Fi Driver, Device Tree, and Dependencies

Step 6: Configure the CAN Bus and Actuator IDs

Step 7: Calibrate Zero Position and Validate Joint Direction

Step 8: Enter Reinforcement Learning Locomotion Mode

Key Hardware Notes

Actuator Direction

Thread Locker

Cable Routing

Emergency Stop

CAN Bus

LoRa and Remote Control

Jetson Thermal and Power

Debugging Notes

Final Result

Reproduction Suggestions

Appendix

Building a Desktop AI Companion with RDK X5, OpenClaw, NavTalk, and MQTT

Project Overview

Core Hardware List

System Architecture

Step 1: Arrange and Connect the Hardware

Step 2: Confirm the RDK X5 Desktop Environment

Step 3: Install Basic Dependencies

Step 4: Install OpenClaw on the RDK X5

Step 5: Install the MQTT Skill and Bridge Environment

Step 6: Configure the MQTT Broker

Step 7: Start OpenClaw and the MQTT Bridge

Step 8: Configure the NavTalk Digital Human Page

Step 9: End-to-End Test

Key Hardware Notes

1. Do not connect only HDMI to the small display

2. The USB audio device must be recognized by the system

3. The MQTT address must be reachable from NavTalk

4. Topics must be configured in pairs

5. Verify with the command line before connecting NavTalk

Final Result

Reproduction Suggestions

Related Links

Appendix

OpenAvatarChat: A Detailed Explanation of System Architecture and Handler Collaboration Mechanism

1. Overall Architecture

1.1 System Hierarchical Structure

1.2 Core Component Description

Handler records: Handler name → Handler environment

2. Data Flow Process

2.1 Complete Data Flow Architecture Diagram

2.2 Detailed Data Flow Process

Step 1: Client Input

Step 2: Data Distribution (Subscription Distribution)

Step 3: Handler Processing

Step 4: Chained Data Flow

Step 5: Client Output

2.3 Key Data Structures: Queues and Routing

3. The Essence of Handler

3.1 What is a Handler?

3.2 The Nature of a Handler: Independent Threads

Stage 3: Handle (handle)

Stage 4: Destroy Context (destroy_context)

3.4 Interface Definition of Handlers

3.5 Key Method of Handler: get_handler_detail

4. Handler Collaborative Mechanism

4.1 Data Subscription Mechanism

Establishing Subscription Relationships

Subscription Example

SenseVoice subscribes to HUMAN_AUDIO (SileroVad's output)

LLM_Bailian subscribes to HUMAN_TEXT (SenseVoice's output)

Edge_TTS subscribes to AVATAR_TEXT (LLM's output)

AvatarMusetalk subscribes to AVATAR_AUDIO (TTS's output)

4.2 Data Distribution Mechanism (Subscription Distribution)

4.3 Handler Parallel Processing Mechanism

Parallel Execution