This article is an AI-assisted translation of a Japanese technical article.
https://zenn.dev/yokomachi/articles/202602_vrm-motion-control-on-web
Introduction
I'm currently working on a personal AI agent project and decided to use a 3D model as the user interface.
Since I didn't have the knowledge to build everything from scratch, I leveraged AITuberKit, an OSS project I'd been aware of for a while, to quickly set up the frontend.
Tech Stack
- VRM model creation: VRoid Studio
- Web frontend: Next.js, TypeScript
- VRM rendering & control: three-vrm (v3.0.0), Three.js
- Base kit: AITuberKit
- Agent implementation: Strands Agents, Amazon Bedrock AgentCore Not covered in detail in this article
VRM and VRoid Studio
VRM is a file format designed for 3D avatars.
With VRoid Studio, you can create characters and export them in VRM format without any 3D modeling knowledge.
In my case, my only prior experience was creating characters in video games, but I was able to create two models (male and female) in about an hour — that's how easy it is.
https://x.com/_cityside/status/2019742015617994773
What AITuberKit Can Do
AITuberKit is an OSS that displays VRM models in a web browser and bundles features like LLM-powered chat, facial expression control, and speech synthesis.
Here are some of the key features AITuberKit provides:
- VRM model display, facial expression control, and lip-sync
- LLM-powered chatbot functionality
- Speech synthesis API integration
- YouTube streaming integration
- Multimodal input
- etc.
For my project, since I'm building it as a personal AI agent, I'm using AITuberKit's base features like VRM display control and chatbot functionality while adding heavy customizations on top.
Implementing Motion Control
Here's where we get to the main topic.
AITuberKit supports switching facial expressions (smile, angry face, etc.) out of the box, so I decided to implement additional body motions (bowing, extending a hand, etc.).
https://x.com/_cityside/status/2016874430056845502
Architecture Overview
Here's the overall picture of the motion control system:
LLM Response
↓ Streaming parser
├─ [emotion] Emotion tag → ExpressionController → Facial expression control
└─ [bow/present] Motion tag → GestureController → Bone control
↑
EmoteController (conflict resolution)
The EmoteController sits between facial expressions and motions to handle conflicts between them.
Motion Definitions
Motions are implemented by defining bone rotations as keyframes.
Here's an example definition for a bow:
// src/features/emoteController/gestureController.ts
interface BoneRotation {
bone: VRMHumanBoneName
rotation: THREE.Quaternion
}
interface GestureKeyframe {
duration: number
bones: BoneRotation[]
}
interface GestureDefinition {
keyframes: GestureKeyframe[]
holdDuration: number
closeEyes?: boolean
}
For the bow motion, three bones — spine, chest, and neck — are each rotated forward to create a more natural-looking bow rather than simply bending at the waist.
The arm bones are also adjusted to achieve a natural posture.
// src/features/emoteController/gestureController.ts
this._gestures.set('bow', {
keyframes: [
{
duration: 1.0,
bones: [
{
bone: 'spine',
rotation: new THREE.Quaternion().setFromEuler(
new THREE.Euler(0.25, 0, 0)
),
},
{
bone: 'chest',
rotation: new THREE.Quaternion().setFromEuler(
new THREE.Euler(0.15, 0, 0)
),
},
{
bone: 'neck',
rotation: new THREE.Quaternion().setFromEuler(
new THREE.Euler(0.12, 0, 0)
),
},
// Arm bones are also adjusted (omitted)
],
},
],
holdDuration: 1.0,
closeEyes: true, // Close eyes during the bow
})
Triggering Motions from LLM Responses
The character's expressions are controlled by having the LLM output emotion and motion tags in its responses.
Emotion tags are implemented by default in AITuberKit. The LLM response looks like this:
[happy]Thank you so much!
Motion tags are a custom addition. They appear in the response just like emotion tags:
Welcome! [bow]What kind of fragrance are you looking for today?
When both emotion and motion tags appear simultaneously, both are triggered.
For example, [happy][bow] results in the character bowing with a smile.
The system prompt includes the following instructions:
`
## Emotional Expression
The format for conversation text is as follows. Choose the single most appropriate emotion for the entire response and prepend the emotion tag at the beginning.
[{neutral|happy|angry|sad|relaxed|surprised}]{conversation text}
`
Handling Conflicts Between Expressions and Motions
Simply applying both facial expressions and motions at the same time can cause unexpected behavior, so I've added the following controls.
For example, having the eyes open during a bow looked unnatural, so I set closeEyes: true to close the eyes on the motion control side.
The EmoteController manages this by passing flags between controllers:
// src/features/emoteController/emoteController.ts
public updateExpression(delta: number) {
const isEmotionActive = this._expressionController.isEmotionActive
// Skip auto-blink if the motion is closing eyes and expression is neutral
const skipAutoBlink =
this._gestureController.isClosingEyes && !isEmotionActive
this._expressionController.update(delta, skipAutoBlink)
}
public updateGesture(delta: number) {
const isEmotionActive = this._expressionController.isEmotionActive
// Skip motion eye-close if an emotion expression is active
this._gestureController.update(delta, isEmotionActive)
}
The emotion expressions and the motion's eye-close feature are mutually exclusive.
When the emotion is neutral, the motion side closes the eyes. When an emotion is active, the motion's eye-close is disabled and control is handed to the expression side.
Wrapping Up
Using a chat UI as the frontend for an AI agent is a very common approach, but even a simple model like this feels lively just by having it move around, which makes it really fun.
That said, controlling motions can be quite tricky — figuring out which bones to rotate and by how much is surprisingly difficult.
For more complex motions, you could look into purchasing motion packs, which might be a good option.

Top comments (0)