DEV Community

Aloysius Chan
Aloysius Chan

Posted on • Originally published at insightginie.com

How the OpenClaw Video Message Skill Creates Avatar Video Notes – Step‑by‑Step Guide

The OpenClaw ecosystem provides a growing library of skills that extend the
capabilities of personal AI assistants. One of the most visually engaging
skills is the video‑message skill, which enables users to generate short
avatar‑driven video clips and deliver them as Telegram video notes. This guide
explains the skill’s purpose, its underlying components, installation
requirements, configuration options, and a typical usage workflow.

At its core, the video‑message skill relies on the avatarcam command‑line
tool. Avatarcam takes an audio file (either supplied directly or produced by a
text‑to‑speech engine) and a VRM avatar model, then renders the avatar
speaking in sync with the audio. The output is a square MP4 video (384×384
pixels) encoded with H.264 video and AAC audio at a constant 30 fps. When the
skill is invoked through OpenClaw’s messaging system, the resulting video is
sent via the Telegram sendVideoNote API, which displays the clip as a
circular video note in chats.

The skill is particularly useful when a user prefers a richer, more personal
response than plain text or audio. Examples include sending a greeting,
delivering a short tutorial, or reacting with an expressive avatar that
lip‑syncs to a spoken message. Because the output conforms to Telegram’s video
note format, the file size stays modest while preserving high visual quality.

To use the skill, several system dependencies must be present. FFmpeg is
required for video processing and encoding. On macOS, FFmpeg can be installed
with Homebrew (brew install ffmpeg). On Linux distributions that use APT,
the command sudo apt-get install -y ffmpeg suffices. Windows users should
download the FFmpeg binary and add its folder to the system PATH.
Additionally, Linux users need a virtual framebuffer (Xvfb) and Xauthority to
run headless rendering; these are installed via sudo apt-get install -y xvfb
xauth
. macOS and Windows provide native display support, so Xvfb is
unnecessary on those platforms.

For Docker‑based deployments, the OpenClaw documentation recommends adding a
set of APT packages to the OPENCLAW_DOCKER_APT_PACKAGES environment
variable. The list includes build-essential procps curl file git ca-
certificates xvfb xauth libgbm1 libxss1 libatk1.0-0 libatk-bridge2.0-0 libgdk-
pixbuf2.0-0 libgtk-3-0 libasound2 libnss3 ffmpeg
. These packages ensure that
the container can render the VRM avatar without a physical GPU or display.

Configuration of the video‑message skill is handled in the TOOLS.md file of
the OpenClaw instance. Two primary settings control the appearance of the
generated video: the avatar file and the background. The avatar setting
expects a path to a VRM model (defaulting to default.vrm if unspecified).
The background setting can be a hex colour (e.g., #00FF00) or a path to an
image file that will be used as the backdrop behind the avatar. Adjusting
these values allows creators to match the video to a brand palette or a
specific scene.

The typical flow begins when a user asks the assistant to “send a video
message saying hello”. OpenClaw first checks whether the request matches the
video‑message skill’s trigger phrases (e.g., “video message”, “avatar video”,
“video reply”). If a match is found, the skill proceeds through three main
stages: text‑to‑speech conversion, avatar video generation, and video‑note
transmission.

In the text‑to‑speech stage, the skill invokes OpenClaw’s built‑in TTS tool
(or an external TTS service if configured) with the supplied text. For the
example phrase “Hello! How are you today?”, the TTS engine produces an audio
file, commonly stored in a temporary location such as /tmp/voice.mp3. If the
user already provides an audio file, this step can be skipped.

Next, the skill runs avatarcam with the generated audio, the avatar path
from TOOLS.md, and the background setting. A representative command looks
like:

avatarcam --audio /tmp/voice.mp3 --output /tmp/video.mp4 --background
'#00FF00'

When executed, avatarcam launches an Electron process that loads the VRM
model, animates its visemes based on the audio’s phonetic content, and renders
the scene at 1280×720 pixels. The rendered frames are captured via
canvas.captureStream(30), piped to FFmpeg, which crops the image to a
square, normalises the frame rate, scales it to 384×384, and encodes the final
MP4. The entire pipeline typically takes about 1.5 times the length of the
audio; a 20‑second clip therefore requires roughly 30 seconds of processing.

After the MP4 file is ready, the skill uses OpenClaw’s message tool to send
it as a video note. The command format is:

message action=send filePath=/tmp/video.mp4 asVideoNote=true

The asVideoNote=true flag tells the message tool to call Telegram’s
sendVideoNote endpoint, which displays the video as a circular note. Once
the transmission succeeds, the skill returns NO_REPLY to indicate that no
further textual response is needed. Finally, any temporary files (e.g.,
/tmp/voice.mp3 and /tmp/video.mp4) are cleaned up to free disk space.

For users who prefer a regular rectangular video instead of a circular note,
the skill can be invoked with asVideoNote=false or simply omitted, causing
the message tool to use the standard sendVideo API. This flexibility makes
the skill adaptable to various chat platforms that support different video
formats.

Performance considerations are worth noting. Because the rendering relies on
CPU‑based compositing via Electron and FFmpeg, generation times scale linearly
with audio length. On a typical modern laptop, a 10‑second message may be
produced in about 15 seconds, while a 60‑second clip (the maximum duration
enforced by the skill) could take close to 90 seconds. Users seeking faster
turnaround can reduce the output resolution or frame rate by modifying the
internal FFmpeg parameters, though doing so may affect visual fidelity.

Security and privacy are also addressed. The skill processes all data locally;
no audio or video leaves the host machine except for the final transmission to
Telegram. Temporary files are stored in a system‑specific temp directory and
removed immediately after sending, minimizing the risk of residual data
exposure. Moreover, the skill does not require any external API keys for TTS
if the built‑in engine is used, keeping the deployment self‑contained.

In summary, the OpenClaw video‑message skill provides a seamless way to
transform spoken or typed messages into engaging avatar videos. By combining
TTS, VRM animation, and FFmpeg processing, it delivers high‑quality,
lip‑synced content that can be sent as Telegram video notes or regular videos.
The skill’s modest system requirements, clear configuration via TOOLS.md,
and straightforward workflow make it an accessible addition for anyone looking
to enrich their AI‑driven conversations with visual flair.

Skill can be found at:
messages/SKILL.md>

Top comments (0)