DEV Community

Cover image for Breaking the Silence: Running Hermes Agent with Local C++ Voice Cloning (VoxCPM2) on ARM64
Alain Chan
Alain Chan

Posted on

Breaking the Silence: Running Hermes Agent with Local C++ Voice Cloning (VoxCPM2) on ARM64

Hermes Agent Challenge Submission: Build With Hermes Agent

This is a submission for the Hermes Agent Challenge

Breaking the Silence: Running Hermes Agent with Local C++ Voice Cloning (VoxCPM2) on ARM64

VPS

Most AI agents are deaf and mute, communicating solely through text or latency-heavy cloud TTS APIs. When I set out to build a fully autonomous morning assistant using Hermes Agent hosted locally on my Debian ARM64 server, I wanted something different. I wanted a private, high-fidelity, cloned voice that could talk to me natively on WhatsApp every morning with custom-tailored weather briefings and diet-aware recommendations.

To achieve this, I integrated Hermes with VoxCPM2—a highly optimized multilingual speech-cloning model running in clean C++. Along the way, I hit some brutal low-level compilation blocks, model-packaging quirks, and real-time audio pipeline hurdles.

Here is the exact blueprint of how I overcame these ARM64 limitations, patched GGML, and wired Hermes Agent to speak to me in a pristine, cloned voice.


The Vision: A Voice-First Private Daily Agent

The goal was to leverage Hermes Agent's autonomous Cron and Persistent Memory systems. Every morning at a scheduled time, a cron job fires a custom Python script. Hermes gathers local weather forecasts, synthesizes them with personal preferences, and prepares a daily briefing.

Instead of printing text, the agent passes the payload to a local C++ inference pipeline, clones a target voice, packages the audio, and sends it directly to my WhatsApp as an instant, native voice message.

+-----------------------------------------------------------------+
|                         Hermes Agent                            |
|  [Cron Job (Scheduled)] -> [Weather/News Fetch] -> [Persist Mem]|
+-------------------------------+---------------------------------+
                                | (Text Payload)
                                v
+-----------------------------------------------------------------+
|                     VoxCPM2 C++ Engine                          |
|  [16kHz Reference WAV] -> [ggml.cpp] -> [High-Fid FP16 Voice]   |
+-------------------------------+---------------------------------+
                                | (Raw WAV Output)
                                v
+-----------------------------------------------------------------+
|                     Audio Pipeline & Delivery                   |
|  [FFmpeg (OGG/Opus)] -> [Local WA Bridge] -> [Native Voice Msg] |
+-----------------------------------------------------------------+
Enter fullscreen mode Exit fullscreen mode

Hurdle 1: Bypassing the 64-Character GGML Tensor Limit

VoxCPM2's C++ inference engine relies on a clean, local build of ggml. When compilation finished and I attempted to load the larger, highly expressive GGUF models for multimodal/cloned speech, the engine crashed instantly with loading errors.

The Cause:

GGML historically hardcodes GGML_MAX_NAME (the maximum length of a tensor's name) to 64 characters. Because high-fidelity speech models contain deep, hierarchical layers with descriptive naming schemes, their tensor names easily exceed 64 characters.

The Fix:

I had to patch the underlying GGML source before building. If you are running into this, navigate to third_party/ggml/include/ggml.h and increase the limit to 128:

// Locate in third_party/ggml/include/ggml.h
// Old definition:
// #define GGML_MAX_NAME 64

// New patched definition:
#define GGML_MAX_NAME 128
Enter fullscreen mode Exit fullscreen mode

After modifying this, re-running the C++ make pipeline allowed the GGUF loader to successfully parse the deep voice layers without truncation or memory segmentation faults.


Hurdle 2: Untangling Model Packages for C++ Inference

Many single-file GGUF packages available online (e.g., standard model merges) lack the necessary metadata required by the raw C++ inference binary of VoxCPM2.

To run end-to-end voice cloning ("Ultimate Mode") successfully, I discovered that you must load separated model files that preserve explicit metadata structure:

  1. base_lm_q8_0.gguf (The quantized base language model weights)
  2. residual_lm_q8_0.gguf (The residual weights)
  3. Or verified unified packages such as voxcpm2-q8_0-audiovae-f16.gguf from bluryar/VoxCPM-GGUF.

By utilizing an FP16 high-fidelity model on an ARM64 CPU, we prioritize pristine vocal textures and rich tone over fast but robotic lower-precision modes.


Hurdle 3: Designing the Real-Time Audio & Delivery Pipeline

Getting Hermes to talk natively on WhatsApp requires an exact, low-latency audio pipeline.

Step 1: Format Reference Audio

VoxCPM2 C++ cloning requires a pristine 16kHz mono WAV format reference file. Our utility script converts a standard MP3 sample to the exact format needed before running the model:

# Conversion using FFmpeg in Python subprocess
subprocess.run([
    "ffmpeg", "-y", "-i", args.ref_mp3,
    "-ar", "16000", "-ac", "1", args.ref_wav
], check=True, stdout=subprocess.DEVNULL, stderr=subprocess.DEVNULL)
Enter fullscreen mode Exit fullscreen mode

Step 2: C++ Inference

The utility executes the C++ binary with customized parameters, leveraging multi-threading optimized for the server's ARM64 CPU:

/home/debian/VoxCPM.cpp/build/examples/voxcpm_tts \
    --model-path /path/to/voxcpm2-f16-audiovae-f16.gguf \
    --prompt-audio /path/to/ref.wav \
    --prompt-text "Reference voice transcript text." \
    --text "Target synthesis weather report text." \
    --output /path/to/output.wav \
    --backend cpu \
    --threads 4 \
    --cfg-value 2.0 \
    --inference-timesteps 10
Enter fullscreen mode Exit fullscreen mode

Step 3: Low-Latency Encoding & WhatsApp Bridge Delivery

Standard WAV files arrive on WhatsApp as document attachments. To deliver them as native, instant voice messages (playable voice bubbles), we transcode them into .ogg format using the highly compressed Opus codec.

We can also apply FFmpeg's dynaudnorm (dynamic audio normalizer) filter to keep output volume levels consistent:

ffmpeg -y -i output.wav -filter:a dynaudnorm -c:a libopus output.ogg
Enter fullscreen mode Exit fullscreen mode

WhatsApp

Once the audio file is ready, the script programmatically makes an HTTP POST request to a local WhatsApp API bridge endpoint /send-media with the payload:

{
  "chatId": "user_whatsapp_jid@lid",
  "filePath": "/path/to/output.ogg",
  "mediaType": "audio"
}
Enter fullscreen mode Exit fullscreen mode

This forces WhatsApp to render the media natively as a press-to-play instant voice message bubble!


Combining It All: The Self-improving Local Weather Scheduler

The backbone of this workflow consists of two main Python components scheduled and triggered under Hermes Agent:

  1. cron_morning_weather.py: Fetches real-time JSON forecast from wttr.in for the user's location, parses hourly temperatures, converts English weather descriptions into natural, expressive Cantonese, decides if an umbrella is needed, and outputs a cute morning briefing.
  2. run_clone.py: Receives the text payload, packages the model, compiles the C++ parameters, encodes the audio using ffmpeg to libopus, and delivers it to the local WhatsApp gateway bridge.

The Magic of Hermes Agent: Memory and Location Privacy

What makes this system genuinely autonomous rather than a simple cron-bash script is Hermes's self-improving memory architecture.

  1. Persistent Memory (User Profile): Hermes maintains an ongoing log of user preferences across sessions. It remembers that Hermes follow user preferences for example like philosophy.
  2. Context-Aware Briefings: When generating the script text, Hermes synthesizes these facts from its memory. The morning weather update isn't just a reading of numbers; it dynamically adds philosophical thoughts suited to the day's weather.
  3. Timezone Synchronization: Because scheduled cron tasks run in the server's UTC background, Hermes automatically calculates local offset (e.g. BST vs UTC) to ensure the morning briefing is delivered exactly at the user's local wake-up time.
  4. Autonomous Skill Management: When there are path updates or script logic tweaks, Hermes adjusts its internal reference memory, avoiding stale or cached references during execution.

Why Open-Source Agents Win

Running Hermes Agent locally on an ARM64 server proved something crucial: We do not need to rely on proprietary or closed-source ecosystems to build delightful, highly personalized AI experiences.

With a 4-line patch to ggml.h, an optimized C++ inference binary, and Hermes's robust multi-session persistent memory, I have a private, voice-cloning companion that knows my diet, my daily schedule, and my philosophical quirks—costing virtually nothing when idle.

If you are building with Hermes, don't just stay in the terminal. Give your agent a voice, patch those C++ boundaries, and build something that feels alive!

Top comments (0)