<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Alain Chan</title>
    <description>The latest articles on DEV Community by Alain Chan (@alaindevs).</description>
    <link>https://dev.to/alaindevs</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3957465%2F7da5c59b-d81d-4953-b1e2-f7cab087b424.png</url>
      <title>DEV Community: Alain Chan</title>
      <link>https://dev.to/alaindevs</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/alaindevs"/>
    <language>en</language>
    <item>
      <title>Breaking the Silence: Running Hermes Agent with Local C++ Voice Cloning (VoxCPM2) on ARM64</title>
      <dc:creator>Alain Chan</dc:creator>
      <pubDate>Fri, 29 May 2026 16:20:13 +0000</pubDate>
      <link>https://dev.to/alaindevs/breaking-the-silence-running-hermes-agent-with-local-c-voice-cloning-voxcpm2-on-arm64-1dfm</link>
      <guid>https://dev.to/alaindevs/breaking-the-silence-running-hermes-agent-with-local-c-voice-cloning-voxcpm2-on-arm64-1dfm</guid>
      <description>&lt;p&gt;&lt;em&gt;This is a submission for the &lt;a href="https://dev.to/challenges/hermes-agent-2026-05-15"&gt;Hermes Agent Challenge&lt;/a&gt;&lt;/em&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Breaking the Silence: Running Hermes Agent with Local C++ Voice Cloning (VoxCPM2) on ARM64
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F7tntao2ae49hlmf07vfc.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F7tntao2ae49hlmf07vfc.png" alt="VPS" width="800" height="531"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Most AI agents are deaf and mute, communicating solely through text or latency-heavy cloud TTS APIs. When I set out to build a fully autonomous morning assistant using &lt;strong&gt;Hermes Agent&lt;/strong&gt; hosted locally on my Debian ARM64 server, I wanted something different. I wanted a private, high-fidelity, cloned voice that could talk to me natively on WhatsApp every morning with custom-tailored weather briefings and diet-aware recommendations.&lt;/p&gt;

&lt;p&gt;To achieve this, I integrated Hermes with &lt;strong&gt;VoxCPM2&lt;/strong&gt;—a highly optimized multilingual speech-cloning model running in clean C++. Along the way, I hit some brutal low-level compilation blocks, model-packaging quirks, and real-time audio pipeline hurdles. &lt;/p&gt;

&lt;p&gt;Here is the exact blueprint of how I overcame these ARM64 limitations, patched GGML, and wired Hermes Agent to speak to me in a pristine, cloned voice.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Vision: A Voice-First Private Daily Agent
&lt;/h2&gt;

&lt;p&gt;The goal was to leverage Hermes Agent's autonomous &lt;strong&gt;Cron&lt;/strong&gt; and &lt;strong&gt;Persistent Memory&lt;/strong&gt; systems. Every morning at a scheduled time, a cron job fires a custom Python script. Hermes gathers local weather forecasts, synthesizes them with personal preferences, and prepares a daily briefing.&lt;/p&gt;

&lt;p&gt;Instead of printing text, the agent passes the payload to a local C++ inference pipeline, clones a target voice, packages the audio, and sends it directly to my WhatsApp as an instant, native voice message.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;+-----------------------------------------------------------------+
|                         Hermes Agent                            |
|  [Cron Job (Scheduled)] -&amp;gt; [Weather/News Fetch] -&amp;gt; [Persist Mem]|
+-------------------------------+---------------------------------+
                                | (Text Payload)
                                v
+-----------------------------------------------------------------+
|                     VoxCPM2 C++ Engine                          |
|  [16kHz Reference WAV] -&amp;gt; [ggml.cpp] -&amp;gt; [High-Fid FP16 Voice]   |
+-------------------------------+---------------------------------+
                                | (Raw WAV Output)
                                v
+-----------------------------------------------------------------+
|                     Audio Pipeline &amp;amp; Delivery                   |
|  [FFmpeg (OGG/Opus)] -&amp;gt; [Local WA Bridge] -&amp;gt; [Native Voice Msg] |
+-----------------------------------------------------------------+
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  Hurdle 1: Bypassing the 64-Character GGML Tensor Limit
&lt;/h2&gt;

&lt;p&gt;VoxCPM2's C++ inference engine relies on a clean, local build of &lt;code&gt;ggml&lt;/code&gt;. When compilation finished and I attempted to load the larger, highly expressive GGUF models for multimodal/cloned speech, the engine crashed instantly with loading errors.&lt;/p&gt;

&lt;h3&gt;
  
  
  The Cause:
&lt;/h3&gt;

&lt;p&gt;GGML historically hardcodes &lt;code&gt;GGML_MAX_NAME&lt;/code&gt; (the maximum length of a tensor's name) to &lt;code&gt;64&lt;/code&gt; characters. Because high-fidelity speech models contain deep, hierarchical layers with descriptive naming schemes, their tensor names easily exceed 64 characters.&lt;/p&gt;

&lt;h3&gt;
  
  
  The Fix:
&lt;/h3&gt;

&lt;p&gt;I had to patch the underlying GGML source before building. If you are running into this, navigate to &lt;code&gt;third_party/ggml/include/ggml.h&lt;/code&gt; and increase the limit to &lt;code&gt;128&lt;/code&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight c"&gt;&lt;code&gt;&lt;span class="c1"&gt;// Locate in third_party/ggml/include/ggml.h&lt;/span&gt;
&lt;span class="c1"&gt;// Old definition:&lt;/span&gt;
&lt;span class="c1"&gt;// #define GGML_MAX_NAME 64&lt;/span&gt;

&lt;span class="c1"&gt;// New patched definition:&lt;/span&gt;
&lt;span class="cp"&gt;#define GGML_MAX_NAME 128
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;After modifying this, re-running the C++ make pipeline allowed the GGUF loader to successfully parse the deep voice layers without truncation or memory segmentation faults.&lt;/p&gt;




&lt;h2&gt;
  
  
  Hurdle 2: Untangling Model Packages for C++ Inference
&lt;/h2&gt;

&lt;p&gt;Many single-file GGUF packages available online (e.g., standard model merges) lack the necessary metadata required by the raw C++ inference binary of VoxCPM2. &lt;/p&gt;

&lt;p&gt;To run end-to-end voice cloning ("Ultimate Mode") successfully, I discovered that you must load separated model files that preserve explicit metadata structure:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;code&gt;base_lm_q8_0.gguf&lt;/code&gt; (The quantized base language model weights)&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;residual_lm_q8_0.gguf&lt;/code&gt; (The residual weights)&lt;/li&gt;
&lt;li&gt;Or verified unified packages such as &lt;code&gt;voxcpm2-q8_0-audiovae-f16.gguf&lt;/code&gt; from &lt;code&gt;bluryar/VoxCPM-GGUF&lt;/code&gt;.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;By utilizing an FP16 high-fidelity model on an ARM64 CPU, we prioritize pristine vocal textures and rich tone over fast but robotic lower-precision modes.&lt;/p&gt;




&lt;h2&gt;
  
  
  Hurdle 3: Designing the Real-Time Audio &amp;amp; Delivery Pipeline
&lt;/h2&gt;

&lt;p&gt;Getting Hermes to talk natively on WhatsApp requires an exact, low-latency audio pipeline.&lt;/p&gt;

&lt;h3&gt;
  
  
  Step 1: Format Reference Audio
&lt;/h3&gt;

&lt;p&gt;VoxCPM2 C++ cloning requires a pristine &lt;strong&gt;16kHz mono WAV&lt;/strong&gt; format reference file. Our utility script converts a standard MP3 sample to the exact format needed before running the model:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Conversion using FFmpeg in Python subprocess
&lt;/span&gt;&lt;span class="n"&gt;subprocess&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;run&lt;/span&gt;&lt;span class="p"&gt;([&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;ffmpeg&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;-y&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;-i&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;args&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;ref_mp3&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;-ar&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;16000&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;-ac&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;1&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;args&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;ref_wav&lt;/span&gt;
&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="n"&gt;check&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;stdout&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;subprocess&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;DEVNULL&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;stderr&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;subprocess&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;DEVNULL&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Step 2: C++ Inference
&lt;/h3&gt;

&lt;p&gt;The utility executes the C++ binary with customized parameters, leveraging multi-threading optimized for the server's ARM64 CPU:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;/home/debian/VoxCPM.cpp/build/examples/voxcpm_tts &lt;span class="se"&gt;\&lt;/span&gt;
    &lt;span class="nt"&gt;--model-path&lt;/span&gt; /path/to/voxcpm2-f16-audiovae-f16.gguf &lt;span class="se"&gt;\&lt;/span&gt;
    &lt;span class="nt"&gt;--prompt-audio&lt;/span&gt; /path/to/ref.wav &lt;span class="se"&gt;\&lt;/span&gt;
    &lt;span class="nt"&gt;--prompt-text&lt;/span&gt; &lt;span class="s2"&gt;"Reference voice transcript text."&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
    &lt;span class="nt"&gt;--text&lt;/span&gt; &lt;span class="s2"&gt;"Target synthesis weather report text."&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
    &lt;span class="nt"&gt;--output&lt;/span&gt; /path/to/output.wav &lt;span class="se"&gt;\&lt;/span&gt;
    &lt;span class="nt"&gt;--backend&lt;/span&gt; cpu &lt;span class="se"&gt;\&lt;/span&gt;
    &lt;span class="nt"&gt;--threads&lt;/span&gt; 4 &lt;span class="se"&gt;\&lt;/span&gt;
    &lt;span class="nt"&gt;--cfg-value&lt;/span&gt; 2.0 &lt;span class="se"&gt;\&lt;/span&gt;
    &lt;span class="nt"&gt;--inference-timesteps&lt;/span&gt; 10
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Step 3: Low-Latency Encoding &amp;amp; WhatsApp Bridge Delivery
&lt;/h3&gt;

&lt;p&gt;Standard WAV files arrive on WhatsApp as document attachments. To deliver them as &lt;strong&gt;native, instant voice messages&lt;/strong&gt; (playable voice bubbles), we transcode them into &lt;code&gt;.ogg&lt;/code&gt; format using the highly compressed &lt;strong&gt;Opus&lt;/strong&gt; codec. &lt;/p&gt;

&lt;p&gt;We can also apply FFmpeg's &lt;code&gt;dynaudnorm&lt;/code&gt; (dynamic audio normalizer) filter to keep output volume levels consistent:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;ffmpeg &lt;span class="nt"&gt;-y&lt;/span&gt; &lt;span class="nt"&gt;-i&lt;/span&gt; output.wav &lt;span class="nt"&gt;-filter&lt;/span&gt;:a dynaudnorm &lt;span class="nt"&gt;-c&lt;/span&gt;:a libopus output.ogg
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fa1z4lifypgj33av2l46h.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fa1z4lifypgj33av2l46h.png" alt="WhatsApp" width="800" height="193"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Once the audio file is ready, the script programmatically makes an HTTP POST request to a local WhatsApp API bridge endpoint &lt;code&gt;/send-media&lt;/code&gt; with the payload:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"chatId"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"user_whatsapp_jid@lid"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"filePath"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"/path/to/output.ogg"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"mediaType"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"audio"&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This forces WhatsApp to render the media natively as a press-to-play instant voice message bubble!&lt;/p&gt;




&lt;h2&gt;
  
  
  Combining It All: The Self-improving Local Weather Scheduler
&lt;/h2&gt;

&lt;p&gt;The backbone of this workflow consists of two main Python components scheduled and triggered under Hermes Agent:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;code&gt;cron_morning_weather.py&lt;/code&gt;: Fetches real-time JSON forecast from &lt;code&gt;wttr.in&lt;/code&gt; for the user's location, parses hourly temperatures, converts English weather descriptions into natural, expressive Cantonese, decides if an umbrella is needed, and outputs a cute morning briefing.&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;run_clone.py&lt;/code&gt;: Receives the text payload, packages the model, compiles the C++ parameters, encodes the audio using &lt;code&gt;ffmpeg&lt;/code&gt; to &lt;code&gt;libopus&lt;/code&gt;, and delivers it to the local WhatsApp gateway bridge.&lt;/li&gt;
&lt;/ol&gt;

&lt;h3&gt;
  
  
  The Magic of Hermes Agent: Memory and Location Privacy
&lt;/h3&gt;

&lt;p&gt;What makes this system genuinely &lt;em&gt;autonomous&lt;/em&gt; rather than a simple cron-bash script is &lt;strong&gt;Hermes's self-improving memory architecture&lt;/strong&gt;.&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Persistent Memory (User Profile)&lt;/strong&gt;: Hermes maintains an ongoing log of user preferences across sessions. It remembers that Hermes follow user preferences for example like philosophy.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Context-Aware Briefings&lt;/strong&gt;: When generating the script text, Hermes synthesizes these facts from its memory. The morning weather update isn't just a reading of numbers; it dynamically adds philosophical thoughts suited to the day's weather.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Timezone Synchronization&lt;/strong&gt;: Because scheduled cron tasks run in the server's UTC background, Hermes automatically calculates local offset (e.g. BST vs UTC) to ensure the morning briefing is delivered exactly at the user's local wake-up time.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Autonomous Skill Management&lt;/strong&gt;: When there are path updates or script logic tweaks, Hermes adjusts its internal reference memory, avoiding stale or cached references during execution.&lt;/li&gt;
&lt;/ol&gt;




&lt;h2&gt;
  
  
  Why Open-Source Agents Win
&lt;/h2&gt;

&lt;p&gt;Running Hermes Agent locally on an ARM64 server proved something crucial: &lt;strong&gt;We do not need to rely on proprietary or closed-source ecosystems to build delightful, highly personalized AI experiences.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;With a 4-line patch to &lt;code&gt;ggml.h&lt;/code&gt;, an optimized C++ inference binary, and Hermes's robust multi-session persistent memory, I have a private, voice-cloning companion that knows my diet, my daily schedule, and my philosophical quirks—costing virtually nothing when idle.&lt;/p&gt;

&lt;p&gt;If you are building with Hermes, don't just stay in the terminal. Give your agent a voice, patch those C++ boundaries, and build something that feels alive!&lt;/p&gt;

</description>
      <category>hermesagentchallenge</category>
      <category>devchallenge</category>
      <category>agents</category>
    </item>
  </channel>
</rss>
