DEV Community: WanjohiChristopher

Voxtral TTS: Is Open-Source Voice AI About to Disrupt ElevenLabs?

WanjohiChristopher — Fri, 29 May 2026 15:29:08 +0000

The voice AI landscape has been dominated by a handful of closed providers for years. If you wanted state-of-the-art text-to-speech (TTS), realistic voice cloning, emotional speech generation, and low-latency streaming, you typically had one option: pay for an API.

That may be changing.

In March 2026, Mistral AI released Voxtral TTS, a 4-billion-parameter open-weights text-to-speech model that challenges the long-standing assumption that frontier voice AI must remain proprietary. In Mistral's human evaluations, native speakers preferred Voxtral over ElevenLabs for multilingual voice cloning in 68.4% of side-by-side comparisons, judged on naturalness and expressivity. (Worth noting that scope: the headline number is specifically for multilingual cloning, not a blanket "better at everything" claim.)

For AI engineers, voice-agent builders, and researchers, this is one of the most important open-weight AI releases of the year.

Why Voice AI Has Been Different

Unlike large language models, speech synthesis has remained largely controlled by commercial providers. While the AI community gained access to powerful open-weight language models such as Llama, Qwen, DeepSeek, and Mistral, high-quality TTS remained mostly locked behind APIs.

There were several reasons:

Speech datasets are expensive to collect.
Natural prosody is difficult to model.
Real-time inference requires significant optimization.
Voice cloning introduces safety and abuse concerns.

As a result, companies such as ElevenLabs built strong moats around their speech technology. Voxtral represents one of the first serious attempts to challenge that moat using an open-weight approach.

What Is Voxtral TTS?

Voxtral TTS is an open-weights text-to-speech model released by Mistral AI. Key capabilities include:

4 billion parameters
Streaming generation
Approximately 70 ms time-to-first-audio under optimized H200 inference conditions
Voice cloning from a 3-second reference clip
Reference-based emotion transfer
Natural pauses and conversational speech patterns
Cross-lingual voice transfer
Support for 9 languages - Arabic, Dutch, English, French, German, Hindi, Italian, Portuguese, and Spanish

One of the most impressive capabilities is cross-lingual voice transfer. For example, a French speaker's voice can be used to generate natural English speech without retraining the model. This has significant implications for multilingual assistants, customer support systems, and global AI products.

Why the 70 ms Latency Matters

Many people focus on voice quality. Engineers focus on latency. A voice assistant may sound amazing, but if it takes 500 milliseconds to begin speaking, users perceive it as slow.

Human conversations operate on extremely short turn-taking cycles. Research consistently shows that delays above a few hundred milliseconds make conversations feel unnatural. Mistral reports a time-to-first-audio of approximately 70 milliseconds.

For comparison:

Human conversational response gaps are often around 200 milliseconds.
Many cloud TTS APIs require significantly longer startup times.
Real-time AI agents depend heavily on reducing latency at every stage.

This is particularly relevant for systems such as customer service agents, AI receptionists, real-time translators, interactive tutoring systems, and autonomous voice assistants. Low-latency speech generation is becoming as important as model intelligence itself.

How Voxtral Works

Voxtral uses a hybrid architecture that splits speech generation into two stages, tied together by a custom neural codec. A common misconception is that it replaces autoregressive generation with flow matching. It does not. It uses both, for different parts of the problem.

flowchart LR
    ref["Voice reference<br/>(3-30s)"] --> enc["Voxtral Codec<br/>encoder"]
    enc -->|"ref audio tokens (12.5 Hz)"| bb["Autoregressive Decoder<br/>Backbone (Ministral-3B)"]
    text["Text prompt tokens"] --> bb
    bb --> lin["Linear Head<br/>semantic token"]
    bb --> flow["Flow-Matching Transformer<br/>(acoustic head)<br/>acoustic tokens"]
    lin -.->|"conditions per timestep"| flow
    lin -->|semantic| dec["Voxtral Codec<br/>decoder (VQ-FSQ)"]
    flow -->|acoustic| dec
    dec --> out["24 kHz waveform"]

1. Autoregressive Semantic Backbone

The model is built on Mistral's Ministral-3B architecture. A voice reference (3 to 30 seconds) is first encoded by the Voxtral Codec into audio tokens at a 12.5 Hz frame rate, where each frame carries both a semantic token and an acoustic token. Those reference tokens, together with the text prompt tokens, are fed to the autoregressive decoder backbone, which generates a sequence of semantic tokens one step at a time until it emits a special end-of-audio token.

2. Flow-Matching Acoustic Head

This is where Voxtral diverges from a pure autoregressive design, but it layers flow matching on top of autoregression rather than abandoning it. At each timestep, the semantic token produced by the backbone conditions a separate acoustic head, a flow-matching transformer, which predicts the acoustic tokens. So the system is autoregressive for the semantic stream and flow-matching for the acoustic stream. Flow matching fills in high-fidelity acoustic detail while keeping inference fast.

3. Voxtral Codec (Hybrid VQ-FSQ)

Both token streams are encoded and decoded by the Voxtral Codec, a speech tokenizer Mistral trained from scratch. It uses a split quantization scheme: vector quantization (VQ) for the semantic tokens and finite scalar quantization (FSQ) for the acoustic tokens. The semantic path also receives a distillation loss from a supervised ASR model, which keeps those tokens linguistically meaningful. At the end, the semantic and acoustic tokens are decoded together into the final 24 kHz waveform.

flowchart LR
    inp["24 kHz audio"] --> encoder["Encoder<br/>Conv + Transformer<br/>(to 12.5 Hz)"]
    encoder --> vq["VQ<br/>semantic tokens"]
    encoder --> fsq["FSQ<br/>acoustic tokens"]
    vq --> decoder["Decoder<br/>Transformer + Conv"]
    fsq --> decoder
    decoder --> outp["Reconstructed<br/>24 kHz audio"]
    asr["Supervised ASR model"] -.->|"distillation loss"| vq

4. 12.5 Hz Frame Rate

Operating at a low 12.5 Hz frame rate keeps the number of tokens the model has to generate small. That is a major reason Voxtral can reach roughly 70 ms time-to-first-audio while still producing natural-sounding speech.

Voice Cloning in Three Seconds

Perhaps the most attention-grabbing feature is voice cloning from only three seconds of reference audio. Historically, voice cloning systems required minutes of training audio, speaker adaptation procedures, and fine-tuning.

Modern foundation models are increasingly able to infer speaker characteristics from extremely short samples. Voxtral extracts speaker identity information from a brief reference clip and conditions generation on those characteristics. The result is speech that preserves vocal tone, speaking style, rhythm, and intonation. This dramatically lowers the barrier for personalized voice applications.

Implications for AI Agents

The biggest impact may not be content creation. It may be AI agents. Most modern voice-agent stacks contain several components: speech-to-text (ASR), a language model, and text-to-speech (TTS). Historically, the TTS component has often been the most closed and expensive layer.

flowchart LR
    cin["Caller audio"] --> asr["ASR<br/>(speech to text)"]
    asr --> llm["LLM<br/>(response)"]
    llm --> tts["Voxtral TTS<br/>(text to speech)"]
    tts --> cout["Audio reply"]

Voxtral enables developers to self-host that layer. This creates opportunities for:

Lower infrastructure costs
Reduced vendor lock-in
Better privacy controls
Fully local voice agents
Edge deployment scenarios

For teams building conversational AI, this is potentially transformative.

For voice AI engineers, Voxtral is arguably more interesting as an architectural contribution than as a benchmark result. The hybrid autoregressive plus flow-matching design demonstrates a path toward combining low latency, strong speaker similarity, and expressive speech generation in a single model. Expect future open-weight voice models to adopt similar hybrid architectures.

One important caveat. Voxtral's weights are released under CC BY-NC 4.0, a non-commercial license inherited from the voice datasets it was trained on (EARS, CML-TTS, IndicVoices-R, and others). You can self-host it today for research, prototyping, internal tools, and personal projects, but shipping it inside a commercial product would require a separate commercial license from Mistral. So the "self-host to cut costs" story is real for experimentation, but it is not yet a drop-in replacement for a paid API in production.

Does This Kill ElevenLabs?

No. At least not yet. ElevenLabs still maintains several advantages.

Production Infrastructure. Running a research model and operating a globally scalable voice platform are very different challenges. ElevenLabs has invested heavily in reliability, scaling, monitoring, and developer tooling.

Proprietary Datasets. Data remains one of the strongest competitive advantages in AI. Even if architectures become public, proprietary speech datasets can continue to provide significant performance benefits.

Enterprise Features. Organizations often care about compliance, security, support, SLAs, and governance. These are areas where commercial providers continue to have advantages.

Licensing. For now, Voxtral's non-commercial license is itself part of ElevenLabs' moat. A startup can prototype on Voxtral for free, but the moment it wants to charge customers it has to either negotiate a commercial license with Mistral or pay for a production-ready API. That keeps the commercial door at least partly closed, regardless of how good the model sounds.

One additional consideration is that all benchmark results reported in this article originate from Mistral's own evaluations. While the results are impressive, independent third-party benchmarking will be important to validate performance across broader workloads, deployment environments, and real-world voice-agent applications. As with any frontier AI model, external validation often reveals strengths and weaknesses not captured in vendor evaluations.

What Happens Next?

The most likely outcome is not that ElevenLabs disappears. Instead, we may see the same pattern that occurred in large language models. Open-weight systems become increasingly capable, while commercial providers continue competing through infrastructure, convenience, reliability, and specialized features.

This shifts the market from "Can open source compete?" to "Why pay for closed systems if open models are good enough?" That is exactly what happened with LLMs. Voice AI may be following the same trajectory.

Why This Matters for Researchers

For researchers working on speech processing, voice agents, target speaker extraction, conversational AI, and human-computer interaction, Voxtral provides something extremely valuable: access. Researchers can now inspect, evaluate, modify, and build upon a frontier-level speech model rather than treating it as a black-box API.

Historically, breakthroughs in AI accelerate when researchers gain direct access to the underlying models. Voxtral could become a similar catalyst for speech AI.

Final Thoughts

Voxtral TTS is more than another model release. It signals a broader shift in the voice AI ecosystem. For years, speech synthesis remained one of the strongest proprietary strongholds in artificial intelligence. Mistral's release demonstrates that frontier-quality voice generation can increasingly be delivered through open-weight models.

Whether Voxtral ultimately dethrones ElevenLabs is almost beside the point. The real story is that developers, startups, researchers, and open-source communities now have access to a serious alternative. And history suggests that when powerful AI technology becomes openly available, innovation accelerates rapidly.

The next generation of voice agents may not be built on closed APIs. They may be built on open foundations.

Originally published at wanjohichristopher.com.

References: Voxtral TTS paper (arXiv:2603.25551) · Model weights on Hugging Face (CC BY-NC 4.0)

I built a phone number you can call and argue with an AI. Here's the part nobody tells you.

WanjohiChristopher — Thu, 28 May 2026 02:34:53 +0000

Audience: engineers and the people who hire them. ~10 min read.

I wanted one thing: dial a regular phone number and have an AI support agent pick
up and actually help. Pull from a knowledge base, book an appointment, sound like
a person. The text-chat version of this is a solved problem now. The phone
version is where the interesting engineering hides, because a phone call is a
real-time, full-duplex audio stream, the model in the middle is slow, and the
transcription is noisy enough that you can't treat it as authoritative.

This is the story of building voice for TeaVoice, an AI customer-support
platform. I'll show you the path I took, the wall I hit, and the four problems
that don't exist in chat and absolutely do exist on a phone.

First, a quick glossary so the rest reads clean:

PSTN: the regular phone network. Actual calls, not app-to-app.
DID: the phone number people dial.
Telnyx: my telephony provider. It bridges the phone call to my server.
Webhook: Telnyx HTTP-POSTs my server when something happens on the call.
Media stream: Telnyx sends me the raw audio, live, instead of a transcript.
STT / TTS: speech→text and text→speech.
VAD: voice activity detection. Figuring out when the caller stopped talking.

Attempt 1: let the phone company do the hard part

The obvious first move is to let Telnyx handle speech. They have an API for it.
The loop looks clean on a whiteboard:

flowchart TD
    A[Answer the call] --> B[Speak the greeting]
    B --> C[Wait for 'speak finished']
    C --> D[Start transcription]
    D --> E[Caller talks]
    E --> F[Transcription webhook arrives]
    F --> G[Stop transcription]
    G --> H[Run it through the AI]
    H --> I[Speak the reply]
    I --> C

Every box from "start transcription" to "transcription webhook" is Telnyx's to
own. That's the part that bit me.

I built that. Then I spent a genuinely humbling number of hours discovering that
the provider's transcription has trapdoors:

One transcription engine returns 200 OK and then sends zero transcription events. Forever. No error. It just silently does nothing.
The moment I added a config option to pick a better transcription model, the whole thing went quiet again. Same 200 OK, still no events.
Their built-in "AI assistant" feature can only be started once per call, so you can't use it to drive a turn-by-turn conversation with your own logic.
And the speech recognition keeps transcribing the agent's own voice as if the caller said it, so you have to choreograph exactly when you start and stop listening around when you're talking.

It wasn't that the provider was bad. It was that the more of the audio pipeline I
handed off, the less I could control the two things that actually matter: latency
and correctness. I was tuning a black box.

So I stopped asking the phone company to listen for me.

Attempt 2: take over the audio

The better path: have Telnyx fork the raw audio of the call to my server over a
WebSocket, and run my own everything. Now the flow is:

flowchart TB
    Caller([📞 Caller on PSTN]) <-->|phone audio| Telnyx[Telnyx Call Control]

    subgraph CP["Control plane: HTTP webhooks"]
        direction LR
        W1["call.initiated<br/>route number, create<br/>record, answer"]
        W2["call.answered<br/>start media stream"]
        W3["call.hangup<br/>clean up, finalize"]
    end

    subgraph DP["Data plane: media WebSocket"]
        direction TB
        VAD["VAD: detect end of turn<br/>(loudness + ~1.5s silence)"]
        STT["Speech-to-Text (Whisper)"]
        AI["AI pipeline<br/>guardrails → search → LLM<br/>(same brain as web chat)"]
        TTS["Text-to-Speech"]
        VAD --> STT --> AI --> TTS
    end

    Telnyx -->|HTTP events| CP
    Telnyx <-->|raw L16 audio| VAD
    TTS -.->|synthesized audio back| Telnyx

There are two clean halves here. The control plane is still webhooks, but
tiny now. Just three events:

call starts → look up which business and which agent this number belongs to, create a call record, answer.
call answered → tell Telnyx "stream the audio to this WebSocket."
call hangs up → clean up timers, finalize the record.

The data plane is the audio WebSocket, and that's where everything
interesting lives.

The big win: the same AI brain that powers web chat now powers the phone. The
transcript runs through the identical pipeline of content guardrails,
knowledge-base search, the LLM, and output checks. I just wrap it with
voice-specific instructions. One brain, two mouths.

That's the architecture. Now the four problems that only exist on a phone.

Problem 1: "Are they done talking?"

In chat, the user presses Enter. That's the turn boundary, handed to you for
free. On a phone there's no Enter. You get a relentless stream of audio chunks
and you have to decide when the caller has finished a thought.

I do the cheap, boring thing that works: measure how loud each chunk is (RMS
amplitude), and call it "end of turn" after about 1.5 seconds of silence
following speech. Buffers shorter than ~100ms get thrown away as noise. No ML, no
fancy endpointing model. Just a loudness threshold and a silence counter.

It's not glamorous, and it occasionally clips someone who pauses mid-sentence to
think. But it's predictable, it adds zero latency, and "predictable" beats
"clever" when you're debugging a live phone call.

Problem 2: the AI keeps interviewing itself

Here's a bug that doesn't exist anywhere else. Because the audio stream is
bidirectional (my TTS audio goes back out the same pipe the caller's audio comes
in), the agent hears its own voice, transcribes it, and treats it as the caller
talking. The AI ends up in a conversation with itself. It's funny for about ten
seconds.

Two guards fix it:

While I'm playing audio to the caller, I drop every incoming chunk on the floor. The agent is deaf while it's speaking.
For a full second after I finish speaking, I keep ignoring incoming audio, because there's a tail of echo and network delay where my own voice is still arriving.

Crude? Yes, and it's a real tradeoff: going deaf while I talk means the caller
can't interrupt me, which is closer to a walkie-talkie than a natural
conversation. But echo cancellation is a rabbit hole, and "go deaf while you
talk, plus a one-second cooldown" eliminated the self-conversation completely. It
was a debugging-first choice, not ideal conversational UX.

Problem 3: the transcription is just... wrong a lot

Phone audio is 8–16kHz of compressed, noisy garbage compared to a podcast mic.
Whisper does its best, but you get transcripts like "I wanna book a point mint
for toose day." If you treat that as gospel and the AI replies "I'm sorry, I
didn't understand" every third turn, the call is unusable.

The fix wasn't a better STT model. It was telling the AI to expect garbage and
guess anyway. Before each turn I inject instructions that say, in effect:

This text came from speech recognition and may be wrong. Figure out what the
caller probably meant and help them. Don't say "could you repeat that" over
and over. If it's truly unintelligible, ask one specific clarifying
question. Keep your answer to 1–2 sentences, because it's going to be read out
loud.

"Book a point mint for toose day" becomes "Sure, I can book an appointment for
Tuesday. What time works?" The model is a fantastic error-correcting decoder if
you give it permission to be one. That instruction prefix did more for call
quality than anything I changed in the audio layer.

Two details that mattered. I pass those instructions as a separate system
message, not glued onto the transcript, because otherwise the model occasionally
repeats them back as if the caller said them. And I cap replies at 1–2 sentences,
because nobody wants an AI reading a five-paragraph essay at them over the phone.

Problem 4: one thing at a time

Audio chunks arrive continuously and I process turns as async tasks, so it's
entirely possible for two turns to start overlapping: two TTS clips playing at
once, two "am I speaking?" flags fighting each other. I wrap the
AI-plus-speak-plus-playback part of each turn in a lock so exactly one turn runs
at a time. Simple, and it kills a whole category of race conditions.

The whole turn, as one state machine

Those four problems aren't separate features. They're a single loop. Here's the
life of one conversational turn, including the deaf-while-speaking and cooldown
states that keep the bot from hearing itself:

stateDiagram-v2
    [*] --> Listening
    Listening --> Capturing: caller speaks, loud enough
    Capturing --> Listening: too short, discard as noise
    Capturing --> Processing: ~1.5s of silence
    Processing --> Speaking: STT then AI then TTS
    Processing --> [*]: caller said goodbye
    Speaking --> Cooldown: playback finished
    Cooldown --> Listening: 1s echo guard elapsed

    note right of Speaking
        Deaf while speaking:
        every inbound chunk dropped
    end note
    note right of Cooldown
        Still deaf for 1s:
        tail echo is still arriving
    end note

How the pros do this

Before you conclude I invented something weird in a basement: I didn't. The
cascade I just described (telephony → audio stream → speech-to-text → LLM →
text-to-speech → back) is the standard voice-agent architecture. It's what
Pipecat AI and LiveKit Agents are frameworks for, and what platforms like
Vapi, Retell, Deepgram's Voice Agent API, and ElevenLabs' Conversational AI all
run under the hood. Pipecat in particular follows the same shape as
what's in this post: transport → VAD → STT → LLM → TTS → transport, frame by
frame. I hand-rolled a mini-Pipecat, emphasis on mini. The frameworks do
the hard parts properly (interruption handling, streaming orchestration,
partial-transcript routing) where I cut corners. If I were
starting today and didn't want to learn these lessons the hard way, I'd reach for
one of those frameworks first.

Where the serious systems pull ahead is that they replace each of my deliberately
crude mechanisms with a purpose-built model. My "wait for 1.5 seconds of silence"
turn detection becomes a semantic turn-taking model (Deepgram's endpointing,
ElevenLabs' dedicated turn-taking model, LiveKit's turn detector) that knows the
difference between "I'm done" and "I'm thinking." My "go deaf while I'm speaking"
echo guard becomes real acoustic echo cancellation plus true barge-in that cuts
the bot off mid-sentence the instant you interrupt. And my batch "synthesize the
whole reply, then play it" becomes streaming TTS, fed token-by-token straight from
the LLM so the caller hears the first words while the rest is still generating:

Piece	My crude version	The production version
Turn detection	Loudness + 1.5s silence	Semantic turn-taking / endpointing model
Echo / interruption	Go deaf while speaking	Acoustic echo cancellation + real barge-in
Text-to-speech	Batch, then play	Streaming, fed from LLM tokens
Speech-to-text	Buffer a turn, transcribe once	Continuous streaming with partial results

There's also a second paradigm worth knowing about, because it changes the whole
picture. Everything above is a cascade: three separate models in a row,
flexible and debuggable but paying a latency tax at every hop. The frontier
(Google's Gemini Live, OpenAI's Realtime API) is moving to speech-to-speech.
From the developer's perspective it's one model that takes audio in and emits
audio out, with no separate transcription or synthesis step to wire up. It's
lower latency and far better at tone, laughter, and interruptions. But it's a black box you can't inspect or swap pieces of, which is
the exact problem that made me abandon "let the phone company do it" in the first
place. Google is the tell here. Their contact-center product is a cascade like
mine, while their frontier product is speech-to-speech: same company, two
answers, because the right one depends on whether you value control or latency
more.

flowchart LR
    subgraph Cascade["Cascade: what I built (and Pipecat, Vapi, Deepgram...)"]
        direction LR
        a1([audio in]) --> a2[STT] --> a3[LLM] --> a4[TTS] --> a5([audio out])
    end
    subgraph S2S["Speech-to-speech: the frontier (Gemini Live, OpenAI Realtime)"]
        direction LR
        b1([audio in]) --> b2[one model] --> b3([audio out])
    end
    Cascade ~~~ S2S

Three boxes, three latency hops, three things you can swap and debug. Versus one
box that's faster and more natural but that you can't open up.

So here's the honest placement of this project. The core cascade architecture is
industry-standard, the mechanisms are the simple-but-debuggable versions of what
the specialists productize, and the next rung up the ladder is either swapping in
better models for each stage or collapsing the whole cascade into a realtime
speech-to-speech model.

What it costs, and what I'd do next

Every turn logs its budget: STT time / AI time / TTS time. That single log line
is the most useful thing I added, because on a phone call latency is the
product. A 4-second silence after someone asks a question feels broken even if
the answer is perfect. In my setup, the LLM call dominated that budget, which points at the
obvious next moves: stream the TTS so the caller hears the first words while the
rest is still generating, and start synthesizing speech from the model's tokens
as they arrive instead of waiting for the full reply.

I'd also replace the loudness-based turn detection with a real endpointing model,
and graduate the echo guard from "go deaf for a second" to actual acoustic echo
cancellation. None of that was needed to ship something that works, though, and
that's the point. The crude versions held up, and they were debuggable at 2am
with a phone in one hand.

The takeaway

The interesting part of voice AI turned out not to be the AI. It's the seam
between a real-time audio stream and a slow, fallible language model: knowing when
someone's done talking, stopping the bot from hearing itself, making the model
robust to a transcriber that's wrong a third of the time, and watching your
latency budget like it's the only metric that matters. Get those right with
embarrassingly simple mechanisms, and the LLM part, the part everyone thinks is
hard, is genuinely the easy bit you already built for chat.

Hermes Agent vs Openclaw

WanjohiChristopher — Sun, 24 May 2026 02:58:43 +0000

WanjohiChristopher

May 22

Hermes vs OpenClaw: The Two Most-Starred AI Agent Frameworks of 2026

#ai #agents #opensource #comparison

Comments

6 min read

Hermes vs OpenClaw: The Two Most-Starred AI Agent Frameworks of 2026

WanjohiChristopher — Fri, 22 May 2026 20:22:00 +0000

The open-source agent space hit a real inflection point in 2026. Two projects now sit near the top of GitHub's charts, and they represent two very different ideas about what a personal AI agent should look like.

Hermes Agent: 163k stars, built by Nous Research, written in Python. Tagline: "The agent that grows with you."
OpenClaw: 374k stars, sponsored by OpenAI, GitHub, NVIDIA, and Vercel, written in TypeScript. Tagline: "Your own personal AI assistant. Any OS. Any platform. The lobster way. 🦞"

At first glance they're solving the same problem: a personal assistant that lives across messaging platforms (Telegram, Discord, Slack, WhatsApp, Signal, iMessage…) and can reason, plan, and call tools. But once you dig in, they're going in pretty different directions. And one of them is already trying to migrate users away from the other.

Here's what I learned reading the READMEs side by side.

The 10-Second Summary

Both projects ship the same baseline:

Multi-channel chat across Telegram, Discord, Slack, WhatsApp, Signal, iMessage, and others
Tool calling for browser, shell, files, and scheduling
Sandboxed execution
Pluggable LLM providers (OpenAI, Anthropic, OpenRouter, local models)
Persistent memory and per-user state
MIT license

The differences are where it gets interesting:

Dimension	Hermes Agent	OpenClaw
Built by	Nous Research	openclaw org (sponsored by OpenAI, GitHub, NVIDIA, Vercel)
Language	Python	TypeScript (Node 22.19+)
GitHub stars	163k	374k
Standout feature	A closed learning loop: self-improving skills and agent-curated memory	Live Canvas: an agent-driven visual workspace, plus native macOS, iOS, and Android apps
Channels	Telegram, Discord, Slack, WhatsApp, Signal, Email, CLI	22+ including iMessage, Teams, Matrix, LINE, Feishu, Mattermost, WeChat, QQ, Nostr
Skills standard	agentskills.io plus Honcho dialectic user modeling	Bundled, managed, and workspace skills, plus the ClawHub registry
Tools	MCP-native, 40+ built-in, RPC subagents	Browser, canvas, nodes, cron, sessions, channel actions
Hosting	Local, Docker, SSH, Singularity, Modal, Daytona, Vercel Sandbox	Local Gateway as the control plane, plus companion macOS, iOS, and Android apps
Ideal user	Developers who want an agent that learns from them across sessions	People who want a polished personal assistant on every device and channel

What Makes Hermes Different: The Closed Learning Loop

Most agent frameworks treat memory like a database. You store facts, you retrieve them later, end of story. Hermes turns memory into a feedback loop instead.

A few specifics worth calling out:

Autonomous skill creation. After a complex task, the agent can write its own skill (basically a reusable procedure) and save it for later.
Skills self-improve during use. When a skill fails or works well, the agent updates it.
Periodic memory nudges. The agent reviews and curates its own memory in the background, not just when you ask.
FTS5 session search with LLM summarization. Past conversations are first-class context. Hermes can search and summarize what it has already done with you.
Honcho dialectic user modeling. A separate component builds a persistent model of who you are across sessions.
agentskills.io standard. Skills are portable across compatible agents, so you can share them or consume them from others.

The bet behind all of this: an agent that gets sharper the more you use it is worth more than one that's smart on day one. As far as I can tell, Hermes is the only mainstream agent actually shipping this kind of closed loop today.

The README puts it plainly:

The self-improving AI agent. It creates skills from experience, improves them during use, nudges itself to persist knowledge, searches its own past conversations, and builds a deepening model of who you are across sessions.

What Makes OpenClaw Different: Channel Breadth and the Live Canvas

OpenClaw is optimizing for surface area, and two things really jump out.

First, the channel list is huge. WhatsApp, Telegram, Slack, Discord, Google Chat, Signal, iMessage, IRC, Microsoft Teams, Matrix, Feishu, LINE, Mattermost, Nextcloud Talk, Nostr, Synology Chat, Tlon, Twitch, Zalo, WeChat, QQ, WebChat. Then add native macOS, iOS, and Android on top. If your team or your family is on it, OpenClaw probably bridges it.

Second, the Live Canvas (with the A2UI protocol). This is OpenClaw's most unique feature: an agent-driven visual workspace where the assistant can render and manipulate a live UI alongside the conversation. The agent draws a chart, builds a form, or sets up a kanban board on a shared canvas you can both see and edit. A2UI is the protocol that makes it work.

Beyond those two, OpenClaw also ships:

Voice Wake and Talk Mode. Wake words on macOS and iOS, continuous voice on Android, with ElevenLabs as the default and system TTS as a fallback.
A native macOS menu-bar app with a push-to-talk overlay, gateway health, and WebChat built in.
Multi-agent routing. Route inbound channels, accounts, and peers to isolated agents, each with its own workspaces and sessions.
Sandboxing. Docker by default, with SSH and OpenShell backends available.

OpenClaw's bet is essentially that most users don't want to live in a CLI. They want voice, vision, and presence on every device they already use.

Security: Same Primitives, Different Defaults

Both projects take messaging-platform exposure seriously, and they share most of the primitives:

DM pairing. Unknown senders get a pairing code, and messages aren't processed until you approve.
Allowlist-based access control.
Sandboxed tool execution for sessions that aren't your trusted main one.
Doctor commands (hermes doctor, openclaw doctor) that flag risky configs.

Where they diverge:

OpenClaw documents an explicit Gateway exposure runbook for anyone running the gateway on a publicly reachable network. Worth reading before you open the port.
Hermes leans more on container and terminal isolation. Its seven terminal backends (Docker, Modal, Daytona, Vercel Sandbox, and others) let you scope exactly where tools actually run.

Neither one is meaningfully "safer by default." The real risk in both cases is the same: an agent connected to your messaging platforms is a fat target. Treat every inbound DM as untrusted input, and follow each project's security guide before any remote exposure.

The Migration Tool: A Competitive Tell

The most revealing fact in the two READMEs (and the one most articles miss) is this:

Hermes ships a built-in OpenClaw migration command.

hermes claw migrate              # Interactive migration
hermes claw migrate --dry-run    # Preview
hermes claw migrate --preset user-data

It imports SOUL.md persona files, MEMORY.md and USER.md entries, user-created skills (into ~/.hermes/skills/openclaw-imports/), command allowlists, messaging settings, allowlisted API keys, TTS assets, and workspace AGENTS.md instructions.

That's not the behavior of a complementary project. That's a successor framework betting it can convert the larger user base. Nous Research is basically saying, in code, if you're on OpenClaw, here's the door.

Whether the bet pays off depends on whether the closed learning loop matters more to users than channel breadth and the visual canvas.

Which One Should You Pick?

Pick Hermes if:

You want an agent that actually learns, improves its own skills, remembers you, and gets sharper over months.
You live in Python and want MCP-native tool integration.
You're a researcher or developer experimenting with agent cognition, trajectory training, or self-improvement.
You're comfortable in a TUI and want serverless hosting (Modal, Daytona, Vercel Sandbox).

Pick OpenClaw if:

You want a polished personal assistant across every device: macOS menu bar, iOS, Android, voice.
You need the niche messaging channels (iMessage, Teams, Matrix, WeChat, QQ, LINE, Feishu).
The Live Canvas matters for your workflow (visual outputs, shared UIs).
You're in a TypeScript shop and want it Node-native.

Use both? Probably not the move. They overlap heavily, and Hermes' migration tool suggests Nous expects you to pick one eventually.

The Bigger Picture

Two years ago the agent debate was basically can these systems do anything useful at all? In 2026 we've moved past that. The real question now is whether your agent should get smarter over time or just be everywhere you are.

Hermes is the strongest bet on the first answer. OpenClaw is the strongest bet on the second. Both are MIT-licensed, both are production-grade, and both have raised the bar for what an open-source personal AI agent can be.

The next interesting question is whether either project (or maybe some hybrid that hasn't shown up yet) manages to do both at scale.

Dig deeper:

Hermes Agent: github.com/NousResearch/hermes-agent
OpenClaw: github.com/openclaw/openclaw

Building an AI-Powered Customer Churn Prediction Pipeline on AWS (Step-by-Step)

WanjohiChristopher — Thu, 01 Jan 2026 00:56:43 +0000

Hey folks! 👋

I recently built a customer churn prediction system that not only predicts who will leave — but also explains why in plain English using Amazon Bedrock.

In this tutorial, I'll walk you through building the entire pipeline from scratch.

What we achieved:

✅ 84.2% AUC on validation data
✅ Real-time predictions via SageMaker endpoint
✅ Natural language explanations powered by Claude (Bedrock)

Let's dive in!

🎯 What We're Building

An end-to-end ML pipeline that:

Ingests customer data into S3
Trains a churn prediction model with SageMaker XGBoost
Deploys a real-time inference endpoint
Explains predictions using Amazon Bedrock (Claude)
Exposes everything via API Gateway + Lambda

Prerequisites: AWS account, basic Python knowledge

🏗️ Architecture Overview

The pipeline consists of 5 tiers:

Tier	Services	Purpose
Data Ingestion	S3	Store raw customer data
ML Training	SageMaker Training	Train XGBoost model
Model Storage	S3	Store model artifacts
Inference & AI	SageMaker Endpoint, Bedrock	Real-time predictions + NL explanations
API Layer	API Gateway, Lambda	Expose REST API

Step 1: Set Up S3 and Upload Data

First, create an S3 bucket and upload the dataset.

# Set bucket name with your account ID
export BUCKET_NAME=churn-prediction-$(aws sts get-caller-identity --query Account --output text)

# Create bucket
aws s3 mb s3://$BUCKET_NAME

# Upload your data
aws s3 cp WA_Fn-UseC_-Telco-Customer-Churn.csv s3://$BUCKET_NAME/raw/

📥 Dataset: Download the Telco Customer Churn dataset from Kaggle.

Step 2: Create SageMaker IAM Role

In AWS Console:

Go to IAM → Roles → Create role
Select SageMaker - Execution
Add policies: AmazonSageMakerFullAccess + AmazonS3FullAccess
Name it: SageMakerChurnRole

Step 3: Train the Model

Create train_churn.py:

import boto3
import sagemaker
import pandas as pd
import os
from sklearn.model_selection import train_test_split
from sagemaker.inputs import TrainingInput

# Config
BUCKET = os.environ['BUCKET_NAME']
ROLE = os.environ['ROLE_ARN']
PREFIX = 'churn-prediction'

session = sagemaker.Session()
region = session.boto_region_name

# Load and prepare data
df = pd.read_csv('WA_Fn-UseC_-Telco-Customer-Churn.csv')
df['TotalCharges'] = pd.to_numeric(df['TotalCharges'], errors='coerce').fillna(0)
df['Churn'] = (df['Churn'] == 'Yes').astype(int)

# Encode categorical columns
cat_cols = ['gender', 'Partner', 'Dependents', 'PhoneService', 'MultipleLines', 
            'InternetService', 'OnlineSecurity', 'OnlineBackup', 'DeviceProtection', 
            'TechSupport', 'StreamingTV', 'StreamingMovies', 'Contract', 
            'PaperlessBilling', 'PaymentMethod']

for col in cat_cols:
    df[col] = df[col].astype('category').cat.codes

# Features
feature_cols = ['SeniorCitizen', 'tenure', 'MonthlyCharges', 'TotalCharges'] + cat_cols
X = df[feature_cols]
y = df['Churn']

# Split and save
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)

train_df = pd.concat([y_train.reset_index(drop=True), X_train.reset_index(drop=True)], axis=1)
test_df = pd.concat([y_test.reset_index(drop=True), X_test.reset_index(drop=True)], axis=1)
train_df.to_csv('train.csv', index=False, header=False)
test_df.to_csv('test.csv', index=False, header=False)

# Upload to S3
s3 = boto3.client('s3')
s3.upload_file('train.csv', BUCKET, f'{PREFIX}/train/train.csv')
s3.upload_file('test.csv', BUCKET, f'{PREFIX}/test/test.csv')

# Train XGBoost
container = sagemaker.image_uris.retrieve('xgboost', region, '1.7-1')

xgb = sagemaker.estimator.Estimator(
    image_uri=container,
    role=ROLE,
    instance_count=1,
    instance_type='ml.m5.xlarge',
    output_path=f's3://{BUCKET}/{PREFIX}/output',
    sagemaker_session=session
)

xgb.set_hyperparameters(
    objective='binary:logistic',
    num_round=100,
    max_depth=5,
    eta=0.2,
    eval_metric='auc'
)

xgb.fit({
    'train': TrainingInput(f's3://{BUCKET}/{PREFIX}/train', content_type='csv'),
    'validation': TrainingInput(f's3://{BUCKET}/{PREFIX}/test', content_type='csv')
})

# Deploy endpoint
predictor = xgb.deploy(
    initial_instance_count=1,
    instance_type='ml.t2.medium',
    endpoint_name='churn-prediction-endpoint',
    serializer=sagemaker.serializers.CSVSerializer()
)

Run it:

export BUCKET_NAME=churn-prediction-YOUR_ACCOUNT_ID
export ROLE_ARN=arn:aws:iam::YOUR_ACCOUNT_ID:role/SageMakerChurnRole
python3 train_churn.py

Training output:

2026-01-01 00:24:27 Uploading - Uploading generated training model
2026-01-01 00:24:27 Completed - Training job completed
Training seconds: 103
Billable seconds: 103

✅ Training complete!
Model artifact: s3://churn-prediction-905418352184/churn-prediction/output/sagemaker-xgboost-2026-01-01-00-22-03-339/output/model.tar.gz

Deploying endpoint (3-5 min)...
INFO:sagemaker:Creating model with name: sagemaker-xgboost-2026-01-01-00-24-53-959
INFO:sagemaker:Creating endpoint-config with name churn-prediction-endpoint
INFO:sagemaker:Creating endpoint with name churn-prediction-endpoint
---------------!
✅ Endpoint deployed: churn-prediction-endpoint
Test prediction: 0.4% churn probability

Step 4: Create Lambda with Bedrock Integration

Create a Lambda function ChurnPredictionAPI with this code:

import json
import boto3
import os

sagemaker_runtime = boto3.client('sagemaker-runtime')
bedrock = boto3.client('bedrock-runtime')

ENDPOINT_NAME = os.environ.get('SAGEMAKER_ENDPOINT', 'churn-prediction-endpoint')

def lambda_handler(event, context):
    body = json.loads(event['body']) if isinstance(event.get('body'), str) else event

    # Get prediction from SageMaker
    response = sagemaker_runtime.invoke_endpoint(
        EndpointName=ENDPOINT_NAME,
        ContentType='text/csv',
        Body=body['features']
    )

    churn_prob = float(response['Body'].read().decode())

    # Generate explanation with Bedrock Claude
    prompt = f"""A customer has {churn_prob:.1%} churn probability.
Customer: Tenure {body.get('tenure', 'N/A')} months, ${body.get('monthly_charges', 'N/A')}/month, {body.get('contract', 'N/A')} contract.
In 2 sentences, explain the risk and suggest one retention action."""

    bedrock_response = bedrock.invoke_model(
        modelId='anthropic.claude-3-haiku-20240307-v1:0',
        body=json.dumps({
            "anthropic_version": "bedrock-2023-05-31",
            "max_tokens": 100,
            "messages": [{"role": "user", "content": prompt}]
        })
    )

    explanation = json.loads(bedrock_response['body'].read())['content'][0]['text']
    risk = "High" if churn_prob > 0.7 else "Medium" if churn_prob > 0.4 else "Low"

    return {
        'statusCode': 200,
        'headers': {'Content-Type': 'application/json'},
        'body': json.dumps({
            'churn_probability': f"{churn_prob:.1%}",
            'risk_level': risk,
            'explanation': explanation
        })
    }

Lambda configuration:

Runtime: Python 3.11
Timeout: 30 seconds
Role: LambdaChurnRole (with SageMaker + Bedrock permissions)
Environment variable: SAGEMAKER_ENDPOINT=churn-prediction-endpoint

Step 5: Create API Gateway

Create an HTTP API in API Gateway
Add Lambda integration → ChurnPredictionAPI
Create POST route: /predict
Deploy and get your invoke URL

🧪 Test the API

curl -X POST "https://YOUR_API_URL/predict" \
  -H "Content-Type: application/json" \
  -d '{
    "features": "0,24,65.5,1500.0,1,0,1,2,0,0,1,1,0,0,1,0,2,1,1",
    "tenure": 24,
    "monthly_charges": 65.5,
    "contract": "Month-to-month"
  }'

Response:


(.venv) server@DLG lambda_package % curl -X POST "https://jxairjovmi.execute-api.us-east-1.amazonaws.com/predict" \
  -H "Content-Type: application/json" \
  -d '{
    "features": "0,24,65.5,1500.0,1,0,1,2,0,0,1,1,0,0,1,0,2,1,1",
    "tenure": 24,
    "monthly_charges": 65.5,
    "contract": "Month-to-month"
  }'

{"churn_probability": "0.6%", "risk_level": "Low", "explanation": "The customer's high churn probability of 0.6% and the month-to-month contract indicate a significant risk of losing the customer. To mitigate this risk, a retention action could be to offer the customer a longer-term contract with a discounted monthly rate or additional benefits, which may help increase their loyalty and reduce the likelihood of churn."}%

🧹 Cleanup

Don't forget to delete resources to avoid charges:

# Delete SageMaker endpoint (most expensive!)
aws sagemaker delete-endpoint --endpoint-name churn-prediction-endpoint
aws sagemaker delete-endpoint-config --endpoint-config-name churn-prediction-endpoint

# Delete Lambda
aws lambda delete-function --function-name ChurnPredictionAPI

# Delete S3 bucket
aws s3 rb s3://$BUCKET_NAME --force

💡 Key Lessons Learned

SageMaker XGBoost is production-ready — achieved 84% AUC with minimal tuning.
Bedrock adds real business value — converting predictions to actionable insights makes ML accessible to non-technical stakeholders.
IAM permissions are tricky — create roles via Console if CLI gives explicit deny errors.
Cost awareness matters — always delete endpoints when not in use (~$0.05/hour adds up!)

Resources

Thanks for reading! If this helped you, follow me for more AWS + Data Engineering content.

Questions? Leave a comment below!

𝗩𝗼𝗶𝗰𝗲 𝗔𝗜: 𝗧𝗧𝗦 - 𝗚𝗶𝘃𝗶𝗻𝗴 𝗬𝗼𝘂𝗿 𝗔𝗜 𝗮 𝗩𝗼𝗶𝗰𝗲

WanjohiChristopher — Tue, 23 Dec 2025 13:45:00 +0000

We've covered how Voice AI listens (ASR), understands (NLU), decides (Dialog Management), remembers (Context), and writes (NLG).

Now for the final piece: 🔊 Making it speak.

That's TTS - Text-to-Speech.

𝗧𝗵𝗲 𝗧𝗿𝗮𝗻𝘀𝗳𝗼𝗿𝗺𝗮𝘁𝗶𝗼𝗻:
Input: "Great news! Your flight to Paris is confirmed."
Output: 〰️〰️〰️ (audio waveform).

𝗧𝗵𝗲 𝗧𝗧𝗦 𝗣𝗶𝗽𝗲𝗹𝗶𝗻𝗲:
1️⃣ 𝗧𝗲𝘅𝘁 𝗔𝗻𝗮𝗹𝘆𝘀𝗶𝘀
• "How to pronounce this?"
• Normalization ($50 → "fifty dollars")
• Grapheme-to-phoneme conversion
• Homograph resolution (read vs read)
2️⃣ 𝗣𝗿𝗼𝘀𝗼𝗱𝘆 𝗣𝗿𝗲𝗱𝗶𝗰𝘁𝗶𝗼𝗻
• How should it sound?
• Pitch contour (intonation)
• Duration (speed)
• Stress & emphasis
• Pauses
3️⃣ 𝗔𝗰𝗼𝘂𝘀𝘁𝗶𝗰 𝗠𝗼𝗱𝗲𝗹
• Generate mel spectrogram.
• Tacotron 2, FastSpeech 2, VITS.
• Maps phonemes → audio features.
4️⃣ 𝗩𝗼𝗰𝗼𝗱𝗲𝗿
• Convert to audio waveform.
• HiFi-GAN, WaveGlow, WaveNet.
• Spectrogram → actual audio.

🎯 And that closes the loop:
Listen → Think → Speak

That’s the full Voice AI pipeline.

Thanks for following along - next, I'll likely recap the full system and share a few real-world failure modes that make or break Voice AI in production. More coming soon. Keep building!!

Cheers!!

𝗩𝗼𝗶𝗰𝗲 𝗔𝗜: 𝗡𝗟𝗚 - 𝗧𝘂𝗿𝗻𝗶𝗻𝗴 𝗗𝗲𝗰𝗶𝘀𝗶𝗼𝗻𝘀 𝗜𝗻𝘁𝗼 𝗪𝗼𝗿𝗱𝘀

WanjohiChristopher — Mon, 22 Dec 2025 13:43:00 +0000

Voice AI listens (ASR), understands (NLU), and decides (Dialog Management).

But decisions aren't responses.
The system knows:
▶️ Action: inform
▶️ Flight: booked
▶️ Destination: Paris
▶️ Date: Dec 20
▶️ Confirmation: AB123

That's not what we say to a user.

This is where 𝗡𝗟𝗚 (Natural Language Generation) comes in.

It transforms structured data into natural speech:
Example:
🤖 "Great news! Your flight to Paris on December 20th is confirmed. Your confirmation number is AB123. Have a wonderful trip!"

𝗧𝗵𝗲 𝗡𝗟𝗚 𝗣𝗶𝗽𝗲𝗹𝗶𝗻𝗲:
1️⃣ 𝗖𝗼𝗻𝘁𝗲𝗻𝘁 𝗣𝗹𝗮𝗻𝗻𝗶𝗻𝗴
🔹"What information to convey?"
🔹Select facts, order them, prioritize.
2️⃣ 𝗦𝗲𝗻𝘁𝗲𝗻𝗰𝗲 𝗣𝗹𝗮𝗻𝗻𝗶𝗻𝗴
🔹"How to structure it?"
🔹One sentence or multiple?
🔹Combine facts?
3️⃣ 𝗦𝘂𝗿𝗳𝗮𝗰𝗲 𝗥𝗲𝗮𝗹𝗶𝘇𝗮𝘁𝗶𝗼𝗻
🔹"What exact words to use?" .
🔹Grammar, vocabulary, tone, fluency.

𝗧𝗵𝗲 𝗲𝘃𝗼𝗹𝘂𝘁𝗶𝗼𝗻:
🔹Templates → slot-filling.
🔹Statistical → n-grams, HMMs.
🔹Neural → Seq2Seq, Transformers.
🔹LLMs → GPT, Claude (SOTA) .
Below are 𝗿𝗲𝗰𝗼𝗺𝗺𝗲𝗻𝗱𝗮𝘁𝗶𝗼𝗻s based on use case:
🔹Need predictability → Templates.
🔹Need natural variety → LLM.
🔹Need both → Hybrid (LLM + guardrails).

The difference between a robotic assistant and a delightful one? NLG.

𝗩𝗼𝗶𝗰𝗲 𝗔𝗜: 𝗖𝗼𝗻𝘁𝗲𝘅𝘁 & 𝗠𝗲𝗺𝗼𝗿𝘆 - 𝗪𝗵𝘆 𝗖𝗼𝗻𝘃𝗲𝗿𝘀𝗮𝘁𝗶𝗼𝗻𝘀 𝗗𝗼𝗻'𝘁 𝗥𝗲𝘀𝗲𝘁

WanjohiChristopher — Sun, 21 Dec 2025 14:40:00 +0000

Dialog Management means = deciding what to do next.

But something else makes Voice AI feel human instead of robotic:

🧠 Context and memory.

𝗪𝗵𝘆 𝗰𝗼𝗻𝘁𝗲𝘅𝘁 𝗺𝗮𝘁𝘁𝗲𝗿𝘀
Consider this exchange:
🗣️ "Book me a flight to Paris."
🗣️ "Make it business class."
That second sentence only makes sense if the system remembers the first.
That's context.
𝗪𝗵𝗮𝘁 𝗰𝗼𝗻𝘁𝗲𝘅𝘁 & 𝗺𝗲𝗺𝗼𝗿𝘆 𝗮𝗰𝘁𝘂𝗮𝗹𝗹𝘆 𝗶𝗻𝗰𝗹𝘂𝗱𝗲:
→ 𝗦𝗵𝗼𝗿𝘁-𝘁𝗲𝗿𝗺 𝗰𝗼𝗻𝘁𝗲𝘅𝘁 (session memory)
🔹Recent turns.
🔹Slot values.
🔹Corrections.
🔹Current dialog state.
→ 𝗟𝗼𝗻𝗴-𝘁𝗲𝗿𝗺 𝗺𝗲𝗺𝗼𝗿𝘆
🔹User preferences.
🔹Past interactions.
🔹Frequent locations.
🔹Knowledge (RAG documents).
This information feeds directly into Dialog Management so the system can make better decisions.

Without memory, every interaction would feel like the first one.

LLMs can reason - but the architecture decides what to remember, when to retrieve it, and when to forget.

That balance is what makes Voice AI feel natural and safe.

𝗩𝗼𝗶𝗰𝗲 𝗔𝗜: 𝗗𝗶𝗮𝗹𝗼𝗴 𝗠𝗮𝗻𝗮𝗴𝗲𝗺𝗲𝗻𝘁 - 𝗧𝗵𝗲 𝗢𝗿𝗰𝗵𝗲𝘀𝘁𝗿𝗮𝘁𝗼𝗿

WanjohiChristopher — Sat, 20 Dec 2025 16:07:00 +0000

We've talked about how Voice AI listens (ASR) and understands (NLU).

But once the system understands the user, there's a harder question:
👉 What should happen next?

This is where Dialog Management comes in.

It's not about generating responses - it's about orchestrating decisions across multiple turns.

E𝘅𝗮𝗺𝗽𝗹𝗲:
👤 "Book a flight to Paris"
🤖 [dest: Paris, origin: ❓] → "Where from?"
👤 "New York"
🤖 [all slots filled ] → "NYC → Paris. Confirm?"

That decision flow? That's Dialog Management.

𝗨𝗻𝗱𝗲𝗿 𝘁𝗵𝗲 𝗵𝗼𝗼𝗱, 𝗶𝘁 𝗵𝗮𝗻𝗱𝗹𝗲𝘀:
→ Tracking conversation state across turns.
→ Knowing what's been said vs what's missing.
→ Deciding when to ask vs when to act.
→ Handling corrections and errors.
→ Executing actions and tools safely.

This is what turns one-shot commands (from the user) into real conversations.

Modern Voice AI agents may use LLMs here - but structure is still essential for reliability and safety.

Without dialog management, even the best models feel unpredictable.

➡️ Next up: How Voice AI remembers - context & memory management.

𝗩𝗼𝗶𝗰𝗲 𝗔𝗜: 𝗡𝗟𝗨 (𝗡𝗮𝘁𝘂𝗿𝗮𝗹 𝗟𝗮𝗻𝗴𝘂𝗮𝗴𝗲 𝗨𝗻𝗱𝗲𝗿𝘀𝘁𝗮𝗻𝗱𝗶𝗻𝗴) - 𝗨𝗻𝗱𝗲𝗿𝘀𝘁𝗮𝗻𝗱𝗶𝗻𝗴 𝗪𝗵𝗮𝘁 𝗬𝗼𝘂 𝗔𝗰𝘁𝘂𝗮𝗹𝗹𝘆 𝗠𝗲𝗮𝗻𝘁

WanjohiChristopher — Fri, 19 Dec 2025 13:33:00 +0000

What happens after the text arrives from ASR.

🗣️ Say you tell a voice assistant:
"Book me a flight to Paris next Friday"

ASR does its job and converts that into text.

But at this point, the system still doesn’t really understand anything.
It doesn’t know:
🔹what you’re trying to do.
🔹which parts of the sentence matter.
🔹or what information is missing.

That’s where NLU (Natural Language Understanding) comes in.

Here’s what NLU figures out behind the scenes:

1️⃣ - 𝗜𝗻𝘁𝗲𝗻𝘁 𝗖𝗹𝗮𝘀𝘀𝗶𝗳𝗶𝗰𝗮𝘁𝗶𝗼𝗻
What are you trying to do?
→ You want to book a flight.

2️⃣ - 𝗘𝗻𝘁𝗶𝘁𝘆 𝗘𝘅𝘁𝗿𝗮𝗰𝘁𝗶𝗼𝗻- details (entities)
→ destination: Paris
→ date: next Friday

3️⃣ And finally - 𝗦𝗹𝗼𝘁 𝗙𝗶𝗹𝗹𝗶𝗻𝗴 - what’s missing
→ where are you flying from?

So the system knows it needs to ask a follow-up.

That's the moment where the conversation starts to feel natural instead of scripted.

With models like GPT-4 or Claude, etc, a lot of this NLU work can now happen in one step without training separate intent classifiers or entity models. The model reasons about intent, details, and gaps together.

That’s a big reason modern Voice AI agents feel more flexible than the older "say it exactly this way" systems.

ASR (Automatic Speech Recognition)

WanjohiChristopher — Thu, 18 Dec 2025 22:30:00 +0000

Yesterday I shared the full Voice AI pipeline.
Today we're diving deep into Stage 1: ASR (Automatic Speech Recognition).

You speak → It becomes text.

Simple, right? Here's what actually happens:

𝟭. 𝗙𝗲𝗮𝘁𝘂𝗿𝗲 𝗘𝘅𝘁𝗿𝗮𝗰𝘁𝗶𝗼𝗻
Raw audio → Digital representation

MFCCs (Mel-Frequency Cepstral Coefficients)
Spectrograms
Filter Banks

𝟮. 𝗔𝗰𝗼𝘂𝘀𝘁𝗶𝗰 𝗠𝗼𝗱𝗲𝗹𝗶𝗻𝗴
Maps audio features to phonemes

Traditional: HMM-GMM, DNN-HMM
Modern: Transformers, Conformers

𝟯. 𝗗𝗲𝗰𝗼𝗱𝗶𝗻𝗴 & 𝗟𝗮𝗻𝗴𝘂𝗮𝗴𝗲 𝗠𝗼𝗱𝗲𝗹𝗶𝗻𝗴
Phonemes → Words using probabilities

Beam Search, CTC, Attention mechanisms

𝟰. 𝗣𝗼𝘀𝘁-𝗣𝗿𝗼𝗰𝗲𝘀𝘀𝗶𝗻𝗴
Clean up the output

Spell checking, punctuation, capitalization

The evolution has been wild:

𝗧𝗿𝗮𝗱𝗶𝘁𝗶𝗼𝗻𝗮𝗹 (1980s-2010s):
→ HMM + GMM
→ Required phonetic alignment
→ Separate components stitched together

𝗦𝗧𝗔𝗧𝗘-𝗢𝗙-𝗧𝗛𝗘-𝗔𝗥𝗧 (Now):
→ Whisper: 680K hours of training, 50+ languages
→ Wav2Vec 2.0: Self-supervised, works with limited data

Get ASR wrong and your entire voice pipeline fails. It's the foundation.

I've attached a diagram breaking down the full ASR architecture.

What ASR model are you using? Any surprises with accuracy or latency?

VOICE AI SYSTEM ARCHITECTURE

WanjohiChristopher — Thu, 18 Dec 2025 04:22:41 +0000

🎙️I’ve been diving deep into Voice AI Agents and decided to map out how they actually work.

You know when you ask Alexa or ChatGPT Voice a question and it just… responds intelligently?

There’s a lot happening in that split second.

How do voice agents work?

At a high level, every voice agent needs to handle three tasks:

👉Listen - capture audio and transcribe it
👉Think - interpret intent, reason, plan
👉Speak - generate audio and stream it back to the user

A Voice AI Agent typically goes through five core stages:
🔹Speech is converted to text (ASR).
🔹The system understands intent and entities (NLU).
🔹It reasons about what action to take (Dialog Manager / Agent Logic).
🔹It generates a response (NLG).
🔹Speaks it back naturally (TTS).

This same agent-style architecture powers Alexa, Siri, Google Assistant, and modern LLM-based voice agents like ChatGPT Voice.

I put together a diagram to visualize the full end-to-end pipeline behind Voice AI Agents - from speech input to intelligent action and response.

I’m planning to break down each component and share more on how agent-based voice systems are built.

Which Voice AI agent do you interact with the most?