I have an Elgato Air Light sitting on my desk. It's great for video recording and calls. But every time I want to turn it on, adjust the brightness, or change the color temperature, I have to reach for my phone, open the app, and tap through menus. It's a small friction, but it adds up.
I also had an M5Stack Core2 gathering dust — an ESP32-based device with a built-in microphone, speaker, and touch screen. I kept thinking: what if I could just talk to it? "Turn on my light." "Make it warmer." "Dim it to 30 percent."
That's when I thought, why not use the Cloudflare Agent SDK to build an agent. I started building my agent with custom functions to handle audio input and output. The M5Stack would connect to the agent deployed on the edge, send the audio chunks, the agent would process this, perform the action, and stream the audio response back to the device. This was working fine, but it was a lot of code, fragile code. If I switched the Text-To-Speech (TTS) or Speech-To-Text(STT) models, I would have to update the code to handle encoding, and decoding. This wasn't fun at all.
Then in April, 2026, Cloudflare announced Cloudflare's Voice SDK. The SDK turns an agent into a real-time voice agent with streaming speech-to-text, text-to-speech, and conversation history. Combine that with Workers AI for the LLM and Cloudflare Mesh for reaching local devices from the edge, and I had everything I needed.
In this article, I'll walk you through how I built a voice-controlled smart light system — from the ESP32 firmware to the Worker running on Cloudflare's edge, and all the gotchas I hit along the way. The article will focus more on the Cloudflare stack, and not the device code.
What I built
A voice assistant running on the M5Stack Core2 that can:
- Have natural conversations using streaming speech-to-text and text-to-speech
- Control my Elgato Air Light on the local network — turn it on/off, adjust brightness and color temperature
- Do all processing on Cloudflare's edge — the ESP32 is just a microphone, speaker, and display
Here's the architecture. Click each node to see what it does:
When I say "turn on my light," the LLM recognizes the intent, calls a tool function, which reaches the Elgato light through Cloudflare Mesh — and then speaks back "Done, I've turned on your light."
Prerequisites
Before you follow along, here's what you'll need:
- An M5Stack Core2 (or any ESP32 with mic and speaker)
- An Elgato Key Light or Air Light on your local network
- A Raspberry Pi 3/4/5 (or any Linux machine) on the same local network as the light
- A Cloudflare account with Workers AI enabled
- Familiarity with TypeScript and Arduino/C++
The Voice SDK: withVoice
The @cloudflare/voice SDK provides a withVoice mixin that turns any Cloudflare Agent (Durable Object) into a real-time voice agent. It handles:
- Continuous streaming STT (speech-to-text) via the Flux model
- Sentence-level TTS (text-to-speech) via Deepgram Aura
- Conversation history persistence in SQLite
- Interruption handling (new speech cancels in-progress TTS)
- A WebSocket protocol that clients connect to
This is where things got exciting for me. The SDK abstracts away so much of the complexity that the core server code is surprisingly compact.
Here's what a single spoken turn looks like inside the Worker:
The server
import { Agent, routeAgentRequest, type Connection } from 'agents';
import { withVoice, WorkersAIFluxSTT, type VoiceTurnContext } from '@cloudflare/voice';
import { streamText } from 'ai';
import { createWorkersAI } from 'workers-ai-provider';
const VoiceAgentBase = withVoice(Agent, { audioFormat: 'pcm16' });
export class VoiceAgent extends VoiceAgentBase<Env> {
transcriber = new WorkersAIFluxSTT(this.env.AI);
tts = new PCM16TTS(this.env.AI);
async onTurn(transcript: string, context: VoiceTurnContext) {
const workersAi = createWorkersAI({ binding: this.env.AI });
const result = streamText({
model: workersAi('@cf/moonshotai/kimi-k2.6'),
system: 'You are a helpful voice assistant. Keep responses concise.',
messages: [
...context.messages.map(m => ({
role: m.role as 'user' | 'assistant',
content: m.content,
})),
{ role: 'user', content: transcript },
],
abortSignal: context.signal,
});
return result.textStream;
}
}
The onTurn method is called whenever the user finishes speaking. It receives the transcript and returns a text stream — the SDK handles converting that text to speech and streaming the audio back. Make sure to append the current transcript to context.messages when building the message list for the LLM.
PCM16 TTS: Why I needed a custom class
This was the first gotcha I hit. The built-in WorkersAITTS class sends { text, speaker } to the Deepgram model (it defaults to aura-1), which outputs MP3 by default. The ESP32 doesn't have an MP3 decoder (or at least what the coding agents told me), so I needed raw PCM16 audio instead.
The fix: a custom TTS class that calls the aura-2-en model directly and passes encoding: "linear16", sample_rate: 24000, and container: "none":
class PCM16TTS {
#ai: Ai;
constructor(ai: Ai) {
this.#ai = ai;
}
async synthesize(text: string, signal?: AbortSignal): Promise<ArrayBuffer | null> {
const resp = await this.#ai.run(
'@cf/deepgram/aura-2-en' as any,
{
text,
speaker: 'luna',
encoding: 'linear16',
sample_rate: 24000,
container: 'none',
} as any,
{ returnRawResponse: true, ...(signal ? { signal } : {}) }
);
return await (resp as Response).arrayBuffer();
}
}
Chunking audio for the ESP32
TTS generates audio per-sentence. A short sentence like "Hi! How can I help you?" produces ~90KB of PCM16 data. The ESP32 WebSocket library has a maximum frame size (WEBSOCKETS_MAX_DATA_SIZE), and the device has limited heap (~170KB free). Sending a single 90KB frame works but leaves little headroom.
The afterSynthesize hook lets me chunk audio into smaller frames before sending:
const AUDIO_CHUNK_SIZE = 4096;
afterSynthesize(audio: ArrayBuffer | null, _text: string, connection: Connection) {
if (!audio) return null;
const src = new Uint8Array(audio);
for (let offset = 0; offset < src.byteLength; offset += AUDIO_CHUNK_SIZE) {
const end = Math.min(offset + AUDIO_CHUNK_SIZE, src.byteLength);
connection.send(src.slice(offset, end));
}
return null; // returning null tells the SDK we handled sending ourselves
}
The WebSocket protocol
The withVoice SDK defines a WebSocket protocol between the client and the server. Here's the full message flow:
Client → Server
| Message | When |
|---|---|
{"type":"hello","protocol_version":1} |
On connect |
{"type":"start_call","preferred_format":"pcm16"} |
User taps to start |
| Binary PCM16 frames (16kHz, 16-bit, mono) | Continuously while in call |
{"type":"end_call"} |
User taps to end |
Server → Client
| Message | Description |
|---|---|
welcome |
Connection acknowledged |
status |
State changes: idle, listening, thinking, speaking
|
transcript |
Final transcript with role: "user" or "assistant"
|
transcript_interim |
Partial STT result while user is speaking |
transcript_start/delta/end |
Streaming LLM response tokens |
audio_config |
Audio format info (format, sampleRate) |
metrics |
Timing info (llm_ms, tts_ms, first_audio_ms) |
| Binary frames | PCM16 audio during speaking status |
The ESP32 client
The M5Stack Core2 has a built-in microphone, speaker, display, and touch screen. The firmware does the following:
- Connects to WiFi, then opens a WebSocket to the Worker
- Sends
helloand waits forwelcome - On touch: sends
start_call, receives the agent's greeting, then begins streaming mic audio as binary PCM16 frames - Receives status updates, transcripts, and audio — plays audio through the speaker using triple-buffered
playRaw() - On touch again: sends
end_call
This part took the most debugging. The ESP32 is a constrained device, and the M5Stack Core2 has some quirks that weren't obvious from the documentation.
Gotcha: mic reinit after speaker playback
The M5Stack Core2 has separate I2S buses for the microphone and speaker, but Speaker.playRaw() disrupts the mic's I2S state. After playback stops, the mic produces silence. This one took me a while to figure out — I kept thinking my WebSocket connection was dropping, but the mic was just... silent.
The fix: fully tear down and reinitialize the mic after each playback session:
void stop_playback() {
M5.Speaker.stop();
M5.Speaker.end();
is_playing = false;
// Restart mic — Speaker.playRaw disrupts the mic I2S bus
M5.Mic.end();
delay(10);
auto mic_cfg = M5.Mic.config();
mic_cfg.sample_rate = 16000;
mic_cfg.magnification = 16;
M5.Mic.config(mic_cfg);
M5.Mic.begin();
}
Gotcha: WebSocket Host header
The ESP32 WebSocket library sends Host: hostname:443 in the header, but routeAgentRequest (which uses partyserver internally) expects just Host: hostname. The extra :443 causes routing to fail silently — no error, no log, just a connection that never reaches the Durable Object.
Important: You need to patch the WebSocketsClient library to omit the port when it's 443 or 80.
Gotcha: Durable Object path routing
routeAgentRequest converts Durable Object binding names to kebab-case for URL routing. The binding VoiceAgent maps to path /agents/voice-agent/default, not /agents/VoiceAgent/default. The coding agents spent an embarrassing amount of time on this one.
Greeting on call start
One nice touch: the agent can speak immediately when a call begins by implementing onCallStart:
async onCallStart(connection: Connection) {
await this.speak(connection, 'Hi! How can I help you?');
}
That means start_call can produce server audio before the user says anything. It makes the experience feel much more natural.
Adding smart home control: Elgato Air Light via Mesh
Now for the interesting part. I wanted to say "turn on my light" and have the Worker control the Elgato Air light sitting on my local network.
The challenge
The Elgato Air Light exposes a REST API on the local network (http://<ip>:9123/elgato/lights). But the Worker runs on Cloudflare's edge — it can't reach 192.168.x.x directly.
The solution: Cloudflare Mesh + VPC Networks
Why Cloudflare Mesh and not Cloudflare Tunnel?
If you've used Cloudflare before, you might be wondering: why not just use Cloudflare Tunnel? Both connect your private network to Cloudflare, but they solve different problems.
Cloudflare Tunnel (cloudflared) is designed for publishing specific services to the internet. You configure a public hostname (like light.example.com), and Tunnel proxies inbound traffic from the internet to your local service. It's great for "I want my app reachable at this URL." But each service needs its own tunnel route, and the Worker can't initiate arbitrary requests to any local IP — it can only reach the services you've explicitly published.
Cloudflare Mesh (formerly WARP Connector) is designed for private network connectivity. A Mesh node advertises CIDR routes, making an entire subnet reachable. With a VPC Network binding, your Worker gets a MESH.fetch() that can reach any IP and port in the advertised range — no per-service configuration needed.
| Cloudflare Tunnel | Cloudflare Mesh | |
|---|---|---|
| Traffic direction | Inbound to origin — clients connect to published services | Bidirectional — any participant can initiate |
| Addressing | By public hostname | By private IP (every participant gets a Mesh IP) |
| Worker access | Reach specific published services | Reach any IP/port in the advertised subnet |
| Connector | cloudflared |
warp-cli |
| Protocols | HTTP/S, TCP, SSH, RDP, SMB | TCP, UDP, ICMP |
| Best for | Exposing apps to the internet | Private network connectivity, VPN replacement |
For this project, the Worker needs to call the Elgato's local REST API at 192.168.x.x:9123 — a private IP that shouldn't be exposed publicly. Mesh gives the Worker outbound access to the entire local subnet with a single binding. If I add more smart devices later, they're automatically reachable too — no new tunnel routes to configure.
This is the same approach I used in my previous article to expose OpenClaw to the internet, but this time using Mesh instead of Tunnels.
The flow:
Worker calls env.MESH.fetch("http://192.168.x.x:9123/elgato/lights")
→ Cloudflare routes to Mesh network
→ Mesh node on local LAN receives the request
→ Forwards to Elgato at 192.168.x.x:9123
→ Response flows back the same path
Setting up the Mesh node
Mesh nodes require Linux. I used a Raspberry Pi 4 sitting on the same local network as the Elgato light.
Step 1: Create a Mesh node in the Cloudflare dashboard
Go to Networking > Mesh and select Add a node. Name it (e.g. home-network) and copy the connector token.
Step 2: Install the WARP client on the Raspberry Pi
SSH into the Pi and run:
# Add Cloudflare's GPG key and repo
curl -fsSL https://pkg.cloudflareclient.com/pubkey.gpg \
| sudo gpg --yes --dearmor -o /usr/share/keyrings/cloudflare-warp-archive-keyring.gpg
echo "deb [signed-by=/usr/share/keyrings/cloudflare-warp-archive-keyring.gpg] https://pkg.cloudflareclient.com/ $(lsb_release -cs) main" \
| sudo tee /etc/apt/sources.list.d/cloudflare-client.list
sudo apt-get update && sudo apt-get install -y cloudflare-warp
Step 3: Register as a Mesh connector and connect
sudo warp-cli connector new <YOUR_TOKEN>
sudo warp-cli connect
Verify:
sudo warp-cli status
# Should show: Status update: Connected
The node should appear as Online in the Mesh dashboard with a Mesh IP assigned.
Step 4: Add a CIDR route
In the Mesh dashboard, go to your node > Routes tab > Add route: 192.168.x.0/24. This tells Cloudflare that this Mesh node can forward traffic to devices on the local 192.168.x.x subnet — including the Elgato light.
Step 5: Configure NAT/MASQUERADE on the Mesh node
Important: By default, traffic from your Worker arrives at the Mesh node with a source IP in the 100.96.0.0/12 WARP range. When the Mesh node forwards this to a local device (like your Elgato), that device will try to reply to its default gateway (your router) instead of the Mesh node, causing connection timeouts.
You need to configure the Mesh node to rewrite the source IP before forwarding to local devices. I cover the exact nftables commands in the Gotchas section below. However, if your application is running on the same machine as the Mesh node, you don't need to set this up.
The Elgato REST API
The Elgato Key Light / Air Light exposes a simple HTTP API on port 9123:
| Endpoint | Method | Description |
|---|---|---|
/elgato/lights |
GET | Get current state |
/elgato/lights |
PUT | Set state |
/elgato/accessory-info |
GET | Device info |
The state payload:
{
"numberOfLights": 1,
"lights": [{
"on": 1,
"brightness": 50,
"temperature": 200
}]
}
-
on: 1 = on, 0 = off -
brightness: 0–100 -
temperature: 143–344 (mirek scale — 143 = ~7000K cool white, 344 = ~2900K warm white)
Adding tool calling to the voice agent
Now that I had a way to reach the Elgato from the Worker, I needed the LLM to call the right API based on what I say. The Vercel AI SDK supports tool calling — you define tools with descriptions and parameters, and the LLM decides when to call them based on user intent.
The kimi-k2.6 model on Workers AI supports multi-turn tool calling natively. When you pass tools to streamText, the SDK:
- Sends tool definitions to the LLM
- When the LLM returns a tool call, executes the
executefunction - Feeds the result back to the LLM
- The LLM generates a natural language response
The textStream returned to onTurn only contains the final spoken text — all the tool calling happens transparently.
Wrangler config
{
"compatibility_flags": ["nodejs_compat"],
"compatibility_date": "2025-09-21",
"migrations": [
{
"new_sqlite_classes": ["VoiceAgent"],
"tag": "v1"
}
],
"durable_objects": {
"bindings": [
{
"class_name": "VoiceAgent",
"name": "VoiceAgent"
}
]
},
"ai": {
"binding": "AI"
},
"vpc_networks": [
{
"binding": "MESH",
"network_id": "cf1:network",
"remote": true
}
],
"vars": {
"ELGATO_IP": "192.168.8.187"
}
}
Tool definitions
import { tool } from 'ai';
import { z } from 'zod/v4';
const ELGATO_PORT = 9123;
function elgatoUrl(env: Env, path: string) {
return `http://${env.ELGATO_IP}:${ELGATO_PORT}${path}`;
}
function elgatoTools(env: Env) {
const fetchLight = async (init?: RequestInit) => {
try {
return await env.MESH.fetch(elgatoUrl(env, '/elgato/lights'), init);
} catch {
return new Response(JSON.stringify({ error: 'Light unreachable via Mesh' }), {
status: 503,
headers: { 'Content-Type': 'application/json' },
});
}
};
const putLight = (body: object) =>
fetchLight({
method: 'PUT',
headers: { 'Content-Type': 'application/json' },
body: JSON.stringify(body),
});
const getLightState = async () => {
const res = await fetchLight();
return (await res.json()) as {
numberOfLights: number;
lights: Array<{ on: number; brightness: number; temperature: number }>;
};
};
return {
get_light_status: tool({
description: 'Get the current status of the desk light',
inputSchema: z.object({}),
execute: async () => {
const res = await fetchLight();
return await res.json();
},
}),
turn_light_on: tool({
description: 'Turn the desk light on',
inputSchema: z.object({}),
execute: async () => {
const state = await getLightState();
const light = state.lights[0];
light.on = 1;
const res = await putLight({ numberOfLights: 1, lights: [light] });
return { success: res.ok };
},
}),
turn_light_off: tool({
description: 'Turn the desk light off',
inputSchema: z.object({}),
execute: async () => {
const state = await getLightState();
const light = state.lights[0];
light.on = 0;
const res = await putLight({ numberOfLights: 1, lights: [light] });
return { success: res.ok };
},
}),
set_light_brightness: tool({
description: 'Set the desk light brightness (0-100)',
inputSchema: z.object({
brightness: z.number().min(0).max(100),
}),
execute: async ({ brightness }) => {
const state = await getLightState();
const light = state.lights[0];
light.brightness = brightness;
const res = await putLight({
numberOfLights: 1,
lights: [light],
});
return { success: res.ok, brightness };
},
}),
set_light_temperature: tool({
description: 'Set the color temperature (143=cool to 344=warm)',
inputSchema: z.object({
temperature: z.number().min(143).max(344),
}),
execute: async ({ temperature }) => {
const state = await getLightState();
const light = state.lights[0];
light.temperature = temperature;
const res = await putLight({
numberOfLights: 1,
lights: [light],
});
return { success: res.ok, temperature };
},
}),
};
}
Then in onTurn:
import { stepCountIs } from 'ai';
async onTurn(transcript: string, context: VoiceTurnContext) {
const workersAi = createWorkersAI({ binding: this.env.AI });
const messages = [
...context.messages.map(m => ({
role: m.role as 'user' | 'assistant',
content: m.content,
})),
{ role: 'user', content: transcript },
];
const result = streamText({
model: workersAi('@cf/moonshotai/kimi-k2.6'),
system: `You are a helpful voice assistant that can also control the desk light.
When asked about the light, use the available tools. Keep responses concise and natural.`,
tools: elgatoTools(this.env),
messages,
abortSignal: context.signal,
stopWhen: stepCountIs(5),
});
return result.textStream;
}
That's it. The LLM handles intent recognition. When I say "make it brighter," the model calls set_light_brightness. When I say "what's the weather," it just responds normally. No keyword parsing, no intent classification system — the LLM figures it out.
Note: The stopWhen: stepCountIs(5) option gives the model enough room for the tool-call → tool-result → final-answer loop, while preventing an accidental unbounded tool loop. In my Worker, I also log tool-call start/finish and step summaries so Mesh or schema failures are visible in Worker logs.
What's running where
| Component | Where | What it does |
|---|---|---|
| ESP32 firmware | M5Stack Core2 on my desk | Mic input, speaker output, touch UI, WebSocket client |
| VoiceAgent | Cloudflare Worker (Durable Object) | STT, LLM, TTS, tool execution, conversation history |
| Workers AI | Cloudflare edge | Flux STT, kimi-k2.6 LLM, Deepgram aura-2-en TTS |
| Mesh node | Raspberry Pi 4 on local LAN | WARP connector bridging Cloudflare to local network |
| Elgato Air Light | Local network (192.168.8.187:9123) | HTTP API for light control |
Gotchas and lessons learned
I hit a lot of issues building this. Here's a summary of everything I ran into, including some I already mentioned above.
The built-in
WorkersAITTSdefaults to MP3. If your client can't decode MP3, you need a custom TTS class that explicitly requestsencoding: "linear16". I covered this earlier in the PCM16 TTS section.routeAgentRequestuses kebab-case paths. The Durable Object bindingVoiceAgentmaps to URL path/agents/voice-agent/default, not/agents/VoiceAgent/default.ESP32 mic needs reinit after speaker playback. On the M5Stack Core2,
Speaker.playRaw()disrupts the mic I2S bus. You must callSpeaker.end(),Mic.end(), thenMic.begin()to restore it.WebSocket Host header matters. The ESP32 WebSocket library sends
Host: hostname:443, which breaksrouteAgentRequestrouting. Patch the library to omit standard ports.afterSynthesizereturning null is valid. You can use it to chunk large TTS audio into smaller WebSocket frames — just send them yourself viaconnection.send()and returnnullso the SDK doesn't double-send.-
Tool calling needs a bounded multi-step loop. Define tools with
executefunctions, pass them tostreamText, and usestopWhen: stepCountIs(5)so the SDK can run the tool-call → execute → feed-result → generate-response loop. ThetextStreamonly yields the final spoken text.-
Mesh routing requires NAT/MASQUERADE on the Mesh node. If your Worker gets
HandshakeTimeoutErrorwhen calling a local device viaenv.MESH.fetch(), the issue is asymmetric routing. When a packet arrives from the Worker, its source IP is in the100.96.0.0/12WARP range. The local device replies to its default gateway (your router), not back to the Mesh node.
The official Cloudflare docs recommend solving this by either making the Mesh node the subnet's default gateway, or adding a static route on your router that points
100.96.0.0/12to the Mesh node. The coding agent went with a different approach: rewriting the source IP before forwarding to local devices usingnftables:On modern Linux systems using
nftables(most newer Raspberry Pi OS versions), add this rule:
-
Mesh routing requires NAT/MASQUERADE on the Mesh node. If your Worker gets
# Check which firewall tool is available
which nft # If this returns a path, use nftables. If not, install iptables.
# Add MASQUERADE rule (replace eth0 with your LAN interface: eth0, wlan0, etc.)
sudo nft add table ip nat
sudo nft add chain ip nat postrouting { type nat hook postrouting priority 100 \; }
sudo nft add rule ip nat postrouting oifname "wlan0" iifname "CloudflareWARP" masquerade
To verify this is the issue before fixing it, SSH into your Mesh node and run:
# This will fail (simulates the Worker's packet path)
curl --interface 100.96.0.2 http://<ELGATO_IP>:9123/elgato/lights
# Error: Failed to connect / Handshake timeout
# This works (local origin traffic)
curl http://<ELGATO_IP>:9123/elgato/lights
# Returns: {"numberOfLights":1,...}
If the first curl fails but the second succeeds, you need the MASQUERADE rule. Make it persistent across reboots by saving the ruleset and loading it at boot:
echo 'table ip nat {
chain postrouting {
type nat hook postrouting priority 100;
oifname "wlan0" iifname "CloudflareWARP" masquerade
}
}' | sudo tee /etc/nftables-mesh-nat.nft
sudo nft -f /etc/nftables-mesh-nat.nft
# Persist across reboots (add to crontab)
(crontab -l 2>/dev/null; echo "@reboot sleep 10 && sudo nft -f /etc/nftables-mesh-nat.nft") | crontab -
Summary
I started with a simple frustration — reaching for my phone every time I wanted to adjust my desk light. What I ended up with is a voice assistant that runs on an ESP32, processes everything on Cloudflare's edge, and controls local devices through Mesh networking.
The stack that made this possible:
- @cloudflare/voice SDK — handles the hard parts of real-time voice (STT, TTS, conversation state, interruption)
- Workers AI — LLM with tool calling for intent recognition
- Cloudflare Mesh — bridges the gap between the edge and my local network
- Vercel AI SDK — clean tool calling abstraction on top of Workers AI
The biggest surprises were the ESP32 quirks (mic reinit after speaker, WebSocket Host header) and the Mesh NAT issue. None of these were documented anywhere, so I hope this article saves you some debugging time.
What's next
There's a lot more I want to do with this setup:
- Add more devices — I have other smart lights on my local network that I'd love to control by voice
- Improve the ESP32 experience — a proper UI on the display showing conversation state and light status
- Experiment with wake word detection instead of the touch-to-talk button
- Try different LLM models as Workers AI adds more options. Kimi K2.6 is excellent, but a bit overkill for this. I might try with smaller models like Granite 4.0 or others from Workers AI.
If you are building something similar, or run into any issues, feel free to hit me up on X (Twitter) or LinkedIn. I'd love to hear about what you're building with the Voice SDK and Mesh. I also co-authored a book - Building a Virtual Assistant with Raspberry Pi that will help you learn how to build an offline first virtual assistant!


Top comments (0)