David Thomas

Posted on Dec 30, 2025

Building a DIY ESP32 AI Voice Assistant with Xiaozhi MCP

#ai #diy #programming #esp32

Commercial voice assistants like Alexa and Google Assistant are impressive, but they often come with trade-offs: privacy concerns, limited customisation, and cloud lock-in. For makers and engineers, that naturally raises a question:

Can we build our own ESP32 AI Voice Assistant - one that’s open, hackable, and truly ours?

With the ESP32-S3 and the Xiaozhi AI framework, the answer is yes.

In this article, I will walk through the design and implementation of a portable ESP32-S3 AI voice assistant that supports wake-word detection, natural conversation, smart-device control, and battery operation. This project combines embedded systems, real-time audio processing, and cloud-based large language models into a single, open-source device.

Project Overview

This DIY AI voice assistant is built around the ESP32-S3-WROOM-1-N16R8, paired with a dual-microphone array, an I²S audio amplifier, and robust power management for portable use.

Key Capabilities

Local wake-word detection using Espressif AFE
Noise reduction, echo cancellation, and beamforming
Cloud-based conversation via Xiaozhi MCP
Wi-Fi + Bluetooth connectivity
Battery-powered or USB-powered operation
Visual feedback using RGB LEDs

How the ESP32 AI Voice Assistant Works

The system uses a hybrid edge-plus-cloud architecture.

On the ESP32-S3

Wake-word detection (WakeNet)
Audio capture via MEMS microphones
Noise suppression and echo cancellation (AFE)
Real-time audio streaming via WebSockets
Local GPIO and peripheral control

In the Cloud

Speech-to-Text (STT)
Large Language Model (LLM) reasoning
Text-to-Speech (TTS)
Tool execution using Model Context Protocol (MCP)

This split allows a low-cost microcontroller to deliver conversational AI performance similar to commercial smart speakers.

Understanding Xiaozhi and MCP

Xiaozhi is an open-source AI chatbot framework designed specifically for ESP32 devices. Instead of embedding heavy AI models locally, it connects ESP32 hardware to cloud-based LLMs using a standard protocol.

What Is MCP (Model Context Protocol)?

Think of MCP as a universal language between AI and hardware.

It allows the AI to:

Discover connected hardware
Understand what each device can do
Execute actions (GPIO, relays, LEDs, motors)
Receive real-time feedback

This means you can say:

“Turn on the green LED”

…and the AI automatically maps that intent to a GPIO action on the ESP32—without custom voice parsing logic.

Hardware Design Highlights

Core Components

ESP32-S3-WROOM-1-N16R8
2× ICS-43434 MEMS microphones
MAX98357A
BQ24250
MAX20402
WS2812B RGB LEDs
USB-C power and programming

Cloud Setup with Xiaozhi

Instead of self-hosting, this project uses the official Xiaozhi cloud, which simplifies deployment.

What You Get

No server maintenance
Multiple LLM backends (Qwen, DeepSeek)
Voice selection and personality tuning
Device analytics and logs
MCP tool management

Once registered, the device instantly becomes conversational.

Voice-Controlled Hardware Using MCP

To demonstrate MCP, this project includes a simple traffic-light LED system.

Available Voice Commands

“Turn on the red LED”
“Switch off all lights”
“What lights are on?”

How It Works

Voice → STT
LLM selects MCP tool
ESP32 executes GPIO action
AI confirms action verbally

This same pattern scales to relays, sensors, motors, or displays.

3D-Printed Enclosure

The enclosure was designed for:

Good acoustic isolation
Clear microphone paths
LED visibility
Passive cooling
Easy assembly (snap-fit)

It turns the PCB into a finished, desktop-ready product rather than a bare prototype.

Real-World Applications

This ESP32 AI voice assistant can act as:

Smart home controller
Personal information assistant
Learning platform for embedded AI
Accessibility tool for hands-free control
Experimental AI hardware sandbox

Key Takeaways

Embedded AI is now accessible to makers
ESP32-S3 is powerful enough for real-time voice interaction
MCP removes the complexity of voice-to-hardware control
Open-source frameworks accelerate innovation
You don’t need big tech infrastructure to build smart devices

This ESP32 AI Voice Assistant shows how far embedded systems have come. By combining efficient hardware, smart protocols, and modern AI models, it is possible to build devices that listen, understand, and respond intelligently - all on hardware you can fully control.

If you are interested in embedded AI, smart devices, or just building something genuinely impressive from scratch, this project is a solid place to start.

DEV Community