DEV Community

David Thomas
David Thomas

Posted on

Building a DIY ESP32 AI Voice Assistant with Xiaozhi MCP

Commercial voice assistants like Alexa and Google Assistant are impressive, but they often come with trade-offs: privacy concerns, limited customisation, and cloud lock-in. For makers and engineers, that naturally raises a question:

Can we build our own ESP32 AI Voice Assistant - one that’s open, hackable, and truly ours?

With the ESP32-S3 and the Xiaozhi AI framework, the answer is yes.

In this article, I will walk through the design and implementation of a portable ESP32-S3 AI voice assistant that supports wake-word detection, natural conversation, smart-device control, and battery operation. This project combines embedded systems, real-time audio processing, and cloud-based large language models into a single, open-source device.


Project Overview

This DIY AI voice assistant is built around the ESP32-S3-WROOM-1-N16R8, paired with a dual-microphone array, an I²S audio amplifier, and robust power management for portable use.

Key Capabilities

  • Local wake-word detection using Espressif AFE
  • Noise reduction, echo cancellation, and beamforming
  • Cloud-based conversation via Xiaozhi MCP
  • Wi-Fi + Bluetooth connectivity
  • Battery-powered or USB-powered operation
  • Visual feedback using RGB LEDs Custom ESP32 AI Voice Assistant Board

How the ESP32 AI Voice Assistant Works

The system uses a hybrid edge-plus-cloud architecture.

On the ESP32-S3

  • Wake-word detection (WakeNet)
  • Audio capture via MEMS microphones
  • Noise suppression and echo cancellation (AFE)
  • Real-time audio streaming via WebSockets
  • Local GPIO and peripheral control

In the Cloud

  • Speech-to-Text (STT)
  • Large Language Model (LLM) reasoning
  • Text-to-Speech (TTS)
  • Tool execution using Model Context Protocol (MCP)

This split allows a low-cost microcontroller to deliver conversational AI performance similar to commercial smart speakers.


Understanding Xiaozhi and MCP

Xiaozhi is an open-source AI chatbot framework designed specifically for ESP32 devices. Instead of embedding heavy AI models locally, it connects ESP32 hardware to cloud-based LLMs using a standard protocol.

What Is MCP (Model Context Protocol)?

Think of MCP as a universal language between AI and hardware.

It allows the AI to:

  • Discover connected hardware
  • Understand what each device can do
  • Execute actions (GPIO, relays, LEDs, motors)
  • Receive real-time feedback

This means you can say:

“Turn on the green LED”

…and the AI automatically maps that intent to a GPIO action on the ESP32—without custom voice parsing logic.


Hardware Design Highlights

Core Components

  • ESP32-S3-WROOM-1-N16R8
  • 2× ICS-43434 MEMS microphones
  • MAX98357A
  • BQ24250
  • MAX20402
  • WS2812B RGB LEDs
  • USB-C power and programming Core Components

Cloud Setup with Xiaozhi

Instead of self-hosting, this project uses the official Xiaozhi cloud, which simplifies deployment.

What You Get

  • No server maintenance
  • Multiple LLM backends (Qwen, DeepSeek)
  • Voice selection and personality tuning
  • Device analytics and logs
  • MCP tool management

Once registered, the device instantly becomes conversational.

PCB


Voice-Controlled Hardware Using MCP

To demonstrate MCP, this project includes a simple traffic-light LED system.

Available Voice Commands

  • “Turn on the red LED”
  • “Switch off all lights”
  • “What lights are on?”

How It Works

  1. Voice → STT
  2. LLM selects MCP tool
  3. ESP32 executes GPIO action
  4. AI confirms action verbally

This same pattern scales to relays, sensors, motors, or displays.


3D-Printed Enclosure

The enclosure was designed for:

  • Good acoustic isolation
  • Clear microphone paths
  • LED visibility
  • Passive cooling
  • Easy assembly (snap-fit)

It turns the PCB into a finished, desktop-ready product rather than a bare prototype.


Real-World Applications

This ESP32 AI voice assistant can act as:

  • Smart home controller
  • Personal information assistant
  • Learning platform for embedded AI
  • Accessibility tool for hands-free control
  • Experimental AI hardware sandbox

Key Takeaways

  • Embedded AI is now accessible to makers
  • ESP32-S3 is powerful enough for real-time voice interaction
  • MCP removes the complexity of voice-to-hardware control
  • Open-source frameworks accelerate innovation
  • You don’t need big tech infrastructure to build smart devices

This ESP32 AI Voice Assistant shows how far embedded systems have come. By combining efficient hardware, smart protocols, and modern AI models, it is possible to build devices that listen, understand, and respond intelligently - all on hardware you can fully control.

If you are interested in embedded AI, smart devices, or just building something genuinely impressive from scratch, this project is a solid place to start.

Top comments (0)