Messin

Posted on Nov 17

ESP32 Offline Voice Recognition Using Edge Impulse

#programming #tutorial #performance #design

Introduction

Voice recognition is becoming an essential part of modern IoT and embedded systems — enabling hands-free control, better accessibility, and a natural way to interact with devices. In this project, you’ll build a fully offline voice-recognition system using the ESP32 development board, a MEMS microphone (the INMP441 module), and the Edge Impulse (EdgeML) platform. No internet, no cloud APIs, and no reliance on external services — true on-device speech command control.

We’ll walk through the workflow: collecting/training a dataset, deploying the model to ESP32, wiring the hardware, and finally running the system (e.g., saying a wake-word then a command like “on”/“off” to toggle an LED).

What you’ll build

A voice-activated assistant that listens for a wake word (e.g., “marvin”), then listens for a command word (e.g., “on” / “off”) to control an LED.

Entirely on-device inference: the ESP32 runs the model locally, so audio never leaves the board.

Expandable: you can extend it to more commands (lights, relays, IoT devices) or even multiple languages.

Key Features

Offline voice recognition: No cloud connectivity needed.
Low latency: On-device processing ensures a fast response.
Customizable wake-word and commands: Train your own set of words.
Privacy-focused: Audio stays on the board; nothing is sent out.
Cost-effective hardware: Just ESP32 + MEMS microphone + minimal extras.

What you’ll need

Hardware

ESP32 development board (e.g., ESP32-DevKitC)
INMP441 MEMS I²S microphone module
Two LEDs (one for “indicator” status, one for “control”)
Resistors (~220 Ω) for LEDs
Breadboard and jumper wires

Workflow: Build & Train the Model

Create a new project in Edge Impulse: Sign up/log in, then create a new project → set the target device (ESP32 or equivalent).

Collect data: Upload audio samples for each of your classes — e.g., “noise”, your wake word (“marvin”), and command words (“on”, “off”). You can use open-datasets (like the Google Speech Commands Dataset) or record your own for better accuracy.

Design the impulse: In Edge Impulse, define the processing block (feature extraction) and the learning block (model architecture).

Generate features: Visualise your dataset clusters; well-separated clusters indicate a good chance of high accuracy.

Train the model: Use default settings or tune epochs/learning rate. After training, validate your model (aim for > 85% accuracy for prototyping).
Deploy as Arduino library: In Edge Impulse, select “Arduino Library” under Deploy → Build → Download .zip library.
Install in Arduino IDE: Add the .zip library via Sketch → Include Library → Add .ZIP Library. Load the example sketch (e.g., esp32_microphone example).

Programming & Code Highlights

The example code sets up I²S for audio capture (sample rate 16 kHz), uses the Edge Impulse “inferencing” library, runs the classifier, and then processes the results.

A key design is the dual-threshold system:

COMMAND_CONFIDENCE_THRESHOLD = 0.80 (80 %) for executing commands.

RECOGNITION_CONFIDENCE_THRESHOLD = 0.50 (50 %) for recognition feedback only.

Workflow in main loop: capture audio → inference → handle wake word/commands. The indicator LED gives visual feedback (wake word detected, listening mode, command executed).

Hardware Setup & Wiring

Connect the INMP441 to the ESP32 following the pin mapping above.

Wire the indicator LED to one GPIO (say, GPIO 23) and the control LED to another (say, GPIO 22). Include resistors.

After wiring, upload the modified example sketch to your ESP32 and open Serial Monitor (115200 baud) to watch live inference data.

Testing Your Voice Assistant

Power the ESP32 setup.

On the serial monitor you’ll see something like: “Voice control ready. Say ‘marvin’ then ‘on’ or ‘off’”.

Say your wake word (“marvin”). Indicator LED blinks to confirm.

Within the listening window (e.g., 10 seconds), say “on” → Control LED should turn on (with visual confirmation).

Say “off” → Control LED turns off.

For any other word above the recognition threshold, single-blink happens for feedback.

Tips for Better Accuracy & Robustness

Add background noise (multiple voices, ambient sounds) to your dataset so your model learns robustly.

Use different voices (male/female, different accents/tones) for each command to generalise better.

Place the microphone at a consistent distance, and experiment with meeting the “quiet” vs “noisy” conditions your device will face.

Increase your command threshold if false triggers happen; you might also reduce the listening window.

Conclusion

With minimal hardware and a powerful no-cloud platform like Edge Impulse, you can build a truly ESP32 Offline Voice Recognition. Whether you are building a smart home assistant, an accessibility device, or simply experimenting with voice interfaces, this project gives you a strong foundation.

If you require ultra-low power (battery operation), consider adding a hardware wake-word detection IC that activates the ESP32 only when necessary.

DEV Community