張旭豐

Posted on Apr 18

Your LED Art Should Dance to the Music, Not Just Blink Randomly

#arduino #led #esp32

Via YouTube — ESP32 project demonstrating audio-reactive LED matrix

What FFT Actually Does for Your Installation

FFT stands for Fast Fourier Transform. It is a mathematical operation that takes a signal in the time domain (amplitude over time) and converts it into the frequency domain (how much of each frequency is present at that moment).

Your microphone captures a waveform. That waveform contains everything mixed together: bass, mids, highs, transients, silence. FFT separates it into bins. Each bin represents a frequency range. The amplitude of each bin tells you how loud that frequency band is right now.

For music visualization, you do not need to understand the math. You need to understand the output: an array of numbers, where index 0 is the lowest frequency (bass) and the last index is the highest frequency (treble).

Frequency bands example (8 bins for simplicity):
[ bass ] [ low-mid ] [ mid ] [ upper-mid ] [ presence ] [ brilliance ] [ air ] [ noise ]
   0         1          2         3             4            5            6        7

In practice, you do not use all 256 or 512 bins that a typical FFT library returns. You group them into ranges. Bass gets its own region. Mids get another. Highs get a third. Three to eight ranges is usually enough to make an installation feel musically intelligent.

The Hardware You Actually Need

ESP32 or ESP8266

ESP32 is the practical choice. You need the RAM: most FFT libraries require a buffer of 512 to 1024 samples to get usable frequency resolution. ESP8266 has 50KB of RAM available to user code; ESP32 has 320KB. The difference matters when you are also driving APA102 LEDs or handling WiFi.

Microphone Module

The MAX9814 is the standard maker recommendation: automatic gain control, onboard preamp, 2.5V to 5V operation. It works, and it costs $5.

The INMP441 is better if you want clean audio without the automatic gain compression. It is an I2S microphone, which means it streams data directly into ESP32's I2S bus without CPU intervention. The tradeoff: you need an I2S-compatible board or adapter, and you need to configure the I2S peripheral correctly.

For a first project, start with MAX9814. For a permanent installation where audio quality matters, use INMP441.

Level Shifting (For APA102 / DotStar LEDs)

APA102 LEDs operate at 5V. ESP32 GPIO outputs are 3.3V. This works most of the time, but intermittent flickering appears in longer strips. Add a 74HC245 buffer between ESP32 and the LED data line if you see this.

For WS2812B (NeoPixel), the 3.3V signal from ESP32 is borderline. It usually works at short distances. Add a 74HC245 or a 1000μF capacitor across the power rails at the LED input if you have problems.

The Code That Actually Works

The naive approach is: read analog from microphone, map to brightness, set LEDs. This produces the blinking-not-dancing problem.

The FFT approach:

#include <Arduino.h>
#include <driver/i2s.h>
#include <FFT.h>

// I2S configuration for INMP441
static const i2s_port_t I2S_PORT = I2S_NUM_0;
static const i2s_config_t i2s_config = {
    .mode = i2s_mode_t(I2S_MODE_MASTER | I2S_MODE_RX),
    .sample_rate = 44100,
    .bits_per_sample = I2S_BITS_PER_SAMPLE_32BIT,
    .channel_format = I2S_CHANNEL_FMT_ONLY_LEFT,
    .communication_format = I2S_COMM_FORMAT_I2S,
    .intr_alloc_flags = ESP_INTR_FLAG_LEVEL1,
    .dma_buf_count = 8,
    .dma_buf_len = 1024
};

// FFT output: 8 frequency bands
float bass = 0, lowMid = 0, mid = 0, highMid = 0, high = 0;

void setup() {
    Serial.begin(115200);
    i2s_driver_install(I2S_PORT, &i2s_config, 0, NULL);
    i2s_set_pin(I2S_PORT, NULL);
}

void loop() {
    // Read audio samples into fft_input buffer
    size_t bytes_read;
    i2s_read(I2S_PORT, fft_input, FFT_SIZE * 2, &bytes_read, portMAX_DELAY);

    // Run FFT
    fft();

    // Map FFT bins to frequency bands (logarithmic mapping)
    // Bass: bins 0-2
    bass = (fft_real[0] + fft_real[1] + fft_real[2]) / 3;
    // Low-mid: bins 3-6
    lowMid = (fft_real[3] + fft_real[4] + fft_real[5] + fft_real[6]) / 4;
    // Mid: bins 7-15
    mid = averageRange(7, 15);
    // High-mid: bins 16-31
    highMid = averageRange(16, 31);
    // High: bins 32+
    high = averageRange(32, FFT_SIZE/2);

    // Use these values to control LED behaviors
    // Bass controls large slow movements (entire strip brightness)
    // High-mid controls fast accents (individual LED flickers)

    visualize();
}

float averageRange(int start, int end) {
    float sum = 0;
    for (int i = start; i <= end; i++) sum += fft_real[i];
    return sum / (end - start + 1);
}

The key insight: logarithmically map FFT bins to frequency ranges. Human hearing is logarithmic. The difference between 100Hz and 200Hz is massive in musical terms. The difference between 10000Hz and 20000Hz is imperceptible. Linear bin mapping puts all your resolution in the highs where it does not matter musically.

Why Your Amplitude-Only Approach Feels Wrong

When you use raw amplitude, the LED responds to everything equally. A bass drum and a high-hat contribute equally if they are at the same volume. Your installation cannot tell the difference between a kick drum hit and someone dropping a fork.

When you use FFT-separated bands, something changes. The bass drives slow, powerful movements: full-strip pulses, deep color shifts, slow fades. The high frequencies drive fast, small movements: sparkle effects, single-LED flickers, rapid color changes.

The music starts to have visual weight. A kick drum hits like a sledgehammer — the entire installation responds with a deep red pulse. A high-hat splashes like glitter — small points of white light dance across the strip.

This is what separates music visualization from loudness tracking.

The Problem Nobody Talks About: Transient Response

FFT tells you about steady-state frequency content. It tells you what frequencies are present. It does not tell you when a sound starts.

A drum hit has a fast attack: the sound goes from silence to full volume in milliseconds. FFT smooths this because it works on a block of samples. With a 1024-sample buffer at 44100Hz, you have a 23ms window. The attack of a drum hit gets averaged across that entire window. The visual response is delayed and smudged.

For percussion, you need transient detection. One approach: compare the current amplitude to a short-term average. When current exceeds average by a threshold, trigger a transient response. The transient response can be immediate and sharp; the FFT-driven response can be slower and more sustained.

float shortTermAvg = 0;
float transientThreshold = 1.8;

void detectTransients(float currentAmplitude) {
    float instant = shortTermAvg * 0.7 + currentAmplitude * 0.3;

    if (currentAmplitude > instant * transientThreshold) {
        // Transient detected: trigger fast response
        triggerFlash();
    }

    shortTermAvg = instant;
}

The numbers 0.7/0.3 and 1.8 threshold are starting points. Adjust them based on how responsive you want the system to feel. Higher threshold = fewer false triggers from ambient noise. Lower decay (changing the 0.7/0.3 ratio) = faster response.

Installation-Specific Considerations

Power Budget

A 5-meter WS2812B strip at full white draws 9 amperes. That is not a number on a label — it is the actual current your power supply must deliver, continuously, without voltage drop. Voltage drop makes the far end of the strip dim and warm-colored.

For a permanent installation: calculate your power draw, multiply by 1.3, and buy a power supply that delivers that much. Use 14 AWG wire for the power bus. Do not daisy-chain power through the LED strip — run separate power wires to each segment.

Latency Perception

There is a delay between sound hitting the microphone and LEDs responding. With FFT processing on ESP32, you are looking at 20-50ms of latency. This is below the threshold of conscious perception for most people, but if you are syncing to live musicians, they will notice.

Test: have someone clap while watching the LEDs. If the flash is noticeably after the clap, you need to reduce your buffer size or move the microphone closer to the sound source.

Ambient Noise Rejection

Installations in public spaces fail because they respond to crowd noise, HVAC systems, and traffic. You need gating: a noise floor below which the system ignores input.

float noiseFloor = 100; // calibrate this per environment

void loop() {
    float bassLevel = getBassLevel();

    if (bassLevel < noiseFloor) {
        // below threshold: return to idle animation
        idleAnimation();
    } else {
        // above threshold: react to music
        musicVisualization(bassLevel);
    }
}

Set the noise floor by measuring the FFT output with nobody around. Set it to about 1.5x that value.

Where to Put the Microphone

For live music: microphone close to the speaker outputting the music, not close to the audience. You want the processed sound, not the acoustic blend of the room.

For ambient installations: microphone should be in a position where it captures the installation's own sound output (if any) and the audience's acoustic response to it. Sometimes this is behind the installation; sometimes it is above.

For gallery settings: a directional microphone pointed at the primary viewing area catches viewer conversations and reactions, making the installation respond to the emotional state of the room.

FAQ

Q: Can I do FFT on ESP8266?

A: Technically yes with reduced buffer sizes (256 samples instead of 1024), but you will get poor frequency resolution and most of your CPU time will be spent in FFT calculations. ESP8266 FFT is a frustrating experience. Use ESP32.

Q: How do I calibrate the frequency band thresholds?

A: Play music you know well. Watch the FFT output in the serial plotter. Identify where your kick drum appears (usually bins 0-4 at 44.1kHz sample rate with 1024 buffer). Identify where the snare appears (bins 10-20). Adjust your band boundaries to match the music you expect to play. There is no universal correct answer — it depends on your sample rate, buffer size, and musical genre.

Q: My LEDs flicker even with a 74HC245. Why?

A: The most common cause is shared power supply noise. If your ESP32 and LEDs share a power supply and the LEDs draw a sudden current spike (which WS2812B does on every color change), the ESP32 sees a voltage dip and reboots or glitches. Solution: separate power supplies for ESP32 and LEDs, with only ground connected. Or add a 1000μF capacitor directly across the LED power and ground pins at the start of the strip.

The Gap Between Watching and Listening

The best sound-reactive installations do not just make you notice the music differently. They make you aware of parts of the music you were not paying attention to.

A good bass visualization does not show you the bass. It makes you feel the room responding to the bass. When the high-hat channel lights up a specific pattern, you hear the high-hat differently — you become aware of how it sits in the mix.

That is the goal: not decoration, but perceptual shift. The installation changes how you experience the music.

Product recommendations for building your first sound-reactive installation:

ESP32 DevKit V1 — Dual-core, 520KB RAM, WiFi. Handles FFT + LED control simultaneously without dropped frames. (Amazon)

MAX9814 Microphone Module — Automatic gain control, 2.5V-5V operation, built-in preamp. Easiest microphone for getting started with audio-reactive projects. (Amazon)

INMP441 I2S Microphone — Digital output, no gain compression, cleaner signal for permanent installations. Requires I2S configuration but sounds significantly better. (Amazon)

APA102 DotStar LED Strip (5m) — 5V, 60 LEDs/m, SPI communication (no strict timing requirements like WS2812B). Easier to drive reliably on ESP32. (Amazon)

I earn from qualifying purchases.

Article #005, 2026-04-18. Content Farm pipeline, Run #005.

DEV Community