Vibration Monitoring Architecture: From Sensor to Dashboard

#iiot #vibrationanalysis #mqtt #influxdb

The first time I tried to stream raw vibration data to a dashboard, I managed to crash my MQTT broker in under ten minutes. I had a high-frequency accelerometer spitting out samples at 5kHz, and I thought I'd just wrap those values in JSON and send them over the wire. The result wasn't a pretty graph; it was a series of Connection refused errors and a broker that had completely locked up under the weight of thousands of tiny packets per second.

If you're building a vibration monitoring system, you're not just dealing with "IoT data." You're dealing with signal processing. There is a massive difference between reporting a temperature every 30 seconds and capturing the harmonic frequencies of a motor bearing. If you treat vibration data like any other telemetry, your network will choke, your database will bloat, and your dashboards will be useless.

What I tried first (The wrong way)

My initial assumption was that the "modern stack" (Sensor $\rightarrow$ MQTT $\rightarrow$ Time Series DB $\rightarrow$ Grafana) would handle everything. I used a cheap industrial sensor that output raw voltage via a 4-20mA loop, fed into a PLC, which then pushed data to a Python script on a Raspberry Pi.

I wrote a simple loop that read the sensor and published to a topic:

# DO NOT DO THIS
while True:
    val = sensor.read() 
    client.publish("factory/machine1/vibration", json.dumps({"value": val}))

I quickly hit three walls:

Network Saturation: Sending one MQTT packet per sample is an architectural sin. The overhead of the TCP/IP stack and MQTT headers is larger than the actual payload. I was spending 90% of my bandwidth on headers.
Database Explosion: InfluxDB is great, but inserting 5,000 points per second per sensor is a recipe for a disk space crisis. My cardinality exploded, and queries that should have taken milliseconds started taking 30 seconds.
The "Noise" Problem: The raw data was a jagged mess. I couldn't see the actual vibration patterns because the high-frequency electrical noise from the nearby VFDs (Variable Frequency Drives) was masking the mechanical signal.

I realized that the gap between the sensor and the dashboard isn't a straight line. It's a funnel. You have to aggressively reduce the data volume at the edge before it ever touches the network.

The Actual Solution: The Edge-Heavy Pipeline

To make this work, I shifted the intelligence to the edge. The goal is to move from "streaming raw samples" to "streaming features." Instead of sending every single point, I calculate the RMS (Root Mean Square), Peak-to-Peak, and FFT (Fast Fourier Transform) bins locally.

1. Signal Conditioning and Edge Processing

I moved the processing to a dedicated edge gateway. I used a Python-based service that buffers samples in memory, applies a digital filter to remove electrical noise, and calculates the metrics.

Here is the implementation of the signal conditioning and feature extraction:

import numpy as np
from scipy.signal import butter, filtfilt
import paho.mqtt.client as mqtt
import time

# Configuration for a 10kHz sampling rate
FS = 10000 
CUTOFF = 2000 # Remove noise above 2kHz
ORDER = 4

def butter_lowpass_filter(data, cutoff, fs, order=5):
    nyq = 0.5 * fs
    normal_cutoff = cutoff / nyq
    b, a = butter(order, normal_cutoff, btype='low', analog=False)
    return filtfilt(b, a, data)

def calculate_features(buffer):
    # Filter the raw signal to remove high-frequency noise
    filtered = butter_lowpass_filter(buffer, CUTOFF, FS, ORDER)

    # Calculate RMS - the primary indicator of overall vibration level
    rms = np.sqrt(np.mean(filtered**2))

    # Calculate Peak-to-Peak
    ptp = np.ptp(filtered)

    # Perform FFT to find the dominant frequency
    fft_vals = np.abs(np.fft.rfft(filtered))
    freqs = np.fft.rfftfreq(len(filtered), 1/FS)
    dominant_freq = freqs[np.argmax(fft_vals)]

    return {
        "rms": float(rms),
        "ptp": float(ptp),
        "dom_freq": float(dominant_freq)
    }

# Main loop: Buffer 1000 samples, then send 1 summary packet
client = mqtt.Client()
client.connect("mqtt-broker.example.com", 1883)

buffer = []
while True:
    val = read_sensor_raw() # Mock function for ADC read
    buffer.append(val)

    if len(buffer) >= 1000:
        features = calculate_features(buffer)
        # Send summary instead of 1000 raw points
        client.publish("iiot/machine1/vibration/features", str(features))
        buffer = [] # Clear buffer

2. The Transport Layer (MQTT 5.0)

For the broker, I shifted from a basic Mosquitto setup to a more controlled configuration. Since vibration data is critical for predictive maintenance, I needed to ensure that the "heartbeat" of the machine was always known.

I used MQTT 5.0 "Will Messages" to detect if a gateway went offline. If the gateway crashes, the broker immediately publishes a "disconnected" status to the health topic, so the dashboard doesn't just show a flat line (which could be mistaken for a stopped machine).

# mosquitto.conf snippet
listener 1883
allow_anonymous false
password_file /etc/mosquitto/passwd
# Prevent the broker from being overwhelmed by slow consumers
max_queued_messages 1000

I've written more about choosing the right broker in my MQTT Broker Selection post, but for vibration, the priority is low latency and high reliability over massive scale.

3. Storage and Visualization

I used InfluxDB 2.x for storage because of its native handling of time-series data. Instead of storing the raw waveform, I store the calculated features. This reduces the storage requirement by 1000x.

In Grafana, I set up a dashboard that monitors the RMS value. However, looking at a raw line graph of vibration is usually useless for operators. They don't know if 0.5g is "bad" or "normal."

I integrated this with a health scoring system. I used a Flux query in InfluxDB to compare the current RMS against a baseline (the average of the last 7 days).

// InfluxDB Flux Query for Relative Vibration
from(bucket: "iiot_data")
  |> range(start: -1h)
  |> filter(fn: (r) => r["_measurement"] == "vibration_sensor")
  |> filter(fn: (r) => r["_field"] == "rms")
  |> aggregateWindow(every: 1m, fn: mean)
  |> map(fn: (r) => ({ r with value: r._value / 0.15 })) // Normalize against threshold 0.15g

This feeds directly into the concept of Equipment Health Scoring, where the goal is to give the operator a single "Health %" rather than a complex spectrum analysis.

Why this architecture works

The reason this works is that it respects the laws of physics and networking.

The Nyquist-Shannon Theorem tells us we need to sample at twice the frequency of the signal we want to capture. If you want to detect a bearing fault at 2kHz, you must sample at 4kHz+. Trying to do this over WiFi or Ethernet using standard JSON-over-MQTT is impossible because the packet overhead kills the throughput.

By calculating the RMS and FFT at the edge, we are performing Data Reduction. We transform a high-bandwidth signal (time domain) into a low-bandwidth set of descriptors (frequency domain).

The edge processing also acts as a mechanical filter. By using a Butterworth low-pass filter, I can strip out the 60Hz hum from the power lines and the high-frequency spikes from the VFDs. If you do this in the cloud, you've already wasted the bandwidth sending noise.

Lessons learned and caveats

If I had to build this again, I'd change a few things:

1. Hardware-level filtering: I spent too much time in Python trying to fix signal noise. In a real industrial environment, you should use an analog anti-aliasing filter (a physical capacitor/resistor circuit) before the signal ever hits the ADC. Software filters are great, but they can't fix aliasing if the signal was already corrupted during sampling.

2. The "Buffer" Trap: My Python script used a simple list for the buffer. At very high sampling rates, Python's list appending becomes slow. I had to switch to numpy arrays with pre-allocated memory to avoid garbage collection pauses that caused gaps in the data.

3. Provisioning the Edge: Managing these Python scripts across five different gateways was a nightmare. I eventually moved the deployment to a GitOps flow, using OpenTofu and GitHub Actions to manage the underlying VM configurations on my Proxmox cluster, ensuring every gateway had the exact same version of scipy and numpy.

4. The Dashboard Paradox: The more data I put on the dashboard, the less the operators used it. The final version of the system only shows three things: a Green/Yellow/Red light for health, the current RMS value, and a "Time to Maintenance" estimate. Everything else (the FFT bins, the raw waveforms) is hidden in a "Deep Dive" tab that only the reliability engineer ever opens.

Vibration monitoring is a classic example of where "more data" is actually "less information." The value isn't in the sensor; it's in the reduction process that happens between the sensor and the screen.