Why Your IoT Data Isn't Fit for ML—And How to Fix It

#python #iot #ai #machinelearning

When you’re dealing with IoT deployments, especially in places like Kenya where connectivity issues and budget constraints are common, you quickly learn that IoT data quality can fail in unexpected ways. Before it even reaches your ML model, numerous problems can arise. I've managed over 2,500 IoT devices under these conditions, and it can be quite a journey.

The data collection chaos

Initially, I assumed that gathering data from devices would be simple. The first signs of trouble appeared when we installed a new batch of sensors in a remote area with unreliable internet. Instead of a clean stream of telemetry data, I received an erratic mess. There were nonsensical data spikes, inconsistent timestamps, and sometimes data packets arrived out of order.

I learned that poor connectivity can wreak havoc on data integrity. The issue isn’t just about data loss; it’s about receiving corrupted or incomplete information. Reliability isn't guaranteed. To address this, implementing a simple retry logic with a buffer on the IoT device helped stabilize 75% of our data gaps.

import time
import random

def send_data(data):
    # Simulate sending data
    if random.choice([True, False]):
        print("Data sent successfully.")
    else:
        print("Failed to send data.")

retry_attempts = 3
for attempt in range(retry_attempts):
    try:
        send_data("sensor_reading")
        break
    except Exception:
        print(f"Attempt {attempt+1} failed. Retrying in 5 seconds...")
        time.sleep(5)

This straightforward approach improved our data quality significantly without incurring additional costs beyond the initial setup.

Real-world spikes and noise

Another challenge was the quality of the raw data. I soon realized that sensors are highly sensitive to real-world conditions. Dust, temperature swings, and even rodents can affect readings. In one instance, temperature sensor readings fluctuated wildly, not due to a system error, but because a gecko had settled on the sensor.

Buffering raw data for a few minutes and calculating a moving average helped smooth out these spikes, reducing noise by about 60%.

The firmware factor

Managing devices with various firmware versions felt like dealing with a chaotic family reunion. I discovered that inconsistent firmware led to inconsistent data formats and payloads. Outdated firmware wouldn't support certain data packet headers, leading to data drops.

This taught me the importance of a unified update mechanism. By using an over-the-air (OTA) update strategy, I unified our firmware versions. This single change reduced data failure rates by 30%.

Data transmission gotchas

Handling sensor data over MQTT on budget devices is another challenge. These low-cost devices don't handle high volume well. During one month, I observed load spikes up to 1Mbps, which overwhelmed the devices and caused packet loss.

To address this, batching data before transmission made a significant difference. It allowed us to manage traffic better and improved overall network reliability, cutting the transmission failure rate in half.

import json

def batch_data(data_list):
    if len(data_list) == 0:
        return
    batched_data = json.dumps(data_list)
    # Simulate sending batched data
    print(f"Batched data sent: {batched_data}")

data_buffer = []
for _ in range(10):  # Assume we collect 10 readings
    data_buffer.append({"sensor_id": "123", "reading": random.randint(0, 100)})

batch_data(data_buffer)

Pre-ML processing struggles

Even if everything goes as planned up to this point, pre-processing before feeding the data into an ML model presents its own problems. Cleansing data for missing or malformed entries was more complex than I anticipated. It's not just about removing anomalies, but also preserving context that might be useful for ML inferences.

One experience stands out. A rule-based anomaly detection system seemed easy to set up, but my initial attempts increased data prep time to hours. This was clearly inefficient. Switching to a threshold-based, real-time processing model reduced preparation time drastically to less than 10 minutes per day, ensuring timely insights.

Building resiliency

IoT in emerging markets has a unique set of challenges, but through various lessons, I’ve come to value the small wins. While I can't make unreliable internet connections stable or turn budget devices into high-end systems, I can build around these constraints to make data as reliable as possible before it reaches those ML models.

Next, I plan to explore edge computing to handle some of these issues locally. I'm sure there will be more challenges to face,I'll update you when I dive into that.