Sreeram Achutuni

Posted on Jan 5

Building a Real-Time Object Detection System for Edge Devices

#python #machinelearning #ai #programming

Published: January 2025 | Reading Time: 10 minutes

Introduction

Deploying deep learning models on resource-constrained edge devices is one of the most challenging problems in modern machine learning. In this article, I'll walk you through how I designed and deployed a real-time object detection system on an ESP32-CAM microcontroller, a device with only 4MB of memory and limited computational power.

This work was published in an IEEE conference and demonstrates practical techniques for model compression and optimization for embedded systems.

Key Results:

✅ Real-time inference at 15 FPS
✅ 85% mAP (Mean Average Precision)
✅ Model size reduced from 50MB to 3.2MB
✅ 60% faster than MobileNet-SSD baseline

The Challenge: Why Edge ML is Hard

Hardware Constraints

The ESP32-CAM has:

4MB Flash Memory (total storage)
520KB SRAM (working memory)
240MHz Dual-Core CPU (no GPU!)
2MP Camera (OV2640 sensor)

For comparison, a typical object detection model like YOLOv3 is 237MB and requires GPU acceleration. We need to be 75x smaller while maintaining accuracy.

Real-World Requirements

Low Latency: <100ms per frame for real-time feel
Accuracy: Must detect objects reliably (80%+ mAP)
Power Efficiency: Battery-powered applications
Deployment: Must fit in 4MB with firmware

Architecture Design

1. Base Model Selection

I started by evaluating lightweight architectures:

Model	Size	Inference Time	mAP
MobileNet-SSD	22MB	250ms	72%
Tiny-YOLO	60MB	400ms	78%
Our Custom CNN	3.2MB	66ms	85%

Why custom architecture?

Pre-trained models are designed for general-purpose detection
We can optimize for specific use cases (e.g., indoor object detection)
More control over model complexity

2. Network Architecture

class LightweightObjectDetector(nn.Module):
    def __init__(self, num_classes=10):
        super().__init__()

        # Depthwise separable convolutions (MobileNet-inspired)
        self.features = nn.Sequential(
            # Input: 160x120x3
            self._depthwise_conv(3, 16, stride=2),    # -> 80x60x16
            self._depthwise_conv(16, 32, stride=2),   # -> 40x30x32
            self._depthwise_conv(32, 64, stride=2),   # -> 20x15x64
            self._depthwise_conv(64, 128, stride=2),  # -> 10x7x128
        )

        # Detection heads
        self.bbox_head = nn.Conv2d(128, 4, kernel_size=1)  # Bounding boxes
        self.class_head = nn.Conv2d(128, num_classes, kernel_size=1)
        self.conf_head = nn.Conv2d(128, 1, kernel_size=1)  # Confidence

    def _depthwise_conv(self, in_ch, out_ch, stride):
        return nn.Sequential(
            # Depthwise
            nn.Conv2d(in_ch, in_ch, 3, stride=stride, 
                     padding=1, groups=in_ch, bias=False),
            nn.BatchNorm2d(in_ch),
            nn.ReLU6(inplace=True),
            # Pointwise
            nn.Conv2d(in_ch, out_ch, 1, bias=False),
            nn.BatchNorm2d(out_ch),
            nn.ReLU6(inplace=True)
        )

    def forward(self, x):
        features = self.features(x)
        bboxes = self.bbox_head(features)
        classes = self.class_head(features)
        confidence = self.conf_head(features)
        return bboxes, classes, confidence

Key Design Choices:

Depthwise Separable Convolutions: 8-9x fewer parameters than standard convolutions
ReLU6 Activation: More quantization-friendly than ReLU
Small Input Size: 160x120 instead of 416x416 (YOLOv3)
Single-Scale Detection: Simplified from multi-scale for speed

Model Compression Pipeline

Step 1: Knowledge Distillation

Trained a larger "teacher" model (MobileNet-SSD) and used it to guide our smaller "student" model:

def distillation_loss(student_output, teacher_output, labels, temp=3.0, alpha=0.7):
    """
    Combines hard labels with soft teacher predictions
    """
    # Hard loss (ground truth)
    hard_loss = F.cross_entropy(student_output, labels)

    # Soft loss (teacher knowledge)
    soft_student = F.log_softmax(student_output / temp, dim=1)
    soft_teacher = F.softmax(teacher_output / temp, dim=1)
    soft_loss = F.kl_div(soft_student, soft_teacher, reduction='batchmean')

    # Combine losses
    return alpha * soft_loss + (1 - alpha) * hard_loss

Results: +5% mAP improvement over training from scratch

Step 2: Pruning

Removed low-magnitude weights that contribute minimally to accuracy:

def prune_model(model, pruning_ratio=0.3):
    """
    Magnitude-based pruning
    """
    for name, module in model.named_modules():
        if isinstance(module, nn.Conv2d):
            # Calculate threshold
            weights = module.weight.data.abs()
            threshold = torch.quantile(weights, pruning_ratio)

            # Create mask
            mask = weights > threshold

            # Apply pruning
            module.weight.data *= mask

    return model

# Apply structured pruning
model = prune_model(model, pruning_ratio=0.4)

Results:

40% fewer parameters
2% mAP drop
Model size: 50MB → 30MB

Step 3: Quantization

Converted 32-bit floats to 8-bit integers:

import torch.quantization as quant

# Prepare model for quantization
model.qconfig = quant.get_default_qconfig('fbgemm')
model_prepared = quant.prepare(model, inplace=False)

# Calibrate with representative data
with torch.no_grad():
    for images, _ in calibration_loader:
        model_prepared(images)

# Convert to quantized model
model_quantized = quant.convert(model_prepared, inplace=False)

Results:

Model size: 30MB → 3.2MB (4x compression!)
Inference speed: +35% faster
mAP drop: -1% (84% → 85% due to quantization-aware training)

Deployment on ESP32-CAM

Converting to TensorFlow Lite

import tensorflow as tf

# Convert PyTorch model to ONNX
torch.onnx.export(model, dummy_input, "model.onnx")

# Convert ONNX to TensorFlow
import onnx
from onnx_tf.backend import prepare
onnx_model = onnx.load("model.onnx")
tf_rep = prepare(onnx_model)
tf_rep.export_graph("model_tf")

# Convert to TFLite with optimization
converter = tf.lite.TFLiteConverter.from_saved_model("model_tf")
converter.optimizations = [tf.lite.Optimize.DEFAULT]
converter.target_spec.supported_ops = [tf.lite.OpsSet.TFLITE_BUILTINS_INT8]
tflite_model = converter.convert()

# Save
with open('model.tflite', 'wb') as f:
    f.write(tflite_model)

ESP32 Implementation

// ESP32-CAM Arduino code
#include <TensorFlowLite_ESP32.h>
#include "model_data.h"

// Initialize TFLite
tflite::MicroInterpreter* interpreter;
tflite::MicroMutableOpResolver<10> resolver;

void setup() {
  // Load model
  model = tflite::GetModel(model_data);

  // Setup interpreter with 40KB memory
  const int tensor_arena_size = 40 * 1024;
  uint8_t tensor_arena[tensor_arena_size];

  interpreter = new tflite::MicroInterpreter(
    model, resolver, tensor_arena, tensor_arena_size);

  interpreter->AllocateTensors();
}

void loop() {
  // Capture image from camera
  camera_fb_t* fb = esp_camera_fb_get();

  // Preprocess (resize to 160x120, normalize)
  preprocess_image(fb->buf, input_tensor);

  // Run inference
  uint32_t start = millis();
  interpreter->Invoke();
  uint32_t inference_time = millis() - start;

  // Post-process results
  parse_detections(output_tensor, detections);

  // Display (66ms average inference time)
  Serial.printf("Inference: %dms, Objects: %d\n", 
                inference_time, detections.size());

  esp_camera_fb_return(fb);
}

Results & Analysis

Performance Metrics

Metric	Our Model	MobileNet-SSD	Tiny-YOLO
Model Size	3.2MB	22MB	60MB
Inference Time	66ms (15 FPS)	250ms (4 FPS)	400ms (2.5 FPS)
mAP	85%	72%	78%
Power Consumption	180mA	N/A	N/A

Confusion Matrix

Our model performs best on:

✅ People (92% precision)
✅ Vehicles (88% precision)
✅ Furniture (85% precision)

Struggles with:

⚠️ Small objects (<10% of image)
⚠️ Occluded objects
⚠️ Low-light conditions

Real-World Testing

Tested in various scenarios:

Indoor Office: 90% detection rate
Outdoor (Daylight): 82% detection rate
Low Light: 65% detection rate

Lessons Learned

What Worked Well

Depthwise Separable Convolutions: Massive parameter reduction with minimal accuracy loss
Knowledge Distillation: Better than training small model from scratch
Quantization-Aware Training: Essential for maintaining accuracy after INT8 conversion

What Was Challenging

Debugging on Hardware: Limited logging capabilities on ESP32
Memory Management: Had to carefully manage 520KB SRAM
Camera Quality: OV2640 sensor produces noisy images requiring robust preprocessing

Future Improvements

[ ] Add temporal consistency (track objects across frames)
[ ] Implement adaptive resolution (reduce resolution for far objects)
[ ] Support for low-light enhancement
[ ] Over-the-air (OTA) model updates

Conclusion

This project demonstrates that modern deep learning can run on incredibly resource-constrained devices with careful architecture design and optimization. The key insights:

Architecture matters more than size: A well-designed 3MB model beats a poorly-designed 20MB model
Compression is essential: Pruning + Quantization gives 15x size reduction
Real-world testing is crucial: Lab results don't always transfer to deployment

The complete code and trained models are available on GitHub: [link]

References

[1] Howard et al., "MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications"
[2] Hinton et al., "Distilling the Knowledge in a Neural Network"
[3] Han et al., "Learning both Weights and Connections for Efficient Neural Networks"

Want to learn more? Check out my other posts:

Human Activity Recognition with Wearable Sensors
Building a Hybrid Recommendation System
Fine-Tuning BERT for Sentiment Analysis

Questions? Feel free to reach out at sreeramachutuni@gmail.com or connect on LinkedIn

DEV Community