Tarush Nagpal

Posted on Jan 15

Building a Real-Time Face Recognition System with ESP32-CAM (in a Weekend)

#esp32 #iot #homeautomation #python

How I built a local, real-time face detection and tracking system that knows who’s around and when. Using just ₹800 of hardware.

ESP32-CAM → MJPEG stream → Python server → face recognition → web dashboard

The Itch That Needed Scratching

Last Friday evening, after a conversation with my friend Palash, I couldn’t shake an idea.

I had an ESP32-CAM lying on my desk, and I wanted to build something useful: a system that could tell who was around and when, without cloud APIs, subscriptions, or training datasets.

Not surveillance. Not production-grade. Just something practical, local, and hackable.

By Sunday night, it worked.

What We’re Building

A fully local face recognition system that:

Streams live video from an ESP32-CAM
Detects and identifies faces automatically
Tracks detection timelines over time
Displays everything in a simple web dashboard

No cloud. No external APIs. Everything runs on your local network.

What this is and isn’t

This is:

A privacy-friendly, local presence-detection system
Ideal for personal projects and experimentation

This is not:

Production surveillance software
Robust under extreme lighting or crowded scenes

Why ESP32-CAM?

The ESP32-CAM is absurdly good value for what it offers:

Cost: ~$10
Built-in camera: 2MP, good enough for faces
WiFi: No cables
Programmable: Full Arduino support
Low power: Can run 24/7

The tradeoff is limited resolution and FPS—but that’s fine. All the intelligence lives on the server.

Part 1: Streaming Video from the ESP32-CAM

We start with the simplest possible requirement: live video.

Not recording. Not RTSP. Just streaming.

Why MJPEG?

MJPEG is perfect here. It’s simply a continuous stream of JPEG images sent over HTTP. No complex codecs, no decoding pipelines.

Hardware Setup

You’ll need:

ESP32-CAM (AI-Thinker model)
USB-to-Serial adapter (for flashing)
5V power supply (3.3V is not enough)

ESP32-CAM Streaming Code

Here’s a minimal streaming setup. This exposes a /stream endpoint that serves MJPEG.

#include "esp_camera.h"
#include <WiFi.h>
#include "esp_http_server.h"

const char* ssid = "YOUR_WIFI";
const char* password = "YOUR_PASSWORD";

// AI-Thinker pin configuration
// https://randomnerdtutorials.com/esp32-cam-ai-thinker-pinout/

static esp_err_t stream_handler(httpd_req_t *req) {
  while (true) {
    camera_fb_t *fb = esp_camera_fb_get();
    if (!fb) continue;

    httpd_resp_send_chunk(req, (const char *)fb->buf, fb->len);
    httpd_resp_send_chunk(req, "\r\n--frame\r\n", 12);

    esp_camera_fb_return(fb);
  }
}

void setup() {
  Serial.begin(115200);

  camera_config_t config;
  config.pixel_format = PIXFORMAT_JPEG;
  config.frame_size = FRAMESIZE_SVGA; // 800x600
  config.jpeg_quality = 10;

  esp_camera_init(&config);

  WiFi.begin(ssid, password);
  while (WiFi.status() != WL_CONNECTED) delay(500);

  startCameraServer();

  Serial.print("Stream available at: http://");
  Serial.println(WiFi.localIP());
}

Live MJPEG stream from the ESP32-CAM - this is what we'll be processing

Part 2: The Brain - Python + Face Recognition

Now that we have video, we need to:

Consume the MJPEG stream
Extract frames
Detect faces
Identify who they belong to
Track detections over time

Libraries Used

We're using face_recognition built on top of dlib.

pip install face-recognition opencv-python flask requests

On my machine, this processes ~8–12 FPS using the HOG model. CNN is more accurate but slower.

Consuming the MJPEG Stream

MJPEG is just JPEG images separated by boundaries.

JPEGs start with 0xFFD8 and end with 0xFFD9. We extract frames by scanning for these markers.

import requests
import cv2
import numpy as np

def consume_stream(url):
    response = requests.get(url, stream=True)
    bytes_data = bytes()

    for chunk in response.iter_content(chunk_size=1024):
        bytes_data += chunk

        # Find JPEG markers
        a = bytes_data.find(b'\xff\xd8')  # Start
        b = bytes_data.find(b'\xff\xd9')  # End

        if a != -1 and b != -1:
            jpg = bytes_data[a:b+2]
            bytes_data = bytes_data[b+2:]

            # Decode and process
            image = cv2.imdecode(
                np.frombuffer(jpg, np.uint8), 
                cv2.IMREAD_COLOR
            )
            process_frame(image)

Face Detection & Encoding

import face_recognition

def process_frame(image):
    # Convert BGR to RGB (OpenCV uses BGR)
    rgb_image = cv2.cvtColor(image, cv2.COLOR_BGR2RGB)

    # Find faces
    face_locations = face_recognition.face_locations(rgb_image)
    face_encodings = face_recognition.face_encodings(
        rgb_image, 
        face_locations
    )

    for encoding in face_encodings:
        person_id = find_or_create_person(encoding)
        record_detection(person_id, datetime.now())
        save_face_image(person_id, image, face_location)

Automatic Identification (No Manual Labels)

When a new face appears, we compare it against known encodings. If it doesn’t match anyone, we create a new person ID.

def find_or_create_person(face_encoding):
    # Compare with known persons
    for person_id, known_encodings in persons.items():
        matches = face_recognition.compare_faces(
            known_encodings, 
            face_encoding, 
            tolerance=0.5
        )

        if True in matches:
            return person_id

    # New person! Create entry
    new_id = f"person_{next_id()}"
    persons[new_id] = [face_encoding]
    return new_id

Duplicates can occur under poor lighting or extreme angles. I added a manual merge feature to fix this.

Each person typically stabilizes at ~5–10 encodings with a max of 20.

Part 3: Visualising

Bucketing Detections

We bucket detections into 30-second intervals:

def get_bucketed_detections(person_id, hours=24):
    detections = person_detections[person_id]
    cutoff = datetime.now() - timedelta(hours=hours)
    recent = [ts for ts in detections if ts > cutoff]

    buckets = defaultdict(int)
    for ts in recent:
        # Round to 30-second bucket
        bucket = ts.replace(
            second=(ts.second // 30) * 30, 
            microsecond=0
        )
        buckets[bucket] += 1

    return buckets

This gives us smooth graphs instead of spiky mess.

The Dashboard

Flask + Chart.js gets you a clean UI quickly:

new Chart(ctx, {
    type: 'line',
    data: {
        labels: timestamps,
        datasets: [{
            label: 'Detections per 30s',
            data: counts,
            borderColor: '#667eea',
            fill: true
        }]
    }
});

Buttons let you toggle between 1 hour, 6 hours, 24 hours, and 7 days.

The Results

After a weekend of hacking:

Real-time processing: ~10 FPS
Fully local: No cloud, no data leaves the network
Automatic identification: No training step
Historical tracking: Clear presence patterns

Detection frequency over 7 days - you can clearly see patterns of when someone is around

The "Aha!" Moments

Moment 1: Realizing MJPEG is just JPEG markers was liberating. No complex video codecs.

Moment 2: The face_recognition library is stupid simple. No TensorFlow, no model training, just works.

Moment 3: 30-second bucketing makes graphs actually useful. Raw timestamps are too noisy.

The Gotchas

Things that bit me:

ESP32-CAM needs 5V: 3.3V will make it reboot randomly
Face detection is CPU-intensive: ~0.3-1 second per frame with HOG model
Lighting matters: Needs decent light for face detection
Face encodings need diversity: Store multiple encodings per person from different angles

What's Next?

Now that the detection system is working, I'm planning to take this further - building a physical enclosure with some 3D printed parts and adding an AI agent that can actually interact based on who it detects. But that's a story for another post.

Try It Yourself

The full code is on GitHub: [https://github.com/tarushnagpal/esp32-cam-face-recognition]

Requirements:

ESP32-CAM (~$10)
Python 3.8+
A weekend

Final Thoughts

This project shows how far you can get with simple, cheap hardware and some weekend coding. A face detection system that runs entirely locally, tracks presence over time, and costs less than lunch. Not bad for a few hours of work.

Coming Soon: Taking this further with 3D printing and AI agents

Resources

Top comments (1)

Sloan the DEV Moderator • Jan 22

We loved your post so we shared it on social.

Keep up the great work!