How I built a local, real-time face detection and tracking system that knows who’s around and when. Using just ₹800 of hardware.
ESP32-CAM → MJPEG stream → Python server → face recognition → web dashboard
The Itch That Needed Scratching
Last Friday evening, after a conversation with my friend Palash, I couldn’t shake an idea.
I had an ESP32-CAM lying on my desk, and I wanted to build something useful: a system that could tell who was around and when, without cloud APIs, subscriptions, or training datasets.
Not surveillance. Not production-grade. Just something practical, local, and hackable.
By Sunday night, it worked.
What We’re Building
A fully local face recognition system that:
- Streams live video from an ESP32-CAM
- Detects and identifies faces automatically
- Tracks detection timelines over time
- Displays everything in a simple web dashboard
No cloud. No external APIs. Everything runs on your local network.
What this is and isn’t
This is:
- A privacy-friendly, local presence-detection system
- Ideal for personal projects and experimentation
This is not:
- Production surveillance software
- Robust under extreme lighting or crowded scenes
Why ESP32-CAM?
The ESP32-CAM is absurdly good value for what it offers:
- Cost: ~$10
- Built-in camera: 2MP, good enough for faces
- WiFi: No cables
- Programmable: Full Arduino support
- Low power: Can run 24/7
The tradeoff is limited resolution and FPS—but that’s fine. All the intelligence lives on the server.
Part 1: Streaming Video from the ESP32-CAM
We start with the simplest possible requirement: live video.
Not recording. Not RTSP. Just streaming.
Why MJPEG?
MJPEG is perfect here. It’s simply a continuous stream of JPEG images sent over HTTP. No complex codecs, no decoding pipelines.
Hardware Setup
You’ll need:
- ESP32-CAM (AI-Thinker model)
- USB-to-Serial adapter (for flashing)
- 5V power supply (3.3V is not enough)
ESP32-CAM Streaming Code
Here’s a minimal streaming setup. This exposes a /stream endpoint that serves MJPEG.
#include "esp_camera.h"
#include <WiFi.h>
#include "esp_http_server.h"
const char* ssid = "YOUR_WIFI";
const char* password = "YOUR_PASSWORD";
// AI-Thinker pin configuration
// https://randomnerdtutorials.com/esp32-cam-ai-thinker-pinout/
static esp_err_t stream_handler(httpd_req_t *req) {
while (true) {
camera_fb_t *fb = esp_camera_fb_get();
if (!fb) continue;
httpd_resp_send_chunk(req, (const char *)fb->buf, fb->len);
httpd_resp_send_chunk(req, "\r\n--frame\r\n", 12);
esp_camera_fb_return(fb);
}
}
void setup() {
Serial.begin(115200);
camera_config_t config;
config.pixel_format = PIXFORMAT_JPEG;
config.frame_size = FRAMESIZE_SVGA; // 800x600
config.jpeg_quality = 10;
esp_camera_init(&config);
WiFi.begin(ssid, password);
while (WiFi.status() != WL_CONNECTED) delay(500);
startCameraServer();
Serial.print("Stream available at: http://");
Serial.println(WiFi.localIP());
}
Live MJPEG stream from the ESP32-CAM - this is what we'll be processing
Part 2: The Brain - Python + Face Recognition
Now that we have video, we need to:
- Consume the MJPEG stream
- Extract frames
- Detect faces
- Identify who they belong to
- Track detections over time
Libraries Used
We're using face_recognition built on top of dlib.
pip install face-recognition opencv-python flask requests
On my machine, this processes ~8–12 FPS using the HOG model. CNN is more accurate but slower.
Consuming the MJPEG Stream
MJPEG is just JPEG images separated by boundaries.
JPEGs start with 0xFFD8 and end with 0xFFD9. We extract frames by scanning for these markers.
import requests
import cv2
import numpy as np
def consume_stream(url):
response = requests.get(url, stream=True)
bytes_data = bytes()
for chunk in response.iter_content(chunk_size=1024):
bytes_data += chunk
# Find JPEG markers
a = bytes_data.find(b'\xff\xd8') # Start
b = bytes_data.find(b'\xff\xd9') # End
if a != -1 and b != -1:
jpg = bytes_data[a:b+2]
bytes_data = bytes_data[b+2:]
# Decode and process
image = cv2.imdecode(
np.frombuffer(jpg, np.uint8),
cv2.IMREAD_COLOR
)
process_frame(image)
Face Detection & Encoding
import face_recognition
def process_frame(image):
# Convert BGR to RGB (OpenCV uses BGR)
rgb_image = cv2.cvtColor(image, cv2.COLOR_BGR2RGB)
# Find faces
face_locations = face_recognition.face_locations(rgb_image)
face_encodings = face_recognition.face_encodings(
rgb_image,
face_locations
)
for encoding in face_encodings:
person_id = find_or_create_person(encoding)
record_detection(person_id, datetime.now())
save_face_image(person_id, image, face_location)
Automatic Identification (No Manual Labels)
When a new face appears, we compare it against known encodings. If it doesn’t match anyone, we create a new person ID.
def find_or_create_person(face_encoding):
# Compare with known persons
for person_id, known_encodings in persons.items():
matches = face_recognition.compare_faces(
known_encodings,
face_encoding,
tolerance=0.5
)
if True in matches:
return person_id
# New person! Create entry
new_id = f"person_{next_id()}"
persons[new_id] = [face_encoding]
return new_id
Duplicates can occur under poor lighting or extreme angles. I added a manual merge feature to fix this.
Each person typically stabilizes at ~5–10 encodings with a max of 20.
Part 3: Visualising
Bucketing Detections
We bucket detections into 30-second intervals:
def get_bucketed_detections(person_id, hours=24):
detections = person_detections[person_id]
cutoff = datetime.now() - timedelta(hours=hours)
recent = [ts for ts in detections if ts > cutoff]
buckets = defaultdict(int)
for ts in recent:
# Round to 30-second bucket
bucket = ts.replace(
second=(ts.second // 30) * 30,
microsecond=0
)
buckets[bucket] += 1
return buckets
This gives us smooth graphs instead of spiky mess.
The Dashboard
Flask + Chart.js gets you a clean UI quickly:
new Chart(ctx, {
type: 'line',
data: {
labels: timestamps,
datasets: [{
label: 'Detections per 30s',
data: counts,
borderColor: '#667eea',
fill: true
}]
}
});
Buttons let you toggle between 1 hour, 6 hours, 24 hours, and 7 days.
The Results
After a weekend of hacking:
- Real-time processing: ~10 FPS
- Fully local: No cloud, no data leaves the network
- Automatic identification: No training step
- Historical tracking: Clear presence patterns
Detection frequency over 7 days - you can clearly see patterns of when someone is around
The "Aha!" Moments
Moment 1: Realizing MJPEG is just JPEG markers was liberating. No complex video codecs.
Moment 2: The face_recognition library is stupid simple. No TensorFlow, no model training, just works.
Moment 3: 30-second bucketing makes graphs actually useful. Raw timestamps are too noisy.
The Gotchas
Things that bit me:
- ESP32-CAM needs 5V: 3.3V will make it reboot randomly
- Face detection is CPU-intensive: ~0.3-1 second per frame with HOG model
- Lighting matters: Needs decent light for face detection
- Face encodings need diversity: Store multiple encodings per person from different angles
What's Next?
Now that the detection system is working, I'm planning to take this further - building a physical enclosure with some 3D printed parts and adding an AI agent that can actually interact based on who it detects. But that's a story for another post.
Try It Yourself
The full code is on GitHub: [https://github.com/tarushnagpal/esp32-cam-face-recognition]
Requirements:
- ESP32-CAM (~$10)
- Python 3.8+
- A weekend
Final Thoughts
This project shows how far you can get with simple, cheap hardware and some weekend coding. A face detection system that runs entirely locally, tracks presence over time, and costs less than lunch. Not bad for a few hours of work.
Coming Soon: Taking this further with 3D printing and AI agents
Top comments (0)