Nilofer 🚀

Posted on May 7

SmolVLM2 Edge Vision Agent: Visual Monitoring Without a GPU or Cloud API

#machinelearning #fastapi #opensource #python

Running vision AI locally has always had a catch, you need a GPU, or you need to send frames to a cloud API and pay per call. SmolVLM2-2.2B changes that. It is a 2.2B-parameter multimodal model specifically designed for CPU inference, and this agent is built around it.

SmolVLM2 Edge Vision Agent is a fully offline edge vision agent that ingests a live webcam feed or an image folder, detects motion using frame-difference analysis, triggers VLM analysis only on scene changes, and persists structured observations to a local SQLite database with a FastAPI web dashboard for review. No API costs. No network calls after the first model download. 16GB RAM, no GPU required.

Project Overview

The agent does five things:

Ingests a live webcam feed or an image folder as input
Performs continuous visual monitoring, frame-difference based motion detection that triggers VLM analysis only on scene changes
Describes new objects, reads text from images - receipts, whiteboards, signs, and logs everything as structured observations
Persists observations to a local SQLite database with timestamps, thumbnails, descriptions, and confidence scores
Exposes a FastAPI web dashboard with live feed, latest observations, and a searchable log

It runs entirely offline. The model auto-downloads on first run and is cached locally from that point forward.

Use cases: home security camera analysis, document digitization pipelines, accessibility tools.

Architecture

The key design decision is the motion gate. Running a 2.2B-parameter model on every frame would be unusable on CPU hardware, inference is not instant. The agent solves this by running frame-difference motion detection on every frame first, and only invoking the VLM when a scene change is detected above the configured threshold.

Per-frame timeline:
Every frame goes through motion detection first. If the frame difference is below the threshold, the frame is dropped with no further processing. If motion is detected, the VLM runs, produces a description, and the observation is stored in SQLite with a thumbnail. This design means expensive model inference only happens when something actually changes in the scene, keeping a Pi-class CPU usable while still describing every meaningful scene change.

The FRAME_DIFF_THRESHOLD defaults to 0.15 and controls how sensitive the motion detector is. A higher value means less sensitivity, minor lighting changes or small movements are ignored. A lower value triggers the VLM more frequently.

Prerequisites

Python: 3.11 or newer.
RAM: 16GB minimum for the real model; less is fine in --mock mode.
Disk: ~5GB free for the model cache.
OS: Linux, macOS, or WSL2 on Windows - uses OpenCV, and webcam access requires native camera support.
No GPU required - SmolVLM2-2.2B is designed for CPU inference.

Installation

git clone https://github.com/dakshjain-1616/smolvlm2-edge-agent.git
cd smolvlm2-edge-agent
make install                                  # pip install -e .
cp .env.example .env                          # then edit values as needed

The make install command runs pip install -e . which installs the package and its pinned runtime dependencies from requirements.txt. The .env.example file contains all documented environment variables, copy it to .env and edit the values you want to override before running.

Configuration

Every tunable is configurable via CLI flags and environment variables. CLI flags take precedence over environment variables. All variables are documented in .env.example in the repository.

MODEL_NAME - HuggingFace model id, default: HuggingFaceTB/SmolVLM2-2.2B-Instruct
USE_MOCK_MODE - bypass model loading with deterministic stub responses, default: false
MODEL_CACHE_DIR - where the HuggingFace model is cached on disk, default: ./models
DB_PATH - SQLite database file path, default: ./data/observations.db
FRAME_DIFF_THRESHOLD - motion sensitivity on a 0–1 scale, higher means less sensitive, default: 0.15
MIN_CONFIDENCE - minimum VLM confidence required to log an observation, default: 0.5
PROCESSING_INTERVAL - seconds between frame samples, default: 1.0
MAX_OBSERVATIONS - cap on stored rows, older observations are pruned, default: 10000
DASHBOARD_HOST - FastAPI bind host, default: 0.0.0.0
DASHBOARD_PORT - FastAPI port, default: 8080
INPUT_SOURCE - camera index or path to image folder, default: 0
OUTPUT_DIR - where observation artifacts are written, default: ./data/observations/
THUMBNAIL_DIR - where frame thumbnails are saved, default: ./data/thumbnails/
LOG_LEVEL - Python logging level, default: INFO
LOG_FILE - optional log file path, default: ./data/agent.log

MIN_CONFIDENCE is worth paying attention to — observations where the VLM's confidence falls below 0.5 are not stored. Raising this filters out uncertain detections. Lowering it logs more, including lower-confidence observations.

Usage

Quick start - mock mode, no model download

The fastest way to verify the full pipeline is mock mode. It bypasses model loading entirely and uses deterministic stub responses, so you can confirm the agent loop, database writes, thumbnail generation, and dashboard all work before committing to the 5GB model download:

mkdir -p data/test_images
python3 -m src --mock --input ./data/test_images --duration 30

This runs the agent for 30 seconds against the data/test_images/ folder using the mock VLM, populates data/observations.db, and writes thumbnails to data/thumbnails/.

Run against a webcam

python3 -m src --input 0 --port 8080

Camera index 0 is the default device. For additional cameras, use index 1, 2, and so on. Open http://localhost:8080 in a browser to see the live dashboard. The dashboard shows the live feed, the most recent observations, and a searchable log of everything the agent has recorded.

Run against an image folder

python3 -m src --input ./images --interval 2.0

Iterates over ./images at 2-second intervals. Useful for batch processing a folder of scanned documents, receipts, or photos without a live camera feed.

Dashboard only in read mode

python3 -m src --mode dashboard --port 8080

Serves the dashboard against an existing data/observations.db without running the agent. Useful for reviewing historical observations without starting a new capture session.

API Reference

The FastAPI dashboard exposes six endpoints:

The /api/search endpoint runs full-text search over stored observation descriptions, useful for finding all observations that mention a specific object, person, or piece of text across the full history.

The /api/observations endpoint is paginated with limit and offset parameters. The default returns the 50 most recent observations.

Models Used

This is the default --model argument and MODEL_NAME env var. No other models are referenced in code, config, or docs. The model is downloaded from HuggingFace on first run and cached in ./models.

Testing

The test suite covers all five modules - database, vision, agent, dashboard, and CLI - with the VLM fully mocked so no model download is needed to run tests:

make test                  # python3 -m pytest tests/ -v
make lint                  # ruff check src/ tests/ --fix
make typecheck             # mypy src/ --ignore-missing-imports

Test coverage:

tests/test_db.py - 10 tests covering SQLite schema, CRUD, and search
tests/test_vision.py - 6 tests covering mock VLM and prompt rendering
tests/test_agent.py - 9 tests covering motion detection and the agent loop
tests/test_dashboard.py - 6 tests covering HTTP route handlers
tests/test_cli.py - 7 tests covering argparse and env-var loading

Total: 36 tests, all passing. No skipped tests.

Project Structure

smolvlm2-edge-agent/
├── src/
│   ├── __init__.py
│   ├── __main__.py              # entry point for python -m src
│   ├── agent.py                 # MotionDetector + VisionAgent
│   ├── vision.py                # VisionEngine (SmolVLM2 wrapper, with MockVisionEngine)
│   ├── db.py                    # SQLite Database class
│   ├── dashboard.py             # FastAPI app factory + route handlers
│   └── cli.py                   # argparse + env loading
├── tests/                       # 36 pytest tests, VLM fully mocked
├── data/.gitkeep                # observations.db, thumbnails/, test_images/ land here
├── models/.gitkeep              # HF model cache
├── pyproject.toml               # ruff + mypy config + console_script
├── requirements.txt             # pinned runtime deps
├── Makefile                     # install, test, lint, typecheck, run, clean
├── .env.example                 # documented env vars
├── .gitignore
├── BUILD_NOTES.md               # build/verification trace
└── PUBLISH.md                   # exact GitHub push commands

The src/ directory maps cleanly to the agent's responsibilities - agent.py handles the motion detection and VLM orchestration loop, vision.py wraps the model with a mock-compatible interface, db.py handles all SQLite operations, dashboard.py is the FastAPI application, and cli.py handles all argument parsing and environment variable loading.

Contributing

PRs welcome. Before submitting, all three of the following must pass with zero errors:

make lint && make typecheck && make test

How I Built This Using NEO

This project was built using NEO. NEO is a fully autonomous AI engineering agent that can write code and build solutions for AI/ML tasks including AI model evals, prompt optimization and end to end AI pipeline development.

The process started with an idea - a fully offline edge vision agent that runs on CPU-only hardware with no GPU and no cloud API calls. I put together a clear project description with the requirements, tech stack, and expected output, and handed it to NEO. From there NEO handled the full build autonomously: writing the code, running tests, fixing issues, and iterating until everything was working end to end. Once NEO completed the build, I did a manual review, tested it myself, and fed any improvements back - which NEO then implemented.

How You Can Use and Extend This With NEO

Use it as an offline home security monitor: Point it at a webcam, let it run, and review what it logged through the dashboard. Every scene change is stored with a timestamp, description, confidence score, and thumbnail - all locally, with no data leaving your machine.

Use it for document digitization pipelines: Point --input at a folder of scanned receipts, whiteboards, or handwritten notes. The VLM reads text from images and logs structured observations. The /api/search endpoint lets you query what was found across the full document set.

Use it as an accessibility tool: Run it against a webcam feed to generate continuous natural language descriptions of what is visible in the environment - stored and searchable, entirely offline.

Extend it with additional VLM backends: VisionEngine in vision.py wraps SmolVLM2-2.2B with a clean interface that MockVisionEngine also implements. Swapping in a different HuggingFace multimodal model means updating vision.py - the agent, database, dashboard, and CLI stay entirely unchanged.

Final Notes

SmolVLM2 Edge Vision Agent shows that meaningful vision AI does not require a GPU or a cloud API. A 2.2B-parameter model, motion-gated inference, a local SQLite store, and a FastAPI dashboard, all running offline on commodity hardware.

The code is at https://github.com/dakshjain-1616/SmolVLM2-Edge-Vision-Agent
You can also build with NEO in your IDE using the VS Code extension or Cursor.
You can use NEO MCP with Claude Code: https://heyneo.com/claude-code