Running vision AI locally has always had a catch, you need a GPU, or you need to send frames to a cloud API and pay per call. SmolVLM2-2.2B changes that. It is a 2.2B-parameter multimodal model specifically designed for CPU inference, and this agent is built around it.
SmolVLM2 Edge Vision Agent is a fully offline edge vision agent that ingests a live webcam feed or an image folder, detects motion using frame-difference analysis, triggers VLM analysis only on scene changes, and persists structured observations to a local SQLite database with a FastAPI web dashboard for review. No API costs. No network calls after the first model download. 16GB RAM, no GPU required.
Project Overview
The agent does five things:
- Ingests a live webcam feed or an image folder as input
- Performs continuous visual monitoring, frame-difference based motion detection that triggers VLM analysis only on scene changes
- Describes new objects, reads text from images - receipts, whiteboards, signs, and logs everything as structured observations
- Persists observations to a local SQLite database with timestamps, thumbnails, descriptions, and confidence scores
- Exposes a FastAPI web dashboard with live feed, latest observations, and a searchable log
It runs entirely offline. The model auto-downloads on first run and is cached locally from that point forward.
Use cases: home security camera analysis, document digitization pipelines, accessibility tools.
Architecture
The key design decision is the motion gate. Running a 2.2B-parameter model on every frame would be unusable on CPU hardware, inference is not instant. The agent solves this by running frame-difference motion detection on every frame first, and only invoking the VLM when a scene change is detected above the configured threshold.
Per-frame timeline:
Every frame goes through motion detection first. If the frame difference is below the threshold, the frame is dropped with no further processing. If motion is detected, the VLM runs, produces a description, and the observation is stored in SQLite with a thumbnail. This design means expensive model inference only happens when something actually changes in the scene, keeping a Pi-class CPU usable while still describing every meaningful scene change.
The FRAME_DIFF_THRESHOLD defaults to 0.15 and controls how sensitive the motion detector is. A higher value means less sensitivity, minor lighting changes or small movements are ignored. A lower value triggers the VLM more frequently.
Prerequisites
- Python: 3.11 or newer.
-
RAM: 16GB minimum for the real model; less is fine in
--mockmode. - Disk: ~5GB free for the model cache.
- OS: Linux, macOS, or WSL2 on Windows - uses OpenCV, and webcam access requires native camera support.
- No GPU required - SmolVLM2-2.2B is designed for CPU inference.
Installation
git clone https://github.com/dakshjain-1616/smolvlm2-edge-agent.git
cd smolvlm2-edge-agent
make install # pip install -e .
cp .env.example .env # then edit values as needed
The make install command runs pip install -e . which installs the package and its pinned runtime dependencies from requirements.txt. The .env.example file contains all documented environment variables, copy it to .env and edit the values you want to override before running.
Configuration
Every tunable is configurable via CLI flags and environment variables. CLI flags take precedence over environment variables. All variables are documented in .env.example in the repository.
-
MODEL_NAME- HuggingFace model id, default:HuggingFaceTB/SmolVLM2-2.2B-Instruct -
USE_MOCK_MODE- bypass model loading with deterministic stub responses, default:false -
MODEL_CACHE_DIR- where the HuggingFace model is cached on disk, default:./models -
DB_PATH- SQLite database file path, default:./data/observations.db -
FRAME_DIFF_THRESHOLD- motion sensitivity on a 0–1 scale, higher means less sensitive, default:0.15 -
MIN_CONFIDENCE- minimum VLM confidence required to log an observation, default:0.5 -
PROCESSING_INTERVAL- seconds between frame samples, default:1.0 -
MAX_OBSERVATIONS- cap on stored rows, older observations are pruned, default:10000 -
DASHBOARD_HOST- FastAPI bind host, default:0.0.0.0 -
DASHBOARD_PORT- FastAPI port, default:8080 -
INPUT_SOURCE- camera index or path to image folder, default:0 -
OUTPUT_DIR- where observation artifacts are written, default:./data/observations/ -
THUMBNAIL_DIR- where frame thumbnails are saved, default:./data/thumbnails/ -
LOG_LEVEL- Python logging level, default:INFO -
LOG_FILE- optional log file path, default:./data/agent.log
MIN_CONFIDENCE is worth paying attention to — observations where the VLM's confidence falls below 0.5 are not stored. Raising this filters out uncertain detections. Lowering it logs more, including lower-confidence observations.
Usage
Quick start - mock mode, no model download
The fastest way to verify the full pipeline is mock mode. It bypasses model loading entirely and uses deterministic stub responses, so you can confirm the agent loop, database writes, thumbnail generation, and dashboard all work before committing to the 5GB model download:
mkdir -p data/test_images
python3 -m src --mock --input ./data/test_images --duration 30
This runs the agent for 30 seconds against the data/test_images/ folder using the mock VLM, populates data/observations.db, and writes thumbnails to data/thumbnails/.
Run against a webcam
python3 -m src --input 0 --port 8080
Camera index 0 is the default device. For additional cameras, use index 1, 2, and so on. Open http://localhost:8080 in a browser to see the live dashboard. The dashboard shows the live feed, the most recent observations, and a searchable log of everything the agent has recorded.
Run against an image folder
python3 -m src --input ./images --interval 2.0
Iterates over ./images at 2-second intervals. Useful for batch processing a folder of scanned documents, receipts, or photos without a live camera feed.
Dashboard only in read mode
python3 -m src --mode dashboard --port 8080
Serves the dashboard against an existing data/observations.db without running the agent. Useful for reviewing historical observations without starting a new capture session.
API Reference
The FastAPI dashboard exposes six endpoints:
The /api/search endpoint runs full-text search over stored observation descriptions, useful for finding all observations that mention a specific object, person, or piece of text across the full history.
The /api/observations endpoint is paginated with limit and offset parameters. The default returns the 50 most recent observations.
Models Used
This is the default --model argument and MODEL_NAME env var. No other models are referenced in code, config, or docs. The model is downloaded from HuggingFace on first run and cached in ./models.
Testing
The test suite covers all five modules - database, vision, agent, dashboard, and CLI - with the VLM fully mocked so no model download is needed to run tests:
make test # python3 -m pytest tests/ -v
make lint # ruff check src/ tests/ --fix
make typecheck # mypy src/ --ignore-missing-imports
Test coverage:
-
tests/test_db.py- 10 tests covering SQLite schema, CRUD, and search -
tests/test_vision.py- 6 tests covering mock VLM and prompt rendering -
tests/test_agent.py- 9 tests covering motion detection and the agent loop -
tests/test_dashboard.py- 6 tests covering HTTP route handlers -
tests/test_cli.py- 7 tests covering argparse and env-var loading
Total: 36 tests, all passing. No skipped tests.
Project Structure
smolvlm2-edge-agent/
├── src/
│ ├── __init__.py
│ ├── __main__.py # entry point for python -m src
│ ├── agent.py # MotionDetector + VisionAgent
│ ├── vision.py # VisionEngine (SmolVLM2 wrapper, with MockVisionEngine)
│ ├── db.py # SQLite Database class
│ ├── dashboard.py # FastAPI app factory + route handlers
│ └── cli.py # argparse + env loading
├── tests/ # 36 pytest tests, VLM fully mocked
├── data/.gitkeep # observations.db, thumbnails/, test_images/ land here
├── models/.gitkeep # HF model cache
├── pyproject.toml # ruff + mypy config + console_script
├── requirements.txt # pinned runtime deps
├── Makefile # install, test, lint, typecheck, run, clean
├── .env.example # documented env vars
├── .gitignore
├── BUILD_NOTES.md # build/verification trace
└── PUBLISH.md # exact GitHub push commands
The src/ directory maps cleanly to the agent's responsibilities - agent.py handles the motion detection and VLM orchestration loop, vision.py wraps the model with a mock-compatible interface, db.py handles all SQLite operations, dashboard.py is the FastAPI application, and cli.py handles all argument parsing and environment variable loading.
Contributing
PRs welcome. Before submitting, all three of the following must pass with zero errors:
make lint && make typecheck && make test
How I Built This Using NEO
This project was built using NEO. NEO is a fully autonomous AI engineering agent that can write code and build solutions for AI/ML tasks including AI model evals, prompt optimization and end to end AI pipeline development.
The process started with an idea - a fully offline edge vision agent that runs on CPU-only hardware with no GPU and no cloud API calls. I put together a clear project description with the requirements, tech stack, and expected output, and handed it to NEO. From there NEO handled the full build autonomously: writing the code, running tests, fixing issues, and iterating until everything was working end to end. Once NEO completed the build, I did a manual review, tested it myself, and fed any improvements back - which NEO then implemented.
How You Can Use and Extend This With NEO
Use it as an offline home security monitor: Point it at a webcam, let it run, and review what it logged through the dashboard. Every scene change is stored with a timestamp, description, confidence score, and thumbnail - all locally, with no data leaving your machine.
Use it for document digitization pipelines: Point --input at a folder of scanned receipts, whiteboards, or handwritten notes. The VLM reads text from images and logs structured observations. The /api/search endpoint lets you query what was found across the full document set.
Use it as an accessibility tool: Run it against a webcam feed to generate continuous natural language descriptions of what is visible in the environment - stored and searchable, entirely offline.
Extend it with additional VLM backends: VisionEngine in vision.py wraps SmolVLM2-2.2B with a clean interface that MockVisionEngine also implements. Swapping in a different HuggingFace multimodal model means updating vision.py - the agent, database, dashboard, and CLI stay entirely unchanged.
Final Notes
SmolVLM2 Edge Vision Agent shows that meaningful vision AI does not require a GPU or a cloud API. A 2.2B-parameter model, motion-gated inference, a local SQLite store, and a FastAPI dashboard, all running offline on commodity hardware.
The code is at https://github.com/dakshjain-1616/SmolVLM2-Edge-Vision-Agent
You can also build with NEO in your IDE using the VS Code extension or Cursor.
You can use NEO MCP with Claude Code: https://heyneo.com/claude-code




Top comments (0)