sasiperi

Posted on May 21 • Edited on May 26

GizmoGuard - Spy Bot (Powered by Gemma4)

#devchallenge #gemmachallenge #gemma

Gemma 4 Challenge: Build With Gemma 4 Submission

This is a submission for the Gemma 4 Challenge: Build with Gemma 4

Inspiration
What I Built
Live Demo
Code
How I Used Gemma4

Inspiration

The idea for GizmoGuard actually started from my daughter.

One day, after having a nightmare about someone secretly entering the house while we were away, she became worried that somebody might sneak into her room and take away her favorite fluffy toy.

She asked:

“Can we build a small robot that watches my important things when nobody is around?”

That simple question became the inspiration for GizmoGuard.

The original goal was to create a small, affordable AI-powered guard bot that could:

watch over important belongings,
detect suspicious activity,
recognize familiar people vs strangers,
and explain what happened if something changed.

The early design, testing, and prototyping started with a fun and simple use case:

“Who moved my mug?”

Using a mug as the protected object made it easy to experiment with:

scene monitoring,
object movement detection,
multimodal reasoning,
and conversational AI responses

before expanding the idea into a broader personal safety and monitoring assistant.

Today, GizmoGuard has evolved into a privacy-first multimodal AI-at-the-edge system capable of monitoring valuable objects and understanding real-world activity locally using Gemma.

What I Built

GizmoGuard is a low-budget, privacy-first AI-at-the-edge personal safety and monitoring bot powered by locally running Gemma models.

The idea started from a simple but relatable problem:

“Who moved my mug?”

GizmoGuard continuously monitors a workspace — or any valuable object of interest, indoors or outdoors — using an ArduCam attached to a Raspberry Pi. The system detects scene changes such as:

Objects being moved
Objects being removed
Objects being replaced
Unexpected objects appearing
People approaching or touching monitored items
Ambient light changes
Unwanted background noise

The system is designed to intelligently distinguish between normal environmental activity and a real scene change near the protected object.

When motion or scene changes are detected, GizmoGuard captures “evidence images” and sends them to a Spring Boot backend API. The backend then uses Gemma 4 for multimodal image reasoning and natural-language explanations.

Using additional preconfigured contextual information, the system can also:

Recognize known people vs strangers
Analyze gestures and emotions
Understand nearby activity and surroundings
Generate spoken voice-enabled responses using the host system’s speech functionality

The entire system is built around a local-first AI architecture:

Images never leave the local environment
No cloud AI APIs are required
No recurring inference costs
Runs affordably on consumer-grade hardware

GizmoGuard demonstrates how compact multimodal AI models like Gemma can power practical, privacy-focused real-world edge AI applications.

Current Architecture

The current GizmoGuard architecture consists of the following components:

Raspberry Pi + ArduCam

Python running on the Raspberry Pi continuously captures images
Performs lightweight motion and scene-change detection
Sends “evidence images” to backend APIs for AI analysis

Spring Boot REST API

The Spring Boot backend acts as the orchestration layer and:

Manages image analysis workflows
Stores chat and contextual memory
Handles evidence image pipelines
Integrates with Gemma 4 using OpenAI-compatible APIs

Docker Model Runner (DMR)

Runs Gemma locally on my laptop
Exposes model APIs for multimodal inference
Enables fully local AI processing without cloud dependencies

Local Storage + MySQL

Stores evidence images locally
Maintains conversation history and contextual memory
Persists AI-generated responses and metadata

Multimodal AI Layer

Analyzes captured images
Explains scene changes in natural language
Supports conversational interaction
Generates contextual reasoning about nearby activity

The project demonstrates how practical multimodal AI systems can run locally using affordable hardware — without requiring expensive cloud infrastructure or hosted AI services.

Demo

Demo Link: Gizmo-Guard Bot Demo

Demo Includes

Mug placed on desk
Scene continuously monitored by Raspberry Pi + ArduCam
Mug moved, removed, or scene unexpectedly changes
Evidence image captured automatically
Gemma analyzes the image and explains what changed using multimodal reasoning
When real people (or images of them) appear in the scene:
- Detects known people pre-configured through system prompts and contextual memory
- Identifies all unknown individuals as strangers
- Analyzes appearance, ambience, emotions, and gestures
- Detects potentially malicious or friendly behavior and reports observations
The system generates voice-enabled spoken responses using the host system’s native speech functionality, allowing GizmoGuard to verbally describe scene changes, alerts, and AI observations in real time.
Future prospects (model voice capabilities for analysis,
Servo-based/GIO-Header wheels)

Code

GitHub (sasiperi) Repo name and Link: gizmo-guard-gemma4-challenge

sasiperi / gizmo-guard-gemma4-challange

Show casing gemma4 capabilities with Edge, local running models

poc-ar-agentic-ai

View on GitHub

Tech stack includes:

Java + Spring Boot
Raspberry Pi + ArduCam
Docker Model Runner (DMR)
Ollama/OpenAI-compatible APIs (gguf models)
Gemma4 multimodal model
MySQL for Chat Memory and Responses etc..
Local filesystem storage for images.

How I Used Gemma4

GizmoGuard is powered by Gemma 4B Quantized (gemma4:4B-Q4_K_XL) running locally through Docker Model Runner (DMR).

I specifically selected this model because it delivered the best overall balance between:

Multimodal capability
Local deployment feasibility
Memory footprint
Reasoning quality
Privacy
Cost efficiency

Why Gemma Was the Right Fit

1. Privacy-First Local AI

One of the primary goals of GizmoGuard was ensuring that camera images and personal workspace data never leave the local environment.

By running Gemma locally:

No images are uploaded to external AI services
No cloud inference is required
The system can operate completely offline

For an always-on visual monitoring system, this was extremely important.

2. Edge-Friendly Performance

I evaluated several local multimodal models.

Some lightweight models were fast but struggled with:

Reliable image understanding
Object consistency
Intelligent system/user prompt processing (chat capabilities)

Larger models produced strong results but required significantly more resources and slower inference times.

gemma4:4B-Q4_K_XL turned out to be the ideal middle ground:

Compact enough for practical local deployment
Efficient enough for near real-time analysis
Still capable of strong multimodal reasoning quality

This made it an excellent fit for AI-at-the-edge workloads.

3. Multimodal Simplicity

A major advantage of Gemma4:4B was its ability to handle:

Image understanding
Reasoning
Conversational responses
System and user prompt processing

within a single model.

This avoided the need to chain together:

Separate vision models
OCR pipelines
Reasoning models
Chat models

Using a unified multimodal model simplified:

Architecture
Orchestration
Deployment
Latency
Operational complexity

4. Cost-Effective AI

Another goal of the project was proving that useful AI systems do not require expensive cloud GPUs or recurring API fees.

Running Gemma locally means:

Zero inference cost
No token billing
Predictable performance
Full ownership of the AI stack

This makes GizmoGuard practical for:

Hobbyists
Makers
Students
Small-scale edge deployments

5. Real-World AI at the Edge

GizmoGuard demonstrates how compact multimodal models like Gemma can power practical real-world edge AI applications using affordable hardware and open-source tooling.

The project combines:

Edge AI
IoT
Computer Vision
Multimodal Reasoning
Local LLM Deployment
Spring Boot APIs
Privacy-First Architecture

into a fully working end-to-end system.

It showcases how modern multimodal AI can move beyond cloud-only deployments and become useful directly at the edge.

DEV Community