DEV Community

Cover image for Gizmo Guard - Spy Bot (Powered by Gemma4)
sasiperi
sasiperi

Posted on

Gizmo Guard - Spy Bot (Powered by Gemma4)

Gemma 4 Challenge: Build With Gemma 4 Submission

This is a submission for the Gemma 4 Challenge: Build with Gemma 4

Table of Contents

  1. Inspiration
  2. What I Built
  3. Live Demo
  4. Code
  5. How I Used Gemma4

Inspiration

The idea for GizmoGuard actually started from my daughter.

One day, after having a nightmare about someone secretly entering the house while we were away, she became worried that somebody might sneak into her room and take away her favorite fluffy toy.

She asked:

“Can we build a small robot that watches my important things when nobody is around?”

That simple question became the inspiration for GizmoGuard.

The original goal was to create a small, affordable AI-powered guard bot that could:

  • watch over important belongings,
  • detect suspicious activity,
  • recognize familiar people vs strangers,
  • and explain what happened if something changed.

The early design, testing, and prototyping started with a fun and simple use case:

“Who moved my mug?”

Using a mug as the protected object made it easy to experiment with:

  • scene monitoring,
  • object movement detection,
  • multimodal reasoning,
  • and conversational AI responses

before expanding the idea into a broader personal safety and monitoring assistant.

Today, GizmoGuard has evolved into a privacy-first multimodal AI-at-the-edge system capable of monitoring valuable objects and understanding real-world activity locally using Gemma.

What I Built

GizmoGuard is a low-budget, privacy-first AI-at-the-edge personal safety and monitoring bot powered by locally running Gemma models.

The idea started from a simple but relatable problem:

“Who moved my mug?”

GizmoGuard continuously monitors a workspace — or any valuable object of interest, indoors or outdoors — using an ArduCam attached to a Raspberry Pi. The system detects scene changes such as:

  • Objects being moved
  • Objects being removed
  • Objects being replaced
  • Unexpected objects appearing
  • People approaching or touching monitored items
  • Ambient light changes
  • Unwanted background noise

The system is designed to intelligently distinguish between normal environmental activity and a real scene change near the protected object.

When motion or scene changes are detected, GizmoGuard captures “evidence images” and sends them to a Spring Boot backend API. The backend then uses Gemma 4 for multimodal image reasoning and natural-language explanations.

Using additional preconfigured contextual information, the system can also:

  • Recognize known people vs strangers
  • Analyze gestures and emotions
  • Understand nearby activity and surroundings
  • Generate spoken voice-enabled responses using the host system’s speech functionality

The entire system is built around a local-first AI architecture:

  • Images never leave the local environment
  • No cloud AI APIs are required
  • No recurring inference costs
  • Runs affordably on consumer-grade hardware

GizmoGuard demonstrates how compact multimodal AI models like Gemma can power practical, privacy-focused real-world edge AI applications.

Current Architecture

The current GizmoGuard architecture consists of the following components:

Raspberry Pi + ArduCam

  • Python running on the Raspberry Pi continuously captures images
  • Performs lightweight motion and scene-change detection
  • Sends “evidence images” to backend APIs for AI analysis

Spring Boot REST API

The Spring Boot backend acts as the orchestration layer and:

  • Manages image analysis workflows
  • Stores chat and contextual memory
  • Handles evidence image pipelines
  • Integrates with Gemma 4 using OpenAI-compatible APIs

Docker Model Runner (DMR)

  • Runs Gemma locally on my laptop
  • Exposes model APIs for multimodal inference
  • Enables fully local AI processing without cloud dependencies

Local Storage + MySQL

  • Stores evidence images locally
  • Maintains conversation history and contextual memory
  • Persists AI-generated responses and metadata

Multimodal AI Layer

Powered by Gemma 4, the AI layer:

  • Analyzes captured images
  • Explains scene changes in natural language
  • Supports conversational interaction
  • Generates contextual reasoning about nearby activity

The project demonstrates how practical multimodal AI systems can run locally using affordable hardware — without requiring expensive cloud infrastructure or hosted AI services.

Demo

Demo Link: Gizmo-Guard Bot Demo

Demo Includes

  1. Mug placed on desk

  2. Scene continuously monitored by Raspberry Pi + ArduCam

  3. Mug moved, removed, or scene unexpectedly changes

  4. Evidence image captured automatically

  5. Gemma analyzes the image and explains what changed using multimodal reasoning

  6. When real people (or images of them) appear in the scene:

    • Detects known people pre-configured through system prompts and contextual memory
    • Identifies all unknown individuals as strangers
    • Analyzes appearance, ambience, emotions, and gestures
    • Detects potentially malicious or friendly behavior and reports observations
  7. The system generates voice-enabled spoken responses using the host system’s native speech functionality, allowing GizmoGuard to verbally describe scene changes, alerts, and AI observations in real time.

  8. Future prospects (model voice capabilities for analysis,
    Servo-based/GIO-Header wheels)

Code

GitHub (sasiperi) Repo name and Link: gizmo-guard-gemma4-challenge

Tech stack includes:

  • Java + Spring Boot
  • Raspberry Pi + ArduCam
  • Docker Model Runner (DMR)
  • Ollama/OpenAI-compatible APIs (gguf models)
  • Gemma4 multimodal model
  • MySQL for Chat Memory and Responses etc..
  • Local filesystem storage for images.

How I Used Gemma4

GizmoGuard is powered by Gemma 4B Quantized (gemma4:4B-Q4_K_XL) running locally through Docker Model Runner (DMR).

I specifically selected this model because it delivered the best overall balance between:

  • Multimodal capability
  • Local deployment feasibility
  • Memory footprint
  • Reasoning quality
  • Privacy
  • Cost efficiency

Why Gemma Was the Right Fit

1. Privacy-First Local AI

One of the primary goals of GizmoGuard was ensuring that camera images and personal workspace data never leave the local environment.

By running Gemma locally:

  • No images are uploaded to external AI services
  • No cloud inference is required
  • The system can operate completely offline

For an always-on visual monitoring system, this was extremely important.


2. Edge-Friendly Performance

I evaluated several local multimodal models.

Some lightweight models were fast but struggled with:

  • Reliable image understanding
  • Object consistency
  • Intelligent system/user prompt processing (chat capabilities)

Larger models produced strong results but required significantly more resources and slower inference times.

gemma4:4B-Q4_K_XL turned out to be the ideal middle ground:

  • Compact enough for practical local deployment
  • Efficient enough for near real-time analysis
  • Still capable of strong multimodal reasoning quality

This made it an excellent fit for AI-at-the-edge workloads.


3. Multimodal Simplicity

A major advantage of Gemma4:4B was its ability to handle:

  • Image understanding
  • Reasoning
  • Conversational responses
  • System and user prompt processing

within a single model.

This avoided the need to chain together:

  • Separate vision models
  • OCR pipelines
  • Reasoning models
  • Chat models

Using a unified multimodal model simplified:

  • Architecture
  • Orchestration
  • Deployment
  • Latency
  • Operational complexity

4. Cost-Effective AI

Another goal of the project was proving that useful AI systems do not require expensive cloud GPUs or recurring API fees.

Running Gemma locally means:

  • Zero inference cost
  • No token billing
  • Predictable performance
  • Full ownership of the AI stack

This makes GizmoGuard practical for:

  • Hobbyists
  • Makers
  • Students
  • Small-scale edge deployments

5. Real-World AI at the Edge

GizmoGuard demonstrates how compact multimodal models like Gemma can power practical real-world edge AI applications using affordable hardware and open-source tooling.

The project combines:

  • Edge AI
  • IoT
  • Computer Vision
  • Multimodal Reasoning
  • Local LLM Deployment
  • Spring Boot APIs
  • Privacy-First Architecture

into a fully working end-to-end system.

It showcases how modern multimodal AI can move beyond cloud-only deployments and become useful directly at the edge.

Top comments (0)