Straightly

Posted on May 25

OIC: From a Working Toast Watcher to a General "Watch It for Me" Agent

#devchallenge #gemmachallenge #gemma

Gemma 4 Challenge: Write about Gemma 4 Submission

This is a submission for the Gemma 4 Challenge: Write About Gemma 4

What I Learned Trying to Put Gemma 4 Into a Local iPhone Watcher

I already had a small iPhone app called OIC, short for "Oh, I See."

In its current working form, OIC can watch my toaster and tell me when my toast is ready. It uses a custom, hand-rolled local vision model that serves the same architectural role in OIC that I hoped Gemma 4 could serve more generally.

What I wanted next was to see whether OIC could become something broader: a general "watch it for me" agent. Instead of writing a separate detector for every situation, I wanted to find out whether a small multimodal Gemma 4 model running locally on the phone could let the same watcher loop handle different tasks through instruction.

The use case I wanted to add was tracking my cat.

My cat likes to go outside, but it is not trained to come home on its own. Sometimes I have to go find it. Sometimes it has already come back, and I waste time and worry looking for it when I did not need to. That felt like a very good use case for a local visual watcher:

watch the back door
detect whether the cat went out or came home
keep a record of the state

I did not get that full Gemma-powered cat-door watcher working in time for this challenge. But I did get far enough to learn something important about local multimodal models, iPhone deployment, and what it really takes to turn a narrow watcher into a general one.

The Core Concept: The Watcher Architecture

The core loop for a watcher is simple:

watch a scene
interpret what matters in that scene
decide whether an event happened
update state and notify the user if needed

What changes from one watcher to another is not the loop. What changes is:

what the user wants watched
what events matter
what labels the watcher should return
what visual reasoning is needed

That is why Gemma 4 interested me.

If a small multimodal model could run locally on the phone and follow instructions well enough, OIC could become a more general visual watcher:

"Watch the toaster and tell me when the toast is ready."
"Watch the back door and tell me whether my cat is outside or back home."
"Watch other scenes that can be described simply enough."

Same loop. Different watched target. Different instruction.

What Was Already Working

Before I tried Gemma, OIC already had one working watcher: toast.

I was not starting from a blank AI demo. I already had:

an iPhone app
a camera loop
a working toast-monitoring path
an alert flow

The toast watcher is still the cleanest baseline in the project. It is narrow, controlled, and useful. It also made the Gemma experiment more interesting, because I was not asking whether AI could solve a toy problem. I was asking whether a working narrow watcher could be extended into a more general one without losing its local-first character.

What I Actually Accomplished

I did not finish the cat-door watcher, but I did accomplish several things that matter.

1. I refactored OIC from a toast-specific app toward a watcher architecture

That included:

watcher specifications
watcher labels
watcher selection in the app
a path for multiple watcher types

2. I kept the existing toast watcher working while extending the architecture

The toast watcher is not just a demo. It is the working baseline.

3. I got a Gemma GGUF runtime working locally on the iPhone

This was a concrete milestone.

I integrated a local llama.cpp iOS XCFramework path, set up app-local model handling, and got the app to load the Gemma GGUF model on-device.

That did not mean the full watcher worked. It did show that OIC could host a Gemma runtime locally on the phone instead of depending on a cloud loop.

4. I built the plumbing that a local watcher actually needs

A lot of the work ended up being operational:

local model file placement
app-managed model directories
separating model transfer from app installation
avoiding accidental app bloat from bundling giant GGUF files
watcher session tracing
result recording

This was not the glamorous part, but it was necessary. Local AI on mobile is not just about the model. It is also about packaging, transfer, storage, and runtime discipline.

5. I learned that one must be able to verify every step

That is the biggest lesson I learned from this attempt.

I had to add traces to tell me exactly where the app was in the pipeline:

the camera opened
the model loaded
the first frame was captured
the first frame actually reached the model

Those are not the same milestone.

Without that level of tracing, I made a few false starts onto the wrong paths. I could have easily mistaken motion in the app for progress in the inference loop.

What Did Not Work

I did not get to the point where I could show that a camera frame from the cat-door watcher was successfully handed into Gemma for image-conditioned inference and returned a usable result.

That is the missing milestone.

More specifically:

I got local GGUF runtime startup working on the phone.
I got the cat-door watcher path into the app.
I got camera start and frame capture traces.
I did not get a verified end-to-end multimodal first-frame Gemma inference result.

That turned out to be the line between "interesting prototype" and "working general watcher."

Two More Technical Lessons

Loading a model is not the same as having a working watcher

Model startup was only the beginning.

A watcher still needed:

a camera frame
conversion into the right image representation
a multimodal call path
structured output
traceable timing
behavior stable enough to repeat

Getting the model to load was progress, but it was not proof that the watcher loop worked.

Local AI on mobile is also a deployment problem

Some of the hardest issues had nothing to do with model intelligence:

app size exploding when model files were bundled into the app
iPhone storage pressure
Finder and file-sharing friction
making sure the phone could actually see the model files where the app expected them

That was a good reminder that on-device AI is not just about whether a model can run. It is also about whether the whole system can be deployed, managed, and repeated cleanly on the device.

Why I Still Think Gemma 4 Is a Strong Fit

Even though I did not finish the cat-door watcher, I still think Gemma 4 is the right kind of model family for this project.

OIC is not trying to be a chat app. It is trying to be a local, focused, scene-aware watcher.

That means I care about:

on-device inference
a narrow control loop
promptable behavior
multimodal reasoning
reusing one product loop across different watcher tasks

That is why Gemma still feels like such a good fit for the idea. If a local model can be instructed well enough, then OIC may not need a separate hand-built algorithm for every scene it watches.

Technical Baseline

Target model path: Gemma 4 GGUF running through llama.cpp
Local host: iPhone
Runtime engine: llama.cpp iOS XCFramework integration
Camera pipeline: AVFoundation capturing real-time frames
Current status: runtime startup works locally; verified first-frame multimodal inference is still the missing step

What I Would Do Next

The next step is not "add more AI."

It is narrower and more technical:

verify that a GGUF-compatible multimodal iOS path exists for the current model assets
get one camera frame into that path
record the exact result in a watcher trace
only then measure latency, cadence, and whether the watcher is practical

That sequence follows directly from the biggest lesson in this project: verify the result of each step before moving on to the next one.

If that works, OIC gets much closer to what I originally wanted: a local watcher loop that can be retargeted by instruction instead of rebuilt from scratch for every new task.

Final Take

This attempt did not produce a finished Gemma-powered cat-door watcher yet.

What it did produce was:

a working toast watcher as the baseline
an attempt to generalize that watcher architecture
an on-device Gemma runtime path on iPhone
a clearer understanding of what local multimodal product work actually demands

I started by hoping Gemma 4 would let me turn OIC into a general "watch it for me" agent.

Trying to do it showed me exactly where the next barrier is.

For OIC, that barrier is the first verified multimodal frame.

DEV Community