This is a submission for the Gemma 4 Challenge: Write About Gemma 4
What I Learned Trying to Put Gemma 4 Into a Local iPhone Watcher
I already had a small iPhone app called OIC, short for "Oh, I See."
In its current working form, OIC can watch my toaster and tell me when my toast is ready. It uses a custom, hand-rolled local vision model that serves the same architectural role in OIC that I hoped Gemma 4 could serve more generally.
What I wanted next was to see whether OIC could become something broader: a general "watch it for me" agent. Instead of writing a separate detector for every situation, I wanted to find out whether a small multimodal Gemma 4 model running locally on the phone could let the same watcher loop handle different tasks through instruction.
The use case I wanted to add was tracking my cat.
My cat likes to go outside, but it is not trained to come home on its own. Sometimes I have to go find it. Sometimes it has already come back, and I waste time and worry looking for it when I did not need to. That felt like a very good use case for a local visual watcher:
- watch the back door
- detect whether the cat went out or came home
- keep a record of the state
I did not get that full Gemma-powered cat-door watcher working in time for this challenge. But I did get far enough to learn something important about local multimodal models, iPhone deployment, and what it really takes to turn a narrow watcher into a general one.
The Core Concept: The Watcher Architecture
The core loop for a watcher is simple:
- watch a scene
- interpret what matters in that scene
- decide whether an event happened
- update state and notify the user if needed
What changes from one watcher to another is not the loop. What changes is:
- what the user wants watched
- what events matter
- what labels the watcher should return
- what visual reasoning is needed
That is why Gemma 4 interested me.
If a small multimodal model could run locally on the phone and follow instructions well enough, OIC could become a more general visual watcher:
- "Watch the toaster and tell me when the toast is ready."
- "Watch the back door and tell me whether my cat is outside or back home."
- "Watch other scenes that can be described simply enough."
Same loop. Different watched target. Different instruction.
What Was Already Working
Before I tried Gemma, OIC already had one working watcher: toast.
I was not starting from a blank AI demo. I already had:
- an iPhone app
- a camera loop
- a working toast-monitoring path
- an alert flow
The toast watcher is still the cleanest baseline in the project. It is narrow, controlled, and useful. It also made the Gemma experiment more interesting, because I was not asking whether AI could solve a toy problem. I was asking whether a working narrow watcher could be extended into a more general one without losing its local-first character.
What I Actually Accomplished
I did not finish the cat-door watcher, but I did accomplish several things that matter.
1. I refactored OIC from a toast-specific app toward a watcher architecture
That included:
- watcher specifications
- watcher labels
- watcher selection in the app
- a path for multiple watcher types
2. I kept the existing toast watcher working while extending the architecture
The toast watcher is not just a demo. It is the working baseline.
3. I got a Gemma GGUF runtime working locally on the iPhone
This was a concrete milestone.
I integrated a local llama.cpp iOS XCFramework path, set up app-local model handling, and got the app to load the Gemma GGUF model on-device.
That did not mean the full watcher worked. It did show that OIC could host a Gemma runtime locally on the phone instead of depending on a cloud loop.
4. I built the plumbing that a local watcher actually needs
A lot of the work ended up being operational:
- local model file placement
- app-managed model directories
- separating model transfer from app installation
- avoiding accidental app bloat from bundling giant GGUF files
- watcher session tracing
- result recording
This was not the glamorous part, but it was necessary. Local AI on mobile is not just about the model. It is also about packaging, transfer, storage, and runtime discipline.
5. I learned that one must be able to verify every step
That is the biggest lesson I learned from this attempt.
I had to add traces to tell me exactly where the app was in the pipeline:
- the camera opened
- the model loaded
- the first frame was captured
- the first frame actually reached the model
Those are not the same milestone.
Without that level of tracing, I made a few false starts onto the wrong paths. I could have easily mistaken motion in the app for progress in the inference loop.
What Did Not Work
I did not get to the point where I could show that a camera frame from the cat-door watcher was successfully handed into Gemma for image-conditioned inference and returned a usable result.
That is the missing milestone.
More specifically:
- I got local GGUF runtime startup working on the phone.
- I got the cat-door watcher path into the app.
- I got camera start and frame capture traces.
- I did not get a verified end-to-end multimodal first-frame Gemma inference result.
That turned out to be the line between "interesting prototype" and "working general watcher."
Two More Technical Lessons
Loading a model is not the same as having a working watcher
Model startup was only the beginning.
A watcher still needed:
- a camera frame
- conversion into the right image representation
- a multimodal call path
- structured output
- traceable timing
- behavior stable enough to repeat
Getting the model to load was progress, but it was not proof that the watcher loop worked.
Local AI on mobile is also a deployment problem
Some of the hardest issues had nothing to do with model intelligence:
- app size exploding when model files were bundled into the app
- iPhone storage pressure
- Finder and file-sharing friction
- making sure the phone could actually see the model files where the app expected them
That was a good reminder that on-device AI is not just about whether a model can run. It is also about whether the whole system can be deployed, managed, and repeated cleanly on the device.
Why I Still Think Gemma 4 Is a Strong Fit
Even though I did not finish the cat-door watcher, I still think Gemma 4 is the right kind of model family for this project.
OIC is not trying to be a chat app. It is trying to be a local, focused, scene-aware watcher.
That means I care about:
- on-device inference
- a narrow control loop
- promptable behavior
- multimodal reasoning
- reusing one product loop across different watcher tasks
That is why Gemma still feels like such a good fit for the idea. If a local model can be instructed well enough, then OIC may not need a separate hand-built algorithm for every scene it watches.
Technical Baseline
-
Target model path: Gemma 4 GGUF running through
llama.cpp - Local host: iPhone
-
Runtime engine:
llama.cppiOS XCFramework integration -
Camera pipeline:
AVFoundationcapturing real-time frames - Current status: runtime startup works locally; verified first-frame multimodal inference is still the missing step
What I Would Do Next
The next step is not "add more AI."
It is narrower and more technical:
- verify that a GGUF-compatible multimodal iOS path exists for the current model assets
- get one camera frame into that path
- record the exact result in a watcher trace
- only then measure latency, cadence, and whether the watcher is practical
That sequence follows directly from the biggest lesson in this project: verify the result of each step before moving on to the next one.
If that works, OIC gets much closer to what I originally wanted: a local watcher loop that can be retargeted by instruction instead of rebuilt from scratch for every new task.
Final Take
This attempt did not produce a finished Gemma-powered cat-door watcher yet.
What it did produce was:
- a working toast watcher as the baseline
- an attempt to generalize that watcher architecture
- an on-device Gemma runtime path on iPhone
- a clearer understanding of what local multimodal product work actually demands
I started by hoping Gemma 4 would let me turn OIC into a general "watch it for me" agent.
Trying to do it showed me exactly where the next barrier is.
For OIC, that barrier is the first verified multimodal frame.
Top comments (0)