What Gemma 4 taught me about the real future of on-device AI, and why I owe Seyi an apology
This is a submission for the Gemma 4 Challenge: Write about Gemma 4
I poured cold water on an intern's excitement about a powerful AI model that could run entirely on a phone. That was a few weeks ago, and now, I am the intern.
Let me explain - Seyi is a mechatronics student at Bells University, Nigeria, doing an internship at my workplace. He came in buzzing one morning about Gemma 4, Google's latest open-weight language model, capable of running completely offline on consumer hardware. His eyes were wide, but mine were narrower. My internal monologue went something like: Why would I want to downgrade to a local model when I have cloud access to far more powerful ones? Privacy concerns didn't register for me. Usage limits weren't a real problem on my Google One plan, and an offline model, by definition, is frozen in time, limited to whatever it knew at training.
Where's the upside, I asked?
I said all of this out loud. Seyi nodded, deflated, and went back to his desk. The universe, as it often tends to do, had a rebuttal scheduled for later that week.
The Argument I Didn't Expect to Lose
A few days later, I was browsing Martin Sauter's blog (he's a telecom author whose book remains one of my favourites), and I check in occasionally even when the content goes well over my head. He'd started a thread on running local AI: terms like OpenWebUI and RAG were being thrown around, discussions about making offline models more dynamic through external connectivity.
Something clicked in my mind. I've been thinking for a while about using AI for specialised tasks at work, but corporate environments don't always play nicely with public cloud AI. Uploading proprietary documents to a public model is a legitimate concern. A capable model running on a personal device, answering to no server and leaving no data trail, was beginning to sound less like a downgrade and more like a completely different category of tool.
Then I saw a LinkedIn post about the AI Edge Gallery, an Android app from Google that lets you run Gemma 4 directly on your phone. No cloud, no subscription, no data leaving the device.
That was it for me. I was in.
More Than a Chatbot in Your Pocket
Before we go further, let me quickly explain what Gemma 4 on AI Edge Gallery actually is, because it is not simply "a smaller ChatGPT that works offline." That framing undersells what's actually available here. While Gemma 4 comes in four variants, two of them are light enough to run on a phone: the E2B, optimised for speed on mid-range devices, and the E4B, a smarter model targeted at modern phones with 8GB of RAM or more. I run a Google Pixel 8 Pro with 12GB of RAM, so the E4B was my lane.
The Edge Gallery app itself organises Gemma 4's E2B and E4B models' capabilities into distinct workspaces. There's AI Chat with Thinking Mode, which shows you the model's step-by-step reasoning in real time (a genuinely useful window into how it arrives at answers). There's Ask Image, which lets the model read and analyse photos from your camera or gallery, entirely offline. There's Audio Scribe for transcription and translation. And then there's Agent Skills, the feature that consumed most of my attention.
Agent Skills are best explained with an analogy. Gemma 4 out of the box is a brilliant person who can hold a nuanced conversation. Skills are what you hand that person before the conversation starts, like a calculator, a specialist reference manual, or a set of behavioural instructions. The model reads a menu of available skills at the start of each session, determines which one is relevant to your request, and activates it. The core file that defines each skill is called a SKILL.md — a plain text document with a metadata block at the top (a name and a trigger description) and freeform instructions below (text-only skills). No coding required. Anyone who can write clearly can, in theory, build a custom AI skill.
That word "in theory" is pushing it a bit. More on that shortly.
My First Experiment: Teaching Gemma to Write Like Me
As a telecom professional who writes about the industry, I run a LinkedIn newsletter called Signal Over Noise, and my first instinct was to test Gemma's ability to replicate my writing style. Not to replace my writing, but to see if a local model could serve as a personal writing assistant, one that knows my voice well enough to be actually useful when I hand it raw material to reshape.
The process of building the skill itself was fascinating. I worked with Claude to do a detailed stylistic analysis of my previous non-AI-assisted articles, creating twelve touchpoints covering voice, rhythm, structure, openings, analogies, vocabulary, signature moves, and more. The analysis surfaced consistent fingerprints: the conversational-analytical blend, the personal entry point as a default opening, the pop-culture analogies with Nigerian context, invented compound words, and the self-aware aside that steps just outside the narrative. That profile became a SKILL.md file - my personal, reusable, installable instruction set that could theoretically rewrite any source material in a documented style, calibrated by adherence level.
Then I tried to run it. And Gemma 4, to its credit, was very honest about its limitations.
Three Lessons From a Phone That Crashed, Looped, and Gave Up Mid-Sentence
Lesson One: Instruction Footprint Matters More Than You Think
The first time I tried to activate my style skill on the E2B and E4B variants, the app crashed immediately. Not "it gave a bad output." It crashed repeatedly.
The root cause, as I later understood it, is that local models operate within strict, hardware-enforced RAM allocations. The original instruction file was verbose: exhaustive descriptions for every adherence level of all twelve style touchpoints. Before the model could process a single user message, the system had to load all of those instructions into its working memory. The sheer volume saturated the phone's RAM allocation before the engine could even initialise, and the application stopped working.
The fix was aggressive compression, stripping the descriptive matrices, keeping only the core tenets, reducing the file by roughly 60%, and bringing the system-level instructions under the memory threshold. It worked, sorta. The app stopped crashing.
But this was also the first clear signal: cloud-model habits do not translate to mobile hardware.
Lesson Two: Structured Instructions Apparently Confuse Small Models?
With a leaner file, the skill loaded. But when I prompted it to rewrite a long technical article on 5G network architecture, the model didn't write anything. It produced this instead:
<|tool_call>call:run_intent{intent:"dabs-style-rewriter", parameters:{"Source Material": "..."
The model had looked at an instruction file formatted with numbered steps, labelled inputs, and a structured schema and concluded it was not a writing persona but an API gateway. It was trying to route data to another application that didn't exist. Instead of channelling a writer's voice, it was behaving like a function router.
The solution was to flip the entire framing from "Steps and Inputs" to "Be this person immediately." Move away from programmatic structure, toward a pure system instruction that establishes persona from the first line. The model responds to being told who to be far better than it responds to being told what to do in sequence.
Lesson Three: Small Models Have a Working Memory Ceiling…and It Shows
With the condensed, persona-first version of the skill, I finally got output. But it didn't quite sound like me, and it stopped mid-sentence abruptly. What was happening, as best as I can understand it, is context exhaustion, i.e., the model's working memory (its KV cache, if you want the technical term) was completely filled by the combination of the instruction file and the dense 700-word 5G article I'd handed it as source material. With no computational overhead left to manage, it began to stutter, outputting typographical duplicates like "all the* the*", before hitting its timeout and cutting out entirely.
This is the fundamental gap between a local mobile model and a large cloud LLM. A cloud model can hold hundreds of complex rules, a dense technical source, and an elaborate creative brief simultaneously and produce something nuanced that balances all of them. A 4-billion-parameter model on a phone cannot. Not even close. The reasoning gap is not a flaw to be patched in a software update; it just a consequence of physics and silicon.
So What Are Small Local Models Actually Good For?
Here is the reframe that changed how I think about all of this and, I'd argue, the more interesting question...
Forcing a small local model to perform complex creative persona replication is the wrong brief entirely. It's like hiring a specialist fabricator to write you a novel. The skills aren't absent; they're just misallocated.
What local mobile models genuinely excel at is structured data, deterministic rules, and zero-hallucination utility work. They are fast, private, and always available. The question isn't "can they replicate a cloud model's creative output?" The question is "what tasks benefit from a local, lightweight, always-on intelligent layer?"
One practical answer I landed on came from my own life. I use a solar-powered inverter setup at home. Managing backup power means keeping an eye on cloud cover, grid stability, and battery load, and I sometimes have to call my wife from work to ask her to check conditions and act accordingly. How about trying out a JavaScript-enabled Agent Skill on my phone, which could probably handle it elegantly.
Here's how: JavaScript Skills (as opposed to text-only skills, earlier described) give the AI the ability to actually execute code within a hidden browser environment on the phone. The model's job is reduced to one thing: capturing the user's natural language input, like a neighbourhood name.
A script then takes that variable, calls a live weather API, retrieves real-time cloud cover data, applies the relevant electrical formulas within JS, and returns a plain-language recommendation. "Turn on the deep freezer. Solar output is adequate." No creative reasoning required and no hallucination risk. The AI is the natural language interface; the code is the engine.
That division of labour, model as conversational intake and code as execution layer, in my opinion, is where mobile AI actually earns its keep.
What This Means for the Bigger Picture
Step back for a moment and consider what is actually happening here. A model capable enough to understand nuanced instructions, hold a meaningful conversation, and trigger appropriate specialised behaviours, all without sending a single byte of your data to any server, now fits inside a device you carry in your pocket.
That is not a minor development.
The immediate practical implications are clearest in environments where cloud AI is restricted: corporate networks, regulated industries, and low-connectivity regions. For anyone in those categories, Gemma 4 on a capable Android phone is not a compromise. It's the first viable option. And for the African context specifically, where data costs remain non-trivial, and cloud latency is a genuine UX concern, the case for capable on-device inference is stronger than it might appear from a high-bandwidth Western baseline.
But the more interesting implication is what it signals about the direction of travel. The bottlenecks I encountered (memory saturation, intent looping, context exhaustion) are real, but they are also well-understood engineering problems.
Model quantisation is getting better. Mobile hardware is also improving. The techniques for writing efficient, structured skills that play to a local model's strengths rather than fighting its limitations are learnable, and they're becoming more documented. The floor on what a phone-based AI can do in 2026 is already higher than most people realise. In a few years, the question won't be whether it's capable enough, but whether the existing and popularly used cloud model is offering you enough additional value to justify the dependency.
I am not suggesting local models will replace cloud AI. That would be a lazy take, and I've already used up my quota of lazy takes in this article (see: Seyi, opening paragraphs). What I am saying is that the role of on-device AI is transitioning from novelty to infrastructure, from "cool demo on a tech blog" to a genuine utility layer for people who need private, fast, and accessible intelligence built into their workflows.
Closing Thoughts: Seyi Was Right
The arc of this whole experiment (from skeptic to cautious convert) tracks almost perfectly with how genuinely disruptive technologies tend to arrive. Not with an obvious, immediate use case that justifies the hype, but with a quiet accumulation of practical realisations that eventually tip into "oh, this is actually something."
My style-replication skill didn't work the way I hoped. The model crashed, looped, and truncated. But in failing those tests, it showed me exactly where its real value lies, and gave me a concrete mental model for building agent skills that actually perform. The solar power assistant is on my list. A RAN architecture reference tool for work is also on the list, one that accepts base station design inputs and returns equipment specifications based on local rules, privately, offline. These are not glamorous AI use cases. They are, however, genuinely useful ones.
On-device AI is not the future of AI. It's the future of how AI fits into ordinary life - quiet, local, and practical, like a good tool should be.
Seyi, if you're reading this: I owe you one.







Top comments (0)