This is a submission for the Gemma 4 Challenge: Build with Gemma 4
The worst thing that can happen to you in the unstable world of 2026 is losing access to the AI you've already gotten used to.
Cloud models — that's a dependency on big tech. Local models — expensive to deploy, and even when the answers are good, they come slowly.
Then in 2026 Google drops the Gemma 4 family. Honestly: a few experiments deep into LLMs, I had consciously closed the door on local models for myself. But the inference speed on my far-from-flagship Honor 200 actually impressed me. I tested it through Google AI Edge Gallery — and that's when I realized the time had come.
This is access I never want to lose. Full control over the model, working-quality answers, and a speed that until recently was unthinkable on a phone with no internet.
That's when I decided to build Sanctum Machina — the Sanctuary of the Machine. A place in my pocket where one of the Gemma 4 models (E2B, E4B) always lives.
What I Built
Google AI Edge Gallery is a testbed: no persistent history, every chat behaves like incognito, a lot of showcase weight. Sanctum Machina took the best parts of that project and wrapped them into a tool you actually want to use every day.
- Persistent multi-chat history with a sidebar (rename, delete, sorted by date)
- Quick-chat mode (incognito — nothing saved)
- Per-model inference settings: temperature, top-K, top-P, system prompt, accelerator
- Multimodal input out of the box: text, image, short audio clip
- Per-message TTFT and decode tok/s in the chat footer
- Pre-flight RAM gate before model download
It also solves another familiar problem of on-device models — the cold-start delay from loading the model into RAM. Sanctum Machina warms the model up once, in the background, at app launch. After that you can spin up as many chats as you want with no extra wait.
On top of that I deliberately cut the app off from the outside world: models can only be downloaded from a hard allowlist, and before any download the app checks whether the device can actually run that model.
Demo
Code
https://github.com/FrowningMonk/sanctum-machina
How I Used Gemma 4
Sanctum Machina runs Gemma 4 E2B and E4B through LiteRT-LM. This tier is the only Gemma 4 family that actually runs on a mid-range Android phone at usable speeds. And multimodal input (text, image, short audio clip) is supported out of the box — which is rare for on-device models of this size.
Full control over inference — and especially the system prompt — does a surprising amount of work on models this small. Prompt engineering is alive and well, and now there's a place to feel that in full. (I'm serious — try it and you'll be surprised what a system prompt can do to a model.)
In the next phase of the project I want to explore tools and an Agent mode using FunctionGemma 270M (the Mobile Actions fine-tune from litert-community). The technical side is essentially solved — I'm still looking for a use case on a phone where it earns its keep.
Also exploring the recently-released Multi-Token Prediction (MTP) drafters for E2B and E4B — a 3x speedup at zero quality cost is exactly the kind of edge improvement on-device inference needs.
Now even if the world falls apart — the Spirit of the Machine will live in the Sanctuary of the Machine and keep working.
In the grim darkness of the far future, the Omnissiah remains with us.


Top comments (0)