FrowningMonk

Posted on May 8 • Edited on May 17

Sanctum Machina: Gemma 4 in your pocket, forever

#devchallenge #gemmachallenge #gemma

Gemma 4 Challenge: Build With Gemma 4 Submission

This is a submission for the Gemma 4 Challenge: Build with Gemma 4

The worst thing that can happen to you in the unstable world of 2026 is losing access to the AI you've already gotten used to.

Cloud models — that's a dependency on big tech. Local models — expensive to deploy, and even when the answers are good, they come slowly.

Then in 2026 Google drops the Gemma 4 family. Honestly: a few experiments deep into LLMs, I had consciously closed the door on local models for myself. But the inference speed on my far-from-flagship Honor 200 actually impressed me. I tested it through Google AI Edge Gallery — and that's when I realized the time had come.

This is access I never want to lose. Full control over the model, working-quality answers, and a speed that until recently was unthinkable on a phone with no internet.

That's when I decided to build Sanctum Machina — the Sanctuary of the Machine. A place in my pocket where one of the Gemma 4 models (E2B, E4B) always lives.

What I Built

Google AI Edge Gallery is a testbed: no persistent history, every chat behaves like incognito, a lot of showcase weight. Sanctum Machina took the best parts of that project and wrapped them into a tool you actually want to use every day.

Persistent multi-chat history with a sidebar (rename, delete, sorted by date)
Quick-chat mode (incognito — nothing saved)
Per-model inference settings: temperature, top-K, top-P, system prompt, accelerator
Multimodal input out of the box: text, image, short audio clip
Per-message TTFT and decode tok/s in the chat footer
Pre-flight RAM gate before model download
Projects with on-device RAG: attach PDFs, index them via GemmaEmbedding-300M, and every chat inside the project runs on that context — offline

It also solves another familiar problem of on-device models — the cold-start delay from loading the model into RAM. Sanctum Machina warms the model up once, in the background, at app launch. After that you can spin up as many chats as you want with no extra wait.

On top of that I deliberately cut the app off from the outside world: models can only be downloaded from a hard allowlist, and before any download the app checks whether the device can actually run that model.

Demo

Code

https://github.com/FrowningMonk/sanctum-machina

How I Used Gemma 4

Sanctum Machina runs Gemma 4 E2B and E4B through LiteRT-LM. This tier is the only Gemma 4 family that actually runs on a mid-range Android phone at usable speeds. And multimodal input (text, image, short audio clip) is supported out of the box — which is rare for on-device models of this size.

Full control over inference — and especially the system prompt — does a surprising amount of work on models this small. Prompt engineering is alive and well, and now there's a place to feel that in full. (I'm serious — try it and you'll be surprised what a system prompt can do to a model.)

In the next phase of the project I want to explore tools and an Agent mode using FunctionGemma 270M (the Mobile Actions fine-tune from litert-community). The technical side is essentially solved — I'm still looking for a use case on a phone where it earns its keep.

Update — May 17, 2026

Two changes since the original post: I tried MTP and shelved it, and I shipped a bigger feature — Projects with on-device RAG. In order.

MTP: tried it, didn't land

Bumped litert-lm to 0.11.0 and enabled enableSpeculativeDecoding — the flag that turns on Multi-Token Prediction. On my Honor 200 I saw no speed improvement: tok/s stayed within measurement noise, sometimes slightly worse.

Something unexpected surfaced, though. In a handful of cases the outputs got noticeably cleaner. The random Chinese characters that occasionally slipped into responses stopped showing up, text recognition from photos got better, Markdown rendered more accurately. These are subjective observations, not measurements — and a caveat is in order: speculative decoding is lossless by design. The main model verifies every token the drafter proposes, so the final token distribution shouldn't change. The effect almost certainly came from the litert-lm upgrade itself, not from enabling MTP — I just moved two variables at once. But the picture in the moment was striking enough that I seriously considered leaving the flag on for quality alone.

I didn't. I went to the source README — and found what I should have checked first: the 3x speedup is announced on flagships. MTP puts extra work on the device (the drafter runs several tokens ahead, the main model verifies them), and a mid-range SoC doesn't have the headroom for that extra work — it eats the win from fewer main-model steps. Net for me: cost without payoff. Extra RAM/CPU and occasionally worse tok/s in exchange for quality that, in all likelihood, was already there after the runtime upgrade.

Rolled back. Takeaway: MTP is a real feature, but the hardware bar is higher than the headlines suggest. If you're shipping speculative decoding in an app meant for everyone, you need runtime device detection and to turn MTP on only for chips that can actually carry it.

Projects: persistent chats + on-device RAG

A local model without your context is a smart conversationalist, not a tool. To make Sanctum Machina a place where you can actually work with your own documents, offline, I added RAG.

Architecture. A new entity — Project — aggregates the persistent chats that were already there, plus a set of documents and an index over them. You attach one or more PDFs to a project, index them, and from then on you can spin up as many chats inside the project as you want — they all sit on top of that index.

Stack. Embeddings are computed by GemmaEmbedding-300M — also on-device, also from the Gemma family, no external service. The document is split into chunks, the chunks go into a local index. On every query: retrieve first, then generate with the retrieved context.

RAG settings in the UI. topK, chunk size, overlap — all exposed. Plus a separate screen showing the resulting chunks, so you can confirm that the document was split sensibly and tweak size/overlap if something looks off.

The design choice I settled on. If a project has no documents, or if indexing failed — the model doesn't answer. Not "from memory," not "from general knowledge." In RAG mode, either there's context or there's no answer. This kills a class of bugs where the user expects an answer grounded in their PDF and gets a hallucination from pretraining instead.

Now even if the world falls apart — the Spirit of the Machine will live in the Sanctuary of the Machine and keep working.

In the grim darkness of the far future, the Omnissiah remains with us.

DEV Community