I've been getting into on-device AI lately.
Not cloud AI. Not sending requests to a server somewhere. I mean a language model running entirely on the phone itself — no internet required, no API costs, no data leaving the device.
When I learn something new, I have to build something. So I did.
Why On-Device AI Is Interesting Right Now
It's slow. It's limited. Compared to cloud LLMs, it's nowhere close.
But the direction is clear.
Smartphone NPUs are getting significantly more powerful every year. Model compression techniques are improving every month. The performance that required a cloud GPU two years ago is starting to run on a phone today.
The people who get familiar with this now will have an advantage when it becomes mainstream. That's why I'm learning it now.
What I Built
A roguelike RPG.
Help Wanted: Hero — Conquering the Demon Lord's Castle, 300 floors.
An on-device LLM generates the dungeon every 5 floors. Mob names, dialogue, boss patterns, hidden events — all created locally, no server involved.
Why a game? Because it's a domain where AI being wrong is fine.
If the mob name sounds weird, it's funny. If the boss dialogue is a little off, it adds to the charm. Games naturally absorb the limitations of small on-device models in a way that most other apps can't.
Also, roguelikes need fresh content every run. That's exactly what generative AI is good at.
How It's Going
I ran the first test on a Samsung Galaxy S24 Ultra.
Generating one dungeon set took 8 minutes and 43 seconds.
That's CPU-only inference with Phi-4-mini (3.8B, INT4 quantized) via ONNX Runtime on Android. NPU acceleration is essential. I'm currently hitting a wall trying to get QNN HTP working.
The next post will cover the full implementation — Unity + ONNX Runtime Android setup, building a C# tokenizer from scratch, the KV cache inference engine, and exactly where and why things broke.
Next: Unity + ONNX Runtime Android — A Full Breakdown of What Went Wrong (and What Didn't)
Top comments (0)