I Started Building a Roguelike RPG — Powered by On-Device AI #4

#ai #android #llm #gamedev

The Model Was the Answer — 16.6 tok/s with Qwen3-1.7B

In the last post, I got llama.cpp + Adreno OpenCL to cut generation time from 523 seconds down to 16.8 seconds.

Today I pushed it further. Turns out the model itself was the bottleneck.

Swapping models doubled the speed again.

First: Quantization Isn't What You Think on Adreno

Before trying a different model, I tested every quantization level on Phi-4-mini to find the optimal setting.

Quantization	Size	tok/s
Q8_0	3.8GB	9.0
Q4_0	2.3GB	5.1
Q6_K	3.2GB	4.2

This is counterintuitive. Lower quantization usually means smaller, faster. Not here.

On the Adreno OpenCL backend, Q4_0 and Q6_K introduce dequantization overhead at the GPU level that actually slows inference down. Q8_0 maps most efficiently to the Adreno compute kernels. This is specific to Qualcomm's OpenCL implementation — other backends may behave differently.

Also: requantizing from Q8_0 to Q4_0 via llama.cpp throws requantizing from type q8_0 is disabled. You need the original BF16/FP16 source model. Keep that in mind before downloading a quantized-only release.

Then: What If the Model Is Smaller?

Once I confirmed Q8_0 is optimal for Adreno, the next question was obvious: what if I just use a smaller model at Q8_0?

I tested Qwen3-1.7B (Q8_0, 1.8GB) against Phi-4-mini (Q8_0, 3.8GB).

	Phi-4-mini (3.8B)	Qwen3-1.7B (1.7B)
Model size	3.8GB	1.8GB
Load time	24.5s	14.4s
Generation speed	9.0 tok/s	16.6 tok/s
150 tokens	16.8s	9.1s
Mob name output	"몬스터이름" (literal placeholder)	"토끼" (rabbit) ✅
JSON structure	Valid	Valid ✅

Qwen3-1.7B wins on every metric. Half the size, 1.8x faster, and noticeably better at following prompt instructions.

The mob name issue is worth noting. Phi-4-mini kept outputting the literal placeholder text "몬스터이름" (which means "monster name" in Korean) instead of generating an actual name. Qwen3-1.7B understood the prompt correctly and generated real names. At 1.7B parameters, Qwen3 punches above its weight on instruction following.

Final Stack

Inference engine : llama.cpp (Adreno OpenCL)
Model            : Qwen3-1.7B Q8_0 (1.8GB GGUF)
Performance      : 16.6 tok/s / 9.1s per 150 tokens
Unity integration: C wrapper (unity_bridge.c) + P/Invoke
Device           : Samsung Galaxy S24 Ultra (Snapdragon 8 Gen 3)

The Full Journey

Approach	tok/s	150 tokens
ONNX Runtime CPU (S24 Ultra)	0.21	523s
ONNX Runtime + QNN HTP	0.31	490s
llama.cpp OpenCL + Phi-4-mini	9.0	16.8s
llama.cpp OpenCL + Qwen3-1.7B	16.6	9.1s

From 0.21 tok/s to 16.6 tok/s. 79x faster than where we started.

The AI Stack Is Done

9 seconds per generation is workable for a dungeon loading screen. No server. No internet. The LLM runs entirely on the device, generates dungeon content, and fits in 1.8GB.

The full implementation — C wrapper, build pipeline, Unity integration — is on GitHub:
👉 unity-android-ondevice-llm

Next up: actually building the game. Top-down exploration, turn-based combat, LLM-generated mobs and dialogue. The hard technical part is done. Now it's time to make something playable.

Next: Building the Game — Top-Down Dungeon + Turn-Based Combat with On-Device LLM