The Model Was the Answer — 16.6 tok/s with Qwen3-1.7B
In the last post, I got llama.cpp + Adreno OpenCL to cut generation time from 523 seconds down to 16.8 seconds.
Today I pushed it further. Turns out the model itself was the bottleneck.
Swapping models doubled the speed again.
First: Quantization Isn't What You Think on Adreno
Before trying a different model, I tested every quantization level on Phi-4-mini to find the optimal setting.
| Quantization | Size | tok/s |
|---|---|---|
| Q8_0 | 3.8GB | 9.0 |
| Q4_0 | 2.3GB | 5.1 |
| Q6_K | 3.2GB | 4.2 |
This is counterintuitive. Lower quantization usually means smaller, faster. Not here.
On the Adreno OpenCL backend, Q4_0 and Q6_K introduce dequantization overhead at the GPU level that actually slows inference down. Q8_0 maps most efficiently to the Adreno compute kernels. This is specific to Qualcomm's OpenCL implementation — other backends may behave differently.
Also: requantizing from Q8_0 to Q4_0 via llama.cpp throws requantizing from type q8_0 is disabled. You need the original BF16/FP16 source model. Keep that in mind before downloading a quantized-only release.
Then: What If the Model Is Smaller?
Once I confirmed Q8_0 is optimal for Adreno, the next question was obvious: what if I just use a smaller model at Q8_0?
I tested Qwen3-1.7B (Q8_0, 1.8GB) against Phi-4-mini (Q8_0, 3.8GB).
| Phi-4-mini (3.8B) | Qwen3-1.7B (1.7B) | |
|---|---|---|
| Model size | 3.8GB | 1.8GB |
| Load time | 24.5s | 14.4s |
| Generation speed | 9.0 tok/s | 16.6 tok/s |
| 150 tokens | 16.8s | 9.1s |
| Mob name output | "몬스터이름" (literal placeholder) | "토끼" (rabbit) ✅ |
| JSON structure | Valid | Valid ✅ |
Qwen3-1.7B wins on every metric. Half the size, 1.8x faster, and noticeably better at following prompt instructions.
The mob name issue is worth noting. Phi-4-mini kept outputting the literal placeholder text "몬스터이름" (which means "monster name" in Korean) instead of generating an actual name. Qwen3-1.7B understood the prompt correctly and generated real names. At 1.7B parameters, Qwen3 punches above its weight on instruction following.
Final Stack
Inference engine : llama.cpp (Adreno OpenCL)
Model : Qwen3-1.7B Q8_0 (1.8GB GGUF)
Performance : 16.6 tok/s / 9.1s per 150 tokens
Unity integration: C wrapper (unity_bridge.c) + P/Invoke
Device : Samsung Galaxy S24 Ultra (Snapdragon 8 Gen 3)
The Full Journey
| Approach | tok/s | 150 tokens |
|---|---|---|
| ONNX Runtime CPU (S24 Ultra) | 0.21 | 523s |
| ONNX Runtime + QNN HTP | 0.31 | 490s |
| llama.cpp OpenCL + Phi-4-mini | 9.0 | 16.8s |
| llama.cpp OpenCL + Qwen3-1.7B | 16.6 | 9.1s |
From 0.21 tok/s to 16.6 tok/s. 79x faster than where we started.
The AI Stack Is Done
9 seconds per generation is workable for a dungeon loading screen. No server. No internet. The LLM runs entirely on the device, generates dungeon content, and fits in 1.8GB.
The full implementation — C wrapper, build pipeline, Unity integration — is on GitHub:
👉 unity-android-ondevice-llm
Next up: actually building the game. Top-down exploration, turn-based combat, LLM-generated mobs and dialogue. The hard technical part is done. Now it's time to make something playable.
Next: Building the Game — Top-Down Dungeon + Turn-Based Combat with On-Device LLM
Top comments (0)