DEV Community

Cover image for I Started Building a Roguelike RPG — Powered by On-Device AI #4
as1as
as1as

Posted on

I Started Building a Roguelike RPG — Powered by On-Device AI #4

The Model Was the Answer — 16.6 tok/s with Qwen3-1.7B

In the last post, I got llama.cpp + Adreno OpenCL to cut generation time from 523 seconds down to 16.8 seconds.

Today I pushed it further. Turns out the model itself was the bottleneck.

Swapping models doubled the speed again.


First: Quantization Isn't What You Think on Adreno

Before trying a different model, I tested every quantization level on Phi-4-mini to find the optimal setting.

Quantization Size tok/s
Q8_0 3.8GB 9.0
Q4_0 2.3GB 5.1
Q6_K 3.2GB 4.2

This is counterintuitive. Lower quantization usually means smaller, faster. Not here.

On the Adreno OpenCL backend, Q4_0 and Q6_K introduce dequantization overhead at the GPU level that actually slows inference down. Q8_0 maps most efficiently to the Adreno compute kernels. This is specific to Qualcomm's OpenCL implementation — other backends may behave differently.

Also: requantizing from Q8_0 to Q4_0 via llama.cpp throws requantizing from type q8_0 is disabled. You need the original BF16/FP16 source model. Keep that in mind before downloading a quantized-only release.


Then: What If the Model Is Smaller?

Once I confirmed Q8_0 is optimal for Adreno, the next question was obvious: what if I just use a smaller model at Q8_0?

I tested Qwen3-1.7B (Q8_0, 1.8GB) against Phi-4-mini (Q8_0, 3.8GB).

Phi-4-mini (3.8B) Qwen3-1.7B (1.7B)
Model size 3.8GB 1.8GB
Load time 24.5s 14.4s
Generation speed 9.0 tok/s 16.6 tok/s
150 tokens 16.8s 9.1s
Mob name output "몬스터이름" (literal placeholder) "토끼" (rabbit)
JSON structure Valid Valid ✅

Qwen3-1.7B wins on every metric. Half the size, 1.8x faster, and noticeably better at following prompt instructions.

The mob name issue is worth noting. Phi-4-mini kept outputting the literal placeholder text "몬스터이름" (which means "monster name" in Korean) instead of generating an actual name. Qwen3-1.7B understood the prompt correctly and generated real names. At 1.7B parameters, Qwen3 punches above its weight on instruction following.


Final Stack

Inference engine : llama.cpp (Adreno OpenCL)
Model            : Qwen3-1.7B Q8_0 (1.8GB GGUF)
Performance      : 16.6 tok/s / 9.1s per 150 tokens
Unity integration: C wrapper (unity_bridge.c) + P/Invoke
Device           : Samsung Galaxy S24 Ultra (Snapdragon 8 Gen 3)
Enter fullscreen mode Exit fullscreen mode

The Full Journey

Approach tok/s 150 tokens
ONNX Runtime CPU (S24 Ultra) 0.21 523s
ONNX Runtime + QNN HTP 0.31 490s
llama.cpp OpenCL + Phi-4-mini 9.0 16.8s
llama.cpp OpenCL + Qwen3-1.7B 16.6 9.1s

From 0.21 tok/s to 16.6 tok/s. 79x faster than where we started.


The AI Stack Is Done

9 seconds per generation is workable for a dungeon loading screen. No server. No internet. The LLM runs entirely on the device, generates dungeon content, and fits in 1.8GB.

The full implementation — C wrapper, build pipeline, Unity integration — is on GitHub:
👉 unity-android-ondevice-llm

Next up: actually building the game. Top-down exploration, turn-based combat, LLM-generated mobs and dialogue. The hard technical part is done. Now it's time to make something playable.


Next: Building the Game — Top-Down Dungeon + Turn-Based Combat with On-Device LLM


Top comments (0)