DEV Community

Cover image for I Started Building a Roguelike RPG — Powered by On-Device AI #3
as1as
as1as

Posted on

I Started Building a Roguelike RPG — Powered by On-Device AI #3

QNN Failed. LiteRT Failed. Then llama.cpp Delivered 42x Speedup.

I wanted to write a success story today.

It turns out I can. But getting there was a bit rough.


What I Tried Today

Attempt Result
QNN HTP + libcdsprpc.so workaround HTP initialized, but only 3 of 363 nodes ran on NPU
LiteRT-LM GPU GPU memory overflow / engine creation failed
llama.cpp + Adreno OpenCL Success. 8.9 tok/s

QNN HTP: 3 Out of 363 Nodes

I solved the libcdsprpc.so access problem from yesterday. The fix was using apktool to decompile the APK, inject uses-native-library directly into the manifest, and repackage. Not elegant, but it worked.

HTP finally initialized:

QnnDsp <W> Initializing HtpProvider ✅
QnnDsp <W> PrepareLibLoader Loading libQnnHtpPrepare.so ✅
Enter fullscreen mode Exit fullscreen mode

Then this log appeared:

number of nodes in the graph: 363
number of nodes supported by QNN: 3
Enter fullscreen mode Exit fullscreen mode

3 out of 363 nodes ran on the NPU. The INT4 block quantization operator (MatMulNBits) isn't supported by HTP. The remaining 360 nodes fell back to CPU. Generation time: 483 seconds — essentially the same as CPU-only (523 seconds).

Runtime compilation of INT4 models via QNN doesn't work. Pre-converted QNN context binaries are required, which means going through the full Qualcomm AI Engine Direct SDK pipeline. That's a future task.


LiteRT-LM: Unity and the GPU Can't Share

Google's official on-device LLM framework. Phi-4-mini is officially supported.

The Bazel native build failed due to a Rust dependency issue, so I switched to the Kotlin AAR approach. Then the GPU memory error hit:

Requested allocation size - 18446744071872970752 bytes
Max allocation size for this GPU - 1073741824 bytes
Enter fullscreen mode Exit fullscreen mode

Unity's renderer is occupying the GPU. There's not enough VRAM left for LLM inference. This is a structural problem — running GPU-accelerated LLM inference inside a Unity game engine isn't viable right now. They're fighting over the same hardware.


llama.cpp: One Missing Library Away

Qualcomm officially contributes Adreno-optimized OpenCL kernels to llama.cpp. Yesterday's build succeeded but crashed on device because libomp.so wasn't included in the APK.

Today I rebuilt with -DGGML_OPENMP=OFF to remove the OpenMP dependency entirely. Build succeeded.

Next issue: P/Invoke. Trying to marshal LlamaModelParams directly from C# caused a SIGSEGV — the struct layout didn't match what C# expected. The fix was writing a C wrapper (unity_bridge.c) that handles all the complex structs internally and exposes a simple interface of 8 functions to C#:

void* unity_llama_model_load(const char* path, int n_gpu_layers);
void* unity_llama_context_create(void* model);
const char* unity_llama_generate(void* ctx, const char* prompt, int max_tokens);
// ...
Enter fullscreen mode Exit fullscreen mode

Then I ran it on device.


Results

Item Result
Model Phi-4-mini Q8_0 (3.8GB GGUF)
Model loading ~23s
Generation time 16.8s
tok/s 8.9
GPU Adreno 750 (OpenCL)

Full Benchmark Comparison

Method tok/s 150 tokens vs baseline
ONNX Runtime CPU (S24 Ultra) 0.21 523s baseline
ONNX Runtime QNN (S24 Ultra) 0.31 490s 1.5x
ONNX Runtime CPU (Mac) 0.45 246s 2.1x
llama.cpp OpenCL (S24 Ultra) 8.9 16.8s 42x

523 seconds down to 16.8 seconds. 42x faster.

16 seconds is workable for a dungeon generation loading screen. On-device LLM is now viable for the game.


What I Learned

  • ONNX Runtime + QNN is effectively useless for INT4 models — 3 of 363 nodes on NPU
  • LiteRT-LM conflicts with Unity's GPU usage — renderer and LLM inference compete for VRAM
  • llama.cpp + Adreno OpenCL is the answer — Qualcomm official optimization, CMake build
  • C wrappers are essential for P/Invoke — never marshal complex C structs directly from C#; wrap them

What's Next

The current model is Q8_0. Requantizing to Q4_0 could push performance above 15 tok/s.

More importantly: it's time to actually build the game. The speed problem is solved. Next up is dungeon generation, the turn-based combat system, and getting to something actually playable.

The detailed llama.cpp + Unity integration — the C wrapper, the build process, the full deployment pipeline — will be its own post.


Next: llama.cpp + Unity Android Integration — C Wrapper, Build Pipeline, and Real Device Deployment


Top comments (0)