QNN Failed. LiteRT Failed. Then llama.cpp Delivered 42x Speedup.
I wanted to write a success story today.
It turns out I can. But getting there was a bit rough.
What I Tried Today
| Attempt | Result |
|---|---|
| QNN HTP + libcdsprpc.so workaround | HTP initialized, but only 3 of 363 nodes ran on NPU |
| LiteRT-LM GPU | GPU memory overflow / engine creation failed |
| llama.cpp + Adreno OpenCL | Success. 8.9 tok/s |
QNN HTP: 3 Out of 363 Nodes
I solved the libcdsprpc.so access problem from yesterday. The fix was using apktool to decompile the APK, inject uses-native-library directly into the manifest, and repackage. Not elegant, but it worked.
HTP finally initialized:
QnnDsp <W> Initializing HtpProvider ✅
QnnDsp <W> PrepareLibLoader Loading libQnnHtpPrepare.so ✅
Then this log appeared:
number of nodes in the graph: 363
number of nodes supported by QNN: 3
3 out of 363 nodes ran on the NPU. The INT4 block quantization operator (MatMulNBits) isn't supported by HTP. The remaining 360 nodes fell back to CPU. Generation time: 483 seconds — essentially the same as CPU-only (523 seconds).
Runtime compilation of INT4 models via QNN doesn't work. Pre-converted QNN context binaries are required, which means going through the full Qualcomm AI Engine Direct SDK pipeline. That's a future task.
LiteRT-LM: Unity and the GPU Can't Share
Google's official on-device LLM framework. Phi-4-mini is officially supported.
The Bazel native build failed due to a Rust dependency issue, so I switched to the Kotlin AAR approach. Then the GPU memory error hit:
Requested allocation size - 18446744071872970752 bytes
Max allocation size for this GPU - 1073741824 bytes
Unity's renderer is occupying the GPU. There's not enough VRAM left for LLM inference. This is a structural problem — running GPU-accelerated LLM inference inside a Unity game engine isn't viable right now. They're fighting over the same hardware.
llama.cpp: One Missing Library Away
Qualcomm officially contributes Adreno-optimized OpenCL kernels to llama.cpp. Yesterday's build succeeded but crashed on device because libomp.so wasn't included in the APK.
Today I rebuilt with -DGGML_OPENMP=OFF to remove the OpenMP dependency entirely. Build succeeded.
Next issue: P/Invoke. Trying to marshal LlamaModelParams directly from C# caused a SIGSEGV — the struct layout didn't match what C# expected. The fix was writing a C wrapper (unity_bridge.c) that handles all the complex structs internally and exposes a simple interface of 8 functions to C#:
void* unity_llama_model_load(const char* path, int n_gpu_layers);
void* unity_llama_context_create(void* model);
const char* unity_llama_generate(void* ctx, const char* prompt, int max_tokens);
// ...
Then I ran it on device.
Results
| Item | Result |
|---|---|
| Model | Phi-4-mini Q8_0 (3.8GB GGUF) |
| Model loading | ~23s |
| Generation time | 16.8s |
| tok/s | 8.9 |
| GPU | Adreno 750 (OpenCL) |
Full Benchmark Comparison
| Method | tok/s | 150 tokens | vs baseline |
|---|---|---|---|
| ONNX Runtime CPU (S24 Ultra) | 0.21 | 523s | baseline |
| ONNX Runtime QNN (S24 Ultra) | 0.31 | 490s | 1.5x |
| ONNX Runtime CPU (Mac) | 0.45 | 246s | 2.1x |
| llama.cpp OpenCL (S24 Ultra) | 8.9 | 16.8s | 42x |
523 seconds down to 16.8 seconds. 42x faster.
16 seconds is workable for a dungeon generation loading screen. On-device LLM is now viable for the game.
What I Learned
- ONNX Runtime + QNN is effectively useless for INT4 models — 3 of 363 nodes on NPU
- LiteRT-LM conflicts with Unity's GPU usage — renderer and LLM inference compete for VRAM
- llama.cpp + Adreno OpenCL is the answer — Qualcomm official optimization, CMake build
- C wrappers are essential for P/Invoke — never marshal complex C structs directly from C#; wrap them
What's Next
The current model is Q8_0. Requantizing to Q4_0 could push performance above 15 tok/s.
More importantly: it's time to actually build the game. The speed problem is solved. Next up is dungeon generation, the turn-based combat system, and getting to something actually playable.
The detailed llama.cpp + Unity integration — the C wrapper, the build process, the full deployment pipeline — will be its own post.
Next: llama.cpp + Unity Android Integration — C Wrapper, Build Pipeline, and Real Device Deployment
Top comments (0)