as1as

Posted on Apr 3 • Edited on Jul 8

I Started Building a Roguelike RPG — Powered by On-Device AI #2

#ai #android #llm #gamedev

Running On-Device LLM in Unity Android — Everything That Broke (and How I Fixed It)

In my last post, I mentioned I was building a roguelike RPG powered by an on-device LLM. This time I'll cover exactly how I did it, what broke, and what the numbers look like.

The short version: I got Phi-4-mini running in Unity on a real Android device in one day. It generated valid JSON. It took 8 minutes and 43 seconds.

0. Why This Tech Stack

Before the details, here's why I made each choice.

Why Phi-4-mini (3.8B)?

Microsoft officially distributes it in ONNX format — no conversion work needed. The INT4 quantized version fits in 4.9GB, which is manageable on a 12GB RAM device. At 3.8B parameters, it's roughly the minimum size that can reliably produce structured JSON output. Smaller models tend to fall apart on formatting tasks.

Why ONNX Runtime?

Cross-platform support across Android, iOS, Windows, and Mac. There's a Unity C# binding, and the asus4/onnxruntime-unity package makes Unity integration straightforward. Most importantly, switching between hardware acceleration backends (QNN, NNAPI, CoreML) is a single line of code — which matters a lot when you're trying to get NPU acceleration working.

Why Unity?

Good ecosystem for 2D roguelikes. Android/iOS cross-platform builds. And I can write LLM inference code in C# alongside game logic without needing a Python bridge.

Why Min SDK 31?

Android 12 (API 31) introduced the ability to declare vendor partition libraries via uses-native-library. QNN HTP depends on libcdsprpc.so, which lives in the vendor partition. Without this declaration, NPU acceleration is completely off the table. Dropping below SDK 31 would mean giving up on QNN entirely.

Why Samsung Galaxy S24 Ultra as the test device?

Snapdragon 8 Gen 3 with Hexagon NPU — one of the few consumer devices where QNN acceleration is actually possible. 12GB RAM gives enough headroom for the 4.9GB model. I wanted to measure the performance ceiling with the best available hardware first. If it doesn't work here, it doesn't work anywhere with current technology.

Also, it's my personal phone. There's no test device budget.

1. ONNX Runtime Setup

Installed com.github.asus4.onnxruntime v0.4.4 via NPM scoped registry. IL2CPP compatibility confirmed with no issues.

Downloaded Phi-4-mini ONNX (cpu_and_mobile variant) from Hugging Face: model.onnx at 52MB + model.onnx.data at 4.9GB.

2. Building a C# Tokenizer From Scratch

Phi-4-mini uses a tiktoken-style BPE tokenizer. No Unity C# implementation existed, so I wrote one.

Loaded vocab (200,029 entries), merges (199,742 entries), and special tokens (12) from tokenizer.json. Implemented GPT-2 byte↔unicode conversion table, BPE encoding/decoding with cache, and special token splitting.

What broke:

Newtonsoft.Json.Linq.JValue → JArray cast failed

I assumed the merges format was "tok1 tok2" strings. It was actually ["tok1","tok2"] arrays. Added a branch to handle both formats.

3. Building the Inference Engine

Implemented KV cache-based auto-regressive greedy decoding.

32 layers, 8 KV heads, head_size 128
Prefill (full prompt at once) → Decode (one token at a time)
past_key_values / present tensor management

What broke (1):

CS1503: DenseTensor<long>(seqLen, new[] {batch, seqLen})

Fixed to new DenseTensor<long>(new[] {batch, seqLen}).

What broke (2):

model.onnx not found

Had the path at 3 levels up (../../..). It needed to be 2 levels (../..).

4. First Generation Test

Kept the prompt short, max 150 tokens.

[LLM] Generated in 181.4s (150 tokens max)

JSON came out:

[
  {"floor":1,"mob":"게으른 빵집 아들","hp":50,"atk":10},
  {"floor":4,"mob":"elite","hp":100,"atk":20},
  {"floor":5,"mob":"boss","hp":200,"atk":40}
]

The mob name on floors 1-3 matches the player character name — that's a prompt issue I'll fix later. The important thing is the JSON structure is valid and complete.

5. Android Build

What broke:

compressReleaseAssets FAILED
Required array size too large

Putting a 5GB model in StreamingAssets hits Java's 2.1GB array limit. Renaming the folder didn't help — anything inside StreamingAssets gets included regardless of name. Solution: move the model folder completely outside of Assets, delete the Gradle cache (Library/Bee/Android, 15GB worth), rebuild.

Deployment approach:

adb push ./models/phi-4-mini \
  /sdcard/Android/data/com.as1as.helpwantedhero/files/Models/phi-4-mini/

APK ships without the model. Model is pushed separately via adb (4.9GB, ~94 seconds).

6. Korean Font in TextMeshPro

The default TMP font (LiberationSans) doesn't include Korean characters. Converted AppleSDGothicNeo.ttc using TMP Font Asset Creator.

Important: the Custom Range field only accepts decimal, not hex. Entering AC00-D7A3 throws a FormatException. Use this instead:

32-126,44032-55203,12593-12686
(ASCII + Korean 가-힣 + ㄱ-ㅣ)

7. Real Device Results

Environment	Time	tok/s
Mac Editor (CPU)	246s	0.45
S24 Ultra (CPU only)	523s	0.21
S24 Ultra (QNN HTP runtime)	490s	0.31

The S24 Ultra is 2.1x slower than Mac. Adding QNN HTP barely moved the needle.

The reason showed up in the INFO logs:

Failed in loading stub: dlopen failed: library "libcdsprpc.so" not found
Failed to create transport for device, error: 4000

QNN EP registration succeeded, but the backend never actually initialized. The entire thing was falling back to CPU. libcdsprpc.so is Qualcomm's DSP RPC library — it lives in the vendor partition and isn't accessible from the app sandbox by default.

The fix is declaring it via uses-native-library in AndroidManifest. That ran into a separate issue: the custom manifest conflicted with Unity's auto-generated one, causing the app to disappear from the launcher entirely. I'll be using a Gradle template to inject just that one line instead.

What I Learned

Min SDK 31 is required for vendor library declarations — and therefore for QNN HTP acceleration
Don't put large files in StreamingAssets. Anything there gets compressed into the APK
NNAPI is not full NPU acceleration. Most LLM operators fall back to CPU
TMP Custom Range is decimal only
3.8B parameters on CPU is not viable for a game. NPU is not optional

Next: Getting QNN HTP to Actually Work — The libcdsprpc.so Wall

DEV Community