Jaydeep Shah (JD)

Posted on Jul 4

One Model, Three Chips, Two Files: How LiteRT Delegates Really Work

#edgeai #android #litert

When I first shipped Gemma 4 E2B to a phone, I had three ways to run it: Backend.CPU(), Backend.GPU(), Backend.NPU(). Three near-identical lines of code. I expected them to be three settings on the same engine, like choosing a quality level on a video export.

They were not. Each one needed its own model file. The NPU refused to load the file the GPU ran happily. Switching hardware was not a flag - it was a different artifact on disk.

That surprised me, and the question stuck: if it is the same model, same weights, why can't one runtime just point them at whichever chip is free? Answering that took me down into a layer I had mostly taken for granted - the one that decides how a model's math actually lands on silicon.

This is the story of that layer. To understand it, we have to start one level below the model file, with something most of us never look at directly: the graph.

First, the graph

A model is not really "a pile of weights." Before it runs anywhere, it is expressed as a computation graph: a map of operations - matrix multiplies, additions, activations, normalizations - wired together, tensors flowing node to node. Weights are the numbers those ops consume. The graph is the recipe; the weights are the ingredients.

For a language model the graph is deep and repetitive: one transformer block (attention, projections, feed-forward, normalization) stacked dozens of times. Mostly the same handful of ops, repeated, data cascading down every layer.

Now the part that decides everything: not every processor can run every operation. A CPU is general - hand it any op, it copes. A GPU runs the massively parallel ops beautifully and is indifferent to the rest. An NPU is the most specialized: it is built to devour one specific menu - the matrix multiplies at the heart of attention - and has no circuitry for ops off that menu.

So a runtime has to get the graph onto the chip. There are two fundamentally different ways to do that, and this is the fork the whole post turns on.

Just-in-time (CPU and GPU). You hand the runtime the same model file. At load time it walks the graph and asks, node by node: can this backend run this op? The contiguous stretches it can run are grouped into subgraphs and prepared for that backend - the GPU even compiles its kernels on-device, right then, for your exact hardware. Anything it cannot run is simply left on the CPU. One model, split at runtime, stitched back together, with the CPU as a safety net. This is why CPU and GPU share one generic file, and why the GPU path is forgiving.

Ahead-of-time (NPU). The Hexagon NPU does not improvise at load time. Its graph is compiled offline, before it ever reaches the device, by Qualcomm's QNN toolchain - operators fused, weights quantized to the exact fixed-point formats the NPU supports, memory layouts rearranged, the whole thing lowered to a Hexagon-native binary. That compiled block ships inside the model file as a single sealed operation. There is no walking the graph at runtime, and no CPU safety net: the accelerated graph is a monolith built for one chip.

This is the crux. CPU and GPU take the graph just-in-time; the NPU takes it ahead-of-time. The same source model, prepared two completely different ways - and that single difference is why the NPU needs its own file, why that file will not run anywhere else, and why producing it is a heavier lift than a GPU export. The rest of this post is really just consequences of this one fork.

See it yourself

The graph is not an abstraction you have to take on faith. Open any .litertlm in Netron, a free model-graph viewer, and it is right there: nodes for every operation - EmbeddingLookup, Mul, Reshape, RmsNorm, FullyConnected - wired together, tensor shapes labeled on every edge.

The first thing that hits you is scale: the module list runs to hundreds of entries, the same transformer ops repeated block after block. This is what "a model is a graph" actually looks like.

If you open the generic and the NPU files side by side, expecting to spot the difference at a glance - you mostly can't. Both are large, and the modules do not line up one-to-one. And that is the honest lesson: the real difference between them is not something you eyeball in a node diagram. It is under the ops - the quantization formats, the fused kernels, the ahead-of-time compilation we just described. Netron shows you the recipe; it does not show you which kitchen it was compiled for.

Why one operation is three different things

Take the single most common operation in the model - a matrix multiply, the heart of every attention and feed-forward layer. On paper it is one line of math. Physically, it is three completely different acts:

On the CPU, a few powerful cores grind through it with NEON SIMD - each core handling a chunk of the math, a handful of lanes at a time. Correct, universal, slow.
On the GPU, hundreds of small cores each take a slice and run them at once. The same multiply, sprayed across a wide parallel array - the workload GPUs were born for.
On the NPU, dedicated fixed-point MAC (multiply-accumulate) hardware devours it in quantized integer form, at a fraction of the energy - but only because the numbers were pre-quantized to the exact format the silicon expects.

Same node in the graph. Three different pieces of hardware, three different execution models, three different power-and-speed tradeoffs.

Now the more interesting case: an operation a backend cannot meaningfully run. Accelerators love predictable, static-shaped, fixed-precision math. Give the GPU delegate an op it does not implement - something with dynamic shapes, or a rare op outside its supported set - and, because CPU and GPU work just-in-time, that op simply falls back to the CPU (the LiteRT GPU delegate documents exactly this supported-op-set behavior). The model still runs; one node just runs somewhere else.

The NPU has no such luxury. Its graph was compiled ahead of time - so an unsupported op cannot quietly fall back mid-stream. It has to be handled before the model ever ships: fused away, replaced, quantized to fit, or kept off the NPU entirely. There is no runtime safety net inside a block built for one chip.

This is the concrete reason the same model becomes different artifacts. The generic file can afford a mixed, forgiving execution. The NPU file has to be a clean, self-consistent, pre-compiled whole - because at runtime, the NPU does not improvise.

So this is what a delegate is

Now the definition lands with weight. A delegate is the component that takes part of your model's graph and runs it on a specific backend. It is the thing that answers, for each operation, "who executes this, and how?"

With everything we have built up, the picture is precise:

The delegate walks the graph and claims what its backend can run. The stretches it accepts become subgraphs it owns; the rest stays on the CPU. You do not hand-assign ops - you pick a backend, and the delegate does the partitioning.
For CPU and GPU, that partitioning happens just-in-time, on device, at load - which is why unsupported ops fall back gracefully and the same file serves both.
For the NPU, the "delegation" was effectively decided ahead of time, when the graph was compiled for Hexagon. By the time the file is on the phone, the choice is already sealed inside it.

So a delegate is not a setting you flip. It is a translator with a fixed vocabulary: it runs the parts of your model it has words for, and hands the rest back. Backend.GPU() and Backend.NPU() look like twins in your code - but one is a translator improvising live, and the other is a translation that was finalized in a studio months ago, by someone else's toolchain.

That last point is where the clean abstraction starts to leak - and where "just use the best available hardware" stops being a one-liner.

The three delegates, briefly

With the mechanism clear, the three delegates are easy to meet - less as spec sheets, more as three characters with different temperaments.

CPU (XNNPACK) - the one that always shows up. It runs any op, on any device, no special setup. It is also the slowest for an LLM - single-digit tokens per second on Gemma 4 E2B. But it is the floor nothing falls through: when a fancier backend gives up on an op, this is who catches it.

GPU (OpenCL, falling back to OpenGL) - fast but temperamental. Meaningfully quicker than CPU, because matrix math is what GPUs do. The catch is the environment: performance and even correctness depend on the device, the Android version, and the GPU driver. Our worst hackathon rabbit hole was exactly this - a broken OpenGL path on Android 16 that masqueraded as an NPU failure for hours. When it works, it is the reliable middle. When the driver betrays you, you learn a lot about logs.

NPU (Qualcomm QNN, on the Hexagon DSP) - the fastest lane with the narrowest on-ramp. By far the best throughput - 41.7 tokens/second on Redacto, several times the CPU. But everything we built up in this post is the price of admission: an ahead-of-time-compiled model file, a set of Qualcomm dispatch libraries shipped alongside it, and a DSP library path that must be set before any inference code loads, or the chip never wakes up. It is the most powerful and the least forgiving of the three.

Three backends, one goal, three completely different personalities. Which sets up the question everyone asks next: if the NPU is fastest, why not just try it, and fall back to GPU or CPU when it is not there?

Why "just fall back to the best chip" is hard

The intuition is a three-line if/else: try the NPU, drop to the GPU if it is missing, drop to the CPU if all else fails. Pick the best hardware available, gracefully degrade.

Here is why it is not that simple - and everything we built up is the reason.

Each backend needs a different artifact. The NPU file is an ahead-of-time-compiled Hexagon binary; as we saw, it will not run on a GPU or CPU - there is no kernel for its compiled block. So you cannot catch an NPU failure and retry the same file on the GPU. Falling back is not flipping a backend flag - it is loading a different model file and selecting a different backend, together, as one unit. The fallback ladder is a list of (model file, backend) pairs, not a list of backends.

And you can only step onto a rung you actually have a file for. This is the limitation we hit head-on. For the stock model, Qualcomm had already produced the NPU-compiled file, so we had all three rungs. But when we fine-tuned Gemma 4 E2B, we could export the CPU/GPU file ourselves in an afternoon - thanks to the tooling the LiteRT team at Google has built - and we could not produce the NPU file at all. Compiling a custom graph for Hexagon (op coverage, quantization calibration, the whole QNN toolchain) was precise, vendor-specific work we could not self-serve. So our fine-tuned model's "cascade" had only two rungs: GPU, then CPU. The fastest lane simply did not exist for it.

That is the honest shape of "just use the best hardware." It is not an if/else over chips. It is a ladder of (model file, backend) pairs - where some rungs may not exist for your model, and the fastest rung is precisely the one you cannot build yourself.

What delegates really decide

When I started, I thought choosing a backend was like choosing a video export quality - one setting, same file, different speed. It is not. A delegate quietly decides far more than speed:

Which file you ship - the NPU needs its own ahead-of-time build; CPU and GPU share one.
What you can even offer your users - the fastest lane may not exist for a custom model, and features like constrained decoding differ by backend.
How you start the app - the NPU wants its libraries and DSP paths in place before any inference code runs.
How you fail - a driver bug, a missing rung, an op with no home - each backend fails its own way.

That is why Backend.CPU(), Backend.GPU(), Backend.NPU() are three of the most deceptively simple lines in edge AI. They read like three settings. They are three different worlds, and the delegate is the layer that quietly translates your one model into whichever world it can reach.

Which, when you sit with it, is a small marvel: one runtime, one line of code, standing between your model and three completely different pieces of silicon - and mostly making it look easy. Understanding where that "easy" leaks is what turns a demo into something you can ship.

Related in this series of "Edge AI from the Trenches"

Why My LLM Runs 4x Faster on Hardware I Had Never Heard Of - the NPU hardware behind the fastest delegate
I Opened a .litertlm File. Here Is What Is Actually in There. - why NPU and GPU models are separate files with different compiled ops
Six Ways NPU Init Will Fail - the failure modes you hit when wiring up the NPU delegate in practice

Jaydeep Shah is a developer with roots in embedded systems, Android platform internals, and silicon-level AI optimization. He now explores on-device AI inference - bringing models from the cloud to phones and edge hardware. Along with his team Edge Artists, he builds applications using LiteRT-LM and Gemma models on mobile hardware, and writes about what works, what breaks, and what he learns along the way. This post is part of the Edge AI from the Trenches series.

Sources:

LiteRT Delegates - how delegates partition a graph across backends
GPU delegates for LiteRT - supported ops and CPU fallback behavior
XNNPACK - the CPU backend
Netron - the model-graph viewer used above
Qualcomm AI Engine Direct (QNN) SDK - ahead-of-time compilation for the Hexagon NPU
Benchmark data: Redacto project, Samsung Galaxy S25 Ultra (Snapdragon 8 Elite, SM8750)

Last updated: July 2026
9th of 22 posts in the "Edge AI from the Trenches" series

DEV Community