DEV Community: Jaydeep Shah (JD)

What Happens When 15 Apps Each Ship a 3 GB Model?

Jaydeep Shah (JD) — Mon, 13 Jul 2026 00:55:00 +0000

The on-device AI industry feels like it is in its "single app demo" phase right now. Most of what I see - hackathon projects, conference talks, developer blogs, including my own - showcases one app running one model on one phone. The demo works. The audience claps. Everyone moves on.

But it left me with an uncomfortable question I have not seen discussed much: what happens when 15 apps on the same phone each ship their own multi-GB model?

I started thinking about this while building Redacto, a zero-trust document redaction app I built for the Qualcomm x Google LiteRT hackathon. Redacto runs Gemma 4 E2B entirely on-device: no internet permission, no cloud calls, no BAA required. It works beautifully in isolation. But the moment I started measuring what it actually consumes - storage, memory, NPU time, I/O bandwidth - I started to wonder whether we are building toward a systemic crunch that the tooling has not caught up with yet.

This piece lays out the math and the questions it raised for me. On the numbers: the measurements here come from a single directional benchmarking session in May 2026.

A note on where I am standing. This is my naive, practitioner-level read from limited exposure - one app, one device, one hackathon. I have not surveyed every tool, runtime, or architecture out there, so treat this as "here is what I observed and what I suspect," not a verdict on the state of the industry. Some of the gaps I describe may already be solved by tools I simply have not come across yet; others may be in active development as I write this; and some may be genuinely missing, the kind of thing the ecosystem will grow into. I plan to dig further, test some of this in practice, and I would genuinely like to hear from readers who know of work that already addresses these problems. Consider this the start of a conversation, not the last word.

The Assumptions Behind This

The whole argument rests on three things I believe will be true as on-device AI scales, based on what I saw building one app. If these do not hold, the crunch looks different, so it is worth stating them plainly.

1. Apps will not converge on one shared model. In my experience, an out-of-the-box model is not always capable of what a specific app needs. A redaction app, a translation app, and a coding assistant are asking genuinely different things of the model. So I expect developers to reach for fine-tuning: some will ship a fully fine-tuned model, many will ship a base model plus a task-specific adapter (LoRA-style), and different teams will start from different bases (one team likes Gemma, another Llama, another Qwen). The realistic picture is not "everyone bundles the same stock Gemma" but a mix: several distinct base models, plus a spread of adapters layered on the common ones. That mix is exactly what makes naive deduplication hard - which I come back to later.

2. It is not only LLMs competing for the hardware. The "15 apps" framing is really shorthand for on-device AI in general. Long before the LLM wave, apps already shipped classic ML models: camera scene detection, computational photography, face and object recognition, speech-to-text, OCR, recommendation and ranking models. Those are smaller than a 2B LLM, but there are a lot of them, many run continuously, and they draw on the same finite NPU, memory bandwidth, storage, and I/O. So the real contention picture is a mix of a few heavy LLMs and many lighter classic-ML models all summing against one device's budget - which makes the crunch arrive sooner than an LLM-only count suggests.

3. Apps will not all be in the foreground. Some AI apps are foreground-only: you open them, run inference, close them (Redacto is one). Others will run as persistent background services: always-on assistants, continuous audio processing, real-time translation. The contention problems below are most acute precisely because these two kinds of apps coexist and compete for the same NPU, memory, and I/O without any arbiter deciding who yields.

The Storage Math: ~84 GB of Weights on a 256 GB Phone

Let me start with what Redacto actually ships.

The app needs two model files to support its backend cascade (try NPU, fall back to GPU, fall back to CPU):

File	Size	Purpose
`gemma4.litertlm`	2.59 GB	GPU and CPU inference (standard tflite ops)
`gemma4_npu.litertlm`	3.02 GB	NPU inference (QNN-compiled DISPATCH_OP custom ops for Hexagon V79)
Total	5.61 GB	One app, one model architecture, two hardware targets

These two files exist because the NPU variant contains chip-specific compiled operations that only run on Hexagon V79. You cannot use the GPU model on the NPU or vice versa. The dispatch delegate architecture requires separate compilation targets, so any app that wants to support both fast-path (NPU) and fallback (GPU/CPU) needs to ship both files. This is not a packaging inefficiency: it is a fundamental consequence of how heterogeneous compute works on mobile SoCs.

Now scale that to a phone with multiple AI-powered apps. Assume a 256 GB phone, of which roughly 250 GB is usable after the OS, system apps, and firmware:

Scenario	Model Storage	Share of Usable Storage (~250 GB)
1 app (Redacto)	5.61 GB	2.2%
5 apps	~28 GB	11%
10 apps	~56 GB	22%
15 apps	~84 GB	34%

This conservatively assumes every app ships the same 5.61 GB footprint (one 2B INT4 model, two hardware targets). Per assumption 1 above, real apps will vary - some larger, some sharing a base model - but the aggregate pressure is the point. And 250 GB usable is the ceiling before real content: photos, videos, and existing apps typically consume another 40-60% on top of the OS.

At 15 apps, you have burned through a third of your phone's storage on AI model weights alone. And this assumes every app uses a 2B parameter model quantized to INT4. Larger models - 4B, 7B, or multimodal variants with vision encoders - would be significantly worse. A single 7B INT4 model can exceed 5 GB. Two variants of that for NPU and GPU, and one app consumes 10+ GB.

This is not a problem you can solve by buying a bigger phone. The 512 GB tier adds cost, and the storage pressure grows linearly with the number of AI apps users install.

NPU Contention: No Scheduler, No Priority, No Recovery

Here is something I learned the hard way during the hackathon: the Hexagon DSP's QNN Protection Domain cannot be re-acquired within the same process after it has been released.

The constraint, as I hit it during the hackathon:

Once a non-NPU Engine has been instantiated and closed in the process, the Hexagon DSP's PD (Protection Domain) reservation can't be re-acquired by a subsequent NPU Engine in the same process. This is a QNN/LiteRT-LM constraint, not something fixable from app code without restarting the process.

The error looks like this:

E QnnDsp: Failed to find available PD for contextId 5 ... err: 1002
E tflite : Encountered unresolved custom op: DISPATCH_OP.

This is a single-app, single-process limitation. Now consider the multi-app scenario.

During the hackathon, I observed teams building apps that run models as persistent background services: always-on assistants, continuous audio processing, real-time translation. Other apps, like Redacto, are foreground-only: the user opens the app, runs inference, closes it.

What happens when three apps need the Hexagon DSP simultaneously?

As far as I could find, there is no clean answer to that question today. From what I saw, the QNN stack has:

No multi-tenant scheduler. There is no system-level arbiter deciding which app gets NPU time and which app waits.
No priority API. A foreground app actively responding to user input has no way to preempt a background service that is running a low-priority classification loop.
No graceful eviction. If one app holds the Protection Domain, another app's NPU init will fail with error 1002. There is no queue, no callback, no "try again when available" mechanism.
No recovery without process restart. Even within a single app, losing the PD means you must kill the process and relaunch. Across apps, this means the system would need to terminate another app's process to free the NPU.

Compare this to how Android handles GPU contention today. The GPU has a kernel-level scheduler (on Qualcomm: the KGSL driver) that multiplexes compute contexts across processes. An app does not need to know or care whether another app is using the GPU. The driver handles context switching, priority management, and memory isolation transparently. This infrastructure took years to mature: Adreno's GPU scheduler evolved through multiple Android releases in the 2013-2017 era, driven by the demands of mobile gaming.

The NPU feels like it is roughly where the GPU was around 2010. We have the hardware. What I could not find is the software infrastructure for multi-tenant access - though I would be glad to be pointed to it if it exists.

Memory Contention: 1.9 GB Per App, and the LMK Knows It

Redacto's peak RSS (Resident Set Size) on the NPU backend was 1,934 MB: nearly 2 GB of physical memory consumed by a single app running a single 2B parameter model.

On the GPU backend, it was 1,375 MB, still over 1.3 GB.

These come from my own benchmark runs (peak RSS via /proc/self/status VmRSS), and as above are directional figures from a single May-2026 session, not a multi-run average.

The NPU variant uses approximately 560 MB more RSS than the GPU variant. This delta comes from the larger NPU model file (3.02 GB vs 2.59 GB on disk, with a proportional difference in runtime memory mapping) and the QNN runtime's internal buffer allocations for the Hexagon DSP communication pathway.

Now do the multi-app math:

Scenario	Estimated Combined RSS	Typical Flagship RAM (12 GB)
1 AI app (NPU)	~1.9 GB	16% of total RAM
2 AI apps (NPU)	~3.8 GB	32% of total RAM
3 AI apps (NPU)	~5.7 GB	48% of total RAM
2 AI apps + Android OS + system services	~6-7 GB	50-58% of total RAM

Android's Low Memory Killer (LMK), specifically lmkd, the userspace daemon that replaced the older kernel-level OOM killer, assigns oom_score_adj values to processes based on their importance. Foreground apps get the lowest scores (highest priority), cached background apps get the highest (first to die). When memory pressure crosses configured thresholds, lmkd starts killing processes in ascending priority order.

Here is the problem: AI apps all request android:largeHeap="true" in their manifests. They have to - not for the model weights, which are memory-mapped, but for the Java-side allocations (tokenizer state, tensor buffers, intermediate activations) that a 2B model pushes past the default heap ceiling. But largeHeap does not give an app more physical memory: it raises the per-process Java heap ceiling. The model weights themselves are memory-mapped from the filesystem, consuming physical RAM pages regardless of the Java heap setting. And lmkd does not distinguish between "this app is using 1.9 GB because it loaded an AI model" and "this app is using 1.9 GB because it has a memory leak." Both get the same oom_score_adj treatment.

The practical consequence: with two AI apps running, Android will aggressively kill background processes to maintain free memory headroom. Users will see their non-AI apps being evicted and cold-restarting constantly. And if both AI apps are in the foreground (split-screen, picture-in-picture, or rapid task-switching), the system has to choose which one to kill. There is no mechanism for the OS to say "unload this app's model but keep the process alive": the granularity is process-level, not model-level.

Every AI app developer will independently arrive at the same conclusion: request largeHeap, memory-map the model, hold it in RAM for fast inference. And every one of them will be correct for their app in isolation. The tragedy-of-the-commons failure only emerges at the system level.

I/O Pressure: Multi-GB Cold Reads on a Shared Bus

Model loading is not instant. On first launch, Redacto reads the full 2.59 GB (GPU) or 3.02 GB (NPU) model file from flash storage into memory. Even on UFS 4.0, which delivers theoretical sequential read throughput of around 4.2 GB/s, a single model load takes measurable time.

Redacto's ahead-of-time (AoT) compilation cache (roughly 327 MB of compiled kernels) speeds up subsequent launches; on the NPU path the engine initialized in about 2.5 seconds once the cache was warm. The cache itself consumes additional storage. And the first launch, the one that determines whether a user keeps or uninstalls the app, always pays the full cold-read cost.

Now picture three AI apps launching in sequence after a phone reboot. Each reads 2.59-3.02 GB from flash. That is roughly 7.8 to 9 GB of sequential reads hitting the UFS controller in rapid succession. UFS storage is a shared bus: the same controller handles app installs, photo writes, database commits, and system updates. Sustained large sequential reads from multiple processes will compete with the random I/O patterns that the rest of the system depends on.

The AoT cache approach, which Redacto uses to accelerate re-launches, is itself a storage trade-off. Each cache consumes additional flash space. Multiply that across 15 apps and you are adding another layer of storage pressure on top of the model files themselves.

What the Ecosystem Needs

These do not feel like hypothetical problems to me. They look like a likely consequence of the current architecture as on-device AI adoption scales. Every number I have cited is from one app with one 2B model on current hardware, so read the trajectory as a hypothesis I want to test further, not a foregone conclusion.

Here is what I think the ecosystem, specifically the Android/LiteRT and QNN/Hexagon stacks, needs to build:

1. Shared model runtime. If five apps all use Gemma 4 E2B, the phone should load one copy of the model into memory and serve inference requests from all five apps through a shared service. This is architecturally similar to how Android's SurfaceFlinger composites GPU output from multiple apps through a single display pipeline, or how the mediaserver process manages hardware codec access. A shared inference service would reduce the per-app memory overhead from ~1.9 GB to a fraction of that (the model weights would be shared; only per-app context buffers would be private).

2. OS-level model deduplication - keyed on more than the whole file. If three apps bundle the same gemma4.litertlm file, the OS should recognize the duplication at install time and store one physical copy. Android already does this for shared native libraries via the uses-native-library manifest tag and the linker namespace system. But per assumption 1, whole-file dedup only catches the easy case. The harder, more realistic case is apps that share a base model but layer different fine-tunes or adapters on top: naive hashing sees these as entirely different files. Real deduplication would need to work at the level of shared base weights plus per-app adapter deltas - a system-managed model registry that understands "same base, different adapter," not just "identical bytes." That is a bigger ask, and it is the one that actually matches how I expect apps to ship models.

3. NPU scheduling and priority APIs. The Hexagon DSP needs a multi-tenant scheduler analogous to what KGSL provides for Adreno GPUs. At minimum: a queue-based access model, foreground/background priority levels, and a preemption mechanism so that a foreground app's inference request does not block behind a background service's batch job. The Protection Domain limitation (error 1002 on re-acquisition) needs to become a recoverable condition, not a process-fatal one.

4. Model-aware memory management. lmkd needs to understand that memory-mapped model weights are a different category than application heap. Ideally, the OS could evict model pages under memory pressure and transparently reload them from flash when needed - similar to how Linux handles page cache eviction for memory-mapped files, but with awareness of the latency cost (reloading a 2.59 GB model is not the same as re-reading a 4 KB config file). A model-level eviction API would let the system reclaim memory from background AI apps without killing their processes entirely.

5. Storage-aware model delivery. Instead of each app bundling its own model files, the OS could manage a shared model cache - similar to how Android's ART manages ahead-of-time compiled dex files in a system-managed directory. Apps would declare their model dependencies (model ID, version, quantization level, target hardware), and the OS would download, store, and deduplicate them centrally. Google Play's on-demand asset delivery already has the infrastructure for large file management; extending it to model files with deduplication semantics is a natural evolution.

The GPU Analogy: We Have Been Here Before

In the early days of Android gaming (2010-2013), GPU compute on mobile was in a remarkably similar state to where NPU compute is today.

Multiple apps could submit GPU workloads, but there was no real scheduling: whichever app got the GPU first held it. Context switching was expensive and poorly defined. Power management was coarse-grained. Memory isolation between GPU contexts was incomplete. Developers had to manage GPU memory manually because the system would not do it for them.

Over the next several years, Qualcomm's KGSL (Kernel Graphics Support Layer) driver evolved to handle multi-context scheduling, priority-based preemption, per-context memory accounting, and cooperative power management. ARM's Mali and Imagination's PowerVR went through similar evolutions. By the time Android gaming matured, the GPU software stack had caught up to the hardware's capabilities.

The NPU is at the beginning of that same arc. The hardware - Qualcomm's Hexagon V79, MediaTek's APU, Google's Tensor TPU - is already capable. In my own single-session measurements, the NPU path decoded Gemma 4 E2B at about 42 tokens per second with a time-to-first-token in the low-100ms range. NPU decode here is memory-bandwidth-bound, so a roughly flat token rate is expected rather than surprising. The point is that the hardware is not the bottleneck.

The software infrastructure - scheduling, memory management, multi-tenancy, developer APIs - is where the gap lives. And unlike the GPU transition, which played out over 5+ years of gradual adoption, the NPU transition is going to be compressed. The AI app ecosystem is growing faster than the mobile gaming ecosystem did, which means the contention problems will arrive sooner and hit harder.

Why This Matters Now

I am writing this piece not because these problems are blocking me today. Redacto works. On the NPU it decoded at about 42 tok/s with a time-to-first-token around 100ms in my one benchmarking session, and it handles document redaction on-device with zero network access.

I am writing it because I can see the wall from here. Every developer building on-device AI apps is making the same rational, individually correct decisions: bundle the model, claim the NPU, request largeHeap, load the weights into memory. And each of those decisions is fine in isolation. The crisis is emergent. It only becomes visible when you multiply by the number of apps on a phone and ask: who arbitrates?

As far as I can tell from where I sit, the answer today is nobody. I did not find an arbiter, a scheduler, a shared runtime, or model deduplication. From the outside, every app looks like an island. If any of these already exist and I missed them, that is genuinely the thing I most want a reader to tell me.

The teams at Google and Qualcomm are building extraordinary hardware and runtime capabilities. LiteRT-LM is a genuine leap forward for on-device LLM inference. The Hexagon V79 is remarkable silicon. But the system-level infrastructure for multi-tenant AI - the layer that makes 15 apps coexist gracefully on one device - is not something I have come across yet. That gap, if it is one, is not any one vendor's fault; it is what happens when several independent, individually sound design choices collide at the system level.

That is the 15-app problem. And the time to start solving it is before we get to 15, not after.

Related in this series of "Edge AI from the Trenches"

Why My LLM Runs 4x Faster on Hardware I Had Never Heard Of - foundational context on the hardware at the center of the contention problem
One Model, Three Chips, Two Files: How LiteRT Delegates Really Work - why separate model files and vendor libraries create the storage multiplication
What I Learned Benchmarking the Same LLM on My Phone's NPU, GPU, and CPU - the memory and performance measurements behind the per-app resource estimates

Jaydeep Shah is a developer with roots in embedded systems, Android platform internals, and silicon-level AI optimization. He now explores on-device AI inference - bringing models from the cloud to phones and edge hardware. Along with his team Edge Artists, he builds applications using LiteRT-LM and Gemma models on mobile hardware, and writes about what works, what breaks, and what he learns along the way. This post is part of the Edge AI from the Trenches series.

Last updated: July 2026
21st of 23 posts in the "Edge AI from the Trenches" series

Edge AI Skills: From Metal to Model

Jaydeep Shah (JD) — Sun, 12 Jul 2026 16:15:00 +0000

I did not take a traditional path into AI. There was no Kaggle phase, no PhD in machine learning, no journey that started with importing TensorFlow in a Jupyter notebook. My path started with an oscilloscope probe on a PCB trace, reading firmware documentation at 2 AM, and writing code that ran on hardware with no operating system at all.

Today I work on on-device LLM inference: running large language models on mobile phones using LiteRT. And the thing I keep realizing is that almost everything I learned in embedded systems applies directly to this work, in ways that are not obvious unless you have lived in both worlds.

This is about the mental models that transfer, the gaps between disciplines, and why edge AI needs people who understand both the model and the metal.

The Career Arc Nobody Plans

My career does not look like a straight line. It looks like a series of right angles.

Embedded systems and firmware. I started writing code that talked directly to hardware: bare-metal C, no RTOS, no abstraction. You learn a particular discipline when your debugging tool is a logic analyzer because there is no printf. You learn to reason about what the hardware is doing at every cycle because there is nothing between you and the silicon.

My first real taste of that was in my second year of undergrad. For a campus quiz event I built a buzzer system on a Texas Instruments chip: several teams wired in, whoever pressed first fired an interrupt, and that team's number lit up on a seven-segment display. It worked flawlessly on a development board. Then I moved it to a PCB I had laid out myself - a big deal at that stage - and it simply stopped working. I lost a day or two to it before the cause surfaced: the dev board had quietly taken care of some pull-up/pull-down resistors for me, and I had not read the chip's datasheet closely enough to carry them over to my own board. There was no printf to fall back on. I found it by reading the spec line by line and tracing every path on the PCB, and I fixed it before the event - the buzzer ran all three days without a glitch. It was the first thing I had built that felt like a real product.

Android WiFi: the HAL layer. I moved into Android's WiFi stack, working in the Hardware Abstraction Layer. The HAL is the interface between Linux kernel drivers and the Android framework. In the WiFi case, the framework talks to the vendor chipset HAL for chip and interface management, and to WPA Supplicant - itself exposed through its own separate HAL interface - for authentication and connection management, all hidden behind a common HIDL/AIDL contract so the Android framework above does not need to know which Qualcomm, Broadcom, or MediaTek radio is installed. This was my first deep exposure to the pattern of hiding hardware specifics behind a stable API contract - the same pattern I would later recognize in LiteRT's delegates.

Silicon-level AI: FP8 optimization on the Sohu chip. Then I joined a silicon company and worked at the root level on FP8 quantization for the first Sohu chip. FP8, 8-bit floating point, is a data format designed specifically for deep learning inference and training. Unlike INT8 quantization, which maps floating-point values to 8-bit integers with a scale factor, FP8 retains a floating-point representation with a sign bit, exponent bits, and mantissa bits. The two common FP8 variants are E4M3 (4 exponent bits, 3 mantissa bits, better for inference because it has higher precision) and E5M2 (5 exponent bits, 2 mantissa bits, better for training because it has wider dynamic range). Working at the chip level meant understanding how transformer model operations, such as attention, feed-forward layers, and normalization, map onto actual silicon: the data paths, the accumulator widths, the memory bandwidth requirements. I saw a transformer not as a Python abstraction but as a pattern of multiply-accumulate operations flowing through physical hardware.

Robotics. I worked on a robotics team where real-time constraints were not aspirational but existential. If your control loop misses its deadline, the robot does not "respond slowly." It crashes into a wall.

Competitive apps benchmarking (my day job today). Now I work in benchmarking: measuring how apps perform against their competitors, on real hardware, under conditions a customer actually cares about. That means automating performance measurement, defining the benchmarking goals with the customer, and building out entire labs of devices and toolchains to run it all. I have written a lot of the tooling around that ecosystem myself, end-to-end platforms that span everything from selecting the right device and installing the apps to the dashboards a customer opens to see the results. Measuring honestly, and building the infrastructure to measure at all when none exists off the shelf, is the daily work. (If that sounds familiar from earlier in this series, it should - it is the same discipline behind how I benchmarked Redacto.)

On-device AI. With that background, I was able to run LLM inference on mobile phones. Along with my team, Edge Artists, I built Redacto at a hackathon, an on-device PII redaction app running Gemma on Snapdragon hardware (it later won the Qualcomm x Google LiteRT Developer Hackathon). Edge AI is the direction I want to move into full-time next. And through all of it, I keep having the same thought: I have seen this problem before.

Mental Models That Transfer

The reason embedded systems experience translates so well to edge AI is not because the technologies are the same. They are not. It is because the constraints are the same, and constraint-driven thinking is a transferable skill.

Memory Hierarchies

In embedded systems, you think constantly about where data lives. Is it in L1 cache or L2? Is it in SRAM or DRAM? How many cycles does a cache miss cost? You design data structures and access patterns around the memory hierarchy because the performance difference between a cache hit and a main-memory fetch can be 100x.

On-device AI has the exact same problem at a different scale. A 2.59 GB model needs to fit in the memory of a phone that has, say, 8 GB of RAM total, but the operating system, background apps, and the camera service are all competing for that RAM. The model weights need to stream through the compute units in a pattern that keeps the hardware fed. If your model's working set exceeds the NPU's local memory, you are constantly shuttling data back and forth across a memory bus, and throughput collapses.

Same constraint. Different scale. Same way of thinking about it.

Hardware Abstraction Layers

The Android WiFi HAL defines a contract: the framework gets an IWifiChip and calls createStaIface(), and the HAL implementation translates that into whatever vendor-specific sequence the underlying chipset requires. The framework does not know whether it is talking to a Qualcomm QCA6490 or a Broadcom BCM4389. It does not need to.

LiteRT's delegate pattern is the same architecture. The inference engine defines a contract: Prepare() the subgraph, Invoke() it, collect the output. The GPU delegate, the NPU delegate (via QNN or the NNAPI), and the XNNPACK CPU delegate each implement this contract differently, translating the same high-level graph operations into hardware-specific instruction sequences. The application code does not know whether the matrix multiplications are running on Adreno shader cores or Hexagon vector units.

When I first read the LiteRT delegate documentation, I did not see a new concept. I saw the HAL pattern with different nouns.

Dispatch Mechanisms

In embedded systems, interrupt dispatch is a core concept. A hardware event fires an interrupt. The interrupt controller routes it to the correct handler based on priority and type. You write vector tables, you manage priorities, you handle preemption.

LiteRT's operation dispatch works the same way. The framework has a graph of operations. Each operation gets dispatched to the appropriate hardware backend: NPU for supported ops, GPU for others, CPU as fallback. The partitioner decides which ops go where, based on what each backend supports and what the performance characteristics are. Operations that the NPU cannot handle get "fallen through" to the GPU or CPU, just like interrupts that a specialized handler cannot process get routed to a default handler.

The vocabulary is different. The mechanism is the same.

Real-Time Constraints

Robotics taught me to think in latency budgets. You have X milliseconds to read the sensor, run the algorithm, and command the actuator. If you exceed that budget, the system does not degrade gracefully: it fails.

On-device LLM inference has its own latency budgets. Time-to-first-token (TTFT) determines whether the user perceives the response as instant or laggy. On the Snapdragon 8 Elite running Gemma through our app, NPU TTFT was around 92ms and GPU TTFT around 366ms. (These come from a single directional benchmarking session on one Galaxy S25 Ultra, not a rigorous multi-run study, so treat them as ballpark rather than reproducible figures.) That is the difference between "the app feels responsive" and "the user wonders if it crashed." It is not a robotics control loop, but the thinking is identical: you have a time budget, the hardware determines whether you can meet it, and if you cannot, you need a different approach.

Debugging Without Visibility

In embedded firmware, you often cannot set a breakpoint. There is no stdout. You have a logic analyzer, maybe a JTAG probe, and you reconstruct what happened from signal traces and register dumps. You develop a skill for reasoning about system behavior from indirect evidence.

On-device AI debugging is remarkably similar. There is no interactive profiler attached to the NPU. You have logcat output, ADB shell commands, and timing measurements. When inference produces wrong results, you cannot step through the model's execution on the NPU the way you would step through code in a debugger. You reason from outputs, from timing anomalies, from the gap between expected and observed behavior. You look at the same kind of indirect evidence and build the same kind of mental models about what the hardware is doing.

What Embedded Engineers See That ML Engineers Miss

I have worked alongside ML engineers who are brilliant at model architecture, training pipelines, and evaluation. But when the model moves to a real device, there are things they tend not to see.

The hardware is not infinite. In the cloud, if you need more memory, you provision a bigger instance. On a phone, 8 GB is 8 GB. If your model plus the OS plus the foreground app exceed available RAM, the OS kills your process. There is no swap file that gracefully handles the overflow. Embedded engineers internalize this constraint from day one: you always know exactly how much memory you have and how much you are using.

Silent failures are worse than crashes. In embedded systems, a hard fault is visible. It triggers an exception handler. You can diagnose it. The dangerous failures are the ones where the system keeps running but produces wrong results: a corrupted sensor reading that looks plausible, a timing error that only manifests under load. On-device AI has the same problem. A model that outputs confidently wrong text is harder to catch than one that throws an exception. Quantization can introduce subtle accuracy degradation that does not trigger any error. The model keeps producing output. It is just slightly wrong.

The deployment environment is hostile. Phones overheat. Thermal throttling reduces clock speeds mid-inference. Background processes compete for resources. The OS can kill your process to reclaim memory. The DSP state may not clean up properly between inference calls. These are not edge cases: they are the normal operating environment. Embedded engineers are used to designing for hostile environments. ML engineers who have only worked in controlled cloud settings sometimes underestimate how many things can go wrong between "the model works on my laptop" and "the model works reliably on a user's phone."

What ML Engineers See That Embedded Engineers Miss

Fair is fair. The transfer is not one-directional. There are things ML engineers understand intuitively that embedded engineers have to learn.

Probabilistic systems behave differently than deterministic ones. Embedded engineers expect deterministic behavior: same input, same output, every time. LLMs are stochastic. The same prompt can produce different outputs. This is not a bug. The randomness (temperature, top-k sampling) is a feature that makes the output more natural. Learning to evaluate a system where "correct" is a distribution, not a single value, requires a genuine shift in thinking.

"Good enough" accuracy is a valid engineering target. In firmware, a CRC check either passes or fails. A packet is either delivered or lost. In ML, you are working with precision, recall, F1 scores, perplexity. A model that gets 94% accuracy might be perfectly acceptable for the use case. An embedded engineer's instinct is to chase the remaining 6%, but in ML, that last 6% might require 10x more compute for diminishing returns. Learning where to stop optimizing is a skill.

The model IS the program. This is the hardest shift. In embedded systems, you write explicit logic: if-else chains, state machines, lookup tables. The program does exactly what you coded. In LLM-based applications, the model's behavior is shaped by training data and prompts, not by explicit rules. Prompt engineering is programming, just in a domain where the "compiler" is a neural network and the "instruction set" is natural language. This requires a fundamentally different approach to system design.

Why Edge AI Needs People at the Intersection

The gap between "model works in Python on a cloud GPU" and "model runs reliably on a phone's NPU" is not primarily an ML problem. It is a hardware-software integration problem.

Getting a transformer model onto an NPU requires understanding quantization (how does FP8 vs INT8 affect this model's accuracy on this specific hardware?), memory layout (how do you tile the attention computation to fit in the NPU's local memory?), dispatch (which operations run on the NPU and which fall back to CPU, and what is the cost of that transition?), and system-level behavior (what happens when the OS throttles the NPU because the phone is warm?).

This is exactly the skill set of people who have worked at hardware-software boundaries. People who have written HALs, debugged firmware, optimized data paths through silicon, and designed systems under real-time constraints.

The edge AI space is growing. Models are getting smaller and more capable. Hardware is getting more specialized. The APIs and tooling are maturing. But the bottleneck is not the models or the hardware. The bottleneck is people who understand both the model and the metal: who can look at a transformer architecture and see the memory access patterns, who can look at an NPU spec sheet and understand what it means for model design.

If you come from embedded systems, from firmware, from silicon, you already have half the picture. The model side can be learned. The hardware intuition cannot be taught in a bootcamp.

And if you come from ML, consider looking down the stack. Understand the hardware your models run on. Learn what a cache miss costs, why memory bandwidth matters, what happens when your model's working set exceeds the accelerator's local memory. The best edge AI engineers I have worked with are the ones who can move fluidly between the model and the metal.

The intersection is where the interesting problems are.

Related in this series of "Edge AI from the Trenches"

Why My LLM Runs 4x Faster on Hardware I Had Never Heard Of: the purpose-built silicon that embedded engineers will recognize as familiar territory
FP32, INT4, and Everything Between - What I Learned About Precision on Mobile: the precision-compression trade-off at the heart of on-device deployment
The Deployment Boundary: Why On-Device AI Breaks After Colab: the concrete deployment failures that require both ML and hardware expertise

Last updated: July 2026
20th of 23 posts in the "Edge AI from the Trenches" series

The Deployment Boundary: Why On-Device AI Breaks After Colab

Jaydeep Shah (JD) — Sun, 12 Jul 2026 14:30:00 +0000

Your model runs perfectly in Colab. Loss converged. Eval looks good. You export to a mobile-friendly format, deploy to a phone, and nothing works.

Not a crash - something worse. A silent failure, a cryptic error from a layer you have never heard of, or behavior subtly different from what you validated in the notebook. You spend three days debugging something that has nothing to do with your model and everything to do with the layers between your Python code and the silicon that runs inference.

I built Redacto, an on-device PII redaction app running Gemma 4 E2B on Snapdragon 8 Elite. Every serious problem we hit came from the gap between "it works in Colab" and "it works on a phone" - and almost none of those problems are documented.

This is the edge AI developer's blind spot. The model is not the hard part. The deployment boundary is.

If you have been following this series, the five failure modes in the next section will be familiar territory - I covered the NPU init failures and the chat template trap in detail in earlier posts. Feel free to skip ahead to "The Deployment Boundary: Where Developers Get Stuck", which is where this post goes somewhere new: naming the structural line between what a developer can do alone and what is locked behind vendor tooling. If you are landing here first, read on - the five failures are the concrete evidence that boundary exists.

The "Last Mile" Breaks in Five Specific Ways

Developers talk about "optimization" and "model size" as the deployment challenges. Those are real but easy. The hard problems are structural - things that break because the on-device runtime is fundamentally different from the Python environment where the model was developed.

Here are five ways our model broke going from Colab to the Samsung Galaxy S25 Ultra. Each is a class of problem, not a one-off bug.

1. Quantization Changes Model Behavior, Not Just Speed

INT4 quantization (4-bit integer weights with FP32 activations) is not a lossless compression. It is a lossy transformation that changes how the model responds to inputs.

In our on-device benchmarks, the standard Gemma 4 E2B model (INT4, 2.59 GB) scored 83.7% on text preservation. Our fine-tuned model (INT4, 4.7 GB, GPU-only) scored 65.9%: it was over-redacting, destroying non-PII context that should have been preserved. (All the benchmark numbers in this post come from a single directional session in May 2026 on one Galaxy S25 Ultra, with no raw logs retained and the device no longer available. Treat them as directional, not as a rigorous multi-run study.)

It is tempting to blame quantization, and quantization does change behavior. But that was not the main story here. The dominant cause was the fine-tune itself: it was an under-resourced run. We trained on only 3,000 of the 400,000 ai4privacy/pii-masking-400k samples, for a single epoch, and the training data taught it to emit generic [REDACTED] and [REDACTED NAME] labels rather than Redacto's structured [CATEGORY_N] format that the benchmark scores against. So a good prompt beat an under-resourced fine-tune. With the full dataset and format-aligned labels, that result could easily flip.

Quantization still matters, and it is a real constraint. When you compress 32 bits of precision into 4 bits, you lose information, and the loss is not uniform across the weight space - it hits edge cases hardest, which are exactly the cases that matter for domain-specific tasks. You must quantize for mobile (Gemma 4 E2B is 5.1B total / 2.3B effective parameters - the E is for effective, via Per-Layer Embeddings - and at FP32 across its 5.1B total parameters it would be well over 8 GB, far too large for phone RAM), but the lesson is that quantization changes your model's behavior in ways your Colab evaluation will not predict.

2. Dispatch Library Versioning: dlopen Fails Silently

On-device inference on Qualcomm hardware requires a chain of shared libraries. libLiteRtDispatch_Qualcomm.so registers a custom op called DISPATCH_OP via static-init constructors when dlopened at runtime. If the dispatch library version does not match the LiteRT runtime version, dlopen fails silently - no crash, no exception, just a missing op registration.

The symptom: Encountered unresolved custom op: DISPATCH_OP. We spent hours looking at model files and export settings. The actual problem was that the .so files in our jniLibs/ directory came from the device's vendor partition - slightly different ABI than the official sample's libraries. Different bytes, same filename. The fix was replacing all six QNN (Qualcomm AI Engine Direct) libraries with the exact QAIRT 2.42 versions from the official sample.

In Colab, you pip install a package and it works. On-device, you are linking native shared objects at runtime, and version mismatches manifest as silent no-ops.

3. DSP Protection Domains: A One-Shot Deal Per Process

The Hexagon NPU is a separate processor with its own firmware, running in a Protection Domain (PD) allocated by the QNN runtime. Once your process acquires a PD and then releases it (by switching to a different backend), you cannot re-acquire it without restarting the entire process.

Switch from NPU to GPU and back to NPU: Failed to find available PD for contextId 5 ... err: 1002. The NPU is gone for that process lifetime.

This is not a software bug. The DSP firmware does not support re-entry after teardown. Our workaround is a cascade: try NPU first on app launch, fall back to GPU or CPU if it fails, never return to NPU within the same session. If the user wants NPU back, they kill the app and relaunch.

No amount of clever engineering changes this. It is how the silicon works.

4. Chat Template Parser Subset: Jinja Features That Break On-Device

Gemma 4's chat template on HuggingFace uses map.get() - a standard Jinja2 method that works in every Python environment. LiteRT-LM's on-device template parser is not a full Jinja2 implementation. It supports a subset. map.get() is not in that subset.

The error: Failed to apply template: unknown method: map has no method named get (in template:238).

This is a deployment trap because the failure is completely invisible during training and export. You fine-tune the model, merge the LoRA adapter, call tokenizer.save_pretrained(), and the HuggingFace-native template gets bundled into the exported .litertlm file automatically. Everything works in Python. The template only fails when the on-device parser tries to execute it at runtime on the phone.

Our fix was to create a re-export notebook that swaps the Gemma 4 chat template with the older Gemma 3 template before calling litert_torch.generative.export_hf. This is undocumented. We found it by reading the error message, guessing that the parser was limited, and testing older templates until one worked.

5. Vision Encoder Assumptions: Text-Only Model Meets Multimodal Runtime

Gemma 4 E2B is a multimodal model - it supports text and images. When you fine-tune it for a text-only task (like PII redaction) and export to .litertlm, the exported file does not include the vision encoder section (TF_LITE_VISION_ENCODER). But the default LiteRT-LM engine configuration expects a vision backend to be present.

The error: NOT_FOUND: TF_LITE_VISION_ENCODER not found in the model.

The fix is to pass visionBackend = null and maxNumImages = null in the engine config when loading a text-only fine-tuned model. But there is a compounding issue: even for the standard multimodal model, setting visionBackend = Backend.GPU() (as the developer guide recommends) crashes on Android 16 with UNIMPLEMENTED: CreateSharedMemoryManager is not implemented. The GPU's OpenGL backend has an unimplemented codepath on Android 16, and it kills the entire engine initialization - even if you are running inference on the NPU and never actually processing an image.

We set all vision and audio sub-backends to CPU across every backend configuration. The cost is zero for text-only workloads, but finding this required reading source-location traces from litert::ml_drift - internal runtime code paths that no application developer should have to debug.

Silicon-Level Reality: What an NPU Actually Is

Most documentation describes the NPU as an "accelerator." This is technically correct and practically useless. Understanding what the NPU physically is explains why deployment problems exist.

An NPU is dedicated matrix multiply silicon. On the Snapdragon 8 Elite, this is the Hexagon V79: a separate processor on the SoC die with its own instruction set, firmware, and memory management. Not a GPU with different drivers - a fundamentally different processor designed for one job: executing neural network operations (matrix multiplication, convolution, activation functions, normalization) at maximum throughput per watt.

The Hexagon V79 supports INT4, INT8, and FP16 precision natively. It is widely reported at around 45 TOPS. In our single-session measurements, it delivered roughly 42 tok/s on Gemma 4 E2B versus about 25 tok/s on the Adreno 830 GPU - close to 1.7x throughput. Time-to-first-token was around 92ms on NPU versus 366ms on GPU - roughly a 4x improvement.

The consequence of dedicated silicon: the NPU has its own instruction set. You cannot run arbitrary TFLite ops on it. The model must be compiled through the QNN SDK into a binary using Hexagon V79-specific operations. This is what DISPATCH_OP is - a single TFLite node wrapping an entire QNN-compiled subgraph. When LiteRT encounters DISPATCH_OP, it hands computation to the Hexagon via the dispatch library. If that library is missing or wrong-versioned, the NPU does not exist to your application.

A model compiled for an earlier Hexagon generation (say V73) will not run on V79; a model built against an earlier QAIRT release may not match the 2.42 runtime libraries. The ops are not portable across chip generations because the silicon is different.

The Deployment Boundary: Where Developers Get Stuck

There is a line in the edge AI stack separating what developers can do independently from what requires vendor access. This is the most important structural reality in the ecosystem, and it is almost never discussed explicitly.

What is public and works:

LiteRT-LM SDK (com.google.ai.edge.litertlm): the Android library for loading and running .litertlm models. Available on Maven Central.
litert_torch export tool: converts HuggingFace models to .litertlm format with quantization. Available via pip.
Pre-compiled models: Google and Qualcomm publish .litertlm files for popular models (Gemma, Llama) compiled for specific chipsets. Available from litert-community repos on HuggingFace.
Sample apps: working reference implementations for GPU and NPU inference. Available on GitHub (google-ai-edge/litert-samples).

What is required for NPU compilation of custom models but not part of the public LiteRT-LM SDK:

AIMET (AI Model Efficiency Toolkit): Qualcomm's quantization tool for preparing models for QNN compilation.
QNN-AOT compiler: converts a quantized model into Hexagon-specific binary ops, producing the DISPATCH_OP subgraphs that run on the NPU.
Chip-specific calibration data and compilation profiles: parameters that map operations to the specific Hexagon version on the target SoC.

Everything before this boundary works. We fine-tuned a model in Colab, exported with litert_torch, and ran it on the phone's GPU. Everything after requires vendor toolchain access. A developer cannot independently compile a fine-tuned model for the NPU.

This is the deployment boundary.

The Fine-Tuned Model NPU Blocker: A Real Case Study

This is not theoretical. It is what happened with Redacto.

We fine-tuned Gemma 4 E2B using QLoRA in Colab (2.85M trainable parameters, 217 seconds on an NVIDIA RTX PRO 6000), exported with litert_torch using dynamic_wi4_afp32 quantization. The fine-tuned model runs on the phone's GPU at about 9 tok/s. The standard model runs on the NPU at roughly 42 tok/s. That looks like a large gap, but be careful reading it: it combines two variables at once - a different model and a different processor (GPU versus NPU) - so it is not a clean apples-to-apples comparison.

But our fine-tuned model cannot run on the NPU. The exported .litertlm contains standard TFLite ops, not QNN-compiled DISPATCH_OP subgraphs. To get it onto the NPU, we would need the QNN-AOT (ahead-of-time) compiler - which is not part of the public SDK.

This is not an engineering shortcut we missed. It is a hardware-team integration boundary. The Qualcomm team that compiled the standard model for Hexagon V79 used internal tools we do not have access to. Every developer who fine-tunes will hit this wall if they want NPU performance.

The consequence: fine-tuning buys better accuracy on GPU at the cost of being locked out of the fastest hardware. Even in our under-resourced run, the fine-tuned model improved entity recall by about 13% on both FIELD_SERVICE and TACTICAL - meaningful quality gains, stuck at roughly 9 tok/s because it cannot reach the NPU.

What This Means for the Ecosystem

The edge AI ecosystem has a bottleneck, and it is not model quality. Gemma 4 E2B scores 80.5% overall on our 85-entry PII redaction benchmark with nothing but prompt engineering. The 2B-class (2.3B effective) model is capable of real work on real tasks.

The bottleneck is developer experience.

The toolchain gap will slow adoption more than any model limitation. Consider the developer journey for on-device AI today:

Pick a model on HuggingFace. (Easy.)
Fine-tune it in Colab for your domain. (Straightforward, well-documented.)
Export to .litertlm. (Works, with undocumented gotchas like the chat template trap.)
Run on GPU. (Works.)
Run on NPU for production performance. (Blocked without vendor toolchain access.)

Steps 1 through 4 can be done in a weekend. Step 5 has no public path. And step 5 is the difference between a demo and a product - because the NPU is architecturally designed for lower power draw per inference operation than the GPU, and mobile apps live and die by battery life.

Developer experience is an engineering problem, not a documentation problem. The issues I described - silent dlopen failures, DSP path ordering, chat template parser subsets, vision encoder config mismatches, Protection Domain one-shot constraints - these are not things that better documentation would fix. They are architectural gaps between the training ecosystem (Python, Colab, HuggingFace) and the deployment ecosystem (C++, Android NDK, Qualcomm QNN, chip-specific firmware). Closing them requires toolchain engineering, not more README files.

The edge AI space needs people who understand both the model and the metal. Most ML engineers have never dlopened a shared library. Most embedded engineers have never fine-tuned a model. The deployment boundary sits at the intersection of these two skill sets. The developers who understand why ADSP_LIBRARY_PATH must be set before any LiteRT library loads, and also understand why INT4 quantization changes entity recall differently across domains - those are the ones who will ship production edge AI.

Right now, that intersection is almost empty. The tools assume you are either an ML engineer who exports models or a platform engineer who integrates vendor SDKs. Nobody is building for the person who does both.

That is changing. On-device AI is shipping in production apps, processing real user data with real privacy requirements. The "last mile" between Colab and the device is where the value is created - and where the work is hardest.

How to Work With the Boundary Today

You cannot dissolve the deployment boundary on your own, but you can stop it from stalling your project. Three moves that follow directly from everything above:

Ship on GPU, treat NPU as a partnership, not a task. The public path takes you through step 4: fine-tune, export, run on the GPU. That is a real, shippable product. NPU compilation of a custom model is not something you will unblock by trying harder - it needs the vendor toolchain (AIMET, the QNN-AOT compiler, chip-specific calibration). Plan for it as a conversation with Qualcomm or Google, on their timeline, not as a sprint task on yours.

Design so the backend is swappable from day one. Every failure in the first half of this post - the silent dlopen, the one-shot Protection Domain, the vision-encoder crash - is a reason your app must degrade gracefully across NPU, GPU, and CPU. Build the cascade first: try the fastest backend, fall back cleanly, never assume any one of them initializes. The app that survives the deployment boundary is the one that never depended on a single backend succeeding.

Staff for the intersection, or grow into it. The problems here are not solved by an ML engineer or a platform engineer working alone; they are solved by someone who can read a litert::ml_drift source trace and reason about how INT4 quantization shifts entity recall. If that person is not on your team, the deployment boundary is where you will lose the most time. It is also the skill set that is about to be worth the most.

The model is not the blind spot. The deployment boundary is.

Related in this series of "Edge AI from the Trenches"

I Opened a .litertlm File. Here Is What Is Actually in There. - the compiled bundle at the end of the pipeline, where all export-time decisions are sealed
One Model, Three Chips, Two Files: How LiteRT Delegates Really Work - the delegate architecture that creates the NPU/GPU/CPU deployment split
What I Learned Getting an NPU to Actually Initialize: Six Silent Failures - the concrete failure modes behind the deployment boundary described here

Sources:

LiteRT-LM overview - the Android runtime for loading and running .litertlm models
litert-torch on PyPI - the export tool that converts HuggingFace models to .litertlm
google-ai-edge/litert-samples - the reference GPU/NPU sample apps and the ABI-matched native libraries
litert-community on HuggingFace - the pre-compiled .litertlm models this post compares against

Test environment:

All technical details and performance numbers come from my own testing of the Redacto project, in a single directional session (May 2026), on the configuration below.

Hardware: Samsung Galaxy S25 Ultra, Snapdragon 8 Elite (SM8750-AC), Hexagon V79 NPU, Adreno 830 GPU, Android 16 (API 36)
Runtime: LiteRT-LM 0.11.0-rc1, QAIRT 2.42 QNN libraries
Model: Gemma 4 E2B (5.1B total / 2.3B effective parameters), INT4 quantized (dynamic_wi4_afp32); standard model file 2.59 GB (GPU/generic), 3.02 GB (NPU)

Last updated: July 2026
19th of 23 posts in the "Edge AI from the Trenches" series

How I Benchmarked an LLM Running Entirely on a Phone (No Cloud, No API)

Jaydeep Shah (JD) — Mon, 06 Jul 2026 06:40:31 +0000

"It works on my test input" is the most dangerous sentence in on-device AI development.

I typed that sentence - or some version of it - a dozen times while building Redacto, our on-device PII redaction app running Gemma 4 E2B on a Samsung Galaxy S25 Ultra. The model would redact a patient name from a clinical note, I would nod, and I would move on. Then I would hand the phone to a teammate, they would type a police report, and the model would redact the suspect description instead of the victim name.

The problem is not the model. The problem is that manual spot-checking is not validation. You are testing a single input against your own expectations, with all the confirmation bias that entails. When you have five domain modes (HIPAA, Financial, Tactical, Journalism, Field Service), three difficulty levels, and two candidate models, you need something systematic. You need a benchmark suite.

This post covers how I built one - from dataset curation to scoring methodology to on-device infrastructure - for a hackathon app running entirely on a phone. No cloud. No API calls. No data leaving the device.

Why Not Use an Existing Framework?

The LLM evaluation space has mature tools. EleutherAI's lm-eval-harness is the community standard for evaluating language models against academic benchmarks like MMLU, HellaSwag, and ARC. Stanford's HELM (Holistic Evaluation of Language Models) provides a multi-metric evaluation framework with standardized scenarios. Google's BIG-bench offers hundreds of tasks for probing specific capabilities.

These frameworks are excellent for what they do. They are also completely wrong for this problem, for three reasons.

First, they assume server-side inference. lm-eval-harness expects to call a model through an API or load it in PyTorch on a GPU server. Redacto's model runs on a Qualcomm Hexagon NPU inside a phone. There is no Python runtime, no HuggingFace tokenizer at evaluation time, no way to hook into the framework's inference loop.

Second, their task formats do not match. Standard benchmarks test general capabilities: multiple choice, short answer, classification. Redacto's task is structured text transformation: take an input containing PII, produce an output with [CATEGORY_N] placeholders in the right positions while preserving all non-PII text. No existing benchmark tests this.

Third, they do not capture what matters. Even if you could somehow run MMLU on-device, a high MMLU score tells you nothing about whether the model will correctly identify that "key is under the mat" is a security risk in a field service context but not in a journalism article. Domain-specific behavior requires domain-specific evaluation.

So I built one from scratch. The entire system - dataset, scorer, runner, UI - came together inside the hackathon and fits in a handful of Kotlin files plus one JSONL asset.

Phase 1: Dataset Curation from ai4privacy

The foundation is data. I needed text samples containing realistic PII, with ground-truth labels identifying exactly which spans are personally identifiable and which category each belongs to.

Source Selection

I used the ai4privacy/pii-masking-400k dataset from HuggingFace, which contains 17,046 English entries with labeled PII spans and masked outputs. Each entry includes the original text, the masked version, and a list of entities with their positions, values, and categories.

This dataset has a critical advantage: it was created from real-world text patterns (job applications, medical forms, financial documents), not synthetically generated. The PII values are synthetic, but the surrounding context is natural. A benchmark needs to test whether the model understands context, not just whether it can regex-match a Social Security Number format.

Quality Filtering

Not all 17,046 entries are usable. I applied five quality filters, reducing the pool to roughly 2,000 candidates:

Length bounds (>80 chars, <600 chars). Entries shorter than 80 characters do not provide enough context for mode assignment. Entries longer than 600 characters exceed what a user would realistically type or photograph on a phone, and they consume excessive inference tokens.
Minimum entity count (3+). Single-entity entries ("John Smith lives here") are trivial. A useful benchmark entry should require the model to identify and track multiple PII elements across a passage.
No synthetic-looking names. The ai4privacy dataset generates some names that read as obviously fake (e.g., "Xylophonia Quartzmaven"). These are easy for any model to flag as names, which defeats the purpose of testing entity detection. I filtered entries where names had character patterns inconsistent with real-world naming conventions.
PII density below 60%. Some entries are almost entirely PII: a list of names and addresses with no surrounding context. These test nothing about context preservation. If 60% or more of the text is PII, the model cannot meaningfully fail at preservation because there is almost nothing to preserve.
Parseable entity annotations. Entries with malformed or overlapping span annotations were excluded (a small number of entries).

Mode Assignment

The ~2,000 quality candidates need to be assigned to one of Redacto's five domain modes. This is not labeled in the source dataset, so I built a two-signal classifier:

Signal 1: Keyword matching. Medical terms (diagnosis, patient, prescription, medication) map to HIPAA. Financial terms (account, routing, balance, transaction) map to FINANCIAL. Security terms (password, gate code, access code, WiFi) map to FIELD_SERVICE. Law enforcement terms (suspect, witness, victim, incident) map to TACTICAL. Source/reporter/official terms map to JOURNALISM.

Signal 2: Label composition. The entity categories present in an entry provide a strong signal. If an entry contains CREDITCARDNUMBER or TAXNUM entities, it is likely FINANCIAL. If it contains MRN (Medical Record Number) entities, it is likely HIPAA.

When both signals agree, assignment is high confidence. When they disagree, keyword matching wins - the text context is more reliable than entity type alone, because a credit card number can appear in a medical billing context.

Label Remapping

This is the step most people would skip, and it is the step that matters most.

The ai4privacy dataset uses generic entity categories: GIVENNAME, SURNAME, EMAIL, PHONE, etc. Redacto's domain modes use different category names depending on context. The same person's name is [NAME_1] in a medical record, [CUSTOMER_1] in a field service report, [SOURCE_1] in a journalism context, and [VICTIM_1] in a police report.

The remapping rules per mode:

Source Category	HIPAA	FINANCIAL	FIELD_SERVICE	TACTICAL	JOURNALISM
GIVENNAME+SURNAME	NAME	NAME	CUSTOMER	VICTIM	SOURCE
EMAIL	EMAIL	EMAIL	CONTACT	CONTACT	CONTACT
PHONE	PHONE	PHONE	CONTACT	CONTACT	CONTACT
SSN	SSN	SSN	ID	ID	ID
CREDITCARDNUMBER	ACCOUNT	CARD	ACCOUNT	ID	ID
TAXNUM	ID	TAXID	ID	ID	ID

I also merged adjacent GIVENNAME and SURNAME entities into single FULLNAME entities. The dataset annotates "Jane" and "Smith" as separate spans, but Redacto expects [NAME_1] to cover "Jane Smith" as a unit. Without this merge, the scorer would penalize correct behavior.

Difficulty Classification

Each entry gets a difficulty level based on entity count and complexity:

Easy: 3-4 entities, single category type
Medium: 5-7 entities, mixed categories
Hard: 8+ entities, or entities requiring contextual reasoning (e.g., a name that is also a place name)

Stratified Sampling

From the ~2,000 candidates, I sampled 12 entries per mode, balanced across difficulty levels. The curation report shows the actual distribution:

HIPAA          : 12 total  (easy=4, medium=4, hard=4)
FINANCIAL      : 12 total  (easy=4, medium=4, hard=4)
FIELD_SERVICE  : 12 total  (easy=5, medium=5, hard=2)
TACTICAL       : 12 total  (easy=7, medium=5, hard=0)
JOURNALISM     : 12 total  (easy=10, medium=2, hard=0)

The imbalance in TACTICAL and JOURNALISM is real: the ai4privacy dataset contains fewer entries that naturally fit these domains. This is a limitation of automated curation and exactly why Phase 2 exists.

Phase 2: Hand-Crafted Contextual Entries

Automated curation gives you quantity and coverage. It does not give you the entries that actually break models.

I wrote 25 additional entries by hand - 5 per mode - targeting contextual reasoning challenges that generic PII datasets cannot test. These are the entries that separate a model that pattern-matches from one that understands.

HIPAA (5 entries): Relational identifiers. "The patient's daughter Lisa called to ask about the prescription." Lisa is PII because her relationship to the patient makes her identifiable. The model must understand that "daughter" creates a linkage. I also included entries with depression diagnoses tied to named patients (the diagnosis itself is PHI when linked to a person), and medication dosages that are clinical data in isolation but become PHI when attached to a name.

TACTICAL (5 entries): Selective redaction. "Suspect is a white male, approximately 6'2", wearing a red jacket, driving a blue 2019 Honda Civic with plate ABC-1234. Victim Jane Doe, age 34, was interviewed at the scene." The model must redact "Jane Doe" but preserve every detail of the suspect description: race, height, clothing, vehicle, plate number. This is the opposite of what most PII models are trained to do.

JOURNALISM (5 entries): Role-based entity handling. "Secretary of Defense Lloyd Austin confirmed the deployment. A senior Pentagon official, speaking on condition of anonymity, said the timeline was accelerated." Lloyd Austin must be preserved (public official, on the record). The anonymous source must be protected. The model needs to understand attribution and source protection conventions.

FIELD_SERVICE (5 entries): Natural-language security information. "Gate code is 4521. Key is under the mat by the side door. WiFi password is TechHouse2024. Back door is usually unlocked." These are not structured PII: there is no SSN or credit card. But they are security-critical information that a field service technician's notes should not expose.

FINANCIAL (5 entries): Entity boundary precision. "Transfer $45,000 from account 7823-4561-9900 (routing 021000021) to the client's trust fund." Dollar amounts and institution names must be preserved. Account numbers, routing numbers, and client names must be redacted. The model needs to distinguish between numeric values that are financial identifiers and numeric values that are transaction details.

Final Dataset

60 automated entries + 25 hand-crafted entries = 85 total entries, stored as redacto_bench.jsonl in the app's Android assets directory.

At 5-10 seconds per entry on GPU inference, 85 entries across 2 models runs in approximately 20 minutes. (These timing figures come from a single directional session on one device; treat them as ballpark, not a rigorous multi-run measurement - see the caveat below.) That is acceptable for a hackathon iteration cycle. I also added a slider (1-85) in the benchmark UI so I could run quick 3-entry sanity checks during development.

The JSONL Schema

Each entry in redacto_bench.jsonl follows this schema:

{
  "id": "hipaa_ctx_001",
  "mode": "HIPAA",
  "difficulty": "hard",
  "input": "The patient's daughter Lisa called to ask about the Metformin prescription for Margaret Chen, DOB 03/15/1952, MRN 4478291.",
  "expected_output": "The patient's daughter [NAME_1] called to ask about the Metformin prescription for [NAME_2], DOB [DATE_1], MRN [MRN_1].",
  "entities": [
    {"value": "Lisa", "category": "NAME", "placeholder": "[NAME_1]"},
    {"value": "Margaret Chen", "category": "NAME", "placeholder": "[NAME_2]"},
    {"value": "03/15/1952", "category": "DATE", "placeholder": "[DATE_1]"},
    {"value": "4478291", "category": "MRN", "placeholder": "[MRN_1]"}
  ],
  "entity_count": 4,
  "source_uid": "handcrafted"
}

Key design choices in this schema:

expected_output is a reference, not an oracle. The scorer does not do exact string matching against this field. It exists for human review and debugging.
entities is the ground truth. Each entity lists its original value, its domain-specific category, and the placeholder it should become. The scorer uses this list.
source_uid distinguishes provenance. Automated entries carry the original ai4privacy UID. Hand-crafted entries are marked "handcrafted". This lets me segment results by provenance and check whether automated entries are too easy.
difficulty enables stratified analysis. I can answer "does the model handle hard HIPAA entries?" without writing a separate query.

The entity categories used per mode, as reported by the curation pipeline:

HIPAA:         DATE, EMAIL, ID, LOCATION, MRN, NAME, PHONE, SSN
FINANCIAL:     ACCOUNT, ADDRESS, CARD, DATE, EMAIL, ID, NAME, PHONE, SSN, TAXID
FIELD_SERVICE: ADDRESS, CONTACT, CUSTOMER, DATE, ID, SECURE
TACTICAL:      ADDRESS, CONTACT, DATE, ID, VICTIM
JOURNALISM:    CONTACT, DATE, ID, LOCATION, SOURCE

Scoring Methodology: Why Not Exact Match?

The first scorer I wrote used exact string comparison between the model output and expected_output. It failed on the first test.

The model produced [NAME_1] where I expected [NAME_1], but with an extra space before the period. Score: 0%. The redaction was perfect. The formatting was trivially different. Exact match cannot distinguish between "wrong" and "not quite how I would have written it."

This is a known problem in NLP evaluation. BLEU scores for machine translation penalize valid paraphrases. ROUGE scores for summarization reward extractive copying over abstractive understanding. For structured text transformation, what matters is not the exact string: it is whether the PII was removed, the format was followed, and the context was preserved.

Three-Metric Scoring

I use three metrics, weighted into an overall score:

Entity Recall (weight: 50%). For each ground-truth PII entity in the entities list, check whether the original value is absent from the model's output. If the model output does not contain "Lisa" or "Margaret Chen" or "03/15/1952", those entities were successfully redacted regardless of what placeholder the model used. This is intentionally lenient: it rewards redaction in any form, not just the exact placeholder format.

entityRecall = (entities where value is absent from output) / (total entities)

Format Compliance (weight: 25%). Count the bracket placeholders in the model output that match the [CATEGORY_N] pattern (e.g., [NAME_1], [SSN_2], [DATE_3]). Divide by the total number of bracket-enclosed tokens in the output. This distinguishes models that use Redacto's structured format from those that output generic [REDACTED] or [PII REMOVED].

formatScore = (placeholders matching [A-Z_]+_[0-9]+) / (total bracket tokens)

Text Preservation (weight: 25%). Compute the fraction of non-PII words from the input that appear in the output. "The patient's daughter called to ask about the Metformin prescription" should survive redaction intact. A model that strips context while redacting PII is not useful: the whole point is to produce a readable, de-identified document.

preservationScore = (non-PII input words found in output) / (total non-PII input words)

Overall Score:

overall = entityRecall * 0.50 + formatScore * 0.25 + preservationScore * 0.25

Why These Weights?

Entity recall gets 50% because it is the primary safety metric. A model that leaks PII has failed its core job regardless of format or preservation. Format compliance and text preservation split the remaining 50% equally because both matter for usability: a perfectly redacted document is useless if it is unreadable (poor preservation) or uses non-standard placeholders that downstream systems cannot parse (poor format compliance).

On-Device Infrastructure

Running 85 benchmark entries on a phone is not the same as running them on a server. Three engineering decisions shaped the infrastructure.

Decision: Text-Only Dataset (D15)

The benchmark dataset is pure text: no images, no OCR. This is deliberate. Redacto supports both text and image input, but the image pipeline adds ML Kit OCR latency that varies with image quality, resolution, lighting, and text density. If I benchmark the full image pipeline, I am measuring OCR performance, not inference performance.

By isolating text input, every millisecond of latency and every scoring delta is attributable to the LLM pipeline. When the fine-tuned model came in roughly 1.9x slower than the standard model in that session, I could attribute it to the model, not the camera. (As with every figure here, that comes from one directional session; see the caveat further down.)

Decision: Steps 1-3 Only, No Validation (D16)

Redacto's production pipeline has four steps: Classify, Detect, Redact, Validate. The benchmark runs only Steps 1-3.

Step 4 (Validate) is a quality gate that reruns Steps 3-4 if it finds missed entities, up to 3 rounds. This is valuable in production but catastrophic for benchmarking. A model that needs 3 validation rounds to redact an entry correctly would show 3x the latency of one that gets it right on the first pass. The latency comparison would conflate model quality with model speed.

For the benchmark, I want to measure: given one pass through the pipeline, how good is the output? The validation step answers a different question (can iterative refinement fix mistakes?) which is worth testing separately.

Decision: ADB-Triggered Benchmarks via BroadcastReceiver (D17)

The benchmark can be triggered programmatically via ADB, without touching the UI:

# Run text benchmark, 5 entries, GPU backend:
adb shell am broadcast \
  -n com.example.redacto/com.example.redacto.benchmark.BenchmarkReceiver \
  -a com.example.redacto.BENCHMARK \
  --es type text \
  --ei count 5 \
  --es backend GPU

# Monitor results in real time:
adb logcat -s RedactoBenchmark:I

This design has a critical architectural detail: the BroadcastReceiver invokes a callback on the running app's ViewModel, which shares the already-initialized inference engine. It does not create a new engine instance.

Why this matters: the Gemma 4 E2B model file is 2.59 GB (the NPU build is 3.02 GB), and a loaded engine's runtime footprint is on that order. Creating a second engine instance for benchmarking would roughly double that (to ~5.18 GB for the generic build), which exceeds available RAM on most devices and causes OOM crashes. By sharing the engine, the benchmark runs against the exact same inference context the user would experience in production: same memory state, same GPU/NPU allocation, same thermal conditions.

The tradeoff is that the app must be open (the callback registration happens in Compose). You cannot run benchmarks against a cold app. For automated CI testing, this would be a limitation. For hackathon iteration, it is fine.

Engine Isolation (v1 Approach)

The initial benchmarking system (v1, in the original Redacto codebase) took a different approach. The BenchmarkRunner created its own InferenceEngine instances, separate from the main app. Each model got a clean init/close cycle, and the main app's engine was untouched.

This worked for comparing two models head-to-head but required enough RAM for two full engine loads. The v2 approach (shared engine via BroadcastReceiver) was adopted when we hit OOM on the fine-tuned model, which is 4.7 GB.

Standard vs Fine-Tuned: An Unfair Fight

With the benchmark suite in place, I ran the first comparison: the standard litert-community Gemma 4 E2B against our QLoRA fine-tuned variant. The results were surprising.

A caveat before the numbers: every figure in the tables below (and the latency and power comparisons that follow) comes from a single directional benchmarking session on one Galaxy S25 Ultra. No raw logs were retained, and the device is no longer available to me, so treat these as one non-reproducible run that points in a direction, not a rigorous multi-run study.

Metric	Standard	Fine-Tuned	Delta
Overall Score	80.5%	70.3%	-10.2%
Entity Recall	79.3%	71.7%	-7.6%
Format Score	79.8%	71.7%	-8.1%
Preservation	83.7%	65.9%	-17.8%

The standard model won overall. But the per-mode entity-recall breakdown told a different story:

Mode	Standard	Fine-Tuned	Winner
FIELD_SERVICE	82.1%	95.3%	Fine-tuned (+13.2%)
FINANCIAL	83.8%	85.5%	Fine-tuned (+1.7%)
HIPAA	95.7%	39.9%	Standard (+55.8%)
JOURNALISM	71.1%	61.3%	Standard (+9.8%)
TACTICAL	63.7%	76.8%	Fine-tuned (+13.1%)

(These are entity-recall scores by mode, the 50%-weighted component of the overall score, not the blended per-mode totals.)

The fine-tuned model came out clearly ahead on FIELD_SERVICE and TACTICAL - the two modes where contextual reasoning matters most. It correctly identified "key is under the mat" as a security risk. It correctly preserved suspect descriptions while redacting victim names.

But it lost badly on HIPAA (-55.8%) and overall preservation (-17.8%). Why?

The Output Format Mismatch

It is worth being honest about how this fine-tune came together, because it frames the whole result: this was an under-resourced attempt. We trained on 3,000 of the 400,000 samples in ai4privacy/pii-masking-400k, for a single epoch, in only a few minutes (about 217 seconds). And it was trained with a generic instruction prompt ("Mask all Personally Identifiable Information"), so it learned to output [REDACTED] or [REDACTED NAME] style placeholders. Redacto's benchmark expects the structured [CATEGORY_N] format ([NAME_1], [SSN_2]).

That mismatch penalizes the fine-tuned model on both Format Compliance (its [REDACTED] placeholders do not match the [CATEGORY_N] pattern) and Entity Recall (some entities are partially redacted with free-form text that still contains fragments of the original value). So the honest read is not "fine-tuning failed" - it is that a good prompt on the standard model beat an under-resourced fine-tune. Both the quantity of training data (3,000 vs 400,000 samples) and its alignment ([REDACTED] vs [CATEGORY_N]) mattered, and with the full dataset and the right output format the result could easily have flipped. The comparison is not measuring model capability: it is measuring how much training investment and format alignment each side had.

What Would Make It Fair

Four changes would make this a valid comparison:

Train on the full dataset. 3,000 of 400,000 samples for one epoch is not a real training run. Using the full corpus (and more than one epoch) is the single biggest thing that could flip the result.
Re-train with Redacto's prompts. Use the exact system prompts and [CATEGORY_N] output format as training targets, rather than the generic [REDACTED] format the fine-tune learned.
Comparable footprint and backend. The fine-tuned model is 4.7 GB versus 2.59 GB for the standard build, and critically it is GPU-only: at that size and quantization it cannot be compiled to the NPU at all. That footprint-and-backend gap, not just quantization granularity, is what drives the 1.9x latency gap and higher power draw (roughly 3x in this one session) - so it is not an apples-to-apples model comparison.
Same chat template. The fine-tuned model originally shipped with a HuggingFace Jinja template that LiteRT-LM could not parse (it used map.get(), which the on-device template parser does not support). This required a re-export with a compatible Gemma 3 template.

The benchmark did exactly what it was supposed to do: it surfaced a training data problem that manual testing would never have caught. I would have looked at three FIELD_SERVICE examples, seen improvement, and shipped the fine-tuned model - not knowing it dragged down HIPAA performance.

What I Would Do Differently

More entries per mode. Seventeen entries per mode (12 automated + 5 hand-crafted) is enough to see trends but not enough for statistical confidence intervals. For production, I would target 50+ per mode.

Automated regression. The current system is manual: I run the benchmark, read the results table, compare to last run. A CI-integrated version would store results in a database and flag regressions automatically.

Latency percentiles. The session doc records average latency, but averages hide outliers. A single 30-second entry in an otherwise fast run skews the mean. P50, P95, and P99 latency would be more informative.

Cross-device testing. All measurements are from a single Samsung Galaxy S25 Ultra. Different Snapdragon variants, MediaTek chips, and Tensor processors will behave differently.

The Minimum Viable Benchmark

If you are building an on-device LLM app and want to start benchmarking today, here is the minimum:

Curate 30-50 entries that represent your actual use cases, not generic NLP benchmarks. Include edge cases you have seen fail in manual testing.
Define 2-3 metrics that measure what matters for your specific task. For redaction, that is entity recall, format compliance, and preservation. For summarization, it might be factual accuracy, compression ratio, and fluency. For classification, it might be accuracy, calibration, and latency.
Store them as a flat file (JSONL, CSV) in your app's assets. No database, no API, no build-time code generation.
Build a scorer that does not use exact match. Unless your task has exactly one correct output, exact match will generate false failures that erode trust in the benchmark.
Measure on the device, not the emulator. Emulator performance numbers are meaningless for on-device inference. Thermal throttling, memory pressure, and accelerator behavior differ fundamentally.
Log everything. Per-entry latency, token counts, backend (GPU/NPU/CPU), and raw model output. You will need it when debugging a score drop.

The goal is not perfection. The goal is replacing "it works on my test input" with "it works on 85 representative inputs across 5 domains and 3 difficulty levels, with quantified scores on the metrics that matter." That is a different kind of confidence.

Related in this series of "Edge AI from the Trenches"

Prompt Engineering Beat My Fine-Tuned Model. Here Is Why. - the decision framework that drove our standard-vs-fine-tuned comparison
What On-Device AI Benchmarks Actually Feel Like - definitions of the performance metrics this suite measures
Why Benchmarking On-Device AI Turned Into Its Own Engineering Project - the ecosystem-level tooling problems that forced us to build this from scratch

Sources:

ai4privacy/pii-masking-400k - the labeled PII dataset the benchmark was curated from
EleutherAI lm-evaluation-harness - the community-standard LLM eval framework this suite is contrasted against
Stanford HELM - holistic multi-metric evaluation framework
Google BIG-bench - large collaborative benchmark suite

Last updated: July 2026
18th of 23 posts in the "Edge AI from the Trenches" series

What I Learned Benchmarking the Same LLM on My Phone's NPU, GPU, and CPU

Jaydeep Shah (JD) — Mon, 06 Jul 2026 05:51:10 +0000

Most discussions about NPU inference are either vendor press releases or synthetic benchmarks on models nobody ships. I wanted something different: real numbers, from a real app, doing real work, on a phone you can actually buy.

This post breaks down what happened when I ran the same PII-redaction pipeline on the NPU, GPU, and CPU of a Samsung Galaxy S25 Ultra, using Gemma 4 E2B INT4 via Google's LiteRT-LM runtime. The results were not what I expected.

The Setup

Device: Samsung Galaxy S25 Ultra
Chipset: Snapdragon 8 Elite (SM8750-AC)
NPU: Hexagon V79, reached via QNN (Qualcomm AI Engine Direct), which targets the Hexagon Tensor Processor (HTP)
Android: 16 (API 36)
Runtime: LiteRT-LM 0.11.0-rc1
Model: Gemma 4 E2B Instruct INT4 (2.59 GB GPU/generic build, 3.02 GB NPU-compiled build)

The NPU path uses ahead-of-time (AoT) compilation, while the CPU and GPU paths use just-in-time (JIT) partitioning with CPU fallback.

App: Redacto, a privacy-redaction tool that takes sensitive text (medical records, financial documents, police reports) and replaces PII with category-coded placeholders like [NAME_1] or [SSN_1].

Pipeline: Four steps, executed sequentially. The first three are the LLM calls these benchmarks measure:

Step 1: CLASSIFY - determine document type (medical, financial, tactical, etc.)
Step 2: DETECT   - identify all PII entities in the text
Step 3: REDACT   - rewrite the text with placeholders replacing PII
Step 4: VALIDATE - check the redacted output for leaks (not benchmarked here)

Each step creates a fresh conversation (no shared context) and generates free-form text output. This is not a single-token classification task: Steps 2 and 3 produce substantial multi-token output.

Dataset: redacto_bench.jsonl, 30 entries across 5 redaction modes (HIPAA, Financial, Field Service, Tactical, Journalism), average input length 208 characters, all "easy" difficulty. Benchmarks run Steps 1-3 only (no validation pass), triggered via ADB broadcast receiver so the already-initialized engine is reused.

Metrics methodology: Wall-clock latency via System.currentTimeMillis(), TTFT via first onMessage callback timestamp, decode speed computed as (tokenCount - 1) * 1000 / (lastTokenTime - firstTokenTime), peak RSS from /proc/self/status. I only report metrics I can measure reliably: no battery draw estimates, no GPU/NPU utilization percentages, nothing derived from unreliable APIs.

One caveat before the numbers: everything below comes from a single benchmarking session on 2026-05-01. No raw logs were retained, and that device is no longer available to me. Treat these as directional findings from one run on one phone, not a rigorous multi-run study.

The Headline Numbers

Here is the per-step average across all 30 entries:

Metric	NPU	GPU	NPU vs GPU
Step 1 (Classify) latency	345 ms	773 ms	2.2x faster
Step 2 (Detect) latency	624 ms	1,586 ms	2.5x faster
Step 3 (Redact) latency	4,060 ms	2,475 ms	0.6x (GPU wins)
Step 1 TTFT	99 ms	381 ms	3.8x faster
Step 2 TTFT	104 ms	375 ms	3.6x faster
Step 3 TTFT	92 ms	366 ms	4.0x faster
Decode speed (Step 1)	~42 tok/s	~25 tok/s	~1.7x faster
Decode speed (Step 2)	~42 tok/s	~25 tok/s	~1.7x faster
Decode speed (Step 3)	~42 tok/s	~25 tok/s	~1.7x faster

NPU decode sits at roughly 42 tok/s and stays essentially flat across the three steps. That flat rate is expected here: NPU decode is memory-bandwidth-bound rather than compute-bound, so the per-token throughput does not vary much with the step. NPU dominates TTFT (sub-100ms vs 366-381ms) and decode throughput (~42 vs ~25 tok/s). If the story ended here, this would be a straightforward "use the NPU" recommendation.

It does not end here.

The Paradox: Faster Per-Token, Slower Overall

Metric	NPU	GPU
Avg total latency	5,062 ms	4,855 ms
Avg total tokens generated	195	91
Avg input length	208 chars	208 chars
Peak RSS	1,934 MB	1,375 MB

Despite being ~1.7x faster per token, the NPU's average total latency is slightly higher than GPU: 5.1 seconds vs 4.9 seconds. And it uses 560 MB more memory.

The root cause is in the token counts. Look at the per-step breakdown:

Step	NPU avg tokens	GPU avg tokens	Ratio
Step 1 (Classify)	10	10	1.0x
Step 2 (Detect)	22	30	0.7x
Step 3 (Redact)	163	51	3.2x

Step 3 is the problem. The NPU generates 3.2x more tokens on average for the redaction step.

Why? The NPU backend uses samplerConfig = null because constrained decoding is unsupported on the Hexagon Tensor Processor (it returns error 12: "not supported"). The GPU uses topK=64, topP=0.95, temperature=1.0, which constrains the output distribution and keeps responses concise. Without any sampling constraints, the NPU's default behavior is verbose: it rambles, adds explanations, repeats content. This is not a vendor failing so much as two independent design choices colliding: the AoT-compiled NPU path and the app's sampling expectations were built against different assumptions about what the runtime supports.

This is a runtime limitation, not a hardware limitation. The Hexagon V79 is doing its job perfectly, decoding tokens at roughly 42 tok/s consistently across all three steps. But because the software stack cannot apply sampling constraints on-NPU, the model generates far more tokens than necessary to complete the task.

Per-Step Latency Flow

The following diagram shows how latency distributes across the three benchmarked pipeline steps, and why the NPU's Step 3 erases its earlier advantage:

The NPU enters Step 3 with a 1,390 ms lead (969 ms vs 2,359 ms cumulative through Steps 1-2). But Step 3's verbosity costs 4,060 ms vs 2,475 ms, turning a 1.4-second advantage into a 200 ms deficit.

The Mode Breakdown: Where NPU Wins and Loses

Not all redaction modes behave the same. Here is the per-mode average total latency:

Mode	Entries	NPU avg	GPU avg	Winner
HIPAA	4	1,977 ms	4,160 ms	NPU by 2.1x
FINANCIAL	4	2,360 ms	4,734 ms	NPU by 2.0x
FIELD_SERVICE	5	2,325 ms	5,284 ms	NPU by 2.3x
JOURNALISM	10	2,348 ms	4,564 ms	NPU by 1.9x
TACTICAL	7	14,201 ms	5,430 ms	GPU by 2.6x

Four out of five modes: NPU wins decisively, often by 2x or more. These modes produce relatively short redaction outputs, so the verbosity penalty stays manageable.

TACTICAL mode is the outlier. NPU averaged 14.2 seconds, nearly 3x slower than GPU. One or more tactical entries triggered extremely long LLM output on the NPU due to unconstrained sampling. This is the worst-case scenario for the samplerConfig = null limitation: when the model decides to be verbose on already-complex inputs (police reports, incident descriptions with many entities), there is no mechanism to rein it in. There is a hard ceiling in play - the deployed build was compiled with a 1024-token KV cache (the app requested maxNumTokens=4000, but the compiled 1024 cache wins) - so verbosity is bounded, but 1024 tokens of unconstrained output is still enough to blow past GPU's total time.

The takeaway: NPU verbosity is not uniformly distributed. It is input-dependent and can spike dramatically on certain content types.

Single-Entry Deep Dive: hipaa_001

To isolate from averaging effects, here is one specific entry, hipaa_001, a 229-character medical text:

Metric	NPU	GPU	NPU vs GPU
Total latency	2,781 ms	5,646 ms	2.0x faster
Step 1 (Classify)	757 ms	955 ms	1.3x faster
Step 2 (Detect)	694 ms	2,258 ms	3.3x faster
Step 3 (Redact)	1,309 ms	2,413 ms	1.8x faster
TTFT (Step 2)	159 ms	424 ms	2.7x faster
Decode tok/s (Step 3)	44.9	24.3	1.8x faster
Total tokens	83	122	NPU fewer
Peak RSS	1,684 MB	1,134 MB	GPU uses 550 MB less

On this particular entry, the NPU wins across the board, 2.0x faster overall. Notably, the NPU generated fewer total tokens (83 vs 122) on this input. The verbosity problem is statistical, not deterministic. On short medical texts, the unconstrained sampler happens to behave well.

This is what NPU performance looks like when the token count cooperates: sub-3-second end-to-end for a three-step LLM pipeline. That is genuinely impressive for on-device inference, with the usual caveat that this is one entry from one session.

Memory: The Hidden Cost

Metric	NPU	GPU
Peak RSS (30-entry bench)	1,934 MB	1,375 MB
Peak RSS (single entry)	1,684 MB	1,134 MB

The NPU consistently uses ~560 MB more memory. Two factors contribute: the NPU-compiled model file is larger (3.02 GB vs 2.59 GB), and the QNN runtime allocates additional buffers for the Hexagon Tensor Processor communication pathway.

On a flagship with 12+ GB RAM, this is manageable. On a mid-range device with 6-8 GB, that extra 560 MB could be the difference between running and OOM-killing.

CPU: The Reliable Fallback

I did not run the full 30-entry benchmark on CPU for this comparison (CPU is the default fallback, not the focus). From the engineering data: CPU inference works on every device, requires no special libraries or DSP path configuration, uses the smallest memory footprint, and runs at roughly 4-6 tok/s, well under a quarter of the GPU's decode speed. It is the universal compatibility option: slowest, but it always works.

For reference, the previous single-pass pipeline on GPU averaged 12.8 tok/s with the standard model. This pipeline's GPU decode speed of ~25 tok/s reflects both the model upgrade and pipeline architecture changes, so these are not directly comparable.

What This Means for Developers

If you are building an on-device LLM feature and choosing between NPU, GPU, and CPU backends, here is what this data suggests:

Choose NPU when:

Time-to-first-token matters for UX (sub-100ms vs 350ms+ is perceptible)
Your outputs are short (classification, yes/no, short extractions)
You can tolerate occasional verbosity spikes
You are targeting flagship hardware with ample RAM

Choose GPU when:

Output length consistency matters (sampling constraints available)
Total wall-clock time matters more than TTFT
Memory pressure is a concern (550 MB savings)
You need constrained decoding (structured JSON, tool calls)

Choose CPU when:

You need universal device compatibility
NPU/GPU initialization fails (QNN library issues, driver mismatches)
You are building a fallback path (and you should always build a fallback path)

The honest answer is that NPU vs GPU is not a clear winner today. The NPU hardware is genuinely faster: ~1.7x decode speed and 3.6-4.0x TTFT were real advantages in this session. But the software stack has a gap: no sampling control on-NPU means you cannot control output length, and that single limitation can erase the hardware advantage on certain workloads.

If constrained sampling support arrives for the Hexagon Tensor Processor, this calculus changes. A ~42 tok/s NPU with proper top-k/top-p sampling would likely win across the workloads I tested against a ~25 tok/s GPU. That is the future. Today, you need to test on your specific prompts and decide.

Methodology Note

All numbers in this post come from Redacto's benchmark suite, measured in a single session on 2026-05-01 on a Samsung Galaxy S25 Ultra running Android 16. No raw logs were retained from that run, and the device is no longer available to me, so these are directional results from one session, not a reproducible multi-run study. The benchmark runs 30 entries through the first three pipeline steps (Classify, Detect, Redact) with no validation pass, triggered via ADB broadcast receiver. Latency is wall-clock System.currentTimeMillis(), TTFT is first callback timestamp, decode speed excludes prefill, and memory is kernel-reported VmRSS. Full methodology is documented in the project's benchmark-results.md.

This work was part of Redacto, which my team Edge Artists built for the Qualcomm x Google LiteRT Hackathon 2026.

Related in this series of "Edge AI from the Trenches"

Why My LLM Runs 4x Faster on Hardware I Had Never Heard Of: foundational context on the hardware being compared
What On-Device AI Benchmarks Actually Feel Like: definitions and human-perception thresholds for the metrics used here
One Model, Three Chips, Two Files: How LiteRT Delegates Really Work: how LiteRT routes inference to different backends

Last updated: July 2026
17th of 23 posts in the "Edge AI from the Trenches" series

My Fine-Tuned Gemma 4 Loaded Fine, Then Broke on the First Message

Jaydeep Shah (JD) — Mon, 06 Jul 2026 05:48:04 +0000

I fine-tuned Gemma 4 E2B. The adapter merged cleanly. The export to .litertlm completed without errors. I pushed the model to my phone, initialized the engine, and everything looked green. Then I tried to create a conversation and got this:

Failed to apply template: unknown method: map has no method named get (in template:238)

No model loading failure. No quantization error. The model initialized, the tokenizer loaded, and then the runtime choked on a Jinja template feature it does not support. This failure only surfaces when you actually try to run inference, not when you load the model. If you are demoing at a hackathon, this is the worst possible time to discover a compatibility issue.

I hit this exact bug while building Redacto, a zero-trust PII redaction app that runs Gemma 4 E2B entirely on-device. This post walks through the full fine-tune-to-deploy pipeline: how to QLoRA a model on Colab, export it for LiteRT-LM, and avoid the undocumented template trap that will block your deployment.

The Full Pipeline

Here is what the fine-tune-to-deploy pipeline looks like end to end:

HuggingFace base weights
    -> QLoRA fine-tune (Colab)
    -> Merge adapter into base
    -> Patch chat template   <-- the step nobody tells you about
    -> Quantize + export to .litertlm
    -> Push to device

Each stage has its own failure modes. The template patch step is the one that was undocumented at the time, and it is the one that will cost you hours if you do not know it exists.

A note on framing before we dig in: this was an under-resourced fine-tune. I trained on 3,000 of the 400,000 samples in the ai4privacy/pii-masking-400k dataset for a single epoch, and the label format did not fully match what Redacto expected downstream. The point of this post is not the fine-tune's accuracy - it is the deployment mechanics I had to work through to get any fine-tuned model onto the device at all.

Step 1: QLoRA Fine-Tuning on Colab

QLoRA (Quantized Low-Rank Adaptation) lets you fine-tune a quantized model by training small adapter matrices instead of updating all parameters. Gemma 4 E2B is 5.1B total parameters but only 2.3B effective (the E stands for effective, achieved via Per-Layer Embeddings). That small effective footprint is why a single Colab GPU can handle the fine-tune.

Here is the training configuration that worked for us:

from peft import LoraConfig, get_peft_model, TaskType

lora_config = LoraConfig(
    r=8,
    lora_alpha=16,
    lora_dropout=0.05,
    target_modules=[
        "q_proj", "k_proj", "v_proj", "o_proj",
        "gate_proj", "up_proj", "down_proj"
    ],
    task_type=TaskType.CAUSAL_LM,
)

model = get_peft_model(base_model, lora_config)

The key parameters:

Rank 8 - the rank of the low-rank adapter matrices. Lower rank means fewer trainable parameters and faster training, at the cost of expressiveness. Rank 8 is a common starting point for small models.
Alpha 16 - the scaling factor applied to the adapter output. A ratio of alpha/rank = 2 is standard practice. Higher alpha amplifies the adapter's contribution relative to the frozen base weights.
Dropout 0.05 - light regularization on the adapter layers. With only 3,000 training examples, some regularization helps prevent overfitting.
Target modules - we targeted all the attention projection layers (q/k/v/o_proj) plus the MLP layers (gate/up/down_proj). This gives the adapter influence over both how the model attends to context and how it transforms representations. For Gemma 4's architecture, these module names include a .linear suffix in the full path.

With this configuration, the adapter adds 2,850,816 trainable parameters, 0.06% of the model's total. I trained on 3,000 examples from the ai4privacy/pii-masking-400k dataset (out of 400,000 available) for 1 epoch on a Colab NVIDIA RTX PRO 6000. Total training time: 217 seconds. Final training loss: 4.3064. This was deliberately a small, quick pass, not a rigorous training run - quantity and alignment of the training data both matter, and with the full dataset the outcome could have looked very different.

After training, merge the adapter back into the base model:

from peft import PeftModel

model = PeftModel.from_pretrained(base_model, adapter_path)
merged_model = model.merge_and_unload()
merged_model.save_pretrained("./merged_model")
tokenizer.save_pretrained("./merged_model")

That last line, tokenizer.save_pretrained(), is where the trap is planted. More on that in a moment.

Step 2: Export to .litertlm

Once you have a merged model directory, export it using LiteRT's conversion tool:

python -m litert_torch.generative.export_hf \
  --hf_model_dir ./merged_model \
  --quantize dynamic_wi4_afp32 \
  -p 256 \
  --cache_length 1024 \
  --externalize_embedder \
  --output gemma4_ft.litertlm

The flags:

dynamic_wi4_afp32 - INT4 weight quantization with FP32 activations. This is the standard quantization scheme for LiteRT-LM models. It compresses the model aggressively while keeping activation precision high enough for quality inference.
-p 256 - prefill chunk size. Controls how many tokens are processed in parallel during the initial prompt encoding.
--cache_length 1024 - deployed KV-cache size. This sets the maximum number of in-flight conversation tokens this build can hold, and it is distinct from the model's trained context length. Gemma 4 E2B is trained for a 128K context, but I compiled this build with a 1024-token cache to keep memory usage in check. Do not confuse the deployed cache size with the model's context window.
--externalize_embedder - separates the embedding table from the main model bundle for memory efficiency during loading.

This export completes without errors. The .litertlm file is generated. You push it to the device:

adb push gemma4_ft.litertlm \
  /sdcard/Android/data/com.example.starterhack/files/gemma4_ft.litertlm

The engine initializes. No crash. You think you are done. You are not.

One more thing worth flagging about this export: the resulting fine-tuned .litertlm came out to about 4.7 GB and was GPU-only. It could not be compiled down to the NPU path, so this build never got the NPU acceleration that the pre-compiled community models get. That is a real constraint of the fine-tune-and-export route as it stands today, separate from the template trap below.

Step 3: The Chat Template Trap

When you call tokenizer.save_pretrained() during the adapter merge step, it writes out the tokenizer files, including tokenizer_config.json and a standalone chat_template.jinja file. These files contain the chat template: the Jinja2 template that formats user/assistant messages into the token sequence the model expects.

The problem: HuggingFace's Gemma 4 repository (google/gemma-4-E2B-it) ships a chat template that uses map.get(), a Jinja2 method for safely accessing dictionary values with a default fallback. This is perfectly valid Jinja2. Python's Jinja2 library handles it fine. HuggingFace's transformers library handles it fine.

LiteRT-LM's on-device Jinja parser does not support it.

The .litertlm export tool reads the chat template from your model directory and embeds it into the bundle. When the LiteRT-LM runtime on Android tries to apply the template to format a conversation, it fails:

Failed to apply template: unknown method: map has no method named get (in template:238)

This error has three properties that make it particularly nasty:

It only appears when you create a conversation, not when you load the model. Engine initialization succeeds. The model loads into GPU memory. The tokenizer initializes. Everything looks green until you actually try to send a message.
There is no warning during export. The export_hf tool does not validate whether the embedded template is compatible with the on-device parser. It bundles whatever template it finds in the model directory.
It was not documented at the time. During the hackathon, I could not find any published documentation listing which Jinja2 features the LiteRT-LM template parser supports, so I discovered the boundaries by hitting them. I later compared notes with a Google engineer who confirmed the same approach: he independently had the exact same finding written down in his own notes, which was reassuring but also told me this was tribal knowledge rather than something you could look up.

Step 4: The Fix

The fix is to replace the Gemma 4 chat template with an older, compatible one before export. The Gemma 3 family's chat template (google/gemma-3-1b-it) avoids map.get() and uses only Jinja features that LiteRT-LM's parser supports. This makes sense: the litert-community pre-compiled models were built against Gemma 3-era templates.

Here is the re-export workflow:

import json
from huggingface_hub import hf_hub_download
import os

# 1. Download the compatible template from Gemma 3
compatible_config_path = hf_hub_download(
    repo_id="google/gemma-3-1b-it",
    filename="tokenizer_config.json"
)

with open(compatible_config_path, "r") as f:
    compatible_config = json.load(f)

compatible_template = compatible_config["chat_template"]

# 2. Patch the merged model's tokenizer config
merged_config_path = "./merged_model/tokenizer_config.json"
with open(merged_config_path, "r") as f:
    merged_config = json.load(f)

merged_config["chat_template"] = compatible_template

with open(merged_config_path, "w") as f:
    json.dump(merged_config, f, indent=2)

# 3. Remove the standalone template file (export_hf may prefer it over the config)
jinja_path = "./merged_model/chat_template.jinja"
if os.path.exists(jinja_path):
    os.remove(jinja_path)
    print("Removed standalone chat_template.jinja")

print("Template patched. Ready for re-export.")

After patching, re-run the export:

python -m litert_torch.generative.export_hf \
  --hf_model_dir ./merged_model \
  --quantize dynamic_wi4_afp32 \
  -p 256 \
  --cache_length 1024 \
  --externalize_embedder \
  --output gemma4_ft_fixed.litertlm

The resulting .litertlm file now contains a compatible template. Conversations work.

A note on template choice. What I did under hackathon pressure was borrow the Gemma 3 family's template, because I already knew it was compatible with the on-device parser. When I later compared notes with a Google engineer, he pointed at a cleaner source: rather than reaching back to a Gemma 3 template, pull the chat template that ships with the LiteRT-LM build of gemma-4-e2b-it (the litert-community packaged model), which is already known-good for the on-device parser, instead of the raw google/gemma-4-e2b HuggingFace template that caused the failure in the first place. Same idea I had arrived at (use a template the on-device parser is known to accept), but staying within the Gemma 4 family instead of borrowing from Gemma 3.

Trap 2: The Missing Vision Encoder

Once the template issue is fixed, you may hit a second wall:

LiteRtLmJniException: NOT_FOUND: TF_LITE_VISION_ENCODER not found in the model

The fine-tuned .litertlm was compiled text-only: no vision encoder is included. But if your engine configuration specifies vision support (which it does by default for Gemma 4, a multimodal model), the runtime tries to find the vision encoder section in the bundle and fails.

The fix is to configure the engine for text-only mode. In Kotlin:

fun initializeEngine(modelPath: String, textOnly: Boolean = false) {
    val engineConfig = EngineConfig.builder()
        .setModelPath(modelPath)

    if (textOnly) {
        engineConfig
            .setVisionBackend(null)
            .setAudioBackend(null)
            .setMaxNumImages(null)
    } else {
        engineConfig
            .setVisionBackend(Backend.CPU())
            .setMaxNumImages(1)
    }

    engine = LlmEngine.create(engineConfig.build())
}

Note the setMaxNumImages(null), not setMaxNumImages(0). Setting it to 0 triggers a separate validation error: max number of images must be positive or null. The engine treats 0 as an invalid value, not as "no images." You must pass null to indicate the model does not handle images.

Why This Matters Beyond My Project

The chat template trap is not specific to Redacto or to PII redaction. It affects anyone who fine-tunes a Gemma 4 model (or any model whose HuggingFace template uses Jinja features beyond LiteRT-LM's supported subset) and tries to deploy it on-device. Redacto was built for the Qualcomm x Google LiteRT Developer Hackathon 2026, and this trap was one of the last things standing between a green engine init and a working demo.

The root cause is an impedance mismatch between two ecosystems:

HuggingFace's transformers library uses Python's full Jinja2 implementation. Template authors can use any Jinja2 feature: filters, methods, macros, the full spec.
LiteRT-LM's on-device runtime uses a lightweight Jinja parser written for embedded/mobile use. It supports a subset of Jinja2, but the subset is not documented anywhere.

When you fine-tune through the standard HuggingFace workflow (AutoModelForCausalLM.from_pretrained(), train with peft, merge with merge_and_unload(), save with save_pretrained()) you get whatever chat template the upstream model ships. If that template uses an unsupported feature, your export will silently embed a broken template.

The pre-compiled models from litert-community on HuggingFace do not have this problem because they were compiled with compatible templates. The problem only surfaces when you fine-tune and re-export yourself.

This is the kind of issue that has no answer in forum posts and community discussions. The error message (map has no method named get) does not appear in Google's documentation. The fix (swap the template with an older version) is unintuitive: you are replacing a component of your Gemma 4 model with one from Gemma 3, and it works because the actual template logic is functionally equivalent, just expressed without map.get().

Checklist: Fine-Tune to LiteRT-LM Deployment

For anyone following this path, here is the condensed checklist:

Fine-tune with QLoRA on Colab (rank 8, alpha 16 is a solid starting point for a model with Gemma 4 E2B's effective footprint)
Merge the adapter with merge_and_unload()
Save the merged model with save_pretrained()
Patch the chat template: download tokenizer_config.json from google/gemma-3-1b-it, extract its chat_template, overwrite your merged model's template
Delete chat_template.jinja from the merged model directory if it exists
Export with litert_torch.generative.export_hf using dynamic_wi4_afp32
If the fine-tuned model is text-only, set visionBackend = null, audioBackend = null, maxNumImages = null in the engine config
Test by creating a conversation and sending a message: do not assume init success means inference will work

What I Would Like to See

LiteRT-LM is a powerful runtime for on-device LLMs. The NPU performance on Snapdragon 8 Elite is genuinely impressive. But the fine-tuning deployment story has gaps:

Document the supported Jinja2 subset. Even a simple list of supported filters, methods, and control structures would save developers hours. At the time I hit this, the only way to discover the boundaries was to hit them at runtime.
Validate templates during export. The export_hf tool could parse the chat template against the same Jinja parser the runtime uses and flag incompatible features before producing the .litertlm file.
Ship compatible templates with the export toolchain. If the tool detects that the model's template uses unsupported features, it could offer to substitute a known-compatible template for the same model family.

The pre-compiled models from litert-community work out of the box. The moment you fine-tune, you are on your own. That gap is the engineering story of this blog post.

Related in this series of "Edge AI from the Trenches"

The Invisible Layer Between My Prompt and the Model: foundational context on what chat templates are and why they break
I Opened a .litertlm File. Here Is What Is Actually in There.: the compiled bundle where the template gets sealed at export time
Prompt Engineering Beat My Fine-Tuned Model. Here Is Why.: the decision framework for when fine-tuning is worth the deployment complexity

Sources:

QLoRA: Efficient Finetuning of Quantized LLMs - Dettmers et al., 2023, the technique behind the fine-tune
PEFT library documentation - the LoRA adapter tooling used here
litert-community on HuggingFace - the pre-compiled models this post compares against
Redacto - Qualcomm x Google LiteRT Developer Hackathon 2026, team Edge Artists

Last updated: July 2026
16th of 23 posts in the "Edge AI from the Trenches" series

Prompt Engineering on a Small On-Device Model Is a Different Sport

Jaydeep Shah (JD) — Sun, 05 Jul 2026 22:45:00 +0000

I built Redacto with my team in a single day at a hackathon: an on-device PII redaction app that runs a quantized small model entirely on a phone. No cloud calls, no API keys, no data leaving the device. The model is Gemma 4 E2B Instruct (roughly 2.3B effective parameters, 5.1B total, using Per-Layer Embeddings, where the E stands for effective), quantized to INT4 and running inside LiteRT-LM on a Snapdragon 8 Elite.

Before this project, I thought I understood prompt engineering. I had written system prompts for GPT-4 and Claude. I had done few-shot formatting, chain-of-thought reasoning, role injection. All the standard moves.

None of it prepared me for what happens when the model is small, the context window is tight, and the hardware has opinions about how your output gets generated.

This post is about what I learned. Not abstract advice: concrete failures and the fixes that emerged from one directional benchmarking session of 30 entries across five modes on real hardware. That session ran once on a Galaxy S25 Ultra in May 2026. I did not save the raw logs and no longer have the device, so treat every precise number here as directional, not as a rigorous multi-run study.

1. Token budget is a real constraint

When you prompt GPT-4 or Claude, your system prompt lives inside a 128K or 200K context window. A 500-token system prompt is a rounding error. You can be verbose. You can add examples. You can explain edge cases at length.

On device, the picture is very different. Gemma 4 E2B is trained for a 128K context, but that is not what I deployed. The model is compiled with a fixed KV cache of 1024 tokens (a 256-token prefill), so the deployed working set is 1024 tokens, not 128K. The app requested maxNumTokens=4000, but the compiled 1024-token cache is what actually wins at runtime. That same 500-token system prompt consumes roughly half of the 1024-token cache before the user's input text arrives, and before the model has generated a single output token.

The math is unforgiving. Redacto runs a 4-step pipeline: Classify, Detect, Redact, Validate. Each step gets its own system prompt. Each prompt had to be compressed ruthlessly. Seven category-specific prompt sets, one each for Medical, Financial, Legal, Tactical, Journalism, Field Service, and General, all had to fit within budgets that would be laughable in cloud prompting.

On a cloud model, you might write:

You are a HIPAA-compliant redaction assistant. Your task is to carefully
analyze the following medical document and identify all Protected Health
Information (PHI) as defined under the HIPAA Privacy Rule, including but
not limited to: patient names, dates of birth, Social Security numbers,
medical record numbers, health plan beneficiary numbers, account numbers,
certificate/license numbers, vehicle identifiers and serial numbers,
device identifiers, web URLs, IP addresses, biometric identifiers,
full-face photographs and comparable images, and any other unique
identifying number, characteristic, or code...

On a small model with a 1024-token deployed cache, that prompt is a luxury you cannot afford. Every word of instruction competes with the input text and the output tokens for the same finite space. The prompt I actually ship looks more like this:

You are a medical document redactor.
Detect these PHI types: names, DOB, SSN, MRN, addresses, phone, fax, email,
account numbers, dates, provider IDs.
Output format: CATEGORY: detected_value (one per line)
Keep: diagnoses, medications, vitals, body locations.

The cloud version teaches. The on-device version commands. Both work, but only one of them fits.

2. Format enforcement vs. format suggestion

Here is something that surprised me early. On cloud models, you can say "please output in JSON format" and get well-formed JSON 95%+ of the time. The model has enough parameters and enough context to track bracket nesting, key-value structure, and quoting rules.

A small quantized model does not have that luxury. In my experience, Gemma 4 E2B running at INT4 precision sometimes struggled with bracket matching. I recall responses where the opening brace appeared but the closing brace was missing, or where keys looked hallucinated, or where the JSON structure would collapse into free text partway through. I did not keep saved samples of these failures, so treat the specifics as recollection rather than logged evidence, but the pattern was consistent enough that I stopped trusting JSON output from the model.

This led to Decision D8 in our engineering log: we abandoned JSON entirely. The output format became line-based, CATEGORY: value, one detection per line. It is less elegant than JSON. It is also dramatically more reliable, because the model only needs to produce a newline and a colon to maintain format compliance, not matched brackets and quoted strings.

The benchmark numbers point the same direction. Across the single session, the standard model using carefully engineered line-based prompts scored higher on overall quality than the fine-tuned model, which had learned a different output format during training (an [REDACTED] style that does not match Redacto's [CATEGORY_N] convention). The format itself is a prompt engineering decision, and it is one of the most consequential decisions you will make on a small model. The format-compliance sub-score bears this out: 79.8% for the standard model versus 71.7% for the fine-tune (the overall scores were 80.5% and 70.3%). As with every figure here, these come from the one directional session, so treat them as directional rather than a multi-run result.

The lesson: on cloud models, you suggest a format and the model follows it. On small models, you must design a format the model can reliably produce. The format has to be simpler than you think. If your model cannot count brackets, do not ask it to count brackets.

3. Conversation architecture replaces conversation length

Cloud prompt engineering has a well-known pattern: stuff everything into one long conversation. Give the model the full context. Let it reason across the entire input. The context window is large enough that you can do classification, detection, transformation, and verification in a single call.

On a small model, this pattern is a trap. Redacto's pipeline has four steps, Classify, Detect, Redact, and Validate, and each step creates and destroys its own conversation object (Decision D5). No shared context. No memory between steps. Four completely independent calls. (The benchmark numbers below cover only the first three steps; the measured session ran without the Validate step.)

Why? Because context pollution in a small model is far more damaging than in a large model. If the Detect step and the Validate step share a conversation, the validator has already seen the detection list. It is no longer an independent auditor: it is biased by its own prior work. On GPT-4, the model is large enough to reason past this bias. On Gemma 4 E2B, it cannot.

This is a shift in thinking. On cloud, you optimize conversation content, what goes into the prompt. On device, you optimize conversation architecture: how many conversations exist, what each one sees, and what is deliberately hidden from each step.

Four fresh conversations cost more total inference time than one long conversation would. But the quality difference is not close. The multi-pass architecture with isolated conversations produces consistently better results because each step does exactly one thing with maximum focus.

4. The prompt IS the specialization layer

Large language models can generalize. You can tell Claude "keep suspect descriptions but redact victim names" in a general instruction, and it will usually reason through the distinction correctly. The model has enough capacity to understand the intent behind your instruction.

A small model needs you to be explicit. Not "keep relevant medical information" but "Keep: diagnoses, medications, vitals, body locations." Not "redact financial identifiers" but "Detect: account numbers, routing numbers, credit card numbers, SSNs, Tax IDs." The model cannot infer what you mean from a general instruction: you must enumerate.

This is Decision D12: seven category-specific prompt sets, each with explicit preserve lists. The preserve list is not a nice-to-have. It is the mechanism that prevents over-redaction.

The benchmark data points the same way, though I want to be careful about the framing. In this single directional session, the standard model with explicit preserve lists scored higher on text preservation (83.7%) than the fine-tuned model (65.9%). That gap is not evidence that "fine-tuning failed." It is evidence that a good prompt beat an under-resourced fine-tune. The fine-tune was deliberately small: about 3,000 of the 400,000 available ai4privacy/pii-masking-400k samples, one epoch, only a few minutes of training (about 217 seconds), and it learned a generic mask-all-PII behavior with an [REDACTED] output format that does not match Redacto's [CATEGORY_N] convention. Both the quantity of data and its alignment to the task were off, so the model over-redacted, removing diagnoses from medical notes and stripping dollar amounts from financial documents, because nothing in its limited training told it what to keep. With the full dataset and an aligned output format, that result could easily flip. There was also a hard deployment constraint: the fine-tuned .litertlm was 4.7 GB and GPU-only, so it could not be compiled to the NPU at all.

Here is what the HIPAA detection prompt's preserve section looks like:

DO NOT flag as PHI:
- Diagnoses and conditions (Type 2 diabetes, hypertension)
- Medications and dosages (Metformin 500mg)
- Vital signs (BP 120/80, HR 72)
- Body locations (left heel, right shoulder)
- Procedures (MRI, CBC, A1C)

And the Tactical (law enforcement) prompt has a completely different preserve list:

DO NOT flag as PII:
- Suspect physical descriptions (race, height, weight, clothing)
- Vehicle descriptions (make, model, color, plate number)
- Officer names and badge numbers
- Crime scene locations and cross streets
- Weapon descriptions

On a cloud model, you might get away with a single general-purpose prompt. On a small on-device model, the prompt is the specialization layer. It compensates for what the small model cannot infer on its own. Each category needs its own carefully curated set of instructions, and the difference between "good enough" and "actually works" is in the preserve lists.

5. Sampling config is intertwined with prompting

This is the one I had not paid enough attention to. In cloud prompt engineering, you rarely think about sampling parameters. Temperature, top-K, top-P: these are set once and forgotten. The model self-regulates output length. Your prompt controls what the model says; the sampling config is background plumbing.

On device, prompting and sampling are inseparable.

Redacto runs on both GPU and NPU backends. The GPU backend uses a constrained sampling configuration: topK=64, topP=0.95, temperature=1.0. The NPU backend cannot use constrained decoding at all: it errors with "not supported, error 12", so it runs with samplerConfig=null, meaning the NPU uses whatever default sampling the runtime provides (Decision D26).

A caveat before the numbers: everything in this section comes from the single directional session described earlier (Galaxy S25 Ultra, May 2026, no raw logs saved, device no longer available). The figures are consistent enough to reason about, but they are directional, not a rigorous multi-run study.

The consequence is dramatic. On Step 3 (Redact), the GPU backend produces an average of 51 tokens per call. The NPU backend, with the same prompt and the same model architecture, produces an average of 163 tokens, roughly 3.2x more output.

The NPU is faster per token: about 42 tok/s versus about 25 tok/s on GPU. That flat NPU rate is expected rather than fabricated precision, because NPU decode here is memory-bandwidth-bound, so it tends to sit at a roughly constant tokens-per-second regardless of the step. But that speed advantage is entirely consumed by the verbosity. Total wall-clock latency for the pipeline averages about 5,062ms on NPU versus about 4,855ms on GPU. The NPU's raw speed advantage is nearly zeroed out because the model generates so much more text without constrained sampling to rein it in.

The TACTICAL mode is the worst case. NPU averaged about 14,201ms per entry on Tactical documents versus GPU's 5,430ms. The unconstrained NPU was generating enormous outputs for complex law enforcement scenarios, sometimes producing narrative explanations and commentary that the GPU's constrained sampling would have truncated.

This means your prompt cannot be evaluated in isolation. The same prompt produces different behavior on different hardware backends because the sampling configuration changes what the model is allowed to generate. When I tune a prompt, I have to test it on both GPU and NPU, because a prompt that produces tight, focused output on GPU might produce runaway verbosity on NPU.

6. The meta-lesson

After benchmarking 30 entries across five modes on two hardware backends in a single directional session, testing both a standard and an under-resourced fine-tuned model, and iterating through dozens of prompt revisions, I have arrived at a framing that I think captures the core difference. (Again: one session, no saved logs, directional rather than rigorous.)

On cloud models, prompt engineering is about getting the best answer. You have abundant context, powerful reasoning, and flexible output. You are optimizing for quality. The ceiling is high and the constraints are loose.

On a small on-device model, prompt engineering is about getting a usable answer within severe constraints. You are optimizing against a budget: token budget, context budget, format reliability budget, and hardware capability budget. Every prompt decision is a tradeoff. More instruction means less room for input. Richer format means more failure modes. Longer context means slower inference.

The skill set is different. Cloud prompt engineering rewards creativity, expressiveness, and thorough instruction. On-device prompt engineering rewards compression, discipline, and architectural thinking. You spend less time on what to say to the model and more time on how to structure the system around the model's limitations.

If I had to place it on a spectrum, on-device prompt engineering is closer to embedded systems programming than it is to chatting with ChatGPT. You are working within hard resource constraints, designing around hardware limitations, and making tradeoffs that would be invisible in an unconstrained environment.

What I would tell someone starting out

If you are moving from cloud prompt engineering to on-device work, here is what I wish someone had told me:

Measure format compliance, not just accuracy. Your model can find the right entities and still produce output your parser cannot read. Track format compliance as a separate metric from the start.

Design your output format for the model, not for your code. Your parsing code can handle complexity. Your small model cannot. Make the format as simple as the model needs it to be, then write a parser that handles the rest.

Isolate your conversations. Do not let steps share context. Each call should see only what it needs. The overhead of multiple conversations is less expensive than the quality loss from context pollution.

Write preserve lists, not just detection lists. Telling the model what to redact is half the problem. Telling it what NOT to redact, explicitly, per category, is the other half. Without preserve lists, small models over-redact aggressively.

Test on every backend. A prompt that works on GPU may fail on NPU because the sampling configuration is different. Your prompt and your sampling config are one system, not two.

Budget your tokens like you budget memory in embedded code. Every token in your system prompt is a token that is not available for input or output. Count them. Compress them. Cut the ones that do not earn their keep.

The model is small. The deployed cache is tight. The hardware has constraints you did not choose. But within those constraints, there is a craft to this work, and it is a craft worth learning, because on-device AI is where the next decade of applications will be built.

Related in this series of "Edge AI from the Trenches"

What Is a Chat Template and Why Does It Matter?: the formatting layer that sits between your prompt and the model's tokenizer
TTFT, tok/s, and What "Fast" Means: the metrics that prompt length and output format directly affect
Fine-Tuning vs Quantization vs Prompt Engineering: when to stay with prompts and when to escalate to fine-tuning

Last updated: July 2026
15th of 23 posts in the "Edge AI from the Trenches" series

What I Learned Trying to Benchmark On-Device AI Honestly

Jaydeep Shah (JD) — Sun, 05 Jul 2026 20:30:00 +0000

When I started benchmarking Redacto - my on-device PII redaction app running Gemma 4 E2B on a Samsung Galaxy S25 Ultra - I wanted to measure everything. Latency, throughput, memory, battery drain, GPU utilization, NPU load, energy consumption in joules. The full picture.

I ended up measuring five things.

Not because I got lazy. Because I adopted a principle early in the project and refused to violate it: if I can't measure it properly, I don't show it. This post is about what that principle looks like in practice: which metrics survive contact with the Android runtime, which ones don't, and how to build a benchmarking system that tells you the truth instead of telling you what you want to hear.

A note on the numbers in this post. Every precise benchmark figure below comes from a single directional benchmarking session on a Galaxy S25 Ultra in May 2026. I did not save raw logs, and the device is no longer available to me. Treat these numbers as directional evidence of what the metrics reveal, not as a rigorous multi-run study. The point of the post is the measurement methodology, not the specific values.

The honesty problem in mobile AI benchmarking

The cloud benchmarking ecosystem is mature. You load a model, run an eval harness, and get structured results. Tools like EleutherAI's lm-eval-harness, Stanford's HELM, and MLPerf Inference provide standardized frameworks with reproducible methodologies. Profiling tools (PyTorch Profiler, NVIDIA Nsight) give you per-layer timing, memory allocation traces, and GPU utilization down to the individual streaming multiprocessor (SM), the cluster of cores that is the basic compute unit inside an NVIDIA GPU. You can see whether each one is saturated or sitting idle.

None of this maturity carries over cleanly to Android. Some vendor and platform tooling does exist (Perfetto for system tracing, Android GPU Inspector, Qualcomm's Snapdragon Profiler), but there is no single, reproducible, app-level harness the way there is in the cloud. And honestly, inside a hackathon I did not have the time to hunt down, install, and wire up every vendor tool and figure out which one could give me clean per-inference numbers. So my natural instinct kicked in: build what I needed myself, out of the raw primitives I was already comfortable with.

On-device benchmarking is a different discipline. You are working inside a constrained operating system that was designed for user-facing apps, not scientific measurement. The general-purpose APIs that are easy to reach are coarse-grained, system-wide, and often misleading. The measurements I actually needed, per-process power draw, accelerator utilization, structured inference metrics, are either absent from the public SDK or locked behind tethered vendor profilers that do not fit into an automated, on-device run.

MLPerf Mobile is the closest thing to a standard, but it measures single-model inference latency on standardized tasks (image classification, object detection, language models). It does not address application-level pipelines where a single user action triggers three or four sequential LLM calls, each with different prompts, different expected output lengths, and different hardware characteristics. MLPerf gives you component-level numbers. Building an app requires system-level numbers.

So I had to build my own system, and I had to be honest about what it could and could not tell me.

Metrics that work

These are the five metrics I kept in the app. Each one has a clear measurement methodology, a known confidence level, and a direct connection to something the user or developer cares about.

1. Wall-clock latency (HIGH confidence)

The simplest and most reliable metric. Wrap System.currentTimeMillis() around the full engine.infer() call:

val startMs = System.currentTimeMillis()
engine.infer(prompt, callback)
val latencyMs = System.currentTimeMillis() - startMs

This captures everything: tokenization, prefill, decode, and any runtime overhead. It is the number the user actually experiences: the time between pressing "Redact" and seeing the result.

Real numbers from 30 easy entries, 3-step pipeline (Classify, Detect, Redact):

Backend	Avg Total Latency	Step 1	Step 2	Step 3
GPU	4,855ms	773ms	1,586ms	2,475ms
NPU	5,062ms	345ms	624ms	4,060ms

The surprise: NPU total latency is slightly worse than GPU despite faster per-token speed. The reason is that the NPU path cannot use constrained decoding (SamplerConfig must be null), so it generates 3.2x more tokens on Step 3 (163 avg vs GPU's 51). Faster per-token speed multiplied by more tokens equals roughly the same wall-clock time. Without measuring both wall-clock and per-step token counts, you would never see this.

2. Time to first token - TTFT (HIGH confidence)

TTFT is the time between starting inference and receiving the first token via the streaming callback. This is a UX metric: it is how long the user stares at a blank screen.

var firstTokenTime: Long? = null
val startMs = System.currentTimeMillis()

engine.infer(prompt) { token ->
    if (firstTokenTime == null) {
        firstTokenTime = System.currentTimeMillis()
    }
    // accumulate token...
}

val ttftMs = (firstTokenTime ?: startMs) - startMs

Real numbers:

Backend	Step 1 TTFT	Step 2 TTFT	Step 3 TTFT
GPU	381ms	375ms	366ms
NPU	99ms	104ms	92ms

NPU TTFT is 3.6-4.0x faster than GPU. This is NPU's clearest advantage: the user sees response text appearing almost instantly (sub-100ms feels immediate). GPU's ~370ms is noticeable but not painful. Anything above 500ms starts to feel slow.

TTFT measures the prefill phase: the time the model spends processing the input prompt before generating the first output token. It scales with prompt length, which matters for Redacto because the system prompts are substantial (category-specific prompt sets, each several hundred tokens).

3. Decode throughput - tok/s (HIGH confidence)

How fast tokens arrive after the first one. The formula excludes the prefill phase:

decode_tok_s = (tokenCount - 1) * 1000 / (lastTokenTime - firstTokenTime)

The - 1 is important: it counts inter-token intervals, not tokens. If you have 10 tokens and measure from first to last, there are 9 intervals. Getting this wrong inflates your numbers by 10-20% depending on token count.

Real numbers (averaged across 30 entries):

Backend	Step 1 tok/s	Step 2 tok/s	Step 3 tok/s
GPU	~25	~25	~25
NPU	~42	~42	~42

NPU is consistently around 1.6-1.7x faster on raw token throughput. GPU decode speed drifts slightly downward across steps while NPU stays essentially flat. The flat NPU rate is expected rather than surprising: NPU decode is memory-bandwidth-bound, so once the model weights are streaming through at the memory ceiling, each step decodes at roughly the same rate regardless of how many tokens it produces. The small GPU drift is harder to explain with confidence - it could be mild thermal throttling on the Adreno GPU during a sustained multi-step run, but I cannot measure GPU temperature or clock frequency via public APIs, so that is an inference from observed behavior, not a measured cause.

4. Output token count (MEDIUM confidence)

Increment a counter each time the onMessage callback fires:

var tokenCount = 0
engine.infer(prompt) { token ->
    tokenCount++
}

Confidence is MEDIUM because I am assuming a 1:1 mapping between callbacks and tokens. The LiteRT-LM streaming API does not guarantee this: a single callback could theoretically deliver multiple tokens, or a partial token. In this session the counts were consistent with expected output lengths, so I trust the number but flag the assumption.

Why token count matters: It exposed the NPU verbosity problem. Without this metric, NPU would look strictly superior (faster TTFT, higher tok/s). Token counts revealed that NPU Step 3 generates 163 tokens on average versus GPU's 51: a 3.2x ratio that explains why NPU total latency is not 1.7x better despite 1.7x faster decoding.

5. Peak RSS memory (HIGH confidence)

Read directly from the kernel:

fun getPeakRssKb(): Long {
    val status = File("/proc/self/status").readText()
    val vmRssLine = status.lines().find { it.startsWith("VmRSS:") }
    return vmRssLine?.split("\\s+".toRegex())?.get(1)?.toLong() ?: -1
}

/proc/self/status is a kernel-provided pseudo-file. VmRSS (Virtual Memory Resident Set Size) is the actual physical memory the process is using, not virtual address space. This is reliable because it is what the kernel uses internally for memory accounting and OOM decisions.

Real numbers:

Backend	Peak RSS
GPU	1,375 MB
NPU	1,934 MB

NPU uses 559 MB more. The NPU model file itself is larger (3.02 GB vs 2.59 GB, because the NPU build is compiled ahead-of-time through QNN), and the Qualcomm QNN runtime allocates additional buffers for the DSP communication path. This is not a measurement artifact: it is a real constraint. On a device with 12 GB RAM, NPU inference leaves less headroom for the OS, background apps, and your own app's non-inference workload.

Metrics that don't work (and why we removed them)

These are the metrics I wanted to include but couldn't, because the available Android APIs do not produce per-process, per-inference numbers.

Battery current draw: system-wide, not per-process

The obvious API is BatteryManager.BATTERY_PROPERTY_CURRENT_NOW:

val batteryManager = getSystemService(Context.BATTERY_SERVICE) as BatteryManager
val currentMicroAmps = batteryManager.getIntProperty(
    BatteryManager.BATTERY_PROPERTY_CURRENT_NOW
)

This returns the instantaneous current draw of the entire device. Not per-process. Not per-component. The screen, the cellular radio, background services, the OS itself: all included. You cannot attribute any portion of this reading to your model inference.

I tried the naive approach: sample before inference, sample after, compute the delta. The problems are fundamental:

It is instantaneous, not averaged. A single sample before and after tells you the current at two points in time. Inference changes current draw continuously over 2-8 seconds. Two point samples cannot characterize a curve.
System noise dominates. A notification arriving, the display auto-brightness adjusting, or a background sync starting during your inference window can swing the reading by hundreds of milliamps.
Current direction varies by charging state. Some devices report negative values when charging. Some report unsigned values. The sign convention is not standardized across OEMs.

Early benchmark runs reported numbers like "101 mA for standard model, 301 mA for fine-tuned model." These numbers were plausible: the fine-tuned model is larger and should draw more power. But plausible is not the same as correct. A plausible-but-unreliable metric is worse than no metric, because people will cite it.

We removed it.

Energy in joules: derived from garbage

If current draw is unreliable, then energy (which is derived from current draw) is doubly unreliable:

energy_joules = current_amps * voltage * duration_seconds

We used an assumed 3.7V (nominal lithium battery voltage). This is wrong on multiple levels:

Actual battery voltage ranges from ~3.3V (nearly dead) to ~4.4V (fully charged). Using a constant 3.7V introduces up to 19% error before you even consider the current measurement problems.
The current measurement is system-wide (see above), so the energy calculation attributes the entire device's power consumption to inference.
Duration is the only reliable term in the equation. Multiplying one good number by two bad numbers gives you a bad number.

We removed it.

GPU/NPU utilization: no public API

There is no public Android API to query GPU or NPU utilization. On desktop Linux, you can read /sys/class/drm/card0/device/gpu_busy_percent or use vendor tools (nvidia-smi, rocm-smi). On Android with Qualcomm silicon:

GPU: Qualcomm's GPU profiling tools (Snapdragon Profiler) require a USB-tethered connection and root-level access on many devices. There is no runtime API an app can call.
NPU: The Hexagon DSP has no utilization API whatsoever in the public SDK. You cannot query whether the NPU is 10% busy or 100% busy during your inference.

This means you cannot answer basic questions like "is my inference compute-bound or memory-bound?" or "is the NPU fully utilized or is my model too small to saturate it?" These are questions that a desktop ML engineer answers in seconds with nvidia-smi. On Android, they are unanswerable without vendor tooling and physical device access.

I did not try to estimate utilization from inference timing. An estimate based on "the NPU should be able to do X tok/s at peak, I measured Y, so utilization is Y/X" would require knowing the theoretical peak, which depends on the model architecture, quantization scheme, and memory bandwidth, none of which are exposed.

I left it out entirely.

Building the benchmark system

Knowing what to measure is half the problem. The other half is running the measurements in a way that is automated, reproducible, and does not interfere with the thing being measured.

The ADB-triggered approach

Our benchmark system is triggered via Android's BroadcastReceiver, invoked from ADB:

# Run text benchmark, 30 easy entries, GPU backend:
adb shell am broadcast \
  -n com.example.redacto/com.example.redacto.benchmark.BenchmarkReceiver \
  -a com.example.redacto.BENCHMARK \
  --es type text \
  --ei count 30 \
  --es difficulty easy

# Same benchmark on NPU:
adb shell am broadcast \
  -n com.example.redacto/com.example.redacto.benchmark.BenchmarkReceiver \
  -a com.example.redacto.BENCHMARK \
  --es type text \
  --ei count 30 \
  --es difficulty easy \
  --es backend NPU

# Monitor results in real time:
adb logcat -s RedactoBenchmark:I

Example logcat output, formatted the way the runner prints it. The per-step values shown here are the NPU averages across the 30-entry run rather than any single captured entry (individual entries vary; I did not retain per-entry logs), so read this as an illustration of the log format and the summary line, not as one real entry:

RedactoBenchmark: [n/30] <entry_id> | Step1: 345ms (TTFT 99ms, 10 tok, ~42 tok/s)
RedactoBenchmark: [n/30] <entry_id> | Step2: 624ms (TTFT 104ms, 22 tok, ~42 tok/s)
RedactoBenchmark: [n/30] <entry_id> | Step3: 4060ms (TTFT 92ms, 163 tok, ~42 tok/s)
RedactoBenchmark: [n/30] <entry_id> | TOTAL: ~5062ms | Tokens: ~195 | RSS: 1934MB
...
RedactoBenchmark: === SUMMARY ===
RedactoBenchmark: Entries: 30 | Avg latency: 5062ms | Avg tokens: 195 | Peak RSS: 1934MB

Why BroadcastReceiver instead of a separate test app? This is the critical design decision. A separate benchmarking app would need to create its own InferenceEngine instance and load the 2.59 GB model into memory. With the main app already running and holding a loaded model, that means two models in RAM simultaneously: roughly 5.2 GB just for inference weights. On a device with 12 GB total RAM, that is an OOM crash waiting to happen.

Instead, the BroadcastReceiver invokes a callback on the running app's ViewModel, which shares the already-initialized engine. One model in memory, zero duplication, no interference with the existing app state.

The tradeoff: the app must be in the foreground. The callback registration happens in Compose's lifecycle, so if the app is backgrounded, the receiver fires but the callback is not registered, and nothing happens. This means you cannot run benchmarks in a pure headless mode. For a hackathon project, this is acceptable. For a production benchmark suite, you would want to decouple the engine lifecycle from the UI framework.

Benchmark scope: Steps 1-3 only

The performance benchmark runs Classify, Detect, and Redact: the first three steps of the four-step pipeline. It does not run Step 4 (Validate).

This is deliberate. The benchmark is a performance test, not a quality test. I want to measure inference speed across backends, modes, and difficulty levels. Validation adds variable retry rounds (up to 2 retries) that make latency comparisons noisy. An entry that passes validation in round 1 takes ~300ms for Step 4. An entry that fails twice and succeeds on round 3 adds ~900ms of variable overhead. Including this in performance numbers would obscure the signal I care about: how fast is the inference engine itself?

Quality is measured separately, using the scoring methodology described below.

Dataset design

A benchmark is only as good as its dataset. Garbage in, garbage out, but also easy in, flattering out. If your test entries are trivially simple, your benchmark will produce impressive numbers that do not reflect real-world performance.

Structure

The redacto_bench.jsonl dataset contains 85 entries:

Mode	Total	Easy	Medium	Hard
HIPAA	17	4	4	4+5
FINANCIAL	17	4	4	4+5
FIELD_SERVICE	17	5	5	2+5
TACTICAL	17	7	5	0+5
JOURNALISM	17	10	2	0+5

(Plain numbers are the automated curation. The +5 in the hard column is the hand-crafted contextual entries: 5 per mode, all rated hard, which is why they land only in that column. That accounts for the full 25 hand-crafted entries; the other 60 are automated. Totals: 30 easy, 20 medium, 35 hard.)

Why text-only?

The dataset is pure text: no images, no OCR. This is an isolation decision. OCR (via ML Kit) adds variable latency of 200-1500ms depending on image complexity, resolution, and text density. Including OCR in the benchmark would make it impossible to distinguish "the model is slow" from "the image was hard to OCR." I measure the pipeline, not ML Kit.

Two-phase curation

Phase 1 - Automated (60 entries): I started with the ai4privacy/pii-masking-400k dataset (17,046 English entries with labeled PII spans). Quality filtering reduced this to roughly 2,000 candidates using criteria: longer than 80 characters, shorter than 600, at least 3 PII entities, no synthetic-looking names, PII density below 60%. Mode assignment used keyword matching (medical terminology maps to HIPAA, financial terms to FINANCIAL) and entity composition (entries with CREDITCARDNUMBER or TAXNUM map to FINANCIAL). Final sampling: 12 per mode, balanced across difficulty levels.

Phase 2 - Hand-crafted (25 entries): Five per mode, targeting contextual reasoning that generic PII data cannot test:

HIPAA: Relational identifiers ("the patient's daughter Lisa"), contextual PHI (a depression diagnosis linked to an identified patient), medication dosages tied to named patients.
TACTICAL: Distinguishing what to keep (suspect descriptions: race, height, clothing, vehicle, license plate) from what to redact (victim and witness names). This is the hardest category because the model must understand roles, not just entity types.
JOURNALISM: Keep public officials (real names like Lloyd Austin, Elizabeth Warren) while redacting confidential source identities. Tests whether the model understands the difference between public and private individuals.
FIELD_SERVICE: Contextual security information beyond passwords: "key is under the mat," "back door is usually unlocked." These are natural-language security risks that regex will never catch.
FINANCIAL: Preserve dollar amounts and institution names while redacting account numbers, SSNs, and routing numbers. Tests entity boundary precision: the model must know where the institution name ends and the account number begins.

Why 85 entries?

On-device GPU inference takes 5-10 seconds per entry for the 3-step pipeline. 85 entries at ~7 seconds average is roughly 10 minutes per backend: 20 minutes total for GPU vs NPU. Long enough to cover 5 modes and 3 difficulties in one run. Short enough that you can iterate within a hackathon session. I also added a slider (1-85) to run quick 3-5 entry sanity checks during development.

Scoring methodology

Scoring measures redaction quality, which is separate from performance. The two are complementary: performance tells you how fast the model runs, scoring tells you how well it redacts.

Three metrics, weighted

Entity recall (weight: 50%): For each ground-truth PII entity, check whether the original value is absent from the model's output. This is intentionally lenient: it does not require exact placeholder format, just that the PII was removed in some form. If the ground truth says "Jane Smith" is PII and the model's output contains neither "Jane" nor "Smith," entity recall counts that as a hit.

Why 50% weight: missing a PII entity is the worst failure mode. An app that leaves a Social Security number visible has failed at its core purpose.

Format compliance (weight: 25%): Fraction of bracket placeholders in the output that match the [CATEGORY_N] pattern (e.g., [NAME_1], [SSN_2]). This distinguishes models that use Redacto's structured format from those that output generic [REDACTED] tags. Format matters because downstream consumers (audit logs, compliance reports) need to know what type of PII was removed, not just that something was removed.

Text preservation (weight: 25%): Fraction of non-PII words from the input that are retained in the output. This catches over-redaction: a model that replaces the entire document with [REDACTED] would score 100% on entity recall and 0% on preservation. "Left heel wound not improving" should survive redaction intact. The diagnosis is medical context, not PII.

Overall score: entityRecall * 0.5 + formatScore * 0.25 + preservationScore * 0.25

Why not exact string match?

The naive approach is to compare the model's output character-by-character against the expected output. This fails on minor formatting differences: an extra space after a comma, slightly different placeholder numbering ([NAME_2] vs [NAME_1] for the same entity detected in different order), line breaks versus spaces. Exact match produces false negatives that penalize correct redactions.

Entity-level scoring is more robust because it measures what actually matters: was the PII removed? It accepts any method of removal: the right placeholder, a different placeholder, or even deletion of the surrounding sentence. The format compliance metric separately penalizes non-standard placeholders, so sloppy redaction that removes the PII but uses wrong formatting still loses points, just not on the recall dimension.

What the numbers actually told us

With honest metrics and a carefully built dataset, even a single directional session revealed things that informal testing never would have:

NPU is not strictly faster than GPU. On a per-token basis, yes: roughly 42 vs 25 tok/s. But because the NPU path cannot use constrained decoding, it generates so many more tokens that total latency is roughly equal (5,062ms vs 4,855ms averaged across 30 entries). The TACTICAL mode is the extreme case: NPU averaged 14,201ms versus GPU's 5,430ms, because one or more entries triggered runaway token generation.

TTFT is NPU's real advantage. Sub-100ms time to first token versus ~370ms on GPU. For a user-facing app, this is the difference between "the app responded instantly" and "there was a noticeable pause." If your app streams output tokens to the UI, NPU TTFT alone may justify the higher memory cost.

A good prompt beat an under-resourced fine-tune. The prompt-driven standard model scored 80.5% overall versus 70.3% for the fine-tuned model. But that comparison is not fair to fine-tuning: the fine-tune was deliberately under-resourced (roughly 3,000 of the 400,000 ai4privacy samples, one epoch, a few minutes of training) and it learned the wrong output format ([REDACTED] instead of Redacto's [CATEGORY_N]), which cost it on format compliance directly. Even so, it won on FIELD_SERVICE entity recall (+13.2%) and TACTICAL entity recall (+13.1%), while the standard model won decisively on HIPAA (+55.8%) and text preservation (+17.8%). The overall score hides these per-mode stories. Without per-mode breakdowns, you would conclude "fine-tuning did not help," which is wrong: both the quantity and the alignment of the training data were off, and with the full dataset and the right target format this result could easily have flipped.

Memory is a hard constraint, not a preference. NPU uses 1,934 MB. GPU uses 1,375 MB. On a 12 GB device with the OS and background apps consuming 4-5 GB, NPU leaves roughly 5-6 GB free. That sounds comfortable until you remember that Android's Low Memory Killer starts getting aggressive well before you hit zero. In practice, NPU inference causes more frequent background app kills, which degrades the user experience in ways that a benchmark does not capture.

What I wish existed

Building this system ate up a large chunk of a hackathon where every hour counted. Here are the walls I hit, and what I wish had been within easy reach for an app developer. I am framing these as gaps I ran into, not as proof the ecosystem lacks them everywhere, because part of the point below is that I did not have time to check exhaustively.

Per-process power attribution. iOS exposes per-process energy metrics via MetricKit. I could not find an equivalent an app can call on Android. Without it, I do not see how an on-device AI benchmark on Android makes credible power-efficiency claims.
Accelerator utilization an app can read. Even a simple "NPU is X% busy" would let developers answer basic optimization questions. From inside my app, the NPU was effectively a black box: I found no runtime API to query it.
Structured inference metrics from LiteRT-LM. Currently, timing data comes from wrapping calls in System.currentTimeMillis() and counting callbacks. The runtime knows its own internal timing (tokenization duration, prefill duration, KV cache hits) and could expose it. I would rather read that directly than parse logcat output for benchmark data.
A standard app-level, on-device eval framework. An Android equivalent of lm-eval-harness that handles engine lifecycle, dataset loading, metric computation, and result formatting for a multi-step pipeline, so that every team does not reinvent this infrastructure from scratch. Single-model benchmark tools exist; what I wanted was something pitched at the application pipeline, not one model invocation.
CI/CD integration for device testing. Run benchmarks on device farms (Firebase Test Lab, Samsung Remote Test Lab) as part of a pull request pipeline. For me, on-device benchmarking stayed a manual, one-device-at-a-time process.

A fair caveat on this wishlist: some of these gaps may already be partly filled by tools I simply did not have time to evaluate during the hackathon. LiteRT itself ships a command-line benchmark tool, and there are vendor profilers I never wired up. I do not yet know how much of the multi-step, app-level, on-device measurement problem they actually solve. That is a project in itself, and it is the one I am planning next: to survey the existing on-device benchmarking tools properly, map what they can and cannot do for an application pipeline like this one, and see whether there is a real, unmet gap worth building for. If there is, that will be its own standalone post in this series.

The honesty principle, applied

The principle - "if I can't measure it properly, I don't show it" - sounds obvious. In practice, it means throwing away metrics that would make your demo more impressive. Battery current draw of "101 mA for standard, 301 mA for fine-tuned" is a great story for a hackathon presentation. It suggests the standard model is 3x more power-efficient. Judges would be impressed. And it would be misleading.

The discipline is not in measuring accurately. The discipline is in admitting what you cannot measure and choosing not to display it. Every on-device AI project I have seen, including published papers, reports at least one metric that does not survive this scrutiny. Usually it is power. Sometimes it is "NPU utilization." Occasionally it is throughput numbers that include prefill in the tok/s calculation (inflating the number by 20-30%).

If you are building on-device AI benchmarks, start with the question: "What can I actually measure, and at what confidence level?" Build your metrics table from the answers. Leave the empty cells empty.

The numbers you do not report are as important as the numbers you do.

All benchmark data here comes from a single directional session on the Redacto project running on a Samsung Galaxy S25 Ultra (Snapdragon 8 Elite, SM8750) with Gemma 4 E2B in May 2026. No raw logs were retained and the device is no longer available, so the numbers are directional, not a rigorous multi-run study. For the foundations behind the metrics mentioned here (TTFT, tok/s, quantization), see What On-Device AI Benchmarks Actually Feel Like.

Related in this series of "Edge AI from the Trenches"

What On-Device AI Benchmarks Actually Feel Like - foundational definitions of the metrics measured here (TTFT, tok/s)
Prompt Engineering Beat My Fine-Tuned Model. Here Is Why. - the decision framework behind model selection for benchmarks
How I Benchmarked an LLM Running Entirely on a Phone (No Cloud, No API) - the dataset curation and scoring methodology in full detail (upcoming in this series)

Device: Samsung Galaxy S25 Ultra (Snapdragon 8 Elite SM8750, 12 GB RAM)
Model: Gemma 4 E2B Instruct (GPU: 2.59 GB, NPU: 3.02 GB)
Runtime: LiteRT-LM
Pipeline: 3-step (Classify, Detect, Redact); no validation step for performance runs
Dataset: 85 entries (60 curated + 25 hand-crafted), 5 modes, 3 difficulty levels
Benchmark session: single directional run, May 2026; no raw logs retained

Last updated: July 2026
14th of 23 posts in the "Edge AI from the Trenches" series

One Prompt Was Not Enough: Building a 4-Step On-Device Redaction Pipeline

Jaydeep Shah (JD) — Sun, 05 Jul 2026 06:35:00 +0000

I thought I could solve PII redaction with one LLM call. Classify the document, find every identifier, replace them with placeholders, and verify nothing leaked, all in a single prompt. The model is smart enough, right?

It took three architectural rewrites and a trail of silently failed redactions before I arrived at something that actually works: a 4-step pipeline where each step gets its own conversation, its own prompt, and its own job. This is the story of how I got there, told through the failures that forced each design decision.

The system is called Redacto. It runs entirely on-device: Gemma 4 E2B (~2.3B effective params, 5.1B total via Per-Layer Embeddings) on a Snapdragon 8 Elite NPU via LiteRT-LM. No cloud, no internet permission, no BAA required. The architecture I am about to describe applies to any LLM-driven document processing pipeline, but the constraints of on-device inference made every failure more painful and every fix more deliberate.

The naive approach: one call to rule them all

The first version of Redacto did exactly what seems reasonable. One prompt, one LLM call:

You are a privacy redaction engine. Analyze the following text. Identify the document type. Find all personally identifiable information. Replace each PII item with a category placeholder like [NAME_1]. Verify that no PII remains. Return the redacted text.

This is the architecture most people reach for first. It mirrors how a human would think about the problem: read the document, find the sensitive stuff, black it out, double-check. A single mental pass.

And it works, sometimes. On short, clean, well-formatted inputs, the model can juggle all four tasks in one shot. I saw it handle a three-line medical note perfectly on the first try. So I shipped it.

Then I tested it on real documents.

What actually happened

Three categories of failure emerged immediately.

Inconsistent output format. The prompt asked for structured output: category labels, detected items, redacted text. But the model would sometimes return the redacted text without the detection list. Other times it would return the detection list but skip the redaction. On a small model running on-device, output format reliability is not something you can take for granted. This is the reality of working with small models: they are powerful, but they cannot reliably hold four distinct objectives in working memory at once (D3).

Missed detections. When the model is simultaneously trying to understand document structure, enumerate PII items, perform text surgery, and self-verify, something has to give. In practice, detection suffered the most. The model would catch obvious items like names and dates but miss relational identifiers, "the patient's daughter Lisa", where "Lisa" is only PII because of its relationship to the patient. Single-pass processing does not give the model enough cognitive room to reason about context (D3).

Silent failures. The worst category. The model would return confidently formatted output that looked correct but had gaps. No error, no warning, just missing redactions. A medical record number sitting in plain text in the "redacted" output. When your system's job is privacy protection, a silent failure is worse than a crash.

These are not hypothetical problems. They are documented failure modes that drove the architectural decisions in Redacto's engineering log.

The evolution through failed approaches

Getting from single-pass to the final 4-step pipeline was not a clean leap. It was two intermediate architectures, each of which fixed one problem and revealed the next.

Attempt 2: Separate detection, deterministic replacement

The first fix was obvious: split detection from replacement. Let the LLM focus on finding PII (Step 2), then use deterministic code to do the actual text replacement (Step 3). No LLM creativity in the replacement step, just string.replace(), sorted by length to avoid partial matches.

This is sound software engineering. Separate concerns. Use deterministic logic where you can. Reserve the LLM for what only an LLM can do.

It failed catastrophically.

Here is the specific failure mode, documented as Decision D4 in our engineering log: OCR text has errors. A scanned medical bill might contain "inquire@blucurrent. mail", note the stray space before "mail." The LLM in Step 2 receives this OCR text and detects an email address. But the LLM does what LLMs do: it "corrects" the text in its output. It reports the detection as "inquire@ucal.com".

Now string.replace() runs. It searches for "inquire@ucal.com" in the original text. That string does not exist in the original. The replacement silently does nothing. The email address remains in the output, unredacted.

In this test, 6 of 8 detections were skipped because the detected text did not match the original OCR text (D4). That figure is directional, from a single test run rather than a rigorous multi-run study, but the pattern was unmistakable: silent, systematic loss.

This is a fundamental tension: the LLM understands semantics but does not preserve exact string representations. Deterministic string operations preserve exact strings but do not understand semantics. When you chain one into the other, the semantic layer corrupts the data that the deterministic layer depends on.

Attempt 3: Word-diff for image bounding boxes

For image redaction, the pipeline was: OCR extracts text with bounding boxes, the LLM redacts the text, then we diff the original against the redacted version to figure out which words were replaced, and draw black boxes over their bounding box coordinates.

The diff approach broke for the same reason. The LLM does not just replace PII, it rewrites surrounding text. It fixes grammar, merges sentences, reorders clauses. The word-level diff between original and redacted text becomes noisy and unreliable. Bounding boxes end up on the wrong words, or they cover too much, or they miss the target entirely.

This was the failure that finally forced a complete rethink of the architecture (D9).

The 4-step pipeline that works

The final architecture decomposes the problem into four steps, each with a single responsibility:

Step 1: Classify

Input: raw document text.
Output: document type and redaction category.

DOCUMENT_TYPE: Medical Note
CATEGORY: Medical

This determines which of the 7 specialist prompt sets will drive Step 2. A medical note activates the Medical detector (the HIPAA PHI identifiers, preserving diagnoses and medications). A police report activates the Tactical detector (protecting victim and witness identities while preserving suspect descriptions). One prompt does not fit all (D12).

Step 2: Detect

Input: document text + category-specific detection prompt.
Output: structured list of every PII item found.

NAME: Mrs. Chen
DATE: 3/15/48
MRN:  4471829
PHONE: 408-555-1234

The detector's only job is to find things. It does not replace them, does not think about output format, does not self-verify. This single focus is what lets it catch relational identifiers that the single-pass approach missed, "the patient's daughter Lisa", because all of its cognitive budget goes to understanding context, not juggling four tasks.

The output uses a simple line-based format (CATEGORY: text), not JSON. This is deliberate: a model this size sometimes struggles with JSON bracket matching, and the line-based format is more token-efficient (D8).

Step 3: Redact

Input: original text + detection list from Step 2.
Output: text with [CATEGORY_N] placeholders.

This is the step where the critical design decision lives. The original plan was deterministic string.replace(). After the 6-of-8 failure described above, this became an LLM call (D4).

The LLM receives both the original OCR text (errors and all) and the detection list (with the LLM's "corrected" versions). Because the LLM understands that "inquire@blucurrent. mail" and "inquire@ucal.com" refer to the same entity, it can perform the substitution correctly despite the mismatch. It handles OCR noise naturally because it operates on semantics, not string equality.

The tradeoff: one extra LLM call, adding roughly 1-5 seconds. If the LLM call fails, we fall back to deterministic replacement as a safety net. But in practice, the LLM call succeeds, and the accuracy improvement is not negotiable for a privacy tool (D6).

Step 4: Validate

Input: the redacted output from Step 3, and nothing else.
Output: PASS or FAIL with a list of missed items.

Step 4 is an LLM-as-a-judge check: a separate model call whose only job is to grade the redacted output, not to produce it. The validator is an independent auditor. It receives the redacted text in a fresh conversation with no memory of what Steps 1-3 decided (D5). It reads the output cold and asks: "Does any PII remain?"

Using the model as its own judge is what makes the pipeline self-correcting. If it finds missed items, Steps 3 and 4 re-run with the missed items added to the detection list. Maximum 3 rounds total (1 initial + 2 retries) (D7). In testing, most documents pass in round 1. Those that fail after 3 rounds have systemic prompt issues, and retrying will not help.

Why every step gets a fresh conversation

This is Decision D5 in our engineering log, and it deserves its own section because it is counterintuitive. Reusing a conversation across steps would be faster, no repeated context loading. Why throw that away?

Context pollution.

If the validator (Step 4) shares a conversation with the detector (Step 2), it has already "seen" the detection reasoning. It knows what the detector was thinking, what it found, what it was uncertain about. That makes it a terrible auditor. It is biased by its own prior work. It will not catch things the detector missed because it has already internalized the detector's blind spots.

A fresh conversation means the validator approaches the redacted text with no preconceptions. It is reading the output as an outsider would. This is the same principle that drives code review: the person who wrote the code should not be the only one reviewing it.

On-device, the cost of a fresh conversation is low. Inference runs on the local NPU at zero marginal cost, no API fees, no rate limits. The additional latency (roughly 1.3 seconds for the validation step) is acceptable because accuracy matters more than speed when the system's job is protecting private health information and financial data (D6).

Indexed-element image redaction

Image redaction required its own architectural breakthrough, documented as Decision D9.

The original approach for images: OCR extracts text with bounding boxes. The LLM redacts the text. Then match the redacted words back to the OCR bounding boxes and draw black rectangles.

This is the word-diff approach I described earlier, and it broke for the same reason: the LLM rewrites text, so the diff between original and redacted is unreliable.

The solution eliminates string matching entirely. Instead of giving the LLM raw text, we give it indexed elements:

[0] Patient
[1] Jane
[2] Smith
[3] DOB:
[4] 03/15/78
[5] compound
[6] fracture

The LLM returns index labels, not strings:

1:NAME
2:NAME
4:DATE

We map indices directly to bounding boxes. Index 1 corresponds to the bounding box for "Jane." Index 2 corresponds to "Smith." We draw black rectangles at those coordinates.

The index-to-bounding-box mapping is lossless. No text matching means no OCR error sensitivity. The HUD field count (how many items the UI reports as redacted) and the visual box count (how many black rectangles appear on the image) are always consistent.

For long documents, OCR elements are chunked at 150 elements per chunk, each chunk processed through the detection LLM separately (D10). This keeps each chunk within the model's deployed 1,024-token KV cache. Gemma 4 E2B is trained for a 128K context, but the build I shipped on-device is compiled with a 1,024-token cache (cache_length=1024, prefill 256); the app requests maxNumTokens=4000, but the compiled cache is what actually governs how much fits. Chunking at 150 elements keeps each detection pass comfortably inside that deployed budget without sacrificing coverage.

The cost of doing it right

The 4-step pipeline is slower than single-pass. Here are the numbers I measured on a Galaxy S25 Ultra with a Snapdragon 8 Elite NPU:

Architecture	Latency (229-char medical note)
Single-pass (1 LLM call)	~1.5s
3-step pipeline (Classify + Detect + Redact)	~2.8s
4-step pipeline (+ Validate)	~3.3s

These are directional figures from a single May-2026 session on one device. I did not retain the raw logs, and that phone is no longer available to me, so treat this as a one-session, order-of-magnitude read rather than a rigorous multi-run benchmark.

That is roughly 2x the latency of single-pass. Is it worth it?

Yes. And the reason is Decision D6 in our engineering log: accuracy over latency, no hard latency budget.

This is a privacy tool. A missed redaction means someone's medical record number, Social Security number, or home address leaks. The cost of a false negative is not "the user has to redo it", it is a potential HIPAA violation, identity theft, or witness endangerment. In that context, an extra 1.8 seconds is not a meaningful cost.

The economics reinforce this. Inference runs on the local NPU at zero marginal cost. There are no API fees per call. There are no rate limits. The only cost is the user's time, and users of a privacy tool prefer thorough redaction over fast results.

Per-step latency breakdown (NPU, patient record via image pipeline), same single-session caveat as above, approximate values only:

Step	Latency (approx)	TTFT (approx)	Tokens (approx)	Decode (approx)
Step 1 (Classify)	~290ms	~95ms	~10	~42 tok/s
Step 2 (Detect)	~2,600ms	~100ms	~100	~42 tok/s
Step 3 (Redact)	~4,700ms	~100ms	~190	~42 tok/s
Step 4 (Validate)	~380ms	~95ms	~12	~42 tok/s

I am deliberately not quoting per-step decode rates to a tenth of a token: I do not have the logs to back that precision. The decode rate lands around 42 tok/s and stays roughly flat across steps, which is what you would expect on this NPU, where decode is memory-bandwidth-bound rather than compute-bound, so generating more tokens does not change the per-token rate much. Step 2 is the bottleneck because detection generates the most tokens, it must enumerate every PII item. Step 4 is fast because most documents pass validation: the output is a single "PASS" token. The pipeline progress indicator ("Step 2/4: Detecting Medical identifiers...") keeps users informed during the 3-8 second wait (D18).

7 category-specific prompt sets

The final architectural piece is the prompt system, documented as Decision D12. Each of the 7 redaction categories has dedicated Detect and Validate prompts. Classify and Redact prompts are universal (shared across categories).

Why can you not use one prompt for all document types?

Because what counts as PII depends entirely on context. A clinical note must preserve "Type 2 diabetes" (it is a diagnosis, not an identifier) while redacting "Jane Smith" (it is a patient name). A police report must preserve suspect descriptions ("6'2", brown jacket, facial scar") while redacting victim names. A financial document must preserve dollar amounts and institution names while redacting account and routing numbers.

Category	Detection focus	Key preserve rules
Medical	HIPAA PHI identifiers (names, DOB, MRN, etc.)	Diagnoses, medications, vitals, body locations
Financial	Account/routing/card/SSN/TaxID	Dollar amounts, institution names, toll-free numbers
Legal	Buyer/seller/tenant names	Property specs, legal terms, dollar amounts
Tactical	Victim/witness/minor protection	Suspect descriptions, officer names, crime scene
Journalism	Source identity protection	Public officials, reporter names
Field Service	Customer PII + security credentials	Equipment details, fault codes
General	Broad PII detection (fallback)	Minimal preserve rules

A single "find all PII" prompt would either over-redact (blanking out diagnoses in medical notes) or under-redact (missing gate codes in field service reports). The category system lets each prompt set be precise about what to redact and what to preserve.

The prompts are embedded as Kotlin constants in a PipelinePrompts object (D13). No runtime file reading, no asset loading. The prompts are compiled into the APK and available immediately. The design reference files (SKILL.md) in the prompts/ directory are for human review, not runtime consumption.

The pattern: decompose, isolate, verify

Looking back, the 4-step pipeline is an application of a principle that shows up across software engineering: when a single component is unreliable at a complex task, decompose it into focused steps with clear interfaces between them.

The specific insights from building Redacto:

LLMs are unreliable multi-taskers on complex instructions. A small model asked to simultaneously classify, detect, replace, and verify will drop one of those tasks, usually the hardest one. Decomposition is not just good engineering, it is a reliability requirement.
Semantic and deterministic operations do not compose cleanly. The LLM operates on meaning; string operations operate on bytes. When you chain them, the semantic layer's tendency to "correct" input corrupts the deterministic layer's assumptions. Either go fully semantic (LLM for substitution) or fully deterministic (regex for detection). Do not mix them at the boundary.
Verification requires independence. A validator that shares context with the system it is validating is not a validator, it is a rubber stamp. The fix is LLM-as-a-judge: a separate, context-free model call that grades the output instead of producing it. Fresh conversations cost latency but buy genuine error detection.
Index-based mapping eliminates an entire class of errors. Anywhere you find yourself doing string matching between LLM output and source data, ask whether you can use indices instead. The LLM is good at understanding what index 1 refers to. It is bad at reproducing the exact string that index 1 contains.
Domain-specific prompts are not premature optimization. Different document types have genuinely different rules about what constitutes sensitive information. A universal prompt will always be a compromise.

These are lessons I learned by shipping broken versions and measuring the failures. The 4-step pipeline is not clever, it is the simplest architecture that actually works.

Full pipeline architecture

Related in this series of "Edge AI from the Trenches"

The Invisible Layer Between My Prompt and the Model - the formatting layer each pipeline step depends on
What I Learned About "On-Device AI": Most of It Is a Promise, Not a Guarantee - why Redacto runs this pipeline with no network permission
Prompt Engineering on a Small On-Device Model Is a Different Sport (upcoming in this series) - the prompt constraints that shaped each pipeline step's system prompt

Last updated: July 2026
13th of 23 posts in the "Edge AI from the Trenches" series

What I Learned Getting an NPU to Actually Initialize: Six Silent Failures

Jaydeep Shah (JD) — Sun, 05 Jul 2026 05:10:00 +0000

Getting the NPU to actually initialize was, by a wide margin, the hardest part of building Redacto. It was not the application logic; that part was straightforward. It was roughly eighty [dramtically speaking] APK rebuilds, each one earned by staring at logcat, reading vendor libraries and headers that ship on the device, and doing adb shell find across the entire /vendor partition to chase down one more silent failure. The model was correct. The API was correct. The device was correct. And yet Engine.initialize() threw, silently, repeatedly, for reasons I could not find documented in the paths I checked.

The goal was Redacto, a zero-trust, on-device PII redaction app that runs Gemma 4 E2B entirely on the Hexagon V79 NPU inside the Snapdragon 8 Elite. (It later went on to win the Qualcomm x Google LiteRT Developer Hackathon 2026, but that is not the part that was hard.) The hard part was the gap between "the pieces are all correct" and "the engine actually starts."

This post is the document I wish had existed. Six distinct failure modes, in the order I hit them, with exact log lines, root causes, and fixes. If you are trying to get LiteRT-LM running on a Qualcomm NPU, especially on a Samsung Galaxy S25 Ultra with Android 16, this will save you most of those eighty rebuilds.

Environment

Item	Value
Device	Samsung Galaxy S25 Ultra
Chipset	Snapdragon 8 Elite for Galaxy (SM8750-AC)
NPU	Hexagon V79
Android	16 (API 36)
LiteRT-LM dep	`com.google.ai.edge.litertlm:litertlm-android:0.11.0-rc1`
Model	`gemma4_npu.litertlm` (3.02 GB, NPU-compiled multimodal Gemma 4 E2B)

All six failures below occurred on this exact configuration. Some are device-specific (Samsung), some are OS-version-specific (Android 16), and some are universal to anyone using QNN HTP on LiteRT-LM.

Failure 1: Dispatch Lib Symbol Mismatch

The log

E tflite : Encountered unresolved custom op: DISPATCH_OP.
E tflite : Node number 0 (DISPATCH_OP) failed to prepare.

Why it happens

When you use an NPU-compiled .litertlm model, the model's computation graph contains DISPATCH_OP custom ops. These are not standard TFLite operations: they are delegated operations that tell LiteRT "hand this subgraph to the QNN HTP backend." The library libLiteRtDispatch_Qualcomm.so registers these custom ops via C++ static-init constructors when it is dlopened by the runtime.

If dlopen fails, those constructors never run, and LiteRT sees DISPATCH_OP as an unresolved custom op. The critical detail: dlopen fails silently. There is no E dlopen: ... line in logcat. You just get the downstream symptom: the custom op registration never happened.

In our case, the dispatch lib version was mismatched against the AAR's libLiteRt.so. We had initially pulled libQnnHtp.so and related libs from the device's /vendor partition, thinking "same device, same chipset, should work." Wrong. The vendor-partition libs are built against a different ABI revision of QAIRT than what the LiteRT-LM AAR expects.

The fix

Pull the exact six native libraries from the official Qualcomm LLM sample's jniLibs/arm64-v8a/ directory. These are built against QAIRT 2.42 and are ABI-compatible with litertlm-android:0.11.0-rc1:

libLiteRtDispatch_Qualcomm.so (478,912 bytes)
libGemmaModelConstraintProvider.so (20,092,072 bytes)
libQnnHtp.so (2,778,176 bytes)
libQnnHtpV79Skel.so (10,975,268 bytes)
libQnnHtpV79Stub.so (679,168 bytes)
libQnnSystem.so (2,983,560 bytes)

Source: the litert-samples GitHub repository, under compiled_model_api/qualcomm/llm_chatbot_npu/app/src/main/jniLibs/arm64-v8a/.

After replacing all six with the sample's QAIRT 2.42 set, logcat shows the dlopen succeeding:

I litert : [qnn_manager.cc:125] Loading qnn shared library from "libQnnHtp.so"
I litert : [qnn_manager.cc:134] Loaded qnn shared library
I tflite : Replacing 1 out of 1 node(s) with delegate (DispatchDelegate) for subgraph 0

Why it is subtle: The error message says "unresolved custom op." Your instinct is to look at the model file, or the op registration code, or the TFLite runtime version. None of those are the problem. The problem is a silent dlopen failure caused by an ABI mismatch in a native library you probably copied from a place that seemed authoritative (the device itself).

Failure 2: `pickFirsts` Masking Nothing

The investigation

After Failure 1, we suspected the pickFirsts block in app/build.gradle.kts was silently selecting wrong-version libs during the build. The pickFirsts directive tells AGP to resolve duplicate native library conflicts by picking the first one found. If the AAR shipped its own libQnnHtp.so and our jniLibs/ directory also had one, pickFirsts could be silently choosing the wrong copy.

We spent hours on this theory. It was wrong.

Why it was wrong

Inspecting the AAR contents at ~/.gradle/caches/.../litertlm-android-0.11.0-rc1.aar, it ships exactly three native libraries:

libLiteRt.so
libLiteRtClGlAccelerator.so
liblitertlm_jni.so

Zero overlap with our six QNN/dispatch libs. pickFirsts had nothing to deduplicate. It was a complete no-op.

What actually mattered

pickFirsts was a no-op. It had nothing to deduplicate, so leaving it in changes nothing (we kept the block as harmless defensive config in case a future dependency ever ships an overlapping lib).

The one real change in this area was removing libLiteRtCompilerPlugin_Qualcomm.so from jniLibs/: that library is for classical-model NPU JIT compilation, not LLM inference, and the official LLM sample does not bundle it.

Why I am including this: Because if you are debugging NPU init, you will probably go down this rabbit hole too. The pickFirsts block looks suspicious. It is not the problem. Save yourself the hours.

Failure 3: Hexagon DSP Cannot Find `libQnnHtpV79Skel.so`

This was the key fix. Understanding this failure requires understanding how the Hexagon DSP loads code.

The log

W apps_std_imp.c:1185: apps_std_fopen_with_env_fd failed with 0xd for /vendor/dsp/cdsp/./libQnnHtpV79Skel.so (No such file or directory)
E remote_handle_open_domain: dynamic loading failed for libQnnHtpV79Skel.so
E QnnDsp <E> Failed to find available PD for contextId 5 ... err: 1002
E litert: [qnn_manager.cc:556] Failed to create QNN context: 1002
E tflite : Failed to initialize kernel.

The architecture you need to understand

The Qualcomm AI Engine has a split-process architecture. Your Android app runs on the application processor (the ARM Cortex cores). But the actual NPU inference runs on the Hexagon DSP, a separate processor with its own firmware, its own address space, and its own filesystem view.

When the QNN HTP backend initializes, it needs to load a "skeleton" library (libQnnHtpV79Skel.so) onto the Hexagon DSP. This is done via FastRPC, Qualcomm's mechanism for remote procedure calls between the application processor and the DSP. The skeleton is the DSP-side implementation; the "stub" (libQnnHtpV79Stub.so) is the app-processor-side proxy.

Here is the critical detail: the DSP does not see your APK's lib/arm64-v8a/ directory. That directory lives in /data/app/, which is an app-processor filesystem path. The DSP loads the skeleton from a DSP-accessible path, and it searches a hardcoded fallback list:

/vendor/dsp/cdsp/
/vendor/lib/rfsa/adsp/
A handful of other vendor-specific paths

Plus whatever is in the ADSP_LIBRARY_PATH environment variable.

The Samsung-specific wrinkle

On the Samsung Galaxy S25 Ultra, the actual location of libQnnHtpV79Skel.so is:

/vendor/lib64/rfs/dsp/snap/libQnnHtpV79Skel.so

We found this by running:

adb shell find /vendor /system /odm -name 'libQnnHtpV79Skel.so'

The path /vendor/lib64/rfs/dsp/snap/ is not in FastRPC's hardcoded fallback list. This is a Samsung-specific vendor path. Without ADSP_LIBRARY_PATH pointing at it, the DSP will never find the skeleton, and QNN context creation fails with error 1002.

The timing trap

We were setting ADSP_LIBRARY_PATH, but in the wrong place. The code was inside InferenceEngine.initialize():

// WRONG: too late
fun initialize(...) {
    android.system.Os.setenv("ADSP_LIBRARY_PATH", paths, true)
    // ... then load engine
}

The problem: libQnnHtp.so reads the ADSP_LIBRARY_PATH environment variable once, when it is first dlopened. Our pre-init call System.loadLibrary("LiteRtDispatch_Qualcomm") was triggering that dlopen before we reached the setenv call. By the time we set the path, QnnHtp had already cached an empty value.

The fix

Create a custom Application subclass and set the environment variables in onCreate(), the earliest practical app hook that reliably runs before our library loading. (ContentProvider initialization and attachBaseContext() technically run earlier, but onCreate() is the simplest place that is still guaranteed to precede the first QNN dlopen.)

class RedactoApp : Application() {
    override fun onCreate() {
        super.onCreate()
        val nativeLibDir = applicationInfo.nativeLibraryDir
        val paths = listOf(
            nativeLibDir,
            "/vendor/lib64/rfs/dsp/snap",   // Samsung S25 Ultra V79 skel
            "/vendor/lib64/hw/audio",       // Samsung alternate
            "/vendor/dsp/cdsp",
            "/vendor/lib64",
            "/vendor/lib64/snap",
            "/system/lib64",
        ).joinToString(":")
        android.system.Os.setenv("ADSP_LIBRARY_PATH", paths, true)
        android.system.Os.setenv("LD_LIBRARY_PATH", paths, true)
    }
}

Wire it into AndroidManifest.xml:

<application android:name=".RedactoApp" ... >

After this fix, the DSP successfully finds the skeleton and creates a QNN Protection Domain.

Why it is subtle: Three things have to be correct simultaneously: (1) the path must include the Samsung-specific vendor directory, (2) the env var must be set before any QNN library is loaded, and (3) the env var must be set early enough to precede that load, which in practice means Application.onCreate() rather than an Activity or a later initialization step. Get any one of these wrong and you get the same opaque "error 1002."

Failure 4: OpenGL `CreateSharedMemoryManager` Unimplemented on Android 16

This was the most deceptive failure. The NPU was working. We did not know the NPU was working.

The log

E delegate_opengl.cc:218: Failed to create DelegateKernelLiteRt: UNIMPLEMENTED: CreateSharedMemoryManager is not implemented.
=== Source Location Trace: ===
third_party/odml/litert/ml_drift/delegate/gpu_backend_opengl.cc:169
third_party/odml/litert/ml_drift/delegate/delegate_kernel.cc:337
third_party/odml/litert/ml_drift/delegate/delegate_kernel_litert.cc:167
E tflite : Failed to initialize kernel.

The root cause

Look at the source location trace carefully. This error is from gpu_backend_opengl.cc. It is a GPU error, not an NPU error.

LiteRT-LM's multimodal Gemma 4 model has sub-backends for different modalities. When you configure the engine, you specify:

val config = EngineConfig(
    backend = Backend.NPU(nativeLibraryDir = nativeLibDir),
    visionBackend = Backend.GPU(),   // the dev guide example uses this
    audioBackend = Backend.CPU(),
)

The developer guide example explicitly shows visionBackend = Backend.GPU(). The idea is that image preprocessing runs on the GPU while the language model runs on the NPU. Reasonable architecture.

On Android 16 (API 36), the GPU vision sub-backend tries to create an OpenGL shared memory manager and hits an unimplemented codepath in litert::ml_drift. The CreateSharedMemoryManager function is not yet implemented for Android 16's new graphics memory model. It throws.

And that throw kills the entire Engine.initialize() call. Not just the vision sub-backend. The whole engine. Even though the NPU backend had already successfully registered DispatchDelegate on multiple subgraphs and created QNN contexts totaling roughly 1.3 GB.

We saw Failed to initialize kernel and assumed the NPU was failing. We spent an entire day re-investigating Failures 1-3, thinking we had regressed. We had not. The NPU was fine. A completely unrelated GPU sub-backend was killing the process.

The fix

Set all sub-backends to CPU:

PreferredBackend.NPU -> {
    val config = EngineConfig(
        modelPath = modelPath,
        backend = Backend.NPU(nativeLibraryDir = nativeLibDir),
        visionBackend = Backend.CPU(),    // NOT Backend.GPU(): hits an unimplemented OpenGL path on Android 16
        audioBackend = Backend.CPU(),
        maxNumTokens = 4000,
        cacheDir = context.cacheDir.absolutePath,
    )
    engine = Engine(config).also { it.initialize() }
}

Our use case is text-only PII redaction. We do not run vision inference through the LiteRT-LM engine (OCR is handled separately by ML Kit). Setting visionBackend = Backend.CPU() costs us nothing and avoids the Android 16 OpenGL crash.

Why it is subtle: The error message says Failed to initialize kernel, the same message as every other failure. The only clue is the source location trace pointing at gpu_backend_opengl.cc, which you might dismiss as irrelevant if you think you are running on the NPU. You are running on the NPU. The NPU is not the part that failed.

Failure 5: Constrained Decoding Error 12 on NPU

The symptom

This one surfaces at runtime rather than init, but it blocks NPU usage just as effectively: the runtime reports that constrained decoding is not supported on NPU, with error 12.

The root cause

In the code, we had written:

ExperimentalFlags.enableConversationConstrainedDecoding = isNpu

The intent was: "NPU is our best backend, so let's enable advanced features on it." The logic was backwards. Constrained decoding forces the model's output to conform to a schema (JSON, tool-call format, etc.) by constraining the sampling at each token. The NPU executor does not support this. When createConversation is called with the flag set and the backend is NPU, the runtime returns error 12: not supported.

The official gallery sample passes enableConversationConstrainedDecoding as an external opt-in parameter, not tied to the backend. Constrained decoding is for structured-output use cases (tool calls, JSON schemas). For free-form text generation, which is what PII redaction needs, it should be off.

The fix

ExperimentalFlags.enableConversationConstrainedDecoding = false

Force off for all backends. Additionally, samplerConfig = null is required for NPU (the NPU uses its own runtime-default sampler; passing a custom SamplerConfig is not supported).

Why it is subtle: The variable name says "constrained decoding," which sounds like a quality feature. The boolean was tied to isNpu, which sounds intentional. The error says "error 12," which is not self-explanatory. You need to know that constrained decoding is a sampling-time constraint, that it requires specific executor support, and that QNN HTP does not provide it.

Failure 6: In-Process Re-Init (Known Constraint)

The log

After the NPU has successfully run in a process, switching to CPU or GPU and then back to NPU produces:

E QnnDsp: Failed to find available PD for contextId 5 ... err: 1002
E tflite : Encountered unresolved custom op: DISPATCH_OP.

The root cause

This is a QNN runtime constraint, not a bug in our code. When the QNN HTP backend initializes, it acquires a Protection Domain (PD) on the Hexagon DSP. A PD is an isolated execution environment on the DSP, think of it as a process-level sandbox. When you close the NPU Engine and create a non-NPU engine, the PD is released. But something in the QNN runtime's state, likely the FastRPC channel or the DSP driver's per-process tracking, is not fully cleaned up. When you try to acquire a new PD for the same process, the DSP refuses with error 1002.

The workaround

We did not fix this. It is not fixable from application code without restarting the process.

Our backend cascade handles it naturally: NPU is the default and is tried first on fresh launch. If it works, great. If the user manually switches to CPU/GPU and later wants NPU back, they need to kill the app and relaunch. Android will restart the Activity automatically.

A future enhancement could automate this by calling Process.killProcess(Process.myPid()) when the user selects NPU after a non-NPU session. We have not shipped that yet.

Why it matters: If you are building a backend-selection UI (as we were), you need to know that NPU is a "first or never" choice within a process lifetime. Design your UX accordingly.

What Successful NPU Init Looks Like

After all six issues are resolved, here is the log sequence for a clean NPU init. Annotated for clarity:

15:54:27.482  RedactoApp: Pre-init ADSP_LIBRARY_PATH=...
                             ^^ Application.onCreate() seeds the env var

15:54:27.664  InferenceEngine: Attempting NPU backend (SDK_INT=36, Android 16)

15:54:27.681  litert: [qnn_manager.cc:401] Adding shared library dir to path
15:54:27.690  litert: [qnn_manager.cc:125] Loading qnn shared library from "libQnnHtp.so"
15:54:27.691  litert: [qnn_manager.cc:134] Loaded qnn shared library
                             ^^ dlopen succeeds, DISPATCH_OP registered

15:54:27.788  tflite: Replacing 1/1 nodes with delegate (DispatchDelegate) for subgraph 0
15:54:28.361  tflite: DispatchDelegate for subgraph 1
15:54:28.740  tflite: DispatchDelegate for subgraph 1 (decoder)
15:54:28.744  tflite: DispatchDelegate for subgraph 4
                             ^^ Four subgraphs delegated to Hexagon V79

15:54:30.171  InferenceEngine: NPU init succeeded
                             ^^ Total init: ~2.5 seconds

The total wall-clock time from Engine(config) to initialize() returning is approximately 2.5 seconds. Most of that is the QNN backend loading context binaries (roughly 1.3 GB of pre-compiled Hexagon instructions) into DSP memory.

A caveat on these figures: the 2.5-second init and the context size come from a single directional session on one Galaxy S25 Ultra in May 2026. I did not save the raw logs and no longer have the device, so treat them as a snapshot of what a clean init looked like, not as rigorous, multi-run benchmarks.

Things That Did NOT Turn Out to Be the Problem

This section is for the next person who would otherwise invest time investigating these. We already did. They are not the issue.

Maven version (0.11.0-rc1). We initially suspected this was a pre-release bug. The exact same version is used by the official Qualcomm sample, which works. There is no newer published version on Maven Central or Google's Maven repository. Building from source via Bazel (as suggested in NPU_ISSUE_REPORT.md) is unnecessary.

useLegacyPackaging = true. This is required and should be kept. LiteRT's dispatch lookup does a filesystem readdir on applicationInfo.nativeLibraryDir. Without legacy packaging, that directory is empty because libraries stay compressed inside the APK. This is not a bug, it is an intended behavior, but it looks suspicious when you are debugging library loading issues.

Bundling libQnnHtpV79Skel.so in jniLibs/. Helpful but not sufficient on its own. The skeleton runs on the DSP and can only be loaded by FastRPC from paths listed in ADSP_LIBRARY_PATH. We bundle our copy AND point the path at the Samsung vendor location, for redundancy.

NPU model file naming. The filenames (gemma4_npu.litertlm vs gemma4.litertlm) do not affect init. What matters is that the NPU variant uses the NPU-compiled model (which contains DISPATCH_OP nodes) and the CPU/GPU variants use the generic model (which contains standard TFLite ops). The actual filenames are arbitrary as long as the variant-to-file mapping is consistent.

The Debugging Stack in Retrospect

Looking back at all those rebuilds, these six failures form a stack. Each one was only visible after the previous one was fixed. Failure 1 (dlopen mismatch) masked Failure 3 (DSP path), which masked Failure 4 (OpenGL sub-backend), which masked Failure 5 (constrained decoding). Failure 2 was a red herring that cost hours. Failure 6 is a permanent constraint we designed around.

The experience taught us something about debugging opaque vendor stacks: the error message you see is almost never the error you have. Encountered unresolved custom op: DISPATCH_OP can mean the dispatch lib is not loaded (Failure 1), the DSP cannot find the skeleton (Failure 3), or the QNN PD cannot be re-acquired (Failure 6). Failed to initialize kernel can mean the NPU backend failed (Failures 1, 3) or that a completely unrelated GPU sub-backend failed (Failure 4).

The only reliable debugging strategy was: fix one layer, rebuild, and see what the next layer reveals.

If you are working on LiteRT-LM NPU enablement and you have hit one of these walls, I hope this saves you time. If you find a seventh failure mode, please document it. The next person will thank you.

Related in this series of "Edge AI from the Trenches"

What Is an NPU and Why Does It Exist?: foundational context on the hardware behind these failures
What Is a Delegate in LiteRT?: how delegates route operations to the NPU and why separate model files are required
What's Inside a .litertlm File?: why DISPATCH_OP custom ops exist and what they mean

Last updated: July 2026
12th of 23 posts in the "Edge AI from the Trenches" series

Prompt Engineering Beat My Fine-Tuned Model. Here Is Why.

Jaydeep Shah (JD) — Sun, 05 Jul 2026 00:00:00 +0000

You have a model that fits on a phone. It runs. It generates tokens. But the outputs are not quite right: maybe the format is wrong, or it misses edge cases, or it hallucinates on domain-specific content.

You have three levers to pull: prompt engineering, quantization, and fine-tuning. The question is not which one is "best." The question is which one to reach for first, and when to escalate. I learned this the hard way building Redacto, an on-device PII redaction app running Gemma 4 E2B via LiteRT-LM on a Snapdragon 8 Elite.

Here is the decision framework I wish I had before I started.

I started with prompts and they worked better than expected

Prompt engineering means crafting the instructions you send to the model: system prompts, examples, output format specifications. No weights change. This is the cheapest, fastest, and most underrated lever. You iterate in minutes, not hours. No training data. No GPU. No export pipeline. Just text.

There are three levels:

Zero-shot. A direct instruction with no examples. "Redact all personally identifiable information. Replace each PII entity with [CATEGORY_N]."

Few-shot. Include 2-3 worked examples so the model can pattern-match the desired behavior. Showing three examples of [NAME_1], [SSN_2], [PHONE_3] enforces the format far more reliably than instructions alone.

System personas. Frame the model's role. "You are a HIPAA compliance officer. Your task is to identify and redact all Protected Health Information." Personas anchor the model to a domain and reduce hallucination.

For small on-device models (1-4B parameters), there is a constraint cloud models do not have: context window. A 1024-token KV cache means your system prompt, few-shot examples, and user input must all fit. Long few-shot prompts that work on a 128K-context cloud model may not fit on-device. You have to be concise.

What I measured. Our standard Gemma 4 E2B model with prompt engineering alone scored 80.5% overall accuracy across 85 test cases and 5 domain modes. On HIPAA specifically, it hit 95.7% entity recall. That is a stock model from litert-community with a well-crafted system prompt. No training. No Colab. No export pipeline.

Quantization turned out to be mandatory, not optional

Quantization reduces the numerical precision of model weights, converting 32-bit floating point values to 4-bit or 8-bit integers. For on-device deployment, this is not a choice. A 2B parameter model at FP32 requires roughly 8 GB of memory. Your phone does not have 8 GB of free RAM for a single model.

The precision levels that matter on-device:

INT8 (8-bit integer): quarters the memory. Slight accuracy loss. Common for server-side optimization.
INT4 (4-bit integer): one-eighth the memory. This is the sweet spot for on-device LLMs. Gemma 4 E2B at INT4 fits in approximately 2.59 GB.

The critical thing I learned: quantization is a compile-time decision, not a runtime decision. When you export to .litertlm, you specify the quantization scheme (e.g., dynamic_wi4_afp32 for INT4 weights with FP32 activations). That choice is baked into the artifact. You cannot re-quantize a .litertlm file. If you want to try INT8 instead of INT4, you re-export from the source weights.

Quantization does degrade accuracy, particularly on edge cases. Research from GPTQ (Frantar et al., 2022) and AWQ (Lin et al., 2023) shows that careful quantization strategies minimize this degradation, but it is never zero.

What I measured. Our standard model uses INT4 quantization at 2.59 GB. The fine-tuned model, exported with a different quantization granularity, came in at 4.7 GB. Same architecture, same parameter count, nearly double the size because of an export pipeline difference. The larger model drew 3x more power (301 mA vs 101 mA) and ran 1.9x slower (10,626 ms vs 5,693 ms average latency). Quantization strategy is not just about accuracy. It directly determines whether your model is practical to deploy.

I fine-tuned last, and the results surprised me

Fine-tuning modifies the model's weights to encode new behavior. But modern fine-tuning uses parameter-efficient methods that train a tiny fraction of the weights.

LoRA (Low-Rank Adaptation) decomposes weight updates into two small matrices instead of updating the full weight matrix. A layer with a 2048 x 2048 weight matrix would require learning 4 million values for a full update. LoRA with rank 8 (Hu et al., 2021) learns two matrices of 2048 x 8 and 8 x 2048: only 32,768 values.

QLoRA takes this further: it loads the base model in 4-bit quantized form during training, then applies LoRA adapters on top. You can fine-tune a multi-billion parameter model on a single consumer GPU. Dettmers et al. (2023) demonstrated that 4-bit quantized models with LoRA adapters can match the performance of full 16-bit fine-tuning.

When to fine-tune: when prompt engineering provably cannot achieve your accuracy or format requirements. Specifically: the model needs domain-specific output formats that few-shot examples cannot enforce, or domain knowledge not in its training data, or subtle distinctions that instructions alone cannot capture.

When not to fine-tune: before trying prompt engineering (always try prompts first), when your training data does not match your deployment task (the trap I fell into), or when you need NPU deployment. As of mid-2026, there is no public compilation toolchain for fine-tuned models to NPU targets like Qualcomm's Hexagon. You can fine-tune and deploy to GPU, but NPU compilation remains blocked for custom models.

The table that changed my thinking

Overall: Standard (prompt engineering) vs Fine-tuned:

Metric	Standard Model	Fine-tuned Model	Delta
Overall Score	80.5%	70.3%	-10.2%
Entity Recall	79.3%	71.7%	-7.6%
Format Score	79.8%	71.7%	-8.1%
Preservation	83.7%	65.9%	-17.8%
Avg Throughput	12.8 tok/s	9.0 tok/s	-30%

These scores come from an earlier single-pass evaluation (85 entries, GPU, an older LiteRT build) - not the shipped 3-step pipeline benchmarked elsewhere in this series. Standard and fine-tuned were scored in the same run, so the head-to-head is fair; just do not compare these absolute numbers to the latency figures from the 3-step pipeline.

The fine-tuned model lost. Overall. But look at the per-domain breakdown:

Per-mode entity recall:

Domain Mode	Standard	Fine-tuned	Winner
FIELD_SERVICE	82.1%	95.3%	Fine-tuned (+13.2%)
FINANCIAL	83.8%	85.5%	Fine-tuned (+1.7%)
HIPAA	95.7%	39.9%	Standard (+55.8%)
JOURNALISM	71.1%	61.3%	Standard (+9.8%)
TACTICAL	63.7%	76.8%	Fine-tuned (+13.1%)

The fine-tuned model crushed FIELD_SERVICE and TACTICAL, the domains where its training data (ai4privacy/pii-masking-400k) happened to include relevant patterns. It catastrophically failed on HIPAA, where the standard model's carefully crafted system prompt was already handling relational PHI ("the patient's daughter Lisa") that the fine-tuning data never covered.

The lesson is not that fine-tuning is bad - it is that our fine-tune never got a fair shot. This was a one-day, one-epoch run on 3,000 examples sampled from a 400,000-example dataset: under 1% of the data available. And that slice was misaligned too, using the generic [REDACTED] format rather than Redacto's structured [CATEGORY_N]. So the model was fed too little, too briefly, in the wrong format. Fine-tuning is only as good as the quantity and alignment of what you train on, and we gave it little of either. With the full 400k and format-matched labels, the result could well have flipped - we simply did not test that. What we actually proved is narrower and more useful: a good prompt beat an under-resourced fine-tune, so before committing to a serious fine-tuning effort, prompting got us further, faster.

QLoRA training details: 2,850,816 trainable parameters (0.06% of total), LoRA rank 8, alpha 16, 3,000 training examples (a sub-1% slice of ai4privacy/pii-masking-400k), 1 epoch, 217 seconds training time. A deliberately quick attempt - the compute was trivial, and we never scaled the data or aligned the label format. That, not any limit of fine-tuning itself, is what these numbers reflect.

The decision framework I now follow

One clarification before the list: quantization is not a quality lever in the same sense as the other two. It is a deployment prerequisite - you quantize because the model physically will not fit otherwise, not because you are chasing accuracy. So think of it as step zero, and then reach for the quality levers in order.

Quantize because you must. This is not negotiable for on-device: choose INT4 for models under 4B parameters. It happens at export time, and it is the price of admission, not a tuning knob.
Prompt engineer first (among the quality levers). Iterate on system prompts, few-shot examples, and output format instructions. This is free, fast, and often sufficient.
Fine-tune last. Only when prompt engineering provably cannot achieve your requirements, AND you have enough well-aligned training data that matches your deployment task.
Know the blockers. Fine-tuned models currently cannot be compiled for NPU targets - the exact wall we hit in the delegates post, where our fine-tune could only reach GPU. You are limited to GPU/CPU backends, and GPU runs at roughly 1.7x slower decode throughput than NPU.

What I took away from this

In the cloud ML world, fine-tuning is often the first thing teams reach for. In our experience building Redacto, it should be the last lever you pull on-device.

The reason is practical: on-device deployment adds an export/compilation step that amplifies every upstream decision. A mismatched chat template breaks the model entirely. A different quantization granularity doubles the model size. Training data that does not match your output format tanks your scores even when the model "knows" the right answer.

Prompt engineering has none of these risks. You change text, you test, you iterate. The model binary stays the same.

Start with prompts. Quantize because you must. Fine-tune only when you have proven, with data, that prompts are not enough, and only when your training data matches your deployment task exactly.

Related in this series of "Edge AI from the Trenches"

This post pulls together threads from across the foundations series:

FP32, INT4, and Everything Between - the quantization lever, in depth
I Opened a .litertlm File - why quantization is a compile-time decision baked into the artifact
The Invisible Layer Between My Prompt and the Model - the chat-template format trap that hobbled our fine-tune
One Model, Three Chips, Two Files - why a fine-tuned model can only reach GPU, not the NPU
What On-Device AI Benchmarks Actually Feel Like - the metrics that quantization and backend choice drive

Jaydeep Shah is a developer with roots in embedded systems, Android platform internals, and silicon-level AI optimization. He now explores on-device AI inference, bringing models from the cloud to phones and edge hardware. Along with his team Edge Artists, he builds applications using LiteRT-LM and Gemma models on mobile hardware, and writes about what works, what breaks, and what he learns along the way. This post is part of the Edge AI from the Trenches series.

Last updated: July 2026
11th of 22 posts in the "Edge AI from the Trenches" series

What On-Device AI Benchmarks Actually Feel Like

Jaydeep Shah (JD) — Sat, 04 Jul 2026 22:15:00 +0000

You run a benchmark. The report says "41.7 tok/s, TTFT 92ms." Is that good? Is that fast? What does the user actually feel?

Most benchmark writeups treat these numbers as abstract scores. But when you are building an app that runs an LLM on a phone, these numbers are the user experience. They determine whether your app feels instant or sluggish, whether text flows naturally or stutters, and whether your prompt even fits in memory.

Here is what I learned about the four metrics that matter most, using real numbers from Redacto, our on-device PII redaction app running Gemma 4 E2B on a Snapdragon 8 Elite.

The first surprise: TTFT matters more than total speed

Time to First Token (TTFT) measures the delay between when inference starts and when the first output token appears. In practice, this is how long the user stares at a blank screen before anything happens.

TTFT covers the "prefill" phase: the model reads the entire input prompt before generating the first output token. Longer prompts mean more prefill work, which means higher TTFT.

How we measure it: Timestamp of the first onMessage callback minus the inference start time. Direct measurement, no estimation.

Here is what TTFT looks like in Redacto, averaged across 30 entries:

Step	GPU TTFT	NPU TTFT	NPU Advantage
Step 1 (Classify)	381ms	99ms	3.8x faster
Step 2 (Detect)	375ms	104ms	3.6x faster
Step 3 (Redact)	366ms	92ms	4.0x faster

The NPU consistently delivers first tokens in under 100ms. At 92ms, the response feels instantaneous. Research on human perception shows that delays under 100ms are perceived as immediate, with no conscious awareness of waiting (Jakob Nielsen, "Response Times: The 3 Important Limits," Nielsen Norman Group, 1993; based on Miller 1968 and Card et al. 1991).

The GPU's 366ms is not terrible, but it crosses into noticeable territory. In my experience with streaming UX, the practical thresholds are roughly: under 100ms feels instant, 100-300ms feels responsive, and above 300ms feels like waiting. (The 100ms threshold comes from Nielsen's framework; the 300ms boundary is my own observation from testing Redacto, not from that research.)

Why this matters more than total latency: Users judge responsiveness at the moment something first appears on screen. A system that shows a first token at 92ms and then takes 4 seconds to finish feels faster than one that waits 366ms and finishes in 3.5 seconds. Streaming output transforms a wait into a reading experience. TTFT is the boundary between "waiting" and "reading."

What ~42 tok/s actually feels like

Tokens per second (tok/s) measures decode speed: how quickly the model generates output tokens after the first one. This is the speed at which text streams onto the screen.

How we measure it: (tokenCount - 1) * 1000 / (lastTokenTime - firstTokenTime), where times are in milliseconds. The minus-one matters: we measure the intervals between tokens, not counting the first token (which is covered by TTFT).

Across Redacto's three pipeline steps, decode speed held roughly constant on each backend:

GPU: ~25 tok/s (24.5 to 25.3 across the three steps)
NPU: ~42 tok/s (essentially flat, step to step)

That flatness surprised me at first - should a longer step not decode slower? But it lines up with how decoding actually works. Generating each token means streaming the entire model's weights out of memory once; the arithmetic per token is tiny by comparison. So decode is memory-bandwidth-bound, and bandwidth is a fixed property of the chip, not of your prompt. The NPU's rate barely moves between steps because it is limited by how fast it can pull weights, which does not change. The GPU drifts a little more, sharing memory bandwidth with other work and thermals.

But what do these numbers feel like? I needed a reference point: human reading speed. The average adult reads approximately 238 words per minute, roughly 4 words per second (Brysbaert, 2019, "How many words do we read per minute? A review and meta-analysis of reading rate," Journal of Memory and Language, Vol. 109). Since one token is approximately 0.75 words in English for typical BPE/SentencePiece tokenizers (a commonly cited approximation; exact ratios vary by tokenizer), 4 words per second is about 5.3 tokens per second.

That gives us a practical scale:

10 tok/s: About 7.5 words per second, 1.9x average reading speed. Noticeably slow. You can see individual words appearing.
25 tok/s (GPU): About 19 words per second, 4.7x reading speed. Comfortable. Text flows faster than you can read it.
~42 tok/s (NPU): About 31 words per second, 7.8x reading speed. Text appears almost instantly. No perceptible difference between 42 tok/s and "all at once."

The practical takeaway: once you pass roughly 20 tok/s, further speed improvements do not change the user's experience much. Below 15 tok/s, the streaming effect starts to feel slow, and below 10 tok/s it becomes a distraction.

The constraint I did not expect: context window

Context window measures how many tokens the model can work with at once - and this is where the gap between "what the model can do" and "what you actually deployed" bit me.

Gemma 4 E2B is trained for a 128K-token context. On a phone, you do not get 128K. You choose a KV-cache size at export time, and that compiled value - not the model's theoretical maximum - is your real ceiling. Redacto's model was compiled with a 1,024-token KV cache (--cache_length 1024, with a 256-token prefill chunk).

Here is the part that caught me: our app config actually requested maxNumTokens = 4000. It did not matter. A runtime request cannot exceed what the model was compiled with - the 1,024-token cache baked into the .litertlm at export wins. (Same lesson as the sealed-file story: what you set at export time is what you get; the app only gets to ask.)

And 1,024 tokens is tight - roughly 768 words - shared across everything:

[System prompt] + [User input] + [Generated output] = must fit in 1,024 tokens

In Redacto's HIPAA detection step, the system prompt alone (PII category definitions, output format instructions, examples) eats a real chunk of that budget. Add a medium-length medical note as input plus the detection output, and a single call can use most of the window.

This has concrete consequences:

You cannot use the verbose system prompts that work fine with cloud models running 128K+ context windows.
Few-shot examples eat directly into space available for user content.
Long input documents may need to be truncated or chunked.
The model's output can be cut off mid-response if the total exceeds the cache.

This is a big reason Redacto uses a multi-step pipeline (Classify, Detect, Redact) instead of a single "do everything at once" prompt. Each step gets a fresh 1,024-token budget with a focused system prompt.

(Note: Redacto's full pipeline has 4 steps: Classify, Detect, Redact, Validate. Our benchmark data covers only the first 3 steps. The validation step adds variable retry rounds that would make performance comparisons noisy, so it is excluded from the metrics in this post.)

What the model costs in RAM

Peak RSS (Resident Set Size) measures the maximum physical RAM consumed during inference. On a phone, this is critical because the AI model, the OS, other apps, and system services all compete for the same RAM pool.

Backend	Peak RSS
GPU	1,375 MB
NPU	1,934 MB

The NPU model uses approximately 560 MB more RAM. This is partly because the NPU model file is larger (3.02 GB vs 2.59 GB) and partly because the QNN runtime allocates additional execution buffers for the Hexagon DSP.

To put 1.9 GB in context: the Samsung Galaxy S25 Ultra has 12 GB of total RAM. Dedicating 1.9 GB to AI inference means roughly 16% of the device's total memory is consumed by your model alone. On lower-end devices with 6-8 GB, running a model this size while maintaining a responsive user experience becomes a genuine engineering challenge.

The finding that surprised me: NPU wins perception, GPU wins total time

Here is the full picture from Redacto's 30-entry benchmark:

Metric	GPU	NPU	Winner
TTFT	366-381ms	92-104ms	NPU (3.6-4.0x)
Decode tok/s	~25	~42	NPU (1.6-1.7x)
Peak RSS	1,375 MB	1,934 MB	GPU (560 MB less)
Avg total latency	4,855ms	5,062ms	Roughly equal
Constrained decoding	Supported	Not supported	GPU
Avg total tokens	92	195	GPU (2.1x fewer)

A note on this data: these numbers come from a single benchmarking session - one Galaxy S25 Ultra, 30 prompts across five modes, in May 2026. I no longer have the device, and the raw per-entry logs were not preserved, so treat everything here as directional evidence from one real run, not a rigorous multi-run study. The patterns - NPU wins TTFT, decode is bandwidth-bound, the verbosity tradeoff erases NPU's speed lead - are robust; the exact digits are one session's snapshot.

The NPU is faster at everything the user directly perceives: time to first token and token generation speed. But it uses more RAM and does not support constrained decoding (SamplerConfig, which controls parameters like topK, topP, and temperature to limit output verbosity). Without constrained decoding, the NPU generates more verbose output (195 avg tokens vs GPU's 92), which erases its per-token speed advantage in total wall-clock time.

This showed up dramatically in TACTICAL mode, where NPU averaged 14,201ms per entry compared to GPU's 5,430ms.

The decision framework I arrived at:

Choose NPU when TTFT and perceived responsiveness matter most, your prompts produce short outputs, and you can afford the extra RAM.
Choose GPU when you need constrained decoding, predictable output length, lower memory footprint, or consistent total latency.
Choose CPU as a fallback when neither GPU nor NPU delegates are available on the target device.

What I take away

Every metric maps to something the user feels:

TTFT is the pause before the response starts. Keep it under 200ms.
tok/s is the speed of the streaming text. Anything above 20 is comfortable.
Context window is how much your prompt can say. Budget it carefully.
Peak RSS is what your app costs in RAM. Know the ceiling for your target devices.

The next time you see a benchmark table, do not just look for the biggest number. Ask what each number means for the person holding the phone.

Related in this series of "Edge AI from the Trenches"

Why My LLM Runs 4x Faster on Hardware I Had Never Heard Of - the hardware that drives the TTFT and tok/s differences measured here
One Model, Three Chips, Two Files: How LiteRT Delegates Really Work - how LiteRT selects the backend that determines your performance numbers
Benchmarking On-Device LLMs: What You Can and Cannot Measure - the methodology and honesty principles behind these metrics

Sources:

Redacto benchmark data - Samsung Galaxy S25 Ultra, Snapdragon 8 Elite
Brysbaert (2019), "How many words do we read per minute?" - Journal of Memory and Language, Vol. 109
Nielsen (1993), "Response Times: The 3 Important Limits" - Nielsen Norman Group

Last updated: July 2026
10th of 22 posts in the "Edge AI from the Trenches" series