Jaydeep Shah (JD)

Posted on May 18

Why My LLM Runs 4x Faster on Hardware I Had Never Heard Of

#edgeai #android #litertlm

I have worked on inference-specific silicon for servers, so the idea of purpose-built AI hardware was familiar. What I had not paid attention to was that mobile phones already ship dedicated AI accelerators - and that switching to one could cut time-to-first-token by 4x on the same model and weights.

Here is what I learned about this chip, why it exists, and the real tradeoffs of using it - with numbers from Redacto, the on-device PII redaction app my team built with Gemma 4 E2B on Snapdragon 8 Elite.

The CPU does everything, but not this

The CPU handles app logic, OS scheduling, networking, and file I/O. It is a general-purpose processor optimized for doing one thing at a time, very fast, with deep pipelines and sophisticated branch prediction. Modern mobile CPUs like the Qualcomm Oryon cores in the Snapdragon 8 Elite have a few high-performance cores and several efficiency cores. They excel at sequential, branching logic.

What they are bad at: doing the same operation on thousands of data elements simultaneously. A CPU doing matrix multiplication is like using a Swiss Army knife to chop vegetables - it works, but it was not designed for that job.

The GPU gets closer, but costs too much power

GPUs were designed to render pixels. A 1080p display has over two million pixels, each needing color, lighting, and texture calculations every frame. GPUs evolved thousands of small cores that execute the same instruction across many data elements - an architecture called SIMD (Single Instruction, Multiple Data).

Researchers realized that neural network training is also massively parallel math, mostly matrix multiplications. NVIDIA's CUDA made it possible to repurpose GPU hardware for this, and the deep learning revolution followed.

On mobile, GPUs like the Adreno 830 can run LLM inference. But mobile GPUs were designed for graphics rendering, not for the specific operation mix that inference demands. Their architecture is more general than inference requires, which means more power per AI operation than necessary.

The NPU: purpose-built silicon for inference

An NPU - Neural Processing Unit - is a processor designed from scratch for one job: running neural network inference as fast as possible, using as little power as possible.

Neural network inference is overwhelmingly a small set of operations repeated billions of times: matrix multiplication, convolutions, activation functions (ReLU, GELU), normalization (LayerNorm, RMSNorm), and softmax. An NPU has dedicated hardware for these operations. Instead of thousands of general-purpose ALUs, it uses systolic arrays or matrix-multiply accelerator blocks that process entire tiles of a matrix in one clock cycle, with specialized data paths tuned for streaming weight matrices from memory to compute.

The result: for inference operations, an NPU is designed to deliver more TOPS (Tera Operations Per Second) per watt than a GPU, which in turn delivers more TOPS per watt than a CPU.

Why this matters on a phone: battery life

On a server, power is cheap. On a phone, you have a battery between 4,000 and 5,500 mAh.

Running LLM inference on the GPU works, but draws more power than the NPU for the same computation. The Hexagon NPU in the Snapdragon 8 Elite is widely reported at 45 TOPS (INT8 precision), designed for sustained workloads at mobile power envelopes.

NPUs exist not because GPUs cannot do AI - they can - but because on a battery-powered device, doing AI efficiently is a hardware design problem, and the NPU is the hardware answer.

CPU vs GPU vs NPU - the comparison

	CPU	GPU	NPU
Designed for	Sequential logic, OS tasks	Parallel rendering, general compute	Neural network inference
Core architecture	Few heterogeneous cores	Hundreds of simple SIMD cores	Systolic arrays / MAC blocks
Power efficiency for AI	Low	Medium	High
Flexibility	Any workload	Graphics + general parallel	Neural network ops only
Programming model	C/C++, any language	CUDA, OpenCL, Vulkan, Metal	Vendor SDK (QNN, ANE, etc.)

What I measured with Redacto

Running Gemma 4 E2B through Redacto's PII redaction pipeline on Samsung Galaxy S25 Ultra:

Metric	GPU	NPU	Difference
Decode throughput	24.5 tok/s	41.7 tok/s	NPU 1.7x faster
Time to first token	366 ms	92 ms	NPU 4.0x faster
Peak memory (RSS)	1,375 MB	1,934 MB	NPU uses 560 MB more

The NPU is 1.7x faster on throughput and 4x faster on time-to-first-token. That TTFT gap - 366ms vs 92ms - is the difference between a noticeable pause and feeling instant.

Why 4x on TTFT but only 1.7x on throughput

This gap puzzled me until I understood what the model actually does in each phase.

Prefill (determines TTFT): When you send a prompt, the model processes all input tokens at once. If your prompt is 200 tokens, it multiplies a 200-by-dimension matrix against the weight matrices of every layer. This is a large matrix-times-matrix operation, and it is compute-bound - the bottleneck is raw math throughput. The NPU's systolic arrays are purpose-built for exactly this. The GPU can do it, but with general-purpose cores that carry overhead irrelevant to matrix multiplication.

Decode (determines tok/s): After the first token, each subsequent token is generated one at a time. Each step multiplies a single vector against all the weight matrices - a matrix-times-vector operation. The actual math per token is small. But you still need to load the entire model's weights from memory for every token you generate. This is memory-bandwidth-bound - and both NPU and GPU share the same LPDDR5X memory pool on the SoC. They are drinking from the same straw.

Phase	Operation	Bottleneck	NPU advantage
Prefill (TTFT)	Matrix x matrix	Compute	4x
Decode (tok/s)	Matrix x vector	Memory bandwidth	1.7x

The NPU dominates when the problem is "do more math faster." It helps less when the problem is "load weights from memory faster" - because both processors share the same memory bus.

The tradeoff is memory: the NPU model file is larger (3.02 GB vs 2.59 GB) and the QNN runtime allocates additional buffers, adding 560 MB of RAM usage.

Where the NPU sits on the SoC

A mobile SoC (System on Chip) is a collection of specialized processors on a single die, sharing memory.

The Hexagon V79 supports INT4, INT8, and FP16 precision and shares the same memory pool as the CPU and GPU, but has its own compute fabric optimized for tensor operations.

NPUs are everywhere now

Every major mobile silicon vendor ships a dedicated neural network accelerator:

Apple Neural Engine: In every Apple chip since A11 (2017). The M4's ANE is rated at 38 TOPS.
Google Tensor: On-device AI Core in Pixel phones, derived from Google's cloud TPU architecture. Google does not publish a TOPS rating.
MediaTek APU: The Dimensity 9400 includes a dedicated APU. MediaTek rates it at 46 TOPS (INT8).

Dedicated AI silicon is now table stakes, not optional.

Why you cannot just "use the NPU"

You cannot take a PyTorch model, point it at the NPU, and run it. The barriers fall into two buckets: the hardware/software ecosystem and the model itself.

The hardware/software side

CPUs converged around a handful of well-defined instruction set architectures - x86 and ARM - with decades of toolchain maturity. GPUs followed a similar path with cross-vendor compute APIs like Vulkan, OpenCL, and Metal that abstract away the underlying silicon.

NPUs have no such convergence. Each vendor ships a proprietary architecture with its own instruction set, memory model, and SDK:

Qualcomm Hexagon uses the QNN SDK
Apple Neural Engine uses Core ML
MediaTek APU uses NeuroPilot
Samsung Exynos NPU uses ONE (On-device Neural Engine)

There is no cross-vendor NPU standard. A model compiled for Hexagon V79 will not run on V73, let alone on Apple's ANE or MediaTek's APU. At runtime, you need vendor-specific dispatch libraries - Qualcomm's libLiteRtDispatch_Qualcomm.so, for example - without which the NPU simply does not exist to your app.

The model side

Even once you have the right toolchain, the model itself needs preparation:

Chip-specific compilation. The model must be compiled through the vendor SDK into a binary targeting that exact NPU version. This is not a generic export - it produces silicon-specific execution graphs.
Separate model files. Redacto ships two files: gemma4.litertlm (2.59 GB, GPU) and gemma4_npu.litertlm (3.02 GB, NPU). Same weights, different compiled backends.
Feature gaps. Constrained decoding (topK, topP, temperature) is unsupported on NPU. Our NPU model generates up to 3.2x more tokens on the most verbose pipeline steps because we cannot constrain its output.

The NPU is the fastest hardware for inference, but reaching it requires navigating a fragmented ecosystem of vendor-specific toolchains, proprietary dispatch libraries, and separate model artifacts. It is not plug-and-play.

What I took away from this

The NPU exists because neural network inference is a specific enough workload that purpose-built silicon can do it dramatically faster and more efficiently than general-purpose processors. On mobile, "more efficiently" translates directly to battery life.

For developers, the NPU is the fastest path to responsive on-device AI. But getting there requires understanding the compilation pipeline, the vendor libraries, and the feature gaps. The hardware is ready. The software ecosystem is catching up.

Related in this series

What Is a Delegate in LiteRT? - how LiteRT routes operations to CPU, GPU, or NPU
FP32, INT4, and Everything Between - What I Learned About Precision on Mobile - the precision tradeoff that makes models fit in phone RAM
NPU vs GPU vs CPU: Real Numbers - measured performance comparison on the hardware described here

Jaydeep Shah is a developer with roots in embedded systems, Android platform internals, and silicon-level AI optimization. He now explores on-device AI inference - bringing models from the cloud to phones and edge hardware. Along with his team Edge Artists, he builds applications using LiteRT-LM and Gemma models on mobile hardware, and writes about what works, what breaks, and what he learns along the way. This post is part of the Edge AI from the Trenches series.

Sources:

Qualcomm Snapdragon 8 Elite - Hexagon V79 NPU specs
Apple introduces M4 chip - Neural Engine, 38 TOPS
MediaTek Dimensity 9400 - APU specs
Sze, V. et al. (2017). "Efficient Processing of Deep Neural Networks." - NPU architecture background
Benchmark data: Redacto project, Samsung Galaxy S25 Ultra (Snapdragon 8 Elite, SM8750)

Last updated: May 2026
5th of 22 posts in the "Edge AI from the Trenches" series

DEV Community