Jaydeep Shah (JD)

Posted on Jul 4

What On-Device AI Benchmarks Actually Feel Like

#edgeai #android #litertlm

You run a benchmark. The report says "41.7 tok/s, TTFT 92ms." Is that good? Is that fast? What does the user actually feel?

Most benchmark writeups treat these numbers as abstract scores. But when you are building an app that runs an LLM on a phone, these numbers are the user experience. They determine whether your app feels instant or sluggish, whether text flows naturally or stutters, and whether your prompt even fits in memory.

Here is what I learned about the four metrics that matter most, using real numbers from Redacto, our on-device PII redaction app running Gemma 4 E2B on a Snapdragon 8 Elite.

The first surprise: TTFT matters more than total speed

Time to First Token (TTFT) measures the delay between when inference starts and when the first output token appears. In practice, this is how long the user stares at a blank screen before anything happens.

TTFT covers the "prefill" phase: the model reads the entire input prompt before generating the first output token. Longer prompts mean more prefill work, which means higher TTFT.

How we measure it: Timestamp of the first onMessage callback minus the inference start time. Direct measurement, no estimation.

Here is what TTFT looks like in Redacto, averaged across 30 entries:

Step	GPU TTFT	NPU TTFT	NPU Advantage
Step 1 (Classify)	381ms	99ms	3.8x faster
Step 2 (Detect)	375ms	104ms	3.6x faster
Step 3 (Redact)	366ms	92ms	4.0x faster

The NPU consistently delivers first tokens in under 100ms. At 92ms, the response feels instantaneous. Research on human perception shows that delays under 100ms are perceived as immediate, with no conscious awareness of waiting (Jakob Nielsen, "Response Times: The 3 Important Limits," Nielsen Norman Group, 1993; based on Miller 1968 and Card et al. 1991).

The GPU's 366ms is not terrible, but it crosses into noticeable territory. In my experience with streaming UX, the practical thresholds are roughly: under 100ms feels instant, 100-300ms feels responsive, and above 300ms feels like waiting. (The 100ms threshold comes from Nielsen's framework; the 300ms boundary is my own observation from testing Redacto, not from that research.)

Why this matters more than total latency: Users judge responsiveness at the moment something first appears on screen. A system that shows a first token at 92ms and then takes 4 seconds to finish feels faster than one that waits 366ms and finishes in 3.5 seconds. Streaming output transforms a wait into a reading experience. TTFT is the boundary between "waiting" and "reading."

What ~42 tok/s actually feels like

Tokens per second (tok/s) measures decode speed: how quickly the model generates output tokens after the first one. This is the speed at which text streams onto the screen.

How we measure it: (tokenCount - 1) * 1000 / (lastTokenTime - firstTokenTime), where times are in milliseconds. The minus-one matters: we measure the intervals between tokens, not counting the first token (which is covered by TTFT).

Across Redacto's three pipeline steps, decode speed held roughly constant on each backend:

GPU: ~25 tok/s (24.5 to 25.3 across the three steps)
NPU: ~42 tok/s (essentially flat, step to step)

That flatness surprised me at first - should a longer step not decode slower? But it lines up with how decoding actually works. Generating each token means streaming the entire model's weights out of memory once; the arithmetic per token is tiny by comparison. So decode is memory-bandwidth-bound, and bandwidth is a fixed property of the chip, not of your prompt. The NPU's rate barely moves between steps because it is limited by how fast it can pull weights, which does not change. The GPU drifts a little more, sharing memory bandwidth with other work and thermals.

But what do these numbers feel like? I needed a reference point: human reading speed. The average adult reads approximately 238 words per minute, roughly 4 words per second (Brysbaert, 2019, "How many words do we read per minute? A review and meta-analysis of reading rate," Journal of Memory and Language, Vol. 109). Since one token is approximately 0.75 words in English for typical BPE/SentencePiece tokenizers (a commonly cited approximation; exact ratios vary by tokenizer), 4 words per second is about 5.3 tokens per second.

That gives us a practical scale:

10 tok/s: About 7.5 words per second, 1.9x average reading speed. Noticeably slow. You can see individual words appearing.
25 tok/s (GPU): About 19 words per second, 4.7x reading speed. Comfortable. Text flows faster than you can read it.
~42 tok/s (NPU): About 31 words per second, 7.8x reading speed. Text appears almost instantly. No perceptible difference between 42 tok/s and "all at once."

The practical takeaway: once you pass roughly 20 tok/s, further speed improvements do not change the user's experience much. Below 15 tok/s, the streaming effect starts to feel slow, and below 10 tok/s it becomes a distraction.

The constraint I did not expect: context window

Context window measures how many tokens the model can work with at once - and this is where the gap between "what the model can do" and "what you actually deployed" bit me.

Gemma 4 E2B is trained for a 128K-token context. On a phone, you do not get 128K. You choose a KV-cache size at export time, and that compiled value - not the model's theoretical maximum - is your real ceiling. Redacto's model was compiled with a 1,024-token KV cache (--cache_length 1024, with a 256-token prefill chunk).

Here is the part that caught me: our app config actually requested maxNumTokens = 4000. It did not matter. A runtime request cannot exceed what the model was compiled with - the 1,024-token cache baked into the .litertlm at export wins. (Same lesson as the sealed-file story: what you set at export time is what you get; the app only gets to ask.)

And 1,024 tokens is tight - roughly 768 words - shared across everything:

[System prompt] + [User input] + [Generated output] = must fit in 1,024 tokens

In Redacto's HIPAA detection step, the system prompt alone (PII category definitions, output format instructions, examples) eats a real chunk of that budget. Add a medium-length medical note as input plus the detection output, and a single call can use most of the window.

This has concrete consequences:

You cannot use the verbose system prompts that work fine with cloud models running 128K+ context windows.
Few-shot examples eat directly into space available for user content.
Long input documents may need to be truncated or chunked.
The model's output can be cut off mid-response if the total exceeds the cache.

This is a big reason Redacto uses a multi-step pipeline (Classify, Detect, Redact) instead of a single "do everything at once" prompt. Each step gets a fresh 1,024-token budget with a focused system prompt.

(Note: Redacto's full pipeline has 4 steps: Classify, Detect, Redact, Validate. Our benchmark data covers only the first 3 steps. The validation step adds variable retry rounds that would make performance comparisons noisy, so it is excluded from the metrics in this post.)

What the model costs in RAM

Peak RSS (Resident Set Size) measures the maximum physical RAM consumed during inference. On a phone, this is critical because the AI model, the OS, other apps, and system services all compete for the same RAM pool.

Backend	Peak RSS
GPU	1,375 MB
NPU	1,934 MB

The NPU model uses approximately 560 MB more RAM. This is partly because the NPU model file is larger (3.02 GB vs 2.59 GB) and partly because the QNN runtime allocates additional execution buffers for the Hexagon DSP.

To put 1.9 GB in context: the Samsung Galaxy S25 Ultra has 12 GB of total RAM. Dedicating 1.9 GB to AI inference means roughly 16% of the device's total memory is consumed by your model alone. On lower-end devices with 6-8 GB, running a model this size while maintaining a responsive user experience becomes a genuine engineering challenge.

The finding that surprised me: NPU wins perception, GPU wins total time

Here is the full picture from Redacto's 30-entry benchmark:

Metric	GPU	NPU	Winner
TTFT	366-381ms	92-104ms	NPU (3.6-4.0x)
Decode tok/s	~25	~42	NPU (1.6-1.7x)
Peak RSS	1,375 MB	1,934 MB	GPU (560 MB less)
Avg total latency	4,855ms	5,062ms	Roughly equal
Constrained decoding	Supported	Not supported	GPU
Avg total tokens	92	195	GPU (2.1x fewer)

A note on this data: these numbers come from a single benchmarking session - one Galaxy S25 Ultra, 30 prompts across five modes, in May 2026. I no longer have the device, and the raw per-entry logs were not preserved, so treat everything here as directional evidence from one real run, not a rigorous multi-run study. The patterns - NPU wins TTFT, decode is bandwidth-bound, the verbosity tradeoff erases NPU's speed lead - are robust; the exact digits are one session's snapshot.

The NPU is faster at everything the user directly perceives: time to first token and token generation speed. But it uses more RAM and does not support constrained decoding (SamplerConfig, which controls parameters like topK, topP, and temperature to limit output verbosity). Without constrained decoding, the NPU generates more verbose output (195 avg tokens vs GPU's 92), which erases its per-token speed advantage in total wall-clock time.

This showed up dramatically in TACTICAL mode, where NPU averaged 14,201ms per entry compared to GPU's 5,430ms.

The decision framework I arrived at:

Choose NPU when TTFT and perceived responsiveness matter most, your prompts produce short outputs, and you can afford the extra RAM.
Choose GPU when you need constrained decoding, predictable output length, lower memory footprint, or consistent total latency.
Choose CPU as a fallback when neither GPU nor NPU delegates are available on the target device.

What I take away

Every metric maps to something the user feels:

TTFT is the pause before the response starts. Keep it under 200ms.
tok/s is the speed of the streaming text. Anything above 20 is comfortable.
Context window is how much your prompt can say. Budget it carefully.
Peak RSS is what your app costs in RAM. Know the ceiling for your target devices.

The next time you see a benchmark table, do not just look for the biggest number. Ask what each number means for the person holding the phone.

Related in this series of "Edge AI from the Trenches"

Why My LLM Runs 4x Faster on Hardware I Had Never Heard Of - the hardware that drives the TTFT and tok/s differences measured here
One Model, Three Chips, Two Files: How LiteRT Delegates Really Work - how LiteRT selects the backend that determines your performance numbers
Benchmarking On-Device LLMs: What You Can and Cannot Measure - the methodology and honesty principles behind these metrics

Jaydeep Shah is a developer with roots in embedded systems, Android platform internals, and silicon-level AI optimization. He now explores on-device AI inference - bringing models from the cloud to phones and edge hardware. Along with his team Edge Artists, he builds applications using LiteRT-LM and Gemma models on mobile hardware, and writes about what works, what breaks, and what he learns along the way. This post is part of the Edge AI from the Trenches series.

Sources: