Jaydeep Shah (JD)

Posted on May 17 • Edited on May 20

What I Learned by Dissecting gemma-4-E2B-it_qualcomm_sm8750.litertlm

#gemmachallenge #edgeai #android #litertlm

Gemma 4 Challenge: Write about Gemma 4 Submission

I recently started exploring on-device AI inference, and honestly, the initial experience was overwhelming. Hundreds of models on HuggingFace, unfamiliar architecture names, quantization formats, chip-specific variants - it felt like drinking from a firehose. When I first saw a filename like gemma-4-E2B-it_qualcomm_sm8750.litertlm, it looked like alphabet soup.

But as I dug deeper - reading model cards, building an actual app, benchmarking on real hardware - each piece of that filename started to make sense. Every segment encodes a specific decision about the model's lineage, its architecture, how it was trained, what chip it was compiled for, and what runtime will execute it.

This post breaks that filename apart, piece by piece, so you do not have to go through the same confusion I did.

The anatomy of the name

`gemma` - the model family

Gemma is Google's family of open-weight large language models. "Open-weight" means Google publishes the trained model weights so you can download, deploy, fine-tune, and build on them without per-token API fees or cloud dependencies.

But Gemma is not one model - it is four generations of architectural evolution, each shifting what "small enough for a phone" actually means. Understanding this lineage matters because the generation determines the architecture, the context window, the chat template format, and ultimately what your on-device app can do.

The Gemma family tree

A few things worth noticing in this evolution:

Attention mechanisms evolved with each generation. Gemma 1 used MQA on the small model (fewer key-value heads = less memory) and standard MHA on the large model. Gemma 2 unified everything to GQA - a middle ground where key-value heads are shared across groups of query heads, reducing KV-cache size without the quality loss of MQA. This sliding-window + global attention alternation in Gemma 2 was a direct response to the memory bandwidth problem during autoregressive decoding - the exact bottleneck that matters on a phone.

The licensing change in Gemma 4 is significant. Generations 1-3 used Google's custom "source-available" license with usage restrictions. Gemma 4 moved to Apache 2.0 - fully permissive, no usage restrictions. This matters for commercial on-device apps: you can ship Gemma 4 in a production app without worrying about license compliance beyond standard Apache 2.0 attribution.

The context window grew dramatically. 8K (Gemma 1-2) to 128K (Gemma 3) to 128K on edge / 256K on cloud models (Gemma 4). The E2B model we run on the Galaxy S25 Ultra has a 128K context window - the same as Gemma 3. The 256K window is reserved for the heavyweight 26B MoE and 31B dense models. On-device, you rarely use the full window anyway - a 4K KV-cache is common for mobile deployment - but the larger training window means the model understands longer documents even when you truncate at inference time.

`4` - the generation

The number tells you which generation of the Gemma family this model belongs to. Higher generation means better quality at the same parameter count - improved architecture, better training data, and lessons from prior generations baked in.

For on-device developers, the generation also determines which chat template the model expects. Gemma 3 and 4 use <start_of_turn>user\n...<end_of_turn> formatting. Older generations use different markup. Getting this wrong does not produce an error - it produces garbage output. (More on this silent failure mode in a future post on chat templates.)

`E2B` - effective 2 billion parameters

This is the most misunderstood part of the name, and it is worth getting right.

The E does not stand for a generic prefix. According to Google's model card on HuggingFace, it stands for effective parameters. The smaller Gemma 4 models (E2B and E4B) use a technique called Per-Layer Embeddings (PLE) that fundamentally changes how parameters are counted.

What Per-Layer Embeddings do

In a standard transformer, there is one shared embedding table that converts tokens to vectors at the input and converts vectors back to tokens at the output. Every decoder layer in between works with the same token representations.

PLE gives each decoder layer its own small embedding table for every token. Instead of sharing one embedding, each layer has a private lookup table that adapts the token representation to that layer's specific role in the network.

Why this changes the parameter count

These per-layer embedding tables are large in raw parameter count - they add up significantly. Gemma 4 E2B has 5.1 billion total parameters, but only 2.3 billion active/effective parameters. The difference is entirely PLE overhead.

But those PLE tables are lookup tables, not compute-heavy matrix multiplications. A lookup is an O(1) memory read per token, not an O(n) matrix multiply. So while the chip has to hold 5.1B parameters worth of memory-mapped weights, the execution engine only computes 2.3B parameters worth of matrix multiplications per token step.

This is the key insight: PLE maximizes parameter efficiency specifically for on-device deployment. You get the quality benefits of having more specialized parameters (5.1B of them), but the inference cost (latency, memory bandwidth, power) stays in the ~2B compute class.

What this means for your phone

Variant	Total Params (with PLE)	Active/Effective Params	Base .litertlm Size	Context	Target Hardware
E2B	5.1B	2.3B	~2.58 GB	128K	Phones, Raspberry Pi, edge devices
E4B	7.9B	4.5B	~3.65 GB	128K	High-end phones (12+ GB RAM)
26B MoE	26B	Subset active per token	Large	256K	Workstations, servers
31B Dense	31B	All active	Large	256K	Servers, high-end GPUs

For Redacto - our on-device PII redaction app - we chose E2B because even with 5.1B total parameters, the INT4-quantized .litertlm is only ~2.58 GB, leaving headroom for the OS, ML Kit OCR, and four independent LLM conversations running in our redaction pipeline, all on a Galaxy S25 Ultra with 12 GB RAM. The PLE architecture means we get quality from 5.1B parameters worth of specialization, at the compute cost of only 2.3B active parameters - a budget that fits within a mobile power envelope.

`-it` - instruction-tuned

This suffix changes everything about how the model behaves.

A base model (no -it) is trained to predict the next token. Give it text, it continues it. It does not follow instructions or understand "system prompt" vs "user message."

An instruction-tuned model (-it) is the base model further trained on instruction-response pairs. It understands conversational structure: system prompts, user turns, assistant responses. It follows directions.

For on-device apps, -it is almost always what you want. In Redacto, each pipeline step sends a system prompt like "You are a medical PII detector. Find all names, dates of birth, medical record numbers..." A base model would ignore this and generate plausible-looking but unstructured text. The -it variant follows the prompt, stays in role, and produces structured output that the next pipeline step can parse.

If you see two versions on HuggingFace - with and without -it - and your app needs the model to follow instructions, pick -it.

`_qualcomm_sm8750` - the compilation target

This suffix tells you the model has been compiled for a specific chip: qualcomm is the vendor, sm8750 is the Snapdragon 8 Elite system-on-chip.

Modern phones have specialized AI silicon - a Neural Processing Unit (NPU) - that runs neural network operations far faster than the CPU or GPU. But using it requires translating the model's computation graph into the chip's native instruction format. Same concept as compiling C for x86 vs arm64 - the math is identical, the binary is not.

File	Target	Decode Speed	TTFT
`gemma-4-E2B-it.litertlm`	CPU / GPU (generic)	24.5 tok/s	366ms
`gemma-4-E2B-it_qualcomm_sm8750.litertlm`	Snapdragon 8 Elite NPU	41.7 tok/s	92ms

The NPU variant is larger (2.8 GB vs 2.4 GB) because it bundles QNN-compiled execution graphs with DISPATCH_OP custom operations targeting the Hexagon V79 DSP. But the speed gain is substantial: 1.7x faster decode throughput, 4x faster time-to-first-token. No chip suffix means the generic variant that runs on any ARM CPU or mobile GPU. (I cover NPUs in detail in Why My LLM Runs 4x Faster on Hardware I Had Never Heard Of.)

`.litertlm` - the file format

The file extension for LiteRT-LM, Google's on-device LLM inference runtime. A .litertlm file is not raw weights - it is a compiled bundle containing quantized weights, tokenizer, chat template, and execution graph, all packaged for immediate on-device inference.

You cannot fine-tune a .litertlm file. It is the end of the pipeline: train on cloud, fine-tune (optionally), quantize, compile, package, deploy. It is distinct from .tflite (classical ML), .gguf (llama.cpp), .onnx (cross-platform), or .safetensors (raw weights for storage/transfer). (I dig into the internals of this format in a separate post.)

Navigating HuggingFace

Two organizations matter when searching for Gemma models ready for on-device deployment:

google - official weights in .safetensors format, for fine-tuning or conversion. This is where you start if you need to customize the model.
litert-community - pre-compiled .litertlm bundles, including chip-specific NPU variants, ready for on-device use. This is where you go if you want to run inference immediately.

Check the model card for quantization level, supported hardware, and license. Look at the file listing for chip-specific variants (_qualcomm_sm8750, etc.). For production on-device apps, litert-community gives you deployment-ready files; google gives you the starting point for fine-tuning.

Putting it all together

Every segment in Redacto's model filename maps to a decision we made:

Segment	What it means	Why we chose it
`gemma`	Google's open-weight LLM family	Apache 2.0, strong on-device ecosystem, Google-backed
`4`	Fourth generation	Best quality per parameter, Apache 2.0 license, PLE architecture
`E2B`	5.1B total / 2.3B effective via PLE	~2.58 GB at INT4, fits in phone RAM with headroom for OCR + 4 LLM calls
`it`	Instruction-tuned	Follows system prompts - critical for our 4-step redaction pipeline
`qualcomm_sm8750`	Compiled for Snapdragon 8 Elite NPU	41.7 tok/s, 92ms TTFT on Hexagon V79
`.litertlm`	LiteRT-LM compiled bundle	Tokenizer + chat template + weights + graph, ready to infer

The next time you see a model filename that looks like alphabet soup, read it left to right: family, generation, size architecture, training variant, target hardware, runtime format. Each piece narrows what the model is, what it can do, and where it can run.

Related in this series of "Edge AI from the Trenches"

What I Learned Turning a HuggingFace Model Into Something My Phone Can Run - what happens after you decode the filename
What Does "On-Device" Actually Mean? - the privacy and deployment context behind choosing an on-device model

Jaydeep Shah is a developer with roots in embedded systems, Android platform internals, and silicon-level AI optimization. He now explores on-device AI inference - bringing models from the cloud to phones and edge hardware.
Along with his team Edge Artists, he builds applications using LiteRT-LM and Gemma models on mobile hardware, and writes about what works, what breaks, and what he learns along the way. This post is part of the Edge AI from the Trenches series.

Sources:

Google Gemma 4 E2B Model Card - PLE architecture, effective parameter explanation
Google Gemma announcements (Gemma 1, 2, 3)
litert-community on HuggingFace
LiteRT documentation
Benchmark data: Redacto project, Samsung Galaxy S25 Ultra (Snapdragon 8 Elite, SM8750)

Last updated: May 2026
1st out of 22 posts in the "Edge AI from the Trenches" series

DEV Community

What I Learned by Dissecting gemma-4-E2B-it_qualcomm_sm8750.litertlm

The anatomy of the name