I recently started exploring on-device AI inference, and honestly, the initial experience was overwhelming. Hundreds of models on HuggingFace, unfamiliar architecture names, quantization formats, chip-specific variants - it felt like drinking from a firehose. When I first saw a filename like gemma-4-E2B-it_qualcomm_sm8750.litertlm, it looked like alphabet soup.
But as I dug deeper - reading model cards, building an actual app, benchmarking on real hardware - each piece of that filename started to make sense. Every segment encodes a specific decision about the model's lineage, its architecture, how it was trained, what chip it was compiled for, and what runtime will execute it.
This post breaks that filename apart, piece by piece, so you do not have to go through the same confusion I did.
The anatomy of the name
gemma - the model family
Gemma is Google's family of open-weight large language models. "Open-weight" means Google publishes the trained model weights so you can download, deploy, fine-tune, and build on them without per-token API fees or cloud dependencies.
But Gemma is not one model - it is four generations of architectural evolution, each shifting what "small enough for a phone" actually means. Understanding this lineage matters because the generation determines the architecture, the context window, the chat template format, and ultimately what your on-device app can do.
The Gemma family tree
A few things worth noticing in this evolution:
Attention mechanisms evolved with each generation. Gemma 1 used MQA on the small model (fewer key-value heads = less memory) and standard MHA on the large model. Gemma 2 unified everything to GQA - a middle ground where key-value heads are shared across groups of query heads, reducing KV-cache size without the quality loss of MQA. This sliding-window + global attention alternation in Gemma 2 was a direct response to the memory bandwidth problem during autoregressive decoding - the exact bottleneck that matters on a phone.
The licensing change in Gemma 4 is significant. Generations 1-3 used Google's custom "source-available" license with usage restrictions. Gemma 4 moved to Apache 2.0 - fully permissive, no usage restrictions. This matters for commercial on-device apps: you can ship Gemma 4 in a production app without worrying about license compliance beyond standard Apache 2.0 attribution.
The context window grew dramatically. 8K (Gemma 1-2) to 128K (Gemma 3) to 128K on edge / 256K on cloud models (Gemma 4). The E2B model we run on the Galaxy S25 Ultra has a 128K context window - the same as Gemma 3. The 256K window is reserved for the heavyweight 26B MoE and 31B dense models. On-device, you rarely use the full window anyway - a 4K KV-cache is common for mobile deployment - but the larger training window means the model understands longer documents even when you truncate at inference time.
--
4 - the generation
The number tells you which generation of the Gemma family this model belongs to. Higher generation means better quality at the same parameter count - improved architecture, better training data, and lessons from prior generations baked in.
For on-device developers, the generation also determines which chat template the model expects. Gemma 3 and 4 use <start_of_turn>user\n...<end_of_turn> formatting. Older generations use different markup. Getting this wrong does not produce an error - it produces garbage output. (More on this silent failure mode in a future post on chat templates.)
--
E2B - effective 2 billion parameters
This is the most misunderstood part of the name, and it is worth getting right.
The E does not stand for a generic prefix. According to Google's model card on HuggingFace, it stands for effective parameters. The smaller Gemma 4 models (E2B and E4B) use a technique called Per-Layer Embeddings (PLE) that fundamentally changes how parameters are counted.
What Per-Layer Embeddings do
In a standard transformer, there is one shared embedding table that converts tokens to vectors at the input and converts vectors back to tokens at the output. Every decoder layer in between works with the same token representations.
PLE gives each decoder layer its own small embedding table for every token. Instead of sharing one embedding, each layer has a private lookup table that adapts the token representation to that layer's specific role in the network.
Why this changes the parameter count
These per-layer embedding tables are large in raw parameter count - they add up significantly. Gemma 4 E2B has 5.1 billion total parameters, but only 2.3 billion active/effective parameters. The difference is entirely PLE overhead.
But those PLE tables are lookup tables, not compute-heavy matrix multiplications. A lookup is an O(1) memory read per token, not an O(n) matrix multiply. So while the chip has to hold 5.1B parameters worth of memory-mapped weights, the execution engine only computes 2.3B parameters worth of matrix multiplications per token step.
This is the key insight: PLE maximizes parameter efficiency specifically for on-device deployment. You get the quality benefits of having more specialized parameters (5.1B of them), but the inference cost (latency, memory bandwidth, power) stays in the ~2B compute class.
What this means for your phone
| Variant | Total Params (with PLE) | Active/Effective Params | Base .litertlm Size | Context | Target Hardware |
|---|---|---|---|---|---|
| E2B | 5.1B | 2.3B | ~2.58 GB | 128K | Phones, Raspberry Pi, edge devices |
| E4B | 7.9B | 4.5B | ~3.65 GB | 128K | High-end phones (12+ GB RAM) |
| 26B MoE | 26B | Subset active per token | Large | 256K | Workstations, servers |
| 31B Dense | 31B | All active | Large | 256K | Servers, high-end GPUs |
For Redacto - our on-device PII redaction app - we chose E2B because even with 5.1B total parameters, the INT4-quantized .litertlm is only ~2.58 GB, leaving headroom for the OS, ML Kit OCR, and four independent LLM conversations running in our redaction pipeline, all on a Galaxy S25 Ultra with 12 GB RAM. The PLE architecture means we get quality from 5.1B parameters worth of specialization, at the compute cost of only 2.3B active parameters - a budget that fits within a mobile power envelope.
--
-it - instruction-tuned
This suffix changes everything about how the model behaves.
A base model (no -it) is trained to predict the next token. Give it text, it continues it. It does not follow instructions or understand "system prompt" vs "user message."
An instruction-tuned model (-it) is the base model further trained on instruction-response pairs. It understands conversational structure: system prompts, user turns, assistant responses. It follows directions.
For on-device apps, -it is almost always what you want. In Redacto, each pipeline step sends a system prompt like "You are a medical PII detector. Find all names, dates of birth, medical record numbers..." A base model would ignore this and generate plausible-looking but unstructured text. The -it variant follows the prompt, stays in role, and produces structured output that the next pipeline step can parse.
If you see two versions on HuggingFace - with and without -it - and your app needs the model to follow instructions, pick -it.
--
_qualcomm_sm8750 - the compilation target
This suffix tells you the model has been compiled for a specific chip: qualcomm is the vendor, sm8750 is the Snapdragon 8 Elite system-on-chip.
Modern phones have specialized AI silicon - a Neural Processing Unit (NPU) - that runs neural network operations far faster than the CPU or GPU. But using it requires translating the model's computation graph into the chip's native instruction format. Same concept as compiling C for x86 vs arm64 - the math is identical, the binary is not.
| File | Target | Decode Speed | TTFT |
|---|---|---|---|
gemma-4-E2B-it.litertlm |
CPU / GPU (generic) | 24.5 tok/s | 366ms |
gemma-4-E2B-it_qualcomm_sm8750.litertlm |
Snapdragon 8 Elite NPU | 41.7 tok/s | 92ms |
The NPU variant is larger (2.8 GB vs 2.4 GB) because it bundles QNN-compiled execution graphs with DISPATCH_OP custom operations targeting the Hexagon V79 DSP. But the speed gain is substantial: 1.7x faster decode throughput, 4x faster time-to-first-token. No chip suffix means the generic variant that runs on any ARM CPU or mobile GPU. (I cover NPUs and hardware delegates in detail in upcoming posts in this series.)
--
.litertlm - the file format
The file extension for LiteRT-LM, Google's on-device LLM inference runtime. A .litertlm file is not raw weights - it is a compiled bundle containing quantized weights, tokenizer, chat template, and execution graph, all packaged for immediate on-device inference.
You cannot fine-tune a .litertlm file. It is the end of the pipeline: train on cloud, fine-tune (optionally), quantize, compile, package, deploy. It is distinct from .tflite (classical ML), .gguf (llama.cpp), .onnx (cross-platform), or .safetensors (raw weights for storage/transfer). (I dig into the internals of this format in a separate post.)
--
Navigating HuggingFace
Two organizations matter when searching for Gemma models ready for on-device deployment:
-
google - official weights in
.safetensorsformat, for fine-tuning or conversion. This is where you start if you need to customize the model. -
litert-community - pre-compiled
.litertlmbundles, including chip-specific NPU variants, ready for on-device use. This is where you go if you want to run inference immediately.
Check the model card for quantization level, supported hardware, and license. Look at the file listing for chip-specific variants (_qualcomm_sm8750, etc.). For production on-device apps, litert-community gives you deployment-ready files; google gives you the starting point for fine-tuning.
--
Putting it all together
Every segment in Redacto's model filename maps to a decision we made:
| Segment | What it means | Why we chose it |
|---|---|---|
gemma |
Google's open-weight LLM family | Apache 2.0, strong on-device ecosystem, Google-backed |
4 |
Fourth generation | Best quality per parameter, Apache 2.0 license, PLE architecture |
E2B |
5.1B total / 2.3B effective via PLE | ~2.58 GB at INT4, fits in phone RAM with headroom for OCR + 4 LLM calls |
it |
Instruction-tuned | Follows system prompts - critical for our 4-step redaction pipeline |
qualcomm_sm8750 |
Compiled for Snapdragon 8 Elite NPU | 41.7 tok/s, 92ms TTFT on Hexagon V79 |
.litertlm |
LiteRT-LM compiled bundle | Tokenizer + chat template + weights + graph, ready to infer |
The next time you see a model filename that looks like alphabet soup, read it left to right: family, generation, size architecture, training variant, target hardware, runtime format. Each piece narrows what the model is, what it can do, and where it can run.
Related in this series of "Edge AI from the Trenches"
- From HuggingFace to Your Phone - the next logical read: what happens after you decode the filename
- What Does "On-Device" Actually Mean? - the privacy and deployment context behind choosing an on-device model
Jaydeep Shah is a developer with roots in embedded systems, Android platform internals, and silicon-level AI optimization. He now explores on-device AI inference - bringing models from the cloud to phones and edge hardware.
Along with his team Edge Artists, he builds applications using LiteRT-LM and Gemma models on mobile hardware, and writes about what works, what breaks, and what he learns along the way. This post is part of the Edge AI from the Trenches series.
Sources:
- Google Gemma 4 E2B Model Card - PLE architecture, effective parameter explanation
- Google Gemma announcements (Gemma 1, 2, 3)
- litert-community on HuggingFace
- LiteRT documentation
- Benchmark data: Redacto project, Samsung Galaxy S25 Ultra (Snapdragon 8 Elite, SM8750)
Last updated: May 2026
1st out of 22 posts in the "Edge AI from the Trenches" series



Top comments (0)