<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Jaydeep Shah (JD)</title>
    <description>The latest articles on DEV Community by Jaydeep Shah (JD) (@jdshah).</description>
    <link>https://dev.to/jdshah</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3935661%2F3d2cc353-5b82-4a61-8d4c-199be47b6ac2.jpg</url>
      <title>DEV Community: Jaydeep Shah (JD)</title>
      <link>https://dev.to/jdshah</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/jdshah"/>
    <language>en</language>
    <item>
      <title>FP32, INT4, and Everything Between - What I Learned About Precision on Mobile</title>
      <dc:creator>Jaydeep Shah (JD)</dc:creator>
      <pubDate>Mon, 18 May 2026 01:00:00 +0000</pubDate>
      <link>https://dev.to/jdshah/fp32-int4-and-everything-between-what-i-learned-about-precision-on-mobile-3mgc</link>
      <guid>https://dev.to/jdshah/fp32-int4-and-everything-between-what-i-learned-about-precision-on-mobile-3mgc</guid>
      <description>&lt;p&gt;I was familiar with precision formats from my embedded systems work, but seeing labels like "INT4," "FP16," and "GPTQ" on HuggingFace model downloads hit differently. In that context, they are not just precision specs - they determine whether your model fits on a phone, how fast it runs, and how much accuracy you trade away. Here is the intuition that clicked once I started deploying to mobile.&lt;/p&gt;

&lt;h2&gt;
  
  
  I started by understanding what gets compressed
&lt;/h2&gt;

&lt;p&gt;A neural network is a massive collection of numbers called &lt;em&gt;weights&lt;/em&gt;. During inference, the model multiplies your input by these weights billions of times. Quantization answers one question: &lt;strong&gt;how precisely do you need to store each of those numbers?&lt;/strong&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  The precision spectrum I had to learn
&lt;/h2&gt;

&lt;p&gt;Each weight can be stored at different levels of precision:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fz64em6dkwm3fmxzfim53.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fz64em6dkwm3fmxzfim53.png" alt="The Precision Spectrum - FP32 through INT4, each step roughly halving memory"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Format&lt;/th&gt;
&lt;th&gt;Bits per Weight&lt;/th&gt;
&lt;th&gt;Relative Size&lt;/th&gt;
&lt;th&gt;Typical Use&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;FP32 (full precision)&lt;/td&gt;
&lt;td&gt;32 bits&lt;/td&gt;
&lt;td&gt;1x (baseline)&lt;/td&gt;
&lt;td&gt;Training, research&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;FP16 (half precision)&lt;/td&gt;
&lt;td&gt;16 bits&lt;/td&gt;
&lt;td&gt;0.5x&lt;/td&gt;
&lt;td&gt;GPU inference, fine-tuning&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;FP8&lt;/td&gt;
&lt;td&gt;8 bits&lt;/td&gt;
&lt;td&gt;0.25x&lt;/td&gt;
&lt;td&gt;Specialized silicon, datacenter inference&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;INT8&lt;/td&gt;
&lt;td&gt;8 bits&lt;/td&gt;
&lt;td&gt;0.25x&lt;/td&gt;
&lt;td&gt;Server-side quantized inference&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;INT4&lt;/td&gt;
&lt;td&gt;4 bits&lt;/td&gt;
&lt;td&gt;0.125x&lt;/td&gt;
&lt;td&gt;On-device / mobile deployment&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Each step down roughly halves the memory footprint. A 2-billion parameter model at FP32 takes about 8 GB of memory (2 billion weights times 4 bytes each). At INT4, those same parameters fit in roughly 2.5 GB.&lt;/p&gt;

&lt;p&gt;That is the difference between "will not fit on your phone" and "runs on your phone."&lt;/p&gt;

&lt;h2&gt;
  
  
  What these formats actually mean
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;FP32&lt;/strong&gt; stores each weight as a 32-bit floating-point number: a sign bit, 8 bits for the exponent, and 23 bits for the fractional part. This gives you roughly 7 decimal digits of precision. The weight 0.00314159 is stored almost exactly.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;FP16&lt;/strong&gt; cuts that in half: 1 sign bit, 5 exponent bits, 10 fractional bits. About 3-4 decimal digits. That same weight might become 0.003143 - close, but not exact.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;INT8&lt;/strong&gt; and &lt;strong&gt;INT4&lt;/strong&gt; are integer formats. Instead of floating-point representation, the weight range is mapped onto a fixed set of integer values. INT8 gives you 256 possible values. INT4 gives you just 16.&lt;/p&gt;

&lt;p&gt;Think about that. A weight that could be any of billions of possible floating-point values is now forced into one of 16 buckets. That is aggressive compression. (In practice, INT4 quantization uses per-group scale factors, so the effective resolution is finer than 16 uniform buckets globally - but the core tradeoff holds.)&lt;/p&gt;

&lt;h3&gt;
  
  
  A note on FP8
&lt;/h3&gt;

&lt;p&gt;FP8 sits in an interesting middle ground - 8 bits, but in floating-point format rather than integer. It preserves more of the dynamic range than INT8 while using the same amount of storage. I worked on FP8 optimization at the silicon level, and it is genuinely a compelling format for inference workloads where you need better accuracy than INT8 but cannot afford FP16's memory cost. Chip designers are increasingly building native FP8 support into AI accelerators precisely because of this tradeoff (&lt;a href="https://docs.nvidia.com/deeplearning/transformer-engine/user-guide/examples/fp8_primer.html" rel="noopener noreferrer"&gt;NVIDIA Transformer Engine&lt;/a&gt;; &lt;a href="https://docs.qualcomm.com/nav/home/QNN_general_overview.html?product=1601111740009302" rel="noopener noreferrer"&gt;Qualcomm AI Engine Direct&lt;/a&gt;).&lt;/p&gt;

&lt;h2&gt;
  
  
  The analogy that made it click for me
&lt;/h2&gt;

&lt;p&gt;The easiest way I found to think about quantization: it is like reducing the number of decimal places in a measurement.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fvwkycxigxxi1rlr78br5.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fvwkycxigxxi1rlr78br5.png" alt="Quantization = Reducing Decimal Places - FP32 at 3.14159m down to INT4 at 3m"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Imagine you are measuring the length of a room:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;FP32:&lt;/strong&gt; 3.14159 meters - laboratory precision&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;FP16:&lt;/strong&gt; 3.142 meters - engineering precision&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;INT8:&lt;/strong&gt; 3.1 meters - good enough for most construction&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;INT4:&lt;/strong&gt; 3 meters - rough estimate&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;For most purposes, "3 meters" is fine. You can buy the right amount of carpet. But if you are doing precision cabinetry, rounding to the nearest meter will produce gaps.&lt;/p&gt;

&lt;p&gt;This is exactly what happens inside a quantized model. For common, well-represented patterns, the model still produces good results. The weights do not need to be precise to the seventh decimal place because the overall pattern is strong enough. But for rare edge cases - unusual inputs, subtle distinctions, low-frequency knowledge - the model loses its ability to differentiate. The signal was in the decimals that got rounded away.&lt;/p&gt;

&lt;p&gt;It is similar to reducing image resolution. A 4K photo and a 480p version both clearly show a person's face. But zoom in on the text on their T-shirt, and the 480p version is unreadable. Common features survive compression; fine details do not.&lt;/p&gt;

&lt;h2&gt;
  
  
  Real numbers: Gemma 4 E2B on a phone
&lt;/h2&gt;

&lt;p&gt;Here is why this matters concretely. Gemma 4 E2B has 5.1 billion total parameters (2.3 billion effective, thanks to Per-Layer Embeddings). At full FP32 precision, it would need approximately 8 GB just for the weights - before accounting for the memory the operating system, other apps, and the inference runtime itself consume.&lt;/p&gt;

&lt;p&gt;Most phones today have 6-12 GB of total RAM, shared across everything. An 8 GB model would leave virtually nothing for Android to run your UI, manage the camera, or keep background apps alive. The system would kill your app or crash.&lt;/p&gt;

&lt;p&gt;At INT4, the standard GPU model compresses to 2.59 GB. That fits. That runs. That is why quantization is not optional for mobile - it is a prerequisite.&lt;/p&gt;

&lt;p&gt;When we built &lt;a href="https://devpost.com/software/redacto" rel="noopener noreferrer"&gt;Redacto&lt;/a&gt;, our on-device PII redaction app for the Qualcomm x Google LiteRT Developer Hackathon 2026, we shipped the standard Gemma 4 E2B model at 2.59 GB (INT4 quantization, specifically &lt;code&gt;dynamic_wi4_afp32&lt;/code&gt; - INT4 weights with FP32 activations). The fine-tuned version of the same model, exported with different quantization granularity, came in at 4.7 GB.&lt;/p&gt;

&lt;p&gt;The performance difference was dramatic:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Metric&lt;/th&gt;
&lt;th&gt;Standard (2.59 GB)*&lt;/th&gt;
&lt;th&gt;Fine-tuned (4.7 GB)*&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Avg latency per inference&lt;/td&gt;
&lt;td&gt;5,693 ms&lt;/td&gt;
&lt;td&gt;10,626 ms&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Throughput&lt;/td&gt;
&lt;td&gt;12.8 tok/s&lt;/td&gt;
&lt;td&gt;9.0 tok/s&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Avg current draw&lt;/td&gt;
&lt;td&gt;101 mA&lt;/td&gt;
&lt;td&gt;301 mA&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;em&gt;*Different quantization granularity from different export pipelines - not a pure apples-to-apples comparison, but a real illustration of how model size drives hardware cost.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;The larger model was 1.9x slower and drew 3x more power. On a phone, power draw translates directly to battery life and thermal throttling.&lt;/p&gt;

&lt;h2&gt;
  
  
  The tradeoff I did not expect
&lt;/h2&gt;

&lt;p&gt;Here is the part nobody told me upfront. Quantization, fine-tuning, and model accuracy are locked in a three-way tradeoff:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;You must quantize to run on mobile.&lt;/strong&gt; There is no way around this. FP32 models do not fit.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Quantization degrades accuracy.&lt;/strong&gt; You are rounding billions of numbers. Some information is lost.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Fine-tuning recovers accuracy on your specific task.&lt;/strong&gt; By sharpening the model on exactly the patterns your app needs, you compensate for what quantization blurred.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;This is why fine-tuning is not a luxury for on-device AI - it is part of the deployment strategy. You are already accepting precision loss from quantization. Fine-tuning lets you direct the remaining precision toward the things that matter most for your use case.&lt;/p&gt;

&lt;p&gt;In our benchmarks, on one specific domain (tactical law enforcement data), the standard model scored 63.7% entity recall while the fine-tuned model scored 76.8%. That 13-point improvement came from training the model on domain-specific examples - teaching it to use its limited precision budget on the patterns that actually matter for that task. (The overall picture is more nuanced - I cover the full comparison in a later post in this series.)&lt;/p&gt;

&lt;h2&gt;
  
  
  Quantization schemes I encountered along the way
&lt;/h2&gt;

&lt;p&gt;When you find quantized models in the wild, here are the formats you will see most often:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;&lt;code&gt;dynamic_wi4_afp32&lt;/code&gt;&lt;/strong&gt; - INT4 weights, FP32 activations. This is what LiteRT-LM uses for on-device export. The weights are aggressively compressed to INT4, but activations stay at full FP32 precision during inference. This preserves more accuracy than quantizing everything (&lt;a href="https://ai.google.dev/edge/litert" rel="noopener noreferrer"&gt;LiteRT documentation&lt;/a&gt;).&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;GPTQ&lt;/strong&gt; - A post-training quantization method that uses calibration data to minimize the error introduced by quantization. It processes the model layer by layer and adjusts remaining weights to compensate for rounding errors in already-quantized layers. Widely supported by the open-source ecosystem (&lt;a href="https://arxiv.org/abs/2210.17323" rel="noopener noreferrer"&gt;Frantar et al., 2022&lt;/a&gt;).&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;AWQ (Activation-Aware Weight Quantization)&lt;/strong&gt; - Observes which weights matter most by looking at activation magnitudes, then protects those important weights from aggressive quantization. Often produces better quality than GPTQ at the same bit width (&lt;a href="https://arxiv.org/abs/2306.00978" rel="noopener noreferrer"&gt;Lin et al., 2023&lt;/a&gt;).&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;bitsandbytes&lt;/strong&gt; - A library that provides easy INT8 and INT4 quantization integrated with the HuggingFace ecosystem. Commonly used for QLoRA fine-tuning, where the base model stays in INT4 and only the small LoRA adapter trains in higher precision (&lt;a href="https://arxiv.org/abs/2305.14314" rel="noopener noreferrer"&gt;Dettmers et al., 2023&lt;/a&gt;).&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Each has its own tradeoffs in quality, speed, and tooling compatibility. For on-device deployment via LiteRT-LM, &lt;code&gt;dynamic_wi4_afp32&lt;/code&gt; is currently the standard path.&lt;/p&gt;

&lt;h2&gt;
  
  
  What I took away from all this
&lt;/h2&gt;

&lt;p&gt;Quantization is not a magic trick and it is not free. It is a deliberate engineering decision: trade precision for the ability to run on constrained hardware. The skill is in understanding exactly what you are trading away and whether your application can tolerate it.&lt;/p&gt;

&lt;p&gt;For most common use cases, INT4 quantization works remarkably well. The model still understands language, still follows instructions, still generates coherent output. But the edges get soft. Rare patterns, subtle distinctions, unusual inputs - these are where you feel the loss.&lt;/p&gt;

&lt;p&gt;If your app lives on those edges, invest in fine-tuning to sharpen them back up. If your app handles common cases, INT4 out of the box might be all you need.&lt;/p&gt;

&lt;p&gt;Either way, now you know what the acronym actually means.&lt;/p&gt;




&lt;h3&gt;
  
  
  Sources
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;a href="https://ai.google.dev/gemma/docs/core/model_card_4" rel="noopener noreferrer"&gt;Gemma 4 Model Card&lt;/a&gt; - PLE architecture, parameter counts&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://ai.google.dev/edge/litert" rel="noopener noreferrer"&gt;LiteRT Documentation&lt;/a&gt; - on-device quantization formats&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://arxiv.org/abs/2210.17323" rel="noopener noreferrer"&gt;GPTQ: Accurate Post-Training Quantization for Generative Pre-Trained Transformers&lt;/a&gt; - Frantar et al., 2022&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://arxiv.org/abs/2306.00978" rel="noopener noreferrer"&gt;AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration&lt;/a&gt; - Lin et al., 2023&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://arxiv.org/abs/2305.14314" rel="noopener noreferrer"&gt;QLoRA: Efficient Finetuning of Quantized LLMs&lt;/a&gt; - Dettmers et al., 2023&lt;/li&gt;
&lt;li&gt;&lt;a href="https://docs.nvidia.com/deeplearning/transformer-engine/user-guide/examples/fp8_primer.html" rel="noopener noreferrer"&gt;NVIDIA Transformer Engine FP8 Documentation&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://docs.qualcomm.com/nav/home/QNN_general_overview.html?product=1601111740009302" rel="noopener noreferrer"&gt;Qualcomm AI Engine Direct Overview&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;Benchmark data: &lt;a href="https://devpost.com/software/redacto" rel="noopener noreferrer"&gt;Redacto project&lt;/a&gt;, Samsung Galaxy S25 Ultra (Snapdragon 8 Elite, SM8750)&lt;/li&gt;
&lt;/ul&gt;




&lt;p&gt;&lt;em&gt;Jaydeep Shah is a developer with roots in embedded systems, Android platform internals, and silicon-level AI optimization. He now explores on-device AI inference - bringing models from the cloud to phones and edge hardware. Along with his team Edge Artists, he builds applications using LiteRT-LM and Gemma models on mobile hardware, and writes about what works, what breaks, and what he learns along the way. This post is part of the Edge AI from the Trenches series.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Last updated: May 2026&lt;/em&gt;&lt;br&gt;
&lt;em&gt;4th of 22 posts in the "Edge AI from the Trenches" series&lt;/em&gt;&lt;/p&gt;

</description>
      <category>deeplearning</category>
      <category>machinelearning</category>
      <category>mobile</category>
      <category>performance</category>
    </item>
    <item>
      <title>What I Learned Untangling LiteRT, LiteRT-LM, and TFLite</title>
      <dc:creator>Jaydeep Shah (JD)</dc:creator>
      <pubDate>Sun, 17 May 2026 23:15:00 +0000</pubDate>
      <link>https://dev.to/jdshah/what-i-learned-untangling-litert-litert-lm-and-tflite-4bho</link>
      <guid>https://dev.to/jdshah/what-i-learned-untangling-litert-litert-lm-and-tflite-4bho</guid>
      <description>&lt;p&gt;When we started building &lt;a href="https://devpost.com/software/redacto" rel="noopener noreferrer"&gt;Redacto&lt;/a&gt; - an on-device PII redaction app running Gemma 4 E2B on a Snapdragon 8 Elite - we kept tripping over three names: TFLite, LiteRT, and LiteRT-LM. Google's own docs sometimes use them interchangeably. Forum posts and community discussions mix them freely. We invested a good amount of time trying to figure out if they were the same thing, different versions, or completely separate tools.&lt;/p&gt;

&lt;p&gt;Here is the distinction I wish someone had spelled out for me on day one.&lt;/p&gt;

&lt;h2&gt;
  
  
  The name that kept following me around: TFLite
&lt;/h2&gt;

&lt;p&gt;TensorFlow Lite (TFLite) was Google's original on-device inference runtime, &lt;a href="https://android-developers.googleblog.com/2017/05/whats-new-in-android-o-developer.html" rel="noopener noreferrer"&gt;announced at Google I/O 2017&lt;/a&gt; as the mobile companion to TensorFlow. It ran classical ML models - image classification, object detection, pose estimation - on phones and embedded devices. It consumed &lt;code&gt;.tflite&lt;/code&gt; model files: small, optimized graphs dispatched to CPU, GPU, or specialized accelerators.&lt;/p&gt;

&lt;h2&gt;
  
  
  The rebrand that confused everyone: LiteRT
&lt;/h2&gt;

&lt;p&gt;In &lt;a href="https://developers.googleblog.com/en/tensorflow-lite-is-now-litert/" rel="noopener noreferrer"&gt;September 2024, Google renamed TensorFlow Lite to LiteRT&lt;/a&gt; (short for "Lite Runtime"). The core runtime, APIs, and &lt;code&gt;.tflite&lt;/code&gt; file format stayed the same. What changed was branding: LiteRT is no longer tied to TensorFlow. You can convert models from TensorFlow, PyTorch, JAX, or other frameworks. The old name implied a dependency that no longer existed.&lt;/p&gt;

&lt;p&gt;The migration is still ongoing - you will find both names in code, packages, and docs. If you see &lt;code&gt;org.tensorflow.lite&lt;/code&gt; in a Gradle dependency and &lt;code&gt;LiteRT&lt;/code&gt; in Google's marketing, they are the same thing.&lt;/p&gt;

&lt;h2&gt;
  
  
  The one that is actually different: LiteRT-LM
&lt;/h2&gt;

&lt;p&gt;This is where I got tripped up the longest. LiteRT-LM is not a rebrand. It is a &lt;a href="https://ai.google.dev/edge/litert-lm/overview" rel="noopener noreferrer"&gt;separate runtime&lt;/a&gt; for running large language models on device, built on top of LiteRT. It adds capabilities the base runtime does not have:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Conversation management&lt;/strong&gt; - system prompts, user turns, multi-turn history&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Tokenization&lt;/strong&gt; - BPE/SentencePiece tokenizer bundled with the model&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Chat template handling&lt;/strong&gt; - model-specific formatting (like Gemma's &lt;code&gt;&amp;lt;start_of_turn&amp;gt;&lt;/code&gt; tags)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Streaming token output&lt;/strong&gt; - callback-based delivery so your app shows text as it generates&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Sampling configuration&lt;/strong&gt; - temperature, top-k, top-p controls&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;LiteRT-LM consumes &lt;code&gt;.litertlm&lt;/code&gt; files, not &lt;code&gt;.tflite&lt;/code&gt; files. A &lt;code&gt;.litertlm&lt;/code&gt; file is a compiled bundle: quantized weights, tokenizer, chat template, and an execution graph optimized for a specific device.&lt;/p&gt;

&lt;h2&gt;
  
  
  What I learned the hard way: they are not interchangeable
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;You &lt;strong&gt;cannot&lt;/strong&gt; run a &lt;code&gt;.litertlm&lt;/code&gt; file with plain LiteRT or TFLite. The base runtime has no concept of conversations, tokenizers, or streaming callbacks.&lt;/li&gt;
&lt;li&gt;You &lt;strong&gt;cannot&lt;/strong&gt; run a &lt;code&gt;.tflite&lt;/code&gt; model with LiteRT-LM. The LLM runtime expects the bundled tokenizer and conversation-aware execution graph that only &lt;code&gt;.litertlm&lt;/code&gt; provides.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;They solve different problems. LiteRT runs classical ML inference. LiteRT-LM runs LLM inference with conversational scaffolding. In Redacto, we use LiteRT-LM exclusively - our pipeline sends system prompts through a multi-step conversation chain, with streaming callbacks and sampling configuration that do not exist in base LiteRT.&lt;/p&gt;

&lt;h2&gt;
  
  
  The stack that made it click
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Flx4iugrizdoxpw674xed.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Flx4iugrizdoxpw674xed.png" alt="The On-Device AI Stack - App → MediaPipe/LiteRT-LM → LiteRT → Hardware delegates" width="700" height="463"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;LiteRT-LM and MediaPipe both sit on top of LiteRT, but they serve different purposes and do not overlap. MediaPipe provides high-level task APIs (face detection, image segmentation) that use LiteRT as the engine underneath. LiteRT-LM provides conversational LLM inference. For a deeper runtime comparison including llama.cpp, ONNX Runtime, and ExecuTorch, see my earlier post on the HuggingFace-to-phone pipeline.&lt;/p&gt;

&lt;h2&gt;
  
  
  The short version
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;TFLite&lt;/strong&gt; = the old name for Google's on-device ML runtime. Being replaced by LiteRT branding.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;LiteRT&lt;/strong&gt; = TFLite renamed, with broader framework support. Same runtime, same &lt;code&gt;.tflite&lt;/code&gt; format.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;LiteRT-LM&lt;/strong&gt; = a separate runtime for LLMs, built on LiteRT. Different file format (&lt;code&gt;.litertlm&lt;/code&gt;), different capabilities.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;If you need an LLM on device, you want LiteRT-LM. If you need a classifier or detector, you want LiteRT. If you see "TFLite" in code, it is the old name for LiteRT.&lt;/p&gt;

&lt;p&gt;The naming is confusing. But once you see the stack, it clicks.&lt;/p&gt;




&lt;p&gt;&lt;strong&gt;Sources:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://developers.googleblog.com/en/tensorflow-lite-is-now-litert/" rel="noopener noreferrer"&gt;TensorFlow Lite is now LiteRT - Google Developers Blog&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://ai.google.dev/edge/litert" rel="noopener noreferrer"&gt;LiteRT overview - Google AI Edge&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://ai.google.dev/edge/litert-lm/overview" rel="noopener noreferrer"&gt;LiteRT-LM overview - Google AI Edge&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://github.com/google-ai-edge/LiteRT-LM" rel="noopener noreferrer"&gt;LiteRT-LM on GitHub&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;




&lt;p&gt;&lt;em&gt;Jaydeep Shah is a developer with roots in embedded systems, Android platform internals, and silicon-level AI optimization. He now explores on-device AI inference - bringing models from the cloud to phones and edge hardware. Along with his team Edge Artists, he builds applications using LiteRT-LM and Gemma models on mobile hardware, and writes about what works, what breaks, and what he learns along the way. This post is part of the Edge AI from the Trenches series.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Last updated: May 2026&lt;/em&gt;&lt;br&gt;
&lt;em&gt;3rd of 22 posts in the "Edge AI from the Trenches" series&lt;/em&gt;&lt;/p&gt;

</description>
      <category>gemmachallenge</category>
      <category>edgeai</category>
      <category>android</category>
      <category>litertlm</category>
    </item>
    <item>
      <title>What I Learned Turning a HuggingFace Model Into Something My Phone Can Run</title>
      <dc:creator>Jaydeep Shah (JD)</dc:creator>
      <pubDate>Sun, 17 May 2026 08:10:00 +0000</pubDate>
      <link>https://dev.to/jdshah/what-i-learned-turning-a-huggingface-model-into-something-my-phone-can-run-53cj</link>
      <guid>https://dev.to/jdshah/what-i-learned-turning-a-huggingface-model-into-something-my-phone-can-run-53cj</guid>
      <description>&lt;p&gt;You found a model on HuggingFace. It looks promising - maybe Gemma, maybe Llama, maybe something smaller. You want to run it on a phone. You click "Download," and then... what? The file is 5 GB of &lt;code&gt;.safetensors&lt;/code&gt; splits. There is no APK, no &lt;code&gt;.tflite&lt;/code&gt;, no obvious next step. The HuggingFace README says "Usage: &lt;code&gt;model = AutoModelForCausalLM.from_pretrained(...)&lt;/code&gt;" - a Python API that does not exist on Android.&lt;/p&gt;

&lt;p&gt;This was exactly where I got stuck when I started building &lt;a href="https://devpost.com/software/redacto" rel="noopener noreferrer"&gt;Redacto&lt;/a&gt; - an on-device PII redaction app running Gemma 4 E2B entirely on a Galaxy S25 Ultra. The distance between "I found a good model" and "it runs on my phone" turned out to be much bigger than I expected. Here is what I learned along the way.&lt;/p&gt;

&lt;h2&gt;
  
  
  What is actually inside a HuggingFace repo
&lt;/h2&gt;

&lt;p&gt;The first thing I had to understand was what I was actually downloading. HuggingFace model repos for LLMs are not just a single weights file - they contain an entire ecosystem of artifacts.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F556e0ahzor5c2ym8efab.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F556e0ahzor5c2ym8efab.png" alt="HuggingFace file listing for google/gemma-4-E2B - raw weights at 10.2 GB" width="800" height="551"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Here is what I found inside a typical LLM repo:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Model weights&lt;/strong&gt; (&lt;code&gt;.safetensors&lt;/code&gt; or &lt;code&gt;.bin&lt;/code&gt; files) - the learned parameters. For Gemma 4 E2B, this is a single 10.2 GB file. Larger models split weights across multiple files (&lt;code&gt;model-00001-of-00004.safetensors&lt;/code&gt;, etc.).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;config.json&lt;/code&gt;&lt;/strong&gt; - the model architecture definition: number of layers, hidden dimensions, attention heads, vocabulary size. This is the blueprint the runtime needs to reconstruct the model's computational graph.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Tokenizer files&lt;/strong&gt; (&lt;code&gt;tokenizer.json&lt;/code&gt;, &lt;code&gt;tokenizer_config.json&lt;/code&gt;, &lt;code&gt;tokenizer.model&lt;/code&gt;) - the mapping between text and token IDs. The model does not see words; it sees integers. The tokenizer defines how "Mrs. Chen" becomes &lt;code&gt;[4521, 18, 9832]&lt;/code&gt; (or whatever the model's vocabulary dictates).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;chat_template&lt;/code&gt;&lt;/strong&gt; (embedded in &lt;code&gt;tokenizer_config.json&lt;/code&gt; or as a standalone &lt;code&gt;.jinja&lt;/code&gt; file) - a Jinja2 template that wraps your messages into the format the model was trained on. Gemma expects &lt;code&gt;&amp;lt;start_of_turn&amp;gt;user\n...&amp;lt;end_of_turn&amp;gt;&lt;/code&gt;. Llama expects &lt;code&gt;[INST]...[/INST]&lt;/code&gt;. Get this wrong and the model produces garbage without any error message.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Model card&lt;/strong&gt; (&lt;code&gt;README.md&lt;/code&gt;) - documentation covering training data, intended use, limitations, license, and benchmark scores.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Not all repositories are the same. Some are raw research checkpoints with no tokenizer. Some include GGUF quantized versions alongside the originals. I learned quickly that for on-device deployment, you want repositories that have already been prepared for your target runtime.&lt;/p&gt;




&lt;h2&gt;
  
  
  Why I could not just download and run
&lt;/h2&gt;

&lt;p&gt;This was my first real lesson. Those &lt;code&gt;.safetensors&lt;/code&gt; files are PyTorch-format tensors at full or half precision. They are designed to be loaded into GPU memory on a workstation running Python. A phone cannot use them directly, and it took me a while to understand why:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Size.&lt;/strong&gt; Gemma 4 E2B at FP32 is roughly 10 GB. A phone with 12 GB of RAM cannot load that while also running the OS, the app, and everything else. You need quantization - compressing weights from 32-bit floats to 4-bit integers - to get the model down to a size that fits.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Format.&lt;/strong&gt; Android does not have a PyTorch runtime. The phone's CPU, GPU, and NPU each have their own instruction sets and memory layouts. The model's computational graph needs to be compiled into operations that these hardware targets can execute.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Runtime.&lt;/strong&gt; An LLM is not just a forward pass. You need conversation management (tracking turns), tokenization (text to integers and back), KV-cache management (storing previous computations so generation does not recompute from scratch every token), and streaming (delivering tokens one at a time for responsive UX). The raw weights have none of this.&lt;/p&gt;

&lt;p&gt;Once I understood these three gaps, the compilation pipeline started to make sense.&lt;/p&gt;




&lt;h2&gt;
  
  
  The pipeline I had to learn
&lt;/h2&gt;

&lt;p&gt;Getting from HuggingFace to a phone-ready model turned out to be a multi-stage process:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ftptmsw32zfso30mcz2c1.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ftptmsw32zfso30mcz2c1.png" alt="The compilation pipeline: HuggingFace → Quantize → Export → .litertlm → Device" width="800" height="156"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  Quantization - making it fit
&lt;/h3&gt;

&lt;p&gt;Quantization converts the model's weights from high-precision floating point (FP32 or FP16) to lower-precision integers (INT8 or INT4). It is a lossy compression step - the model loses some accuracy, but the file shrinks dramatically.&lt;/p&gt;

&lt;p&gt;For Gemma 4 E2B, INT4 quantization (specifically &lt;code&gt;dynamic_wi4_afp32&lt;/code&gt; - INT4 weights with FP32 activations) brings the model from ~10 GB down to ~2.58 GB. That is the difference between "impossible on a phone" and "fits in memory with room for the app."&lt;/p&gt;

&lt;p&gt;Quantization is not optional for mobile. It is a hard requirement. This was not obvious to me at first - I kept looking for ways around it before accepting that every on-device model goes through this step.&lt;/p&gt;

&lt;h3&gt;
  
  
  Export and compilation - the tool I did not know existed
&lt;/h3&gt;

&lt;p&gt;This is where I discovered &lt;a href="https://pypi.org/project/litert-torch/" rel="noopener noreferrer"&gt;&lt;code&gt;litert-torch&lt;/code&gt;&lt;/a&gt; - a pip-installable package from Google that takes a HuggingFace model and produces a &lt;code&gt;.litertlm&lt;/code&gt; file. During export, the tool:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Reads the model architecture from &lt;code&gt;config.json&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;Loads and quantizes the weights&lt;/li&gt;
&lt;li&gt;Embeds the tokenizer&lt;/li&gt;
&lt;li&gt;Embeds the chat template (from &lt;code&gt;tokenizer_config.json&lt;/code&gt; or an override)&lt;/li&gt;
&lt;li&gt;Compiles the computational graph into LiteRT operations&lt;/li&gt;
&lt;li&gt;Packages everything into a single &lt;code&gt;.litertlm&lt;/code&gt; bundle&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;The official export command from &lt;a href="https://ai.google.dev/edge/litert-lm/models/gemma-4" rel="noopener noreferrer"&gt;Google's Gemma 4 documentation&lt;/a&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;litert-torch export_hf &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--model&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;google/gemma-4-E2B-it &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--output_dir&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;/tmp/gemma4_2b &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--externalize_embedder&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--jinja_chat_template_override&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;litert-community/gemma-4-E2B-it-litert-lm
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The &lt;code&gt;--model&lt;/code&gt; flag takes a HuggingFace model ID or local path. &lt;code&gt;--externalize_embedder&lt;/code&gt; separates the embedding table for memory efficiency. The &lt;code&gt;--jinja_chat_template_override&lt;/code&gt; points to a known-compatible chat template - this flag exists because some model templates use Jinja features that the on-device parser does not support. I learned this the hard way, and I cover that story in a later post in this series.&lt;/p&gt;

&lt;p&gt;For fine-tuned models, you can point &lt;code&gt;--model&lt;/code&gt; at a local directory and add quantization:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;litert-torch export_hf &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--model&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;./my_finetuned_model &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--output_dir&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;./output &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--externalize_embedder&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--quantization_recipe&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;dynamic_wi4_afp32 &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--jinja_chat_template_override&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;litert-community/gemma-4-E2B-it-litert-lm
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Device-specific variants - one model, two files
&lt;/h3&gt;

&lt;p&gt;This was a surprise. The exported &lt;code&gt;.litertlm&lt;/code&gt; runs on CPU and GPU out of the box. But if you want to target the NPU (Neural Processing Unit) - which on a Snapdragon 8 Elite delivers 41.7 tok/s versus 24.5 tok/s on GPU - you need a second compilation step.&lt;/p&gt;

&lt;p&gt;The NPU variant goes through the Qualcomm QNN toolchain, which compiles certain operations into &lt;code&gt;DISPATCH_OP&lt;/code&gt; custom ops that run directly on the Hexagon V79 DSP. This produces a separate, larger &lt;code&gt;.litertlm&lt;/code&gt; file that is tied to a specific chip.&lt;/p&gt;

&lt;p&gt;The standard GPU/CPU file works across all ARM64 Android devices. The NPU file works only on the exact SoC it was compiled for. I did not expect to need two different model files for what is technically the same model.&lt;/p&gt;

&lt;h3&gt;
  
  
  Push to device - the easy part
&lt;/h3&gt;

&lt;p&gt;The final step turned out to be the simplest:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;adb push gemma-4-E2B-it.litertlm &lt;span class="se"&gt;\&lt;/span&gt;
  /sdcard/Android/data/com.example.redacto/files/gemma4.litertlm
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The model lives in the app's private external storage. On first load, LiteRT-LM parses the bundle, sets up the KV-cache, initializes the appropriate hardware delegate, and for NPU generates an AOT (ahead-of-time) compilation cache. Cold start takes about 10 seconds for GPU and 14 seconds for NPU. With the AOT cache in place, subsequent launches drop to around 2 seconds.&lt;/p&gt;




&lt;h2&gt;
  
  
  How it actually went with Redacto
&lt;/h2&gt;

&lt;p&gt;For Redacto, we did not run the export pipeline ourselves for the production model. This is something I wish I had known earlier: Google's &lt;code&gt;litert-community&lt;/code&gt; organization on HuggingFace publishes pre-compiled &lt;code&gt;.litertlm&lt;/code&gt; files for popular models, including Gemma 4 E2B.&lt;/p&gt;

&lt;p&gt;We downloaded from &lt;a href="https://huggingface.co/litert-community/gemma-4-E2B-it-litert-lm" rel="noopener noreferrer"&gt;&lt;code&gt;litert-community/gemma-4-E2B-it-litert-lm&lt;/code&gt;&lt;/a&gt;:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fj0q86kj3zbwof3o7kv41.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fj0q86kj3zbwof3o7kv41.png" alt="litert-community repo - compiled .litertlm files ready for deployment" width="800" height="513"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fobjb7guqeamx30jxl1l2.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fobjb7guqeamx30jxl1l2.png" alt="Model tree showing lineage: google/gemma-4-E2B → instruction-tuned → litert-community compiled" width="647" height="189"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;File&lt;/th&gt;
&lt;th&gt;Size&lt;/th&gt;
&lt;th&gt;Target Hardware&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;gemma-4-E2B-it.litertlm&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;2.59 GB&lt;/td&gt;
&lt;td&gt;CPU/GPU (all ARM64 Android devices)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;gemma-4-E2B-it_qualcomm_sm8750.litertlm&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;3.02 GB&lt;/td&gt;
&lt;td&gt;Snapdragon 8 Elite NPU only&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Two files. Same model. Same weights at the same precision. The size difference (~430 MB) comes from the QNN-compiled custom ops embedded in the NPU variant. Those ops will crash if you try to run them on a GPU - and the GPU variant cannot dispatch to the NPU. They are not interchangeable.&lt;/p&gt;

&lt;p&gt;We did, however, run &lt;code&gt;litert-torch export_hf&lt;/code&gt; ourselves when we fine-tuned the model. That is when we hit the chat template trap: &lt;code&gt;tokenizer.save_pretrained()&lt;/code&gt; bundled the HuggingFace-native Jinja template (which uses &lt;code&gt;map.get()&lt;/code&gt;), and LiteRT-LM's on-device template parser does not support that function. The model loaded, the tokenizer initialized, and then inference produced garbage. No error, no crash - just wrong output. We had to manually swap the template with an older compatible version before re-exporting. (I cover this trap in detail in a later post in this series.)&lt;/p&gt;




&lt;h2&gt;
  
  
  What I learned about the runtime landscape
&lt;/h2&gt;

&lt;p&gt;As I went through this process, I also had to figure out where LiteRT-LM fits among the other on-device inference options. Here is the comparison I wish I had found earlier:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Runtime&lt;/th&gt;
&lt;th&gt;Focus&lt;/th&gt;
&lt;th&gt;Model Format&lt;/th&gt;
&lt;th&gt;Hardware Targets&lt;/th&gt;
&lt;th&gt;LLM Features&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;LiteRT-LM&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;LLMs on Android/iOS&lt;/td&gt;
&lt;td&gt;&lt;code&gt;.litertlm&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;CPU, GPU, NPU (Qualcomm, MediaTek)&lt;/td&gt;
&lt;td&gt;Tokenizer, chat template, KV-cache, streaming, conversation management&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;TFLite / LiteRT&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Classical ML (image, audio, NLP)&lt;/td&gt;
&lt;td&gt;&lt;code&gt;.tflite&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;CPU, GPU, NPU, Edge TPU&lt;/td&gt;
&lt;td&gt;None (no tokenizer, no chat, no streaming)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;ONNX Runtime&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Cross-platform inference&lt;/td&gt;
&lt;td&gt;&lt;code&gt;.onnx&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;CPU, GPU (DirectML, CUDA)&lt;/td&gt;
&lt;td&gt;Limited (community extensions)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;llama.cpp&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;LLM inference, CPU-focused&lt;/td&gt;
&lt;td&gt;&lt;code&gt;.gguf&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;CPU (NEON/AVX), some GPU (Metal, CUDA, Vulkan)&lt;/td&gt;
&lt;td&gt;Tokenizer, chat template, KV-cache, streaming&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;MediaPipe&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;ML pipelines for media tasks&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;.tflite&lt;/code&gt; + config&lt;/td&gt;
&lt;td&gt;CPU, GPU&lt;/td&gt;
&lt;td&gt;LLM Inference API (wraps LiteRT under the hood)&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The distinctions that mattered most for my use case:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;LiteRT-LM vs TFLite/LiteRT.&lt;/strong&gt; I initially confused these. TFLite (now rebranded as LiteRT) handles classical ML models - image classifiers, object detectors. LiteRT-LM is built on top of LiteRT specifically for LLMs. You cannot run a &lt;code&gt;.litertlm&lt;/code&gt; file with the TFLite interpreter, and you cannot run a &lt;code&gt;.tflite&lt;/code&gt; model with LiteRT-LM. Same infrastructure, different model types.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;LiteRT-LM vs llama.cpp.&lt;/strong&gt; llama.cpp is excellent for CPU-based inference on laptops and desktops. On phones, LiteRT-LM's advantage is its deep integration with vendor-specific NPU delegates - on Snapdragon 8 Elite, NPU inference through LiteRT-LM runs at 41.7 tok/s versus the ~10-15 tok/s range typical of CPU-only execution for a model this size.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;LiteRT-LM vs MediaPipe.&lt;/strong&gt; Google's MediaPipe provides a higher-level LLM Inference API that wraps LiteRT under the hood. Simpler API, less control. LiteRT-LM gives you more control over engine initialization, backend selection, and sampling configuration - which I needed for our multi-step redaction pipeline.&lt;/p&gt;




&lt;h2&gt;
  
  
  The mental model that made it click
&lt;/h2&gt;

&lt;p&gt;If I had to summarize everything I learned in one sentence: a HuggingFace model repository is the source code, and a &lt;code&gt;.litertlm&lt;/code&gt; file is the compiled binary.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fbw3u0rr2e1vvi3v8jj7j.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fbw3u0rr2e1vvi3v8jj7j.png" alt="The Mental Model: main.c → gcc → program.exe parallels .safetensors → litert-torch → .litertlm" width="740" height="566"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;You would not try to run a &lt;code&gt;.c&lt;/code&gt; file on a microcontroller without compiling it first. You would not try to load &lt;code&gt;.safetensors&lt;/code&gt; on a phone. Both need a compilation step that transforms human-friendly source into machine-friendly executable. The difference is that model compilation also includes quantization (compression), hardware specialization (targeting specific silicon), and runtime packaging (embedding the tokenizer, chat template, and execution graph).&lt;/p&gt;

&lt;p&gt;Once this clicked for me, the rest of the on-device AI stack made sense. The &lt;code&gt;.litertlm&lt;/code&gt; file is not just a model - it is a self-contained inference package that knows how to tokenize input, format conversations, run the forward pass on your target hardware, and stream output back to your app.&lt;/p&gt;

&lt;p&gt;Finding the right model on HuggingFace is step one. Getting it to your phone is the actual engineering.&lt;/p&gt;




&lt;h3&gt;
  
  
  Related in this series
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;a href="https://dev.to/jdshah/what-i-learned-untangling-litert-litert-lm-and-tflite-4bho"&gt;What I Learned Untangling LiteRT, LiteRT-LM, and TFLite&lt;/a&gt; - clarifies the naming confusion in the stack this post describes&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://dev.to/jdshah/fp32-int4-and-everything-between-what-i-learned-about-precision-on-mobile-3mgc"&gt;FP32, INT4, and Everything Between&lt;/a&gt; - what quantization does and why it is mandatory for mobile&lt;/li&gt;
&lt;li&gt;What's Inside a &lt;code&gt;.litertlm&lt;/code&gt; File? - deep dive into the compiled bundle at the end of the pipeline&lt;/li&gt;
&lt;li&gt;The Chat Template Trap - what happens when the export pipeline embeds an incompatible template&lt;/li&gt;
&lt;/ul&gt;




&lt;p&gt;&lt;em&gt;Jaydeep Shah is a developer with roots in embedded systems, Android platform internals, and silicon-level AI optimization. He now explores on-device AI inference - bringing models from the cloud to phones and edge hardware. &lt;br&gt;
Along with his team Edge Artists, he builds applications using LiteRT-LM and Gemma models on mobile hardware, and writes about what works, what breaks, and what he learns along the way. This post is part of the Edge AI from the Trenches series.&lt;/em&gt;&lt;/p&gt;




&lt;p&gt;&lt;strong&gt;Sources:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://huggingface.co/docs/hub/models" rel="noopener noreferrer"&gt;HuggingFace Model Hub&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://huggingface.co/litert-community/gemma-4-E2B-it-litert-lm" rel="noopener noreferrer"&gt;litert-community/gemma-4-E2B-it-litert-lm&lt;/a&gt; - pre-compiled .litertlm files&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://ai.google.dev/edge/litert-lm/models/gemma-4" rel="noopener noreferrer"&gt;Gemma 4 export documentation&lt;/a&gt; - official export commands&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://pypi.org/project/litert-torch/" rel="noopener noreferrer"&gt;litert-torch on PyPI&lt;/a&gt; - the export tool&lt;/li&gt;
&lt;li&gt;&lt;a href="https://ai.google.dev/edge/litert-lm/overview" rel="noopener noreferrer"&gt;LiteRT-LM overview&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;em&gt;Last updated: May 2026&lt;/em&gt;&lt;br&gt;
&lt;em&gt;2nd of 22 posts in the "Edge AI from the Trenches" series&lt;/em&gt;&lt;/p&gt;

</description>
      <category>gemmachallenge</category>
      <category>edgeai</category>
      <category>android</category>
      <category>litertlm</category>
    </item>
    <item>
      <title>What I Learned by Dissecting gemma-4-E2B-it_qualcomm_sm8750.litertlm</title>
      <dc:creator>Jaydeep Shah (JD)</dc:creator>
      <pubDate>Sun, 17 May 2026 05:00:00 +0000</pubDate>
      <link>https://dev.to/jdshah/what-i-learned-by-dissecting-gemma-4-e2b-itqualcommsm8750litertlm-o85</link>
      <guid>https://dev.to/jdshah/what-i-learned-by-dissecting-gemma-4-e2b-itqualcommsm8750litertlm-o85</guid>
      <description>&lt;p&gt;I recently started exploring on-device AI inference, and honestly, the initial experience was overwhelming. Hundreds of models on HuggingFace, unfamiliar architecture names, quantization formats, chip-specific variants - it felt like drinking from a firehose. When I first saw a filename like &lt;code&gt;gemma-4-E2B-it_qualcomm_sm8750.litertlm&lt;/code&gt;, it looked like alphabet soup.&lt;/p&gt;

&lt;p&gt;But as I dug deeper - reading model cards, building an actual app, benchmarking on real hardware - each piece of that filename started to make sense. Every segment encodes a specific decision about the model's lineage, its architecture, how it was trained, what chip it was compiled for, and what runtime will execute it.&lt;/p&gt;

&lt;p&gt;This post breaks that filename apart, piece by piece, so you do not have to go through the same confusion I did.&lt;/p&gt;

&lt;h2&gt;
  
  
  The anatomy of the name
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fb7in7feud6np0mlui8jz.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fb7in7feud6np0mlui8jz.png" alt="Anatomy of a Model Filename" width="800" height="213"&gt;&lt;/a&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  &lt;code&gt;gemma&lt;/code&gt; - the model family
&lt;/h2&gt;

&lt;p&gt;Gemma is Google's family of open-weight large language models. "Open-weight" means Google publishes the trained model weights so you can download, deploy, fine-tune, and build on them without per-token API fees or cloud dependencies.&lt;/p&gt;

&lt;p&gt;But Gemma is not one model - it is four generations of architectural evolution, each shifting what "small enough for a phone" actually means. Understanding this lineage matters because the generation determines the architecture, the context window, the chat template format, and ultimately what your on-device app can do.&lt;/p&gt;

&lt;h3&gt;
  
  
  The Gemma family tree
&lt;/h3&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F87iqmpftde2hn2hc9opq.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F87iqmpftde2hn2hc9opq.png" alt="The Gemma Family: 4 Generations of Evolution" width="800" height="270"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;A few things worth noticing in this evolution:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Attention mechanisms evolved with each generation.&lt;/strong&gt; Gemma 1 used MQA on the small model (fewer key-value heads = less memory) and standard MHA on the large model. Gemma 2 unified everything to GQA - a middle ground where key-value heads are shared across groups of query heads, reducing KV-cache size without the quality loss of MQA. This sliding-window + global attention alternation in Gemma 2 was a direct response to the memory bandwidth problem during autoregressive decoding - the exact bottleneck that matters on a phone.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The licensing change in Gemma 4 is significant.&lt;/strong&gt; Generations 1-3 used Google's custom "source-available" license with usage restrictions. Gemma 4 moved to Apache 2.0 - fully permissive, no usage restrictions. This matters for commercial on-device apps: you can ship Gemma 4 in a production app without worrying about license compliance beyond standard Apache 2.0 attribution.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The context window grew dramatically.&lt;/strong&gt; 8K (Gemma 1-2) to 128K (Gemma 3) to 128K on edge / 256K on cloud models (Gemma 4). The E2B model we run on the Galaxy S25 Ultra has a 128K context window - the same as Gemma 3. The 256K window is reserved for the heavyweight 26B MoE and 31B dense models. On-device, you rarely use the full window anyway - a 4K KV-cache is common for mobile deployment - but the larger training window means the model understands longer documents even when you truncate at inference time.&lt;/p&gt;

&lt;p&gt;--&lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;code&gt;4&lt;/code&gt; - the generation
&lt;/h2&gt;

&lt;p&gt;The number tells you which generation of the Gemma family this model belongs to. Higher generation means better quality at the same parameter count - improved architecture, better training data, and lessons from prior generations baked in.&lt;/p&gt;

&lt;p&gt;For on-device developers, the generation also determines which &lt;strong&gt;chat template&lt;/strong&gt; the model expects. Gemma 3 and 4 use &lt;code&gt;&amp;lt;start_of_turn&amp;gt;user\n...&amp;lt;end_of_turn&amp;gt;&lt;/code&gt; formatting. Older generations use different markup. Getting this wrong does not produce an error - it produces garbage output. (More on this silent failure mode in a future post on chat templates.)&lt;/p&gt;

&lt;p&gt;--&lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;code&gt;E2B&lt;/code&gt; - effective 2 billion parameters
&lt;/h2&gt;

&lt;p&gt;This is the most misunderstood part of the name, and it is worth getting right.&lt;/p&gt;

&lt;p&gt;The &lt;code&gt;E&lt;/code&gt; does &lt;strong&gt;not&lt;/strong&gt; stand for a generic prefix. According to &lt;a href="https://huggingface.co/google/gemma-4-E2B-it" rel="noopener noreferrer"&gt;Google's model card on HuggingFace&lt;/a&gt;, it stands for &lt;strong&gt;effective parameters&lt;/strong&gt;. The smaller Gemma 4 models (E2B and E4B) use a technique called &lt;strong&gt;Per-Layer Embeddings (PLE)&lt;/strong&gt; that fundamentally changes how parameters are counted.&lt;/p&gt;

&lt;h3&gt;
  
  
  What Per-Layer Embeddings do
&lt;/h3&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fq0czh93nbt3cfjit9ir3.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fq0czh93nbt3cfjit9ir3.png" alt="Standard Transformer vs PLE Architecture" width="636" height="471"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;In a standard transformer, there is one shared embedding table that converts tokens to vectors at the input and converts vectors back to tokens at the output. Every decoder layer in between works with the same token representations.&lt;/p&gt;

&lt;p&gt;PLE gives &lt;strong&gt;each decoder layer its own small embedding table&lt;/strong&gt; for every token. Instead of sharing one embedding, each layer has a private lookup table that adapts the token representation to that layer's specific role in the network.&lt;/p&gt;

&lt;h3&gt;
  
  
  Why this changes the parameter count
&lt;/h3&gt;

&lt;p&gt;These per-layer embedding tables are large in raw parameter count - they add up significantly. Gemma 4 E2B has &lt;strong&gt;5.1 billion total parameters&lt;/strong&gt;, but only &lt;strong&gt;2.3 billion active/effective parameters&lt;/strong&gt;. The difference is entirely PLE overhead.&lt;/p&gt;

&lt;p&gt;But those PLE tables are &lt;strong&gt;lookup tables&lt;/strong&gt;, not compute-heavy matrix multiplications. A lookup is an O(1) memory read per token, not an O(n) matrix multiply. So while the chip has to hold 5.1B parameters worth of memory-mapped weights, the execution engine only computes 2.3B parameters worth of matrix multiplications per token step.&lt;/p&gt;

&lt;p&gt;This is the key insight: &lt;strong&gt;PLE maximizes parameter efficiency specifically for on-device deployment.&lt;/strong&gt; You get the quality benefits of having more specialized parameters (5.1B of them), but the inference cost (latency, memory bandwidth, power) stays in the ~2B compute class.&lt;/p&gt;

&lt;h3&gt;
  
  
  What this means for your phone
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Variant&lt;/th&gt;
&lt;th&gt;Total Params (with PLE)&lt;/th&gt;
&lt;th&gt;Active/Effective Params&lt;/th&gt;
&lt;th&gt;Base .litertlm Size&lt;/th&gt;
&lt;th&gt;Context&lt;/th&gt;
&lt;th&gt;Target Hardware&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;E2B&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;5.1B&lt;/td&gt;
&lt;td&gt;2.3B&lt;/td&gt;
&lt;td&gt;~2.58 GB&lt;/td&gt;
&lt;td&gt;128K&lt;/td&gt;
&lt;td&gt;Phones, Raspberry Pi, edge devices&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;E4B&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;7.9B&lt;/td&gt;
&lt;td&gt;4.5B&lt;/td&gt;
&lt;td&gt;~3.65 GB&lt;/td&gt;
&lt;td&gt;128K&lt;/td&gt;
&lt;td&gt;High-end phones (12+ GB RAM)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;26B MoE&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;26B&lt;/td&gt;
&lt;td&gt;Subset active per token&lt;/td&gt;
&lt;td&gt;Large&lt;/td&gt;
&lt;td&gt;256K&lt;/td&gt;
&lt;td&gt;Workstations, servers&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;31B Dense&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;31B&lt;/td&gt;
&lt;td&gt;All active&lt;/td&gt;
&lt;td&gt;Large&lt;/td&gt;
&lt;td&gt;256K&lt;/td&gt;
&lt;td&gt;Servers, high-end GPUs&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;For Redacto - our on-device PII redaction app - we chose E2B because even with 5.1B total parameters, the INT4-quantized &lt;code&gt;.litertlm&lt;/code&gt; is only ~2.58 GB, leaving headroom for the OS, ML Kit OCR, and four independent LLM conversations running in our redaction pipeline, all on a Galaxy S25 Ultra with 12 GB RAM. The PLE architecture means we get quality from 5.1B parameters worth of specialization, at the compute cost of only 2.3B active parameters - a budget that fits within a mobile power envelope.&lt;/p&gt;

&lt;p&gt;--&lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;code&gt;-it&lt;/code&gt; - instruction-tuned
&lt;/h2&gt;

&lt;p&gt;This suffix changes everything about how the model behaves.&lt;/p&gt;

&lt;p&gt;A &lt;strong&gt;base model&lt;/strong&gt; (no &lt;code&gt;-it&lt;/code&gt;) is trained to predict the next token. Give it text, it continues it. It does not follow instructions or understand "system prompt" vs "user message."&lt;/p&gt;

&lt;p&gt;An &lt;strong&gt;instruction-tuned model&lt;/strong&gt; (&lt;code&gt;-it&lt;/code&gt;) is the base model further trained on instruction-response pairs. It understands conversational structure: system prompts, user turns, assistant responses. It follows directions.&lt;/p&gt;

&lt;p&gt;For on-device apps, &lt;code&gt;-it&lt;/code&gt; is almost always what you want. In Redacto, each pipeline step sends a system prompt like "You are a medical PII detector. Find all names, dates of birth, medical record numbers..." A base model would ignore this and generate plausible-looking but unstructured text. The &lt;code&gt;-it&lt;/code&gt; variant follows the prompt, stays in role, and produces structured output that the next pipeline step can parse.&lt;/p&gt;

&lt;p&gt;If you see two versions on HuggingFace - with and without &lt;code&gt;-it&lt;/code&gt; - and your app needs the model to follow instructions, pick &lt;code&gt;-it&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;--&lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;code&gt;_qualcomm_sm8750&lt;/code&gt; - the compilation target
&lt;/h2&gt;

&lt;p&gt;This suffix tells you the model has been &lt;strong&gt;compiled for a specific chip&lt;/strong&gt;: &lt;code&gt;qualcomm&lt;/code&gt; is the vendor, &lt;code&gt;sm8750&lt;/code&gt; is the Snapdragon 8 Elite system-on-chip.&lt;/p&gt;

&lt;p&gt;Modern phones have specialized AI silicon - a Neural Processing Unit (NPU) - that runs neural network operations far faster than the CPU or GPU. But using it requires translating the model's computation graph into the chip's native instruction format. Same concept as compiling C for &lt;code&gt;x86&lt;/code&gt; vs &lt;code&gt;arm64&lt;/code&gt; - the math is identical, the binary is not.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;File&lt;/th&gt;
&lt;th&gt;Target&lt;/th&gt;
&lt;th&gt;Decode Speed&lt;/th&gt;
&lt;th&gt;TTFT&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;gemma-4-E2B-it.litertlm&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;CPU / GPU (generic)&lt;/td&gt;
&lt;td&gt;24.5 tok/s&lt;/td&gt;
&lt;td&gt;366ms&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;gemma-4-E2B-it_qualcomm_sm8750.litertlm&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Snapdragon 8 Elite NPU&lt;/td&gt;
&lt;td&gt;41.7 tok/s&lt;/td&gt;
&lt;td&gt;92ms&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The NPU variant is larger (2.8 GB vs 2.4 GB) because it bundles QNN-compiled execution graphs with &lt;code&gt;DISPATCH_OP&lt;/code&gt; custom operations targeting the Hexagon V79 DSP. But the speed gain is substantial: 1.7x faster decode throughput, 4x faster time-to-first-token. No chip suffix means the generic variant that runs on any ARM CPU or mobile GPU. (I cover NPUs and hardware delegates in detail in upcoming posts in this series.)&lt;/p&gt;

&lt;p&gt;--&lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;code&gt;.litertlm&lt;/code&gt; - the file format
&lt;/h2&gt;

&lt;p&gt;The file extension for &lt;strong&gt;LiteRT-LM&lt;/strong&gt;, Google's on-device LLM inference runtime. A &lt;code&gt;.litertlm&lt;/code&gt; file is not raw weights - it is a compiled bundle containing quantized weights, tokenizer, chat template, and execution graph, all packaged for immediate on-device inference.&lt;/p&gt;

&lt;p&gt;You cannot fine-tune a &lt;code&gt;.litertlm&lt;/code&gt; file. It is the &lt;em&gt;end&lt;/em&gt; of the pipeline: train on cloud, fine-tune (optionally), quantize, compile, package, deploy. It is distinct from &lt;code&gt;.tflite&lt;/code&gt; (classical ML), &lt;code&gt;.gguf&lt;/code&gt; (llama.cpp), &lt;code&gt;.onnx&lt;/code&gt; (cross-platform), or &lt;code&gt;.safetensors&lt;/code&gt; (raw weights for storage/transfer). (I dig into the internals of this format in a separate post.)&lt;/p&gt;

&lt;p&gt;--&lt;/p&gt;

&lt;h2&gt;
  
  
  Navigating HuggingFace
&lt;/h2&gt;

&lt;p&gt;Two organizations matter when searching for Gemma models ready for on-device deployment:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;&lt;a href="https://huggingface.co/google" rel="noopener noreferrer"&gt;google&lt;/a&gt;&lt;/strong&gt; - official weights in &lt;code&gt;.safetensors&lt;/code&gt; format, for fine-tuning or conversion. This is where you start if you need to customize the model.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;a href="https://huggingface.co/litert-community" rel="noopener noreferrer"&gt;litert-community&lt;/a&gt;&lt;/strong&gt; - pre-compiled &lt;code&gt;.litertlm&lt;/code&gt; bundles, including chip-specific NPU variants, ready for on-device use. This is where you go if you want to run inference immediately.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Check the model card for quantization level, supported hardware, and license. Look at the file listing for chip-specific variants (&lt;code&gt;_qualcomm_sm8750&lt;/code&gt;, etc.). For production on-device apps, &lt;code&gt;litert-community&lt;/code&gt; gives you deployment-ready files; &lt;code&gt;google&lt;/code&gt; gives you the starting point for fine-tuning.&lt;/p&gt;

&lt;p&gt;--&lt;/p&gt;

&lt;h2&gt;
  
  
  Putting it all together
&lt;/h2&gt;

&lt;p&gt;Every segment in Redacto's model filename maps to a decision we made:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Segment&lt;/th&gt;
&lt;th&gt;What it means&lt;/th&gt;
&lt;th&gt;Why we chose it&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;gemma&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Google's open-weight LLM family&lt;/td&gt;
&lt;td&gt;Apache 2.0, strong on-device ecosystem, Google-backed&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;4&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Fourth generation&lt;/td&gt;
&lt;td&gt;Best quality per parameter, Apache 2.0 license, PLE architecture&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;E2B&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;5.1B total / 2.3B effective via PLE&lt;/td&gt;
&lt;td&gt;~2.58 GB at INT4, fits in phone RAM with headroom for OCR + 4 LLM calls&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;it&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Instruction-tuned&lt;/td&gt;
&lt;td&gt;Follows system prompts - critical for our 4-step redaction pipeline&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;qualcomm_sm8750&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Compiled for Snapdragon 8 Elite NPU&lt;/td&gt;
&lt;td&gt;41.7 tok/s, 92ms TTFT on Hexagon V79&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;.litertlm&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;LiteRT-LM compiled bundle&lt;/td&gt;
&lt;td&gt;Tokenizer + chat template + weights + graph, ready to infer&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The next time you see a model filename that looks like alphabet soup, read it left to right: family, generation, size architecture, training variant, target hardware, runtime format. Each piece narrows what the model is, what it can do, and where it can run.&lt;/p&gt;




&lt;h3&gt;
  
  
  Related in this series of "Edge AI from the Trenches"
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;From HuggingFace to Your Phone - the next logical read: what happens after you decode the filename&lt;/li&gt;
&lt;li&gt;What Does "On-Device" Actually Mean? - the privacy and deployment context behind choosing an on-device model&lt;/li&gt;
&lt;/ul&gt;




&lt;p&gt;&lt;em&gt;Jaydeep Shah is a developer with roots in embedded systems, Android platform internals, and silicon-level AI optimization. He now explores on-device AI inference - bringing models from the cloud to phones and edge hardware. &lt;br&gt;
Along with his team Edge Artists, he builds applications using LiteRT-LM and Gemma models on mobile hardware, and writes about what works, what breaks, and what he learns along the way. This post is part of the Edge AI from the Trenches series.&lt;/em&gt;&lt;/p&gt;




&lt;p&gt;&lt;strong&gt;Sources:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;a href="https://huggingface.co/google/gemma-4-E2B-it" rel="noopener noreferrer"&gt;Google Gemma 4 E2B Model Card&lt;/a&gt; - PLE architecture, effective parameter explanation&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://blog.google/technology/developers/gemma-open-models/" rel="noopener noreferrer"&gt;Google Gemma announcements&lt;/a&gt; (Gemma &lt;a href="https://blog.google/technology/developers/gemma-open-models/" rel="noopener noreferrer"&gt;1&lt;/a&gt;, &lt;a href="https://blog.google/technology/developers/google-gemma-2/" rel="noopener noreferrer"&gt;2&lt;/a&gt;, &lt;a href="https://blog.google/technology/developers/gemma-3/" rel="noopener noreferrer"&gt;3&lt;/a&gt;)&lt;/li&gt;
&lt;li&gt;&lt;a href="https://huggingface.co/litert-community" rel="noopener noreferrer"&gt;litert-community on HuggingFace&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://ai.google.dev/edge/litert" rel="noopener noreferrer"&gt;LiteRT documentation&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;Benchmark data: &lt;a href="https://devpost.com/software/redacto" rel="noopener noreferrer"&gt;Redacto project, Samsung Galaxy S25 Ultra (Snapdragon 8 Elite, SM8750)&lt;/a&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;em&gt;Last updated: May 2026&lt;/em&gt;&lt;br&gt;
&lt;em&gt;1st out of 22 posts in the "Edge AI from the Trenches" series&lt;/em&gt;&lt;/p&gt;

</description>
      <category>litertlm</category>
      <category>gemmachallenge</category>
      <category>edgeai</category>
      <category>android</category>
    </item>
  </channel>
</rss>
