Jaydeep Shah (JD)

Posted on May 17

What I Learned Turning a HuggingFace Model Into Something My Phone Can Run

#litertlm #gemmachallenge #edgeai #android

Gemma 4 Challenge: Write about Gemma 4 Submission

You found a model on HuggingFace. It looks promising - maybe Gemma, maybe Llama, maybe something smaller. You want to run it on a phone. You click "Download," and then... what? The file is 5 GB of .safetensors splits. There is no APK, no .tflite, no obvious next step. The HuggingFace README says "Usage: model = AutoModelForCausalLM.from_pretrained(...)" - a Python API that does not exist on Android.

This was exactly where I got stuck when I started building Redacto - an on-device PII redaction app running Gemma 4 E2B entirely on a Galaxy S25 Ultra. The distance between "I found a good model" and "it runs on my phone" turned out to be much bigger than I expected. Here is what I learned along the way.

What is actually inside a HuggingFace repo

The first thing I had to understand was what I was actually downloading. HuggingFace model repos for LLMs are not just a single weights file - they contain an entire ecosystem of artifacts.

Here is what I found inside a typical LLM repo:

Model weights (.safetensors or .bin files) - the learned parameters. For Gemma 4 E2B, this is a single 10.2 GB file. Larger models split weights across multiple files (model-00001-of-00004.safetensors, etc.).
config.json - the model architecture definition: number of layers, hidden dimensions, attention heads, vocabulary size. This is the blueprint the runtime needs to reconstruct the model's computational graph.
Tokenizer files (tokenizer.json, tokenizer_config.json, tokenizer.model) - the mapping between text and token IDs. The model does not see words; it sees integers. The tokenizer defines how "Mrs. Chen" becomes [4521, 18, 9832] (or whatever the model's vocabulary dictates).
chat_template (embedded in tokenizer_config.json or as a standalone .jinja file) - a Jinja2 template that wraps your messages into the format the model was trained on. Gemma expects <start_of_turn>user\n...<end_of_turn>. Llama expects [INST]...[/INST]. Get this wrong and the model produces garbage without any error message.
Model card (README.md) - documentation covering training data, intended use, limitations, license, and benchmark scores.

Not all repositories are the same. Some are raw research checkpoints with no tokenizer. Some include GGUF quantized versions alongside the originals. I learned quickly that for on-device deployment, you want repositories that have already been prepared for your target runtime.

Why I could not just download and run

This was my first real lesson. Those .safetensors files are PyTorch-format tensors at full or half precision. They are designed to be loaded into GPU memory on a workstation running Python. A phone cannot use them directly, and it took me a while to understand why:

Size. Gemma 4 E2B at FP32 is roughly 10 GB. A phone with 12 GB of RAM cannot load that while also running the OS, the app, and everything else. You need quantization - compressing weights from 32-bit floats to 4-bit integers - to get the model down to a size that fits.

Format. Android does not have a PyTorch runtime. The phone's CPU, GPU, and NPU each have their own instruction sets and memory layouts. The model's computational graph needs to be compiled into operations that these hardware targets can execute.

Runtime. An LLM is not just a forward pass. You need conversation management (tracking turns), tokenization (text to integers and back), KV-cache management (storing previous computations so generation does not recompute from scratch every token), and streaming (delivering tokens one at a time for responsive UX). The raw weights have none of this.

Once I understood these three gaps, the compilation pipeline started to make sense.

The pipeline I had to learn

Getting from HuggingFace to a phone-ready model turned out to be a multi-stage process:

Quantization - making it fit

Quantization converts the model's weights from high-precision floating point (FP32 or FP16) to lower-precision integers (INT8 or INT4). It is a lossy compression step - the model loses some accuracy, but the file shrinks dramatically.

For Gemma 4 E2B, INT4 quantization (specifically dynamic_wi4_afp32 - INT4 weights with FP32 activations) brings the model from ~10 GB down to ~2.58 GB. That is the difference between "impossible on a phone" and "fits in memory with room for the app."

Quantization is not optional for mobile. It is a hard requirement. This was not obvious to me at first - I kept looking for ways around it before accepting that every on-device model goes through this step.

Export and compilation - the tool I did not know existed

This is where I discovered litert-torch - a pip-installable package from Google that takes a HuggingFace model and produces a .litertlm file. During export, the tool:

Reads the model architecture from config.json
Loads and quantizes the weights
Embeds the tokenizer
Embeds the chat template (from tokenizer_config.json or an override)
Compiles the computational graph into LiteRT operations
Packages everything into a single .litertlm bundle

The official export command from Google's Gemma 4 documentation:

litert-torch export_hf \
  --model=google/gemma-4-E2B-it \
  --output_dir=/tmp/gemma4_2b \
  --externalize_embedder \
  --jinja_chat_template_override=litert-community/gemma-4-E2B-it-litert-lm

The --model flag takes a HuggingFace model ID or local path. --externalize_embedder separates the embedding table for memory efficiency. The --jinja_chat_template_override points to a known-compatible chat template - this flag exists because some model templates use Jinja features that the on-device parser does not support. I learned this the hard way, and I cover that story in a later post in this series.

For fine-tuned models, you can point --model at a local directory and add quantization:

litert-torch export_hf \
  --model=./my_finetuned_model \
  --output_dir=./output \
  --externalize_embedder \
  --quantization_recipe=dynamic_wi4_afp32 \
  --jinja_chat_template_override=litert-community/gemma-4-E2B-it-litert-lm

Device-specific variants - one model, two files

This was a surprise. The exported .litertlm runs on CPU and GPU out of the box. But if you want to target the NPU (Neural Processing Unit) - which on a Snapdragon 8 Elite delivers 41.7 tok/s versus 24.5 tok/s on GPU - you need a second compilation step.

The NPU variant goes through the Qualcomm QNN toolchain, which compiles certain operations into DISPATCH_OP custom ops that run directly on the Hexagon V79 DSP. This produces a separate, larger .litertlm file that is tied to a specific chip.

The standard GPU/CPU file works across all ARM64 Android devices. The NPU file works only on the exact SoC it was compiled for. I did not expect to need two different model files for what is technically the same model.

Push to device - the easy part

The final step turned out to be the simplest:

adb push gemma-4-E2B-it.litertlm \
  /sdcard/Android/data/com.example.redacto/files/gemma4.litertlm

The model lives in the app's private external storage. On first load, LiteRT-LM parses the bundle, sets up the KV-cache, initializes the appropriate hardware delegate, and for NPU generates an AOT (ahead-of-time) compilation cache. Cold start takes about 10 seconds for GPU and 14 seconds for NPU. With the AOT cache in place, subsequent launches drop to around 2 seconds.

How it actually went with Redacto

For Redacto, we did not run the export pipeline ourselves for the production model. This is something I wish I had known earlier: Google's litert-community organization on HuggingFace publishes pre-compiled .litertlm files for popular models, including Gemma 4 E2B.

We downloaded from litert-community/gemma-4-E2B-it-litert-lm:

File	Size	Target Hardware
`gemma-4-E2B-it.litertlm`	2.59 GB	CPU/GPU (all ARM64 Android devices)
`gemma-4-E2B-it_qualcomm_sm8750.litertlm`	3.02 GB	Snapdragon 8 Elite NPU only

Two files. Same model. Same weights at the same precision. The size difference (~430 MB) comes from the QNN-compiled custom ops embedded in the NPU variant. Those ops will crash if you try to run them on a GPU - and the GPU variant cannot dispatch to the NPU. They are not interchangeable.

We did, however, run litert-torch export_hf ourselves when we fine-tuned the model. That is when we hit the chat template trap: tokenizer.save_pretrained() bundled the HuggingFace-native Jinja template (which uses map.get()), and LiteRT-LM's on-device template parser does not support that function. The model loaded, the tokenizer initialized, and then inference produced garbage. No error, no crash - just wrong output. We had to manually swap the template with an older compatible version before re-exporting. (I cover this trap in detail in a later post in this series.)

What I learned about the runtime landscape

As I went through this process, I also had to figure out where LiteRT-LM fits among the other on-device inference options. Here is the comparison I wish I had found earlier:

Runtime	Focus	Model Format	Hardware Targets	LLM Features
LiteRT-LM	LLMs on Android/iOS	`.litertlm`	CPU, GPU, NPU (Qualcomm, MediaTek)	Tokenizer, chat template, KV-cache, streaming, conversation management
TFLite / LiteRT	Classical ML (image, audio, NLP)	`.tflite`	CPU, GPU, NPU, Edge TPU	None (no tokenizer, no chat, no streaming)
ONNX Runtime	Cross-platform inference	`.onnx`	CPU, GPU (DirectML, CUDA)	Limited (community extensions)
llama.cpp	LLM inference, CPU-focused	`.gguf`	CPU (NEON/AVX), some GPU (Metal, CUDA, Vulkan)	Tokenizer, chat template, KV-cache, streaming
MediaPipe	ML pipelines for media tasks	`.tflite` + config	CPU, GPU	LLM Inference API (wraps LiteRT under the hood)

The distinctions that mattered most for my use case:

LiteRT-LM vs TFLite/LiteRT. I initially confused these. TFLite (now rebranded as LiteRT) handles classical ML models - image classifiers, object detectors. LiteRT-LM is built on top of LiteRT specifically for LLMs. You cannot run a .litertlm file with the TFLite interpreter, and you cannot run a .tflite model with LiteRT-LM. Same infrastructure, different model types.

LiteRT-LM vs llama.cpp. llama.cpp is excellent for CPU-based inference on laptops and desktops. On phones, LiteRT-LM's advantage is its deep integration with vendor-specific NPU delegates - on Snapdragon 8 Elite, NPU inference through LiteRT-LM runs at 41.7 tok/s versus the ~10-15 tok/s range typical of CPU-only execution for a model this size.

LiteRT-LM vs MediaPipe. Google's MediaPipe provides a higher-level LLM Inference API that wraps LiteRT under the hood. Simpler API, less control. LiteRT-LM gives you more control over engine initialization, backend selection, and sampling configuration - which I needed for our multi-step redaction pipeline.

The mental model that made it click

If I had to summarize everything I learned in one sentence: a HuggingFace model repository is the source code, and a .litertlm file is the compiled binary.

You would not try to run a .c file on a microcontroller without compiling it first. You would not try to load .safetensors on a phone. Both need a compilation step that transforms human-friendly source into machine-friendly executable. The difference is that model compilation also includes quantization (compression), hardware specialization (targeting specific silicon), and runtime packaging (embedding the tokenizer, chat template, and execution graph).

Once this clicked for me, the rest of the on-device AI stack made sense. The .litertlm file is not just a model - it is a self-contained inference package that knows how to tokenize input, format conversations, run the forward pass on your target hardware, and stream output back to your app.

Finding the right model on HuggingFace is step one. Getting it to your phone is the actual engineering.

Related in this series

What I Learned Untangling LiteRT, LiteRT-LM, and TFLite - clarifies the naming confusion in the stack this post describes
FP32, INT4, and Everything Between - what quantization does and why it is mandatory for mobile
What's Inside a .litertlm File? - deep dive into the compiled bundle at the end of the pipeline
The Chat Template Trap - what happens when the export pipeline embeds an incompatible template

Jaydeep Shah is a developer with roots in embedded systems, Android platform internals, and silicon-level AI optimization. He now explores on-device AI inference - bringing models from the cloud to phones and edge hardware.
Along with his team Edge Artists, he builds applications using LiteRT-LM and Gemma models on mobile hardware, and writes about what works, what breaks, and what he learns along the way. This post is part of the Edge AI from the Trenches series.

Sources: