What I Learned Untangling LiteRT, LiteRT-LM, and TFLite

#gemmachallenge #edgeai #android #litertlm

Gemma 4 Challenge: Write about Gemma 4 Submission

When we started building Redacto - an on-device PII redaction app running Gemma 4 E2B on a Snapdragon 8 Elite - we kept tripping over three names: TFLite, LiteRT, and LiteRT-LM. Google's own docs sometimes use them interchangeably. Forum posts and community discussions mix them freely. We invested a good amount of time trying to figure out if they were the same thing, different versions, or completely separate tools.

Here is the distinction I wish someone had spelled out for me on day one.

The name that kept following me around: TFLite

TensorFlow Lite (TFLite) was Google's original on-device inference runtime, announced at Google I/O 2017 as the mobile companion to TensorFlow. It ran classical ML models - image classification, object detection, pose estimation - on phones and embedded devices. It consumed .tflite model files: small, optimized graphs dispatched to CPU, GPU, or specialized accelerators.

The rebrand that confused everyone: LiteRT

In September 2024, Google renamed TensorFlow Lite to LiteRT (short for "Lite Runtime"). The core runtime, APIs, and .tflite file format stayed the same. What changed was branding: LiteRT is no longer tied to TensorFlow. You can convert models from TensorFlow, PyTorch, JAX, or other frameworks. The old name implied a dependency that no longer existed.

The migration is still ongoing - you will find both names in code, packages, and docs. If you see org.tensorflow.lite in a Gradle dependency and LiteRT in Google's marketing, they are the same thing.

The one that is actually different: LiteRT-LM

This is where I got tripped up the longest. LiteRT-LM is not a rebrand. It is a separate runtime for running large language models on device, built on top of LiteRT. It adds capabilities the base runtime does not have:

Conversation management - system prompts, user turns, multi-turn history
Tokenization - BPE/SentencePiece tokenizer bundled with the model
Chat template handling - model-specific formatting (like Gemma's <start_of_turn> tags)
Streaming token output - callback-based delivery so your app shows text as it generates
Sampling configuration - temperature, top-k, top-p controls

LiteRT-LM consumes .litertlm files, not .tflite files. A .litertlm file is a compiled bundle: quantized weights, tokenizer, chat template, and an execution graph optimized for a specific device.

What I learned the hard way: they are not interchangeable

You cannot run a .litertlm file with plain LiteRT or TFLite. The base runtime has no concept of conversations, tokenizers, or streaming callbacks.
You cannot run a .tflite model with LiteRT-LM. The LLM runtime expects the bundled tokenizer and conversation-aware execution graph that only .litertlm provides.

They solve different problems. LiteRT runs classical ML inference. LiteRT-LM runs LLM inference with conversational scaffolding. In Redacto, we use LiteRT-LM exclusively - our pipeline sends system prompts through a multi-step conversation chain, with streaming callbacks and sampling configuration that do not exist in base LiteRT.

The stack that made it click

LiteRT-LM and MediaPipe both sit on top of LiteRT, but they serve different purposes and do not overlap. MediaPipe provides high-level task APIs (face detection, image segmentation) that use LiteRT as the engine underneath. LiteRT-LM provides conversational LLM inference. For a deeper runtime comparison including llama.cpp, ONNX Runtime, and ExecuTorch, see my earlier post on the HuggingFace-to-phone pipeline.

The short version

TFLite = the old name for Google's on-device ML runtime. Being replaced by LiteRT branding.
LiteRT = TFLite renamed, with broader framework support. Same runtime, same .tflite format.
LiteRT-LM = a separate runtime for LLMs, built on LiteRT. Different file format (.litertlm), different capabilities.

If you need an LLM on device, you want LiteRT-LM. If you need a classifier or detector, you want LiteRT. If you see "TFLite" in code, it is the old name for LiteRT.

The naming is confusing. But once you see the stack, it clicks.

Sources:

Jaydeep Shah is a developer with roots in embedded systems, Android platform internals, and silicon-level AI optimization. He now explores on-device AI inference - bringing models from the cloud to phones and edge hardware. Along with his team Edge Artists, he builds applications using LiteRT-LM and Gemma models on mobile hardware, and writes about what works, what breaks, and what he learns along the way. This post is part of the Edge AI from the Trenches series.

Last updated: May 2026
3rd of 22 posts in the "Edge AI from the Trenches" series