Jayita Gulati

Posted on Oct 2

Converting TensorFlow Models to TensorFlow Lite: A Step-by-Step Guide

#datascience #machinelearning #programming #tensorflow

Deploying machine learning models on mobile devices, IoT hardware, and embedded systems requires lightweight and efficient inference engines. TensorFlow Lite (TFLite) is Google’s solution for running ML models on edge devices with low latency and a small footprint. To use it, you need to convert your standard TensorFlow models into the TensorFlow Lite format (.tflite).

This article walks you through the process of converting TensorFlow models into TensorFlow Lite format.

Why TensorFlow Lite?

TensorFlow Lite offers several advantages for on-device inference:
• Reduced model size – Models are compressed through techniques like quantization and pruning, making them small enough to fit on devices with restricted storage.
• Optimized performance – TFLite uses hardware acceleration (via GPU, NNAPI, or specialized DSPs) to deliver faster inference compared to running full TensorFlow.
• Cross-platform compatibility – It supports Android, iOS, embedded Linux, and even microcontrollers.
• On-device machine learning – Since inference happens locally, TFLite enables real-time applications without relying on cloud servers, improving latency, privacy, and offline functionality.

Step 1: Train or Load Your TensorFlow Model

You can start with either:
• A pre-trained TensorFlow model (e.g., from TensorFlow Hub).
• A custom model trained with Keras or the TensorFlow API.

Example:

import tensorflow as tf

# Define a simple model
model = tf.keras.Sequential([
    tf.keras.layers.Dense(128, activation='relu', input_shape=(784,)),
    tf.keras.layers.Dense(10, activation='softmax')
])

# Compile and train
model.compile(optimizer='adam',
              loss='sparse_categorical_crossentropy',
              metrics=['accuracy'])

Step 2: Convert the Model to TensorFlow Lite

Once the model is trained or loaded, you can convert it into the lightweight .tflite format using the TensorFlow Lite Converter. This step compresses the model and prepares it for efficient deployment on mobile and edge devices.

# Convert the Keras model to TensorFlow Lite
converter = tf.lite.TFLiteConverter.from_keras_model(model)
tflite_model = converter.convert()

# Save the model to a .tflite file
with open("model.tflite", "wb") as f:
    f.write(tflite_model)

If you have a SavedModel format instead:

converter = tf.lite.TFLiteConverter.from_saved_model("saved_model_directory")
tflite_model = converter.convert()

At this point, you have a working .tflite model. But it may still be too large or slow for smaller devices. That’s where optimization comes in.

Step 3: Optimize the Model

Optimization reduces size and speeds up inference, especially important for edge devices. In addition to quantization, techniques such as model compression through pruning and clustering can further shrink model size and improve efficiency before conversion.

Dynamic Range Quantization

Dynamic Range Quantization quantizes weights to int8 while keeping inputs/outputs in float, giving smaller models with minimal accuracy loss.

converter = tf.lite.TFLiteConverter.from_keras_model(model)
converter.optimizations = [tf.lite.Optimize.DEFAULT]
tflite_quant_model = converter.convert()

with open("model_dynamic_quant.tflite", "wb") as f:
    f.write(tflite_quant_model)

Integer Quantization

Integer Quantization fully quantizes weights and activations to int8, best for CPUs and microcontrollers without floating-point support.

def representative_dataset():
    for _ in range(100):
        yield [np.random.rand(1, 784).astype(np.float32)]

converter = tf.lite.TFLiteConverter.from_keras_model(model)
converter.optimizations = [tf.lite.Optimize.DEFAULT]
converter.representative_dataset = representative_dataset
converter.target_spec.supported_ops = [tf.lite.OpsSet.TFLITE_BUILTINS_INT8]
converter.inference_input_type = tf.int8
converter.inference_output_type = tf.int8
tflite_int8_model = converter.convert()

with open("model_int8.tflite", "wb") as f:
    f.write(tflite_int8_model)

Float16 Quantization

Float16 Quantization stores weights in float16 but computes in float32, reducing size with little accuracy impact (optimized for GPUs).

import tensorflow_model_optimization as tfmot

# Apply QAT
qat_model = tfmot.quantization.keras.quantize_model(model)

qat_model.compile(optimizer='adam',
                  loss='sparse_categorical_crossentropy',
                  metrics=['accuracy'])

# Retrain on dataset
# qat_model.fit(x_train, y_train, epochs=5)

# Convert to TFLite
converter = tf.lite.TFLiteConverter.from_keras_model(qat_model)
tflite_qat_model = converter.convert()

with open("model_qat.tflite", "wb") as f:
    f.write(tflite_qat_model)

Quantization-Aware Training (QAT)

Quantization-Aware Training (QAT) simulates quantization during training, preserving accuracy when deploying heavily quantized models.

import tensorflow_model_optimization as tfmot

# Apply QAT
qat_model = tfmot.quantization.keras.quantize_model(model)

qat_model.compile(optimizer='adam',
                  loss='sparse_categorical_crossentropy',
                  metrics=['accuracy'])

# Retrain on dataset
# qat_model.fit(x_train, y_train, epochs=5)

# Convert to TFLite
converter = tf.lite.TFLiteConverter.from_keras_model(qat_model)
tflite_qat_model = converter.convert()

with open("model_qat.tflite", "wb") as f:
    f.write(tflite_qat_model)

Step 4: Run Inference with TensorFlow Lite Interpreter

Once the model is converted, you can load it with the TensorFlow Lite Interpreter to perform predictions on new data. The interpreter allocates tensors, accepts input data, runs inference, and returns the output results for evaluation or deployment.

# Load TFLite model
interpreter = tf.lite.Interpreter(model_path="model.tflite")
interpreter.allocate_tensors()

# Get input and output details
input_details = interpreter.get_input_details()
output_details = interpreter.get_output_details()

# Example input
import numpy as np
input_data = np.array(np.random.random_sample(input_details[0]['shape']), dtype=np.float32)

# Run inference
interpreter.set_tensor(input_details[0]['index'], input_data)
interpreter.invoke()
output_data = interpreter.get_tensor(output_details[0]['index'])

print("Predictions:", output_data)

Step 5: Deploy to Your Target Device

Depending on your platform, deployment looks different:
• Android – Use TensorFlow Lite Android Support Library or ML Kit.
• iOS – Use TensorFlow Lite Swift library.
• Microcontrollers (TinyML) – Use TensorFlow Lite for Microcontrollers (no OS required).

Best Practices

• Prefer SavedModel format over .h5 or frozen graphs for smoother conversion and better metadata handling.
• Use Quantization-Aware Training (QAT) if targeting low-power devices to minimize accuracy loss after conversion.
• Provide a representative dataset for integer quantization to ensure proper calibration.
• Test the TFLite model on real hardware (Android, iOS, Raspberry Pi, microcontrollers) to confirm performance and accuracy.

Final Thoughts

Converting TensorFlow models to TensorFlow Lite unlocks powerful opportunities to run AI applications on mobile and edge devices. Whether you’re building real-time vision apps, speech recognition, or IoT solutions, TensorFlow Lite provides the tools to make your models efficient, fast, and deployable anywhere.

DEV Community