Jaydeep Shah (JD)

Posted on May 18

FP32, INT4, and Everything Between - What I Learned About Precision on Mobile

#deeplearning #machinelearning #mobile #performance

I was familiar with precision formats from my embedded systems work, but seeing labels like "INT4," "FP16," and "GPTQ" on HuggingFace model downloads hit differently. In that context, they are not just precision specs - they determine whether your model fits on a phone, how fast it runs, and how much accuracy you trade away. Here is the intuition that clicked once I started deploying to mobile.

I started by understanding what gets compressed

A neural network is a massive collection of numbers called weights. During inference, the model multiplies your input by these weights billions of times. Quantization answers one question: how precisely do you need to store each of those numbers?

The precision spectrum I had to learn

Each weight can be stored at different levels of precision:

Format	Bits per Weight	Relative Size	Typical Use
FP32 (full precision)	32 bits	1x (baseline)	Training, research
FP16 (half precision)	16 bits	0.5x	GPU inference, fine-tuning
FP8	8 bits	0.25x	Specialized silicon, datacenter inference
INT8	8 bits	0.25x	Server-side quantized inference
INT4	4 bits	0.125x	On-device / mobile deployment

Each step down roughly halves the memory footprint. A 2-billion parameter model at FP32 takes about 8 GB of memory (2 billion weights times 4 bytes each). At INT4, those same parameters fit in roughly 2.5 GB.

That is the difference between "will not fit on your phone" and "runs on your phone."

What these formats actually mean

FP32 stores each weight as a 32-bit floating-point number: a sign bit, 8 bits for the exponent, and 23 bits for the fractional part. This gives you roughly 7 decimal digits of precision. The weight 0.00314159 is stored almost exactly.

FP16 cuts that in half: 1 sign bit, 5 exponent bits, 10 fractional bits. About 3-4 decimal digits. That same weight might become 0.003143 - close, but not exact.

INT8 and INT4 are integer formats. Instead of floating-point representation, the weight range is mapped onto a fixed set of integer values. INT8 gives you 256 possible values. INT4 gives you just 16.

Think about that. A weight that could be any of billions of possible floating-point values is now forced into one of 16 buckets. That is aggressive compression. (In practice, INT4 quantization uses per-group scale factors, so the effective resolution is finer than 16 uniform buckets globally - but the core tradeoff holds.)

A note on FP8

FP8 sits in an interesting middle ground - 8 bits, but in floating-point format rather than integer. It preserves more of the dynamic range than INT8 while using the same amount of storage. I worked on FP8 optimization at the silicon level, and it is genuinely a compelling format for inference workloads where you need better accuracy than INT8 but cannot afford FP16's memory cost. Chip designers are increasingly building native FP8 support into AI accelerators precisely because of this tradeoff (NVIDIA Transformer Engine; Qualcomm AI Engine Direct).

The analogy that made it click for me

The easiest way I found to think about quantization: it is like reducing the number of decimal places in a measurement.

Imagine you are measuring the length of a room:

FP32: 3.14159 meters - laboratory precision
FP16: 3.142 meters - engineering precision
INT8: 3.1 meters - good enough for most construction
INT4: 3 meters - rough estimate

For most purposes, "3 meters" is fine. You can buy the right amount of carpet. But if you are doing precision cabinetry, rounding to the nearest meter will produce gaps.

This is exactly what happens inside a quantized model. For common, well-represented patterns, the model still produces good results. The weights do not need to be precise to the seventh decimal place because the overall pattern is strong enough. But for rare edge cases - unusual inputs, subtle distinctions, low-frequency knowledge - the model loses its ability to differentiate. The signal was in the decimals that got rounded away.

It is similar to reducing image resolution. A 4K photo and a 480p version both clearly show a person's face. But zoom in on the text on their T-shirt, and the 480p version is unreadable. Common features survive compression; fine details do not.

Real numbers: Gemma 4 E2B on a phone

Here is why this matters concretely. Gemma 4 E2B has 5.1 billion total parameters (2.3 billion effective, thanks to Per-Layer Embeddings). At full FP32 precision, it would need approximately 8 GB just for the weights - before accounting for the memory the operating system, other apps, and the inference runtime itself consume.

Most phones today have 6-12 GB of total RAM, shared across everything. An 8 GB model would leave virtually nothing for Android to run your UI, manage the camera, or keep background apps alive. The system would kill your app or crash.

At INT4, the standard GPU model compresses to 2.59 GB. That fits. That runs. That is why quantization is not optional for mobile - it is a prerequisite.

When we built Redacto, our on-device PII redaction app for the Qualcomm x Google LiteRT Developer Hackathon 2026, we shipped the standard Gemma 4 E2B model at 2.59 GB (INT4 quantization, specifically dynamic_wi4_afp32 - INT4 weights with FP32 activations). The fine-tuned version of the same model, exported with different quantization granularity, came in at 4.7 GB.

The performance difference was dramatic:

Metric	Standard (2.59 GB)*	Fine-tuned (4.7 GB)*
Avg latency per inference	5,693 ms	10,626 ms
Throughput	12.8 tok/s	9.0 tok/s
Avg current draw	101 mA	301 mA

*Different quantization granularity from different export pipelines - not a pure apples-to-apples comparison, but a real illustration of how model size drives hardware cost.

The larger model was 1.9x slower and drew 3x more power. On a phone, power draw translates directly to battery life and thermal throttling.

The tradeoff I did not expect

Here is the part nobody told me upfront. Quantization, fine-tuning, and model accuracy are locked in a three-way tradeoff:

You must quantize to run on mobile. There is no way around this. FP32 models do not fit.
Quantization degrades accuracy. You are rounding billions of numbers. Some information is lost.
Fine-tuning recovers accuracy on your specific task. By sharpening the model on exactly the patterns your app needs, you compensate for what quantization blurred.

This is why fine-tuning is not a luxury for on-device AI - it is part of the deployment strategy. You are already accepting precision loss from quantization. Fine-tuning lets you direct the remaining precision toward the things that matter most for your use case.

In our benchmarks, on one specific domain (tactical law enforcement data), the standard model scored 63.7% entity recall while the fine-tuned model scored 76.8%. That 13-point improvement came from training the model on domain-specific examples - teaching it to use its limited precision budget on the patterns that actually matter for that task. (The overall picture is more nuanced - I cover the full comparison in a later post in this series.)

Quantization schemes I encountered along the way

When you find quantized models in the wild, here are the formats you will see most often:

dynamic_wi4_afp32 - INT4 weights, FP32 activations. This is what LiteRT-LM uses for on-device export. The weights are aggressively compressed to INT4, but activations stay at full FP32 precision during inference. This preserves more accuracy than quantizing everything (LiteRT documentation).
GPTQ - A post-training quantization method that uses calibration data to minimize the error introduced by quantization. It processes the model layer by layer and adjusts remaining weights to compensate for rounding errors in already-quantized layers. Widely supported by the open-source ecosystem (Frantar et al., 2022).
AWQ (Activation-Aware Weight Quantization) - Observes which weights matter most by looking at activation magnitudes, then protects those important weights from aggressive quantization. Often produces better quality than GPTQ at the same bit width (Lin et al., 2023).
bitsandbytes - A library that provides easy INT8 and INT4 quantization integrated with the HuggingFace ecosystem. Commonly used for QLoRA fine-tuning, where the base model stays in INT4 and only the small LoRA adapter trains in higher precision (Dettmers et al., 2023).

Each has its own tradeoffs in quality, speed, and tooling compatibility. For on-device deployment via LiteRT-LM, dynamic_wi4_afp32 is currently the standard path.

What I took away from all this

Quantization is not a magic trick and it is not free. It is a deliberate engineering decision: trade precision for the ability to run on constrained hardware. The skill is in understanding exactly what you are trading away and whether your application can tolerate it.

For most common use cases, INT4 quantization works remarkably well. The model still understands language, still follows instructions, still generates coherent output. But the edges get soft. Rare patterns, subtle distinctions, unusual inputs - these are where you feel the loss.

If your app lives on those edges, invest in fine-tuning to sharpen them back up. If your app handles common cases, INT4 out of the box might be all you need.

Either way, now you know what the acronym actually means.

Sources

Gemma 4 Model Card - PLE architecture, parameter counts
LiteRT Documentation - on-device quantization formats
GPTQ: Accurate Post-Training Quantization for Generative Pre-Trained Transformers - Frantar et al., 2022
AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration - Lin et al., 2023
QLoRA: Efficient Finetuning of Quantized LLMs - Dettmers et al., 2023
NVIDIA Transformer Engine FP8 Documentation
Qualcomm AI Engine Direct Overview
Benchmark data: Redacto project, Samsung Galaxy S25 Ultra (Snapdragon 8 Elite, SM8750)

Jaydeep Shah is a developer with roots in embedded systems, Android platform internals, and silicon-level AI optimization. He now explores on-device AI inference - bringing models from the cloud to phones and edge hardware. Along with his team Edge Artists, he builds applications using LiteRT-LM and Gemma models on mobile hardware, and writes about what works, what breaks, and what he learns along the way. This post is part of the Edge AI from the Trenches series.

Last updated: May 2026
4th of 22 posts in the "Edge AI from the Trenches" series

DEV Community