You have a model that fits on a phone. It runs. It generates tokens. But the outputs are not quite right: maybe the format is wrong, or it misses edge cases, or it hallucinates on domain-specific content.
You have three levers to pull: prompt engineering, quantization, and fine-tuning. The question is not which one is "best." The question is which one to reach for first, and when to escalate. I learned this the hard way building Redacto, an on-device PII redaction app running Gemma 4 E2B via LiteRT-LM on a Snapdragon 8 Elite.
Here is the decision framework I wish I had before I started.
I started with prompts and they worked better than expected
Prompt engineering means crafting the instructions you send to the model: system prompts, examples, output format specifications. No weights change. This is the cheapest, fastest, and most underrated lever. You iterate in minutes, not hours. No training data. No GPU. No export pipeline. Just text.
There are three levels:
Zero-shot. A direct instruction with no examples. "Redact all personally identifiable information. Replace each PII entity with [CATEGORY_N]."
Few-shot. Include 2-3 worked examples so the model can pattern-match the desired behavior. Showing three examples of [NAME_1], [SSN_2], [PHONE_3] enforces the format far more reliably than instructions alone.
System personas. Frame the model's role. "You are a HIPAA compliance officer. Your task is to identify and redact all Protected Health Information." Personas anchor the model to a domain and reduce hallucination.
For small on-device models (1-4B parameters), there is a constraint cloud models do not have: context window. A 1024-token KV cache means your system prompt, few-shot examples, and user input must all fit. Long few-shot prompts that work on a 128K-context cloud model may not fit on-device. You have to be concise.
What I measured. Our standard Gemma 4 E2B model with prompt engineering alone scored 80.5% overall accuracy across 85 test cases and 5 domain modes. On HIPAA specifically, it hit 95.7% entity recall. That is a stock model from litert-community with a well-crafted system prompt. No training. No Colab. No export pipeline.
Quantization turned out to be mandatory, not optional
Quantization reduces the numerical precision of model weights, converting 32-bit floating point values to 4-bit or 8-bit integers. For on-device deployment, this is not a choice. A 2B parameter model at FP32 requires roughly 8 GB of memory. Your phone does not have 8 GB of free RAM for a single model.
The precision levels that matter on-device:
- INT8 (8-bit integer): quarters the memory. Slight accuracy loss. Common for server-side optimization.
- INT4 (4-bit integer): one-eighth the memory. This is the sweet spot for on-device LLMs. Gemma 4 E2B at INT4 fits in approximately 2.59 GB.
The critical thing I learned: quantization is a compile-time decision, not a runtime decision. When you export to .litertlm, you specify the quantization scheme (e.g., dynamic_wi4_afp32 for INT4 weights with FP32 activations). That choice is baked into the artifact. You cannot re-quantize a .litertlm file. If you want to try INT8 instead of INT4, you re-export from the source weights.
Quantization does degrade accuracy, particularly on edge cases. Research from GPTQ (Frantar et al., 2022) and AWQ (Lin et al., 2023) shows that careful quantization strategies minimize this degradation, but it is never zero.
What I measured. Our standard model uses INT4 quantization at 2.59 GB. The fine-tuned model, exported with a different quantization granularity, came in at 4.7 GB. Same architecture, same parameter count, nearly double the size because of an export pipeline difference. The larger model drew 3x more power (301 mA vs 101 mA) and ran 1.9x slower (10,626 ms vs 5,693 ms average latency). Quantization strategy is not just about accuracy. It directly determines whether your model is practical to deploy.
I fine-tuned last, and the results surprised me
Fine-tuning modifies the model's weights to encode new behavior. But modern fine-tuning uses parameter-efficient methods that train a tiny fraction of the weights.
LoRA (Low-Rank Adaptation) decomposes weight updates into two small matrices instead of updating the full weight matrix. A layer with a 2048 x 2048 weight matrix would require learning 4 million values for a full update. LoRA with rank 8 (Hu et al., 2021) learns two matrices of 2048 x 8 and 8 x 2048: only 32,768 values.
QLoRA takes this further: it loads the base model in 4-bit quantized form during training, then applies LoRA adapters on top. You can fine-tune a multi-billion parameter model on a single consumer GPU. Dettmers et al. (2023) demonstrated that 4-bit quantized models with LoRA adapters can match the performance of full 16-bit fine-tuning.
When to fine-tune: when prompt engineering provably cannot achieve your accuracy or format requirements. Specifically: the model needs domain-specific output formats that few-shot examples cannot enforce, or domain knowledge not in its training data, or subtle distinctions that instructions alone cannot capture.
When not to fine-tune: before trying prompt engineering (always try prompts first), when your training data does not match your deployment task (the trap I fell into), or when you need NPU deployment. As of mid-2026, there is no public compilation toolchain for fine-tuned models to NPU targets like Qualcomm's Hexagon. You can fine-tune and deploy to GPU, but NPU compilation remains blocked for custom models.
The table that changed my thinking
Overall: Standard (prompt engineering) vs Fine-tuned:
| Metric | Standard Model | Fine-tuned Model | Delta |
|---|---|---|---|
| Overall Score | 80.5% | 70.3% | -10.2% |
| Entity Recall | 79.3% | 71.7% | -7.6% |
| Format Score | 79.8% | 71.7% | -8.1% |
| Preservation | 83.7% | 65.9% | -17.8% |
| Avg Throughput | 12.8 tok/s | 9.0 tok/s | -30% |
These scores come from an earlier single-pass evaluation (85 entries, GPU, an older LiteRT build) - not the shipped 3-step pipeline benchmarked elsewhere in this series. Standard and fine-tuned were scored in the same run, so the head-to-head is fair; just do not compare these absolute numbers to the latency figures from the 3-step pipeline.
The fine-tuned model lost. Overall. But look at the per-domain breakdown:
Per-mode entity recall:
| Domain Mode | Standard | Fine-tuned | Winner |
|---|---|---|---|
| FIELD_SERVICE | 82.1% | 95.3% | Fine-tuned (+13.2%) |
| FINANCIAL | 83.8% | 85.5% | Fine-tuned (+1.7%) |
| HIPAA | 95.7% | 39.9% | Standard (+55.8%) |
| JOURNALISM | 71.1% | 61.3% | Standard (+9.8%) |
| TACTICAL | 63.7% | 76.8% | Fine-tuned (+13.1%) |
The fine-tuned model crushed FIELD_SERVICE and TACTICAL, the domains where its training data (ai4privacy/pii-masking-400k) happened to include relevant patterns. It catastrophically failed on HIPAA, where the standard model's carefully crafted system prompt was already handling relational PHI ("the patient's daughter Lisa") that the fine-tuning data never covered.
The lesson is not that fine-tuning is bad - it is that our fine-tune never got a fair shot. This was a one-day, one-epoch run on 3,000 examples sampled from a 400,000-example dataset: under 1% of the data available. And that slice was misaligned too, using the generic [REDACTED] format rather than Redacto's structured [CATEGORY_N]. So the model was fed too little, too briefly, in the wrong format. Fine-tuning is only as good as the quantity and alignment of what you train on, and we gave it little of either. With the full 400k and format-matched labels, the result could well have flipped - we simply did not test that. What we actually proved is narrower and more useful: a good prompt beat an under-resourced fine-tune, so before committing to a serious fine-tuning effort, prompting got us further, faster.
QLoRA training details: 2,850,816 trainable parameters (0.06% of total), LoRA rank 8, alpha 16, 3,000 training examples (a sub-1% slice of ai4privacy/pii-masking-400k), 1 epoch, 217 seconds training time. A deliberately quick attempt - the compute was trivial, and we never scaled the data or aligned the label format. That, not any limit of fine-tuning itself, is what these numbers reflect.
The decision framework I now follow
One clarification before the list: quantization is not a quality lever in the same sense as the other two. It is a deployment prerequisite - you quantize because the model physically will not fit otherwise, not because you are chasing accuracy. So think of it as step zero, and then reach for the quality levers in order.
- Quantize because you must. This is not negotiable for on-device: choose INT4 for models under 4B parameters. It happens at export time, and it is the price of admission, not a tuning knob.
- Prompt engineer first (among the quality levers). Iterate on system prompts, few-shot examples, and output format instructions. This is free, fast, and often sufficient.
- Fine-tune last. Only when prompt engineering provably cannot achieve your requirements, AND you have enough well-aligned training data that matches your deployment task.
- Know the blockers. Fine-tuned models currently cannot be compiled for NPU targets - the exact wall we hit in the delegates post, where our fine-tune could only reach GPU. You are limited to GPU/CPU backends, and GPU runs at roughly 1.7x slower decode throughput than NPU.
What I took away from this
In the cloud ML world, fine-tuning is often the first thing teams reach for. In our experience building Redacto, it should be the last lever you pull on-device.
The reason is practical: on-device deployment adds an export/compilation step that amplifies every upstream decision. A mismatched chat template breaks the model entirely. A different quantization granularity doubles the model size. Training data that does not match your output format tanks your scores even when the model "knows" the right answer.
Prompt engineering has none of these risks. You change text, you test, you iterate. The model binary stays the same.
Start with prompts. Quantize because you must. Fine-tune only when you have proven, with data, that prompts are not enough, and only when your training data matches your deployment task exactly.
Related in this series of "Edge AI from the Trenches"
This post pulls together threads from across the foundations series:
- FP32, INT4, and Everything Between - the quantization lever, in depth
- I Opened a .litertlm File - why quantization is a compile-time decision baked into the artifact
- The Invisible Layer Between My Prompt and the Model - the chat-template format trap that hobbled our fine-tune
- One Model, Three Chips, Two Files - why a fine-tuned model can only reach GPU, not the NPU
- What On-Device AI Benchmarks Actually Feel Like - the metrics that quantization and backend choice drive
Jaydeep Shah is a developer with roots in embedded systems, Android platform internals, and silicon-level AI optimization. He now explores on-device AI inference, bringing models from the cloud to phones and edge hardware. Along with his team Edge Artists, he builds applications using LiteRT-LM and Gemma models on mobile hardware, and writes about what works, what breaks, and what he learns along the way. This post is part of the Edge AI from the Trenches series.
Last updated: July 2026
11th of 22 posts in the "Edge AI from the Trenches" series

Top comments (0)