Fine-tuning a VLM to build an on-device fashion-scoring app

#ai #machinelearning #ios #llm

Scoring outfits with AI. Offline.

Can it be done?
Style is qualitative. There isn't a single answer. AI can give a generic answer, but can it answer something like fashion, where the criteria vary by culture?

There is a way.
This article is a record of building a fully offline fashion-scoring app on iPhone using a Visual LLM (VLM).

The approach

Use a closed system of evaluation criteria.

Every aesthetic or philosophical judgment has many schools of thought, and it's hard to produce an open answer that satisfies every possible criterion.

But within a single school — whether in fashion, sports, or specialized work — there are cases where the correct answer is determined inside a closed system.

For example, here I referenced the idea of "the balance between dressy and casual," popularized for a general audience by the Japanese men's-fashion influencer "MB," and treated "if the dressy-to-casual balance is close to 7:3, it looks stylish" as the axis, scoring input images on it. (This is my own interpretation from reading MB's blog and so on.)

Each item of an outfit — tops, bottoms, shoes — is scored against a somewhat systematized standard. An AI (LLM) can do this. And it does it quite well. Even ~1,000 training examples is enough. You don't need to learn every possible item; it extrapolates to unseen ones.

That's the real subject here. More than scoring fashion itself, the theme is how well-suited an LLM is to handling a "closed system."

Small models that fit on an iPhone are well-suited to this kind of domain-specific fine-tuning. With fewer parameters, training is cheap.

This approach works not just for fashion but for anything where the answer is established within a closed system of a given school — makeup, sports form, fortune-telling, and so on.

How it's built

Fine-tune by knowledge distillation:

Teacher = large model (Qwen3-VL-235B-A22B)
Student = small model (Qwen3-VL-2B)

Feed a theory document (~10KB: definitions of 5 axes + baseline tables + aggregation rules + output rules) to the large model as a prompt, and have it score the training images according to that document.

Only the large model can do this; the small model can't hold the entire theory document.

Using the set of (image input given to the large model, output the large model produced), fine-tune the small model.

Now the small model can produce output grounded in the theoretical system. It doesn't know the theory document, but it can perform the baked-in process.

For this one closed-domain evaluation alone, the small model can imitate the behavior of a model 10×–100× its size.

How it works

Input: image
Output: fixed-schema JSON label
Train Qwen3-VL 2B on (image, fixed question, JSON) triplets via LoRA fine-tuning (student)
Convert to CoreML -> iPhone -> fully offline scoring

Because it's "closed," ~800 images are enough. The mapping has low entropy, so if the teacher emits labels under consistent rules, even a small set lets the student reconstruct those rules.

The highest-leverage part is the "theory document"

The most influential file in this pipeline is neither the training script nor the model definition — it's the theory document (the instructions to the teacher).

Writing a genuine theory document is the one thing you can't skimp on.

Output schema (author's reconstruction)

The JSON I had the student emit looks roughly like this (an implementation structure, not text from any original source):

{
  "items": [
    {
      "category": "tops",
      "description": "white cotton dress shirt",
      "scores": { "color": 5, "silhouette": 4, "material": 4, "design": 4, "item_type": 4 },
      "item_dress_score": 4.2
    },
    {
      "category": "bottoms",
      "description": "black skinny trousers",
      "scores": { "color": 5, "silhouette": 4, "material": 4, "design": 4, "item_type": 4 },
      "item_dress_score": 4.3
    }
  ],
  "overall_dress_ratio": 0.71,
  "coordinate_silhouette": { "type": "I", "style_score": 4, "rationale": "..." },
  "target_ratio": 0.70,
  "verdict": "Near-ideal 7:3 for street wear. Dressy edges ahead slightly; clean enough.",
  "advice": "..."
}

Implementation stack and numbers

Role	Choice	Notes
Base model	`Qwen/Qwen3-VL-2B-Instruct`	fp16/int8 stable on Apple Silicon; shipped
(compared) alt base	`google/gemma-4-E2B-it`	schema collapse at int4; passed over for FT
Teacher labeler	Qwen3-VL-235B-A22B	reads the theory and judges JSON
Training	LoRA rank16 / alpha32, `language_model.*` only, vision frozen	~25 min on Colab A100
Conversion	coreml-llm Qwen3-VL stateful pipeline	MLState + slice_update KV
Device	iPhone 17 Pro (A19 ANE)	2.3GB int8 / ~24 tok/s

Training data was ~800–900 full-body outfit photos from Unsplash + Pexels (~750 used for training). One iteration (collect → label → train → convert → transfer) takes roughly 2.5 hours.

Closing: a "dedicated scorer" that fits in your pocket

Specialized knowledge that can be written as a closed system runs faster, cheaper, more consistently, and more privately when distilled wholesale into a 2B model on-device than when thrown at a giant API. If a giant general model is "an advisor who knows a little about everything," what I built here is a way to put "a scorer who has drilled one certification standard into their body" in your pocket. Scoring, assessment, certification, fixed-schema extraction — the world has surprisingly many "closed systems," and any of them might be bakeable to device size with the same pattern.

※ To repeat: this implementation is not supervised or endorsed by any specific individual or organization; it is an independent reconstruction of publicly known ideas for technical validation. The scores do not represent any definitive "correct answer."

Note

The idea this article builds on — "shifting the dressy-to-casual balance toward 7:3" — references a publicly and widely known idea in Japanese men's fashion. The scoring axes, JSON schema, prompt design, and aggregation rules here are my own reconstruction for technical validation, not a quotation or reproduction of any original text, figures, or images. This implementation is not an official, supervised, partnered, or endorsed app of any individual or organization, nor is it intended as an accurate explanation of the theory. It is purely a technical experiment in "how to internalize a subjective evaluation axis into an image-understanding model," and the scores do not constitute anyone's definitive judgment. The value of this article lies not in fashion theory itself but in the methodology of distilling a closed system into a small model.

Originally published in Japanese on Qiita. I build apps with machine learning and AR, and write about both. GitHub / X