Shiva Shrestha

Posted on May 12

Fine-tuning CLIP on a Niche Domain: How I Got +26pp Accuracy on Architectural Styles and What You Can Apply to Your Own Domain

#machinelearning #ai #python #computervision

Most fine-tuning write-ups end at "we got X% accuracy." This one walks through the four decisions before and after the training loop that actually moved the number. The training loop itself was the easy part. If you're fine-tuning a vision-language model on a niche domain, these are the decisions you'll face too.

The project: I fine-tuned OpenCLIP ViT-B/32 on 24 architectural style classes and shipped the embedder as the retrieval backbone for visquery.com, an architectural precedent search tool. Base CLIP zero-shot on my val set: 61.4%. Fine-tuned: 87.4%. That's +26 percentage points, and almost none of it came from tuning the training loop.

Each section below is a decision point with the reasoning behind it. Not just what I did, but why, and what the generalizable principle is for any domain.

1. Pick a domain where you can read the errors, not just count them

Generalizable principle: domain knowledge isn't just context, it's a forcing function for better decisions at every stage.

I'm an architect by training. When I open a confusion matrix and see my model conflating Baroque with Beaux-Arts, I know that's a fair mistake both styles share ornate facades, heavy cornices, and classical orders lifted from Rome. When it mixes up Georgian and Colonial, I can point to exactly which visual cues overlap (white symmetrical facades, pedimented entries) and which don't (window proportions, cornice detailing).

That's not just satisfying. It changes how you iterate.

Most fine-tuning posts use datasets where the author trusts the labels but can't explain the errors. You end up chasing metrics without knowing whether a mistake is a model failure or a labeling ambiguity or whether the two classes are genuinely hard to distinguish even for humans. Pick a domain you understand well enough to judge the confusions, not just measure them. If you can't do that yet, talk to a domain expert before you label anything.

For architectural styles, the hardest confusion clusters are: Gothic/Romanesque (pre-Renaissance, both stone, both vertical emphasis), Greek Revival/Colonial/Georgian (white-columned American residential and civic), and Queen Anne/Tudor/Edwardian (late 19th/early 20th British-derived residential). I knew these pairs before I wrote a single line of training code. That knowledge shaped every subsequent decision the hard-negative batching strategy in section 4 flows directly from this list.

2. Let base CLIP filter its own training data

Generalizable principle: use the pretrained model as a data quality gate before any fine-tuning begins.

Starting dataset: 7,018 training images across 24 classes, sourced from Wikimedia Commons under CC licenses. Before touching a training loop, I ran every image through base CLIP's zero-shot classifier and dropped anything where confidence on its own label fell below 0.05.

The intuition is simple: if the unmodified model sees zero signal that an image belongs to its labeled class, the label is probably wrong or the image is genuinely ambiguous. Training on it is noise. This works for any domain swap out the class names and it's a drop-in quality filter for your dataset.

Here's the filter in about 25 lines:

import open_clip
import torch
from PIL import Image

model, _, preprocess = open_clip.create_model_and_transforms('ViT-B-32', pretrained='openai')
model.eval()
tokenizer = open_clip.get_tokenizer('ViT-B-32')

CONFIDENCE_THRESHOLD = 0.05

def is_clean_sample(image_path: str, label: str, class_names: list[str]) -> bool:
    image = preprocess(Image.open(image_path).convert("RGB")).unsqueeze(0)
    texts = tokenizer([f"a photo of {c} architecture" for c in class_names])

    with torch.no_grad():
        img_feat = model.encode_image(image)
        txt_feat = model.encode_text(texts)
        img_feat /= img_feat.norm(dim=-1, keepdim=True)
        txt_feat /= txt_feat.norm(dim=-1, keepdim=True)
        probs = (100.0 * img_feat @ txt_feat.T).softmax(dim=-1)

    idx = class_names.index(label)
    return probs[0, idx].item() >= CONFIDENCE_THRESHOLD

clean = [(p, l) for p, l in all_samples if is_clean_sample(p, l, class_names)]
print(f"Kept {len(clean)}/{len(all_samples)} after quality filter")

After filtering, I oversampled each class to 280 images with augmentation — not duplication. Every copy gets independent transforms (random crops, flips, color jitter, Gaussian blur), so there are no duplicate gradients. Minimum class size before oversampling was 122 images.

3. Two-stage training: text tower first, then visual

Generalizable principle: align the text side to your label vocabulary before touching visual weights. Protect what the pretrained model already knows.

Single-stage fine-tuning across all layers tends to overwrite the general visual representations CLIP already learned. The two-stage approach preserves them.

Stage 1 (epochs 1–5, LR 5e-6): Freeze the entire visual encoder. Train only the text tower and projection heads. The goal here isn't accuracy — it's alignment. I'm teaching the model that "Baroque architecture" in my label vocabulary corresponds to the visual features CLIP already knows how to see. No point moving those visual weights until the text side is calibrated.

Stage 2 (epochs 6–15, LR 5e-7): Unfreeze resblocks 10 and 11 only — the last two transformer blocks in the visual encoder. Drop the LR by 10x. Now the model can develop fine-grained visual discriminability: learn that Baroque and Beaux-Arts look different, not just label differently.

Training log:

Epoch	Loss	Val Acc	Stage
1	1.915	80.2%	1 (text only)
3	1.568	84.5%	1
5	1.493	87.0%	1
6	1.476	86.2%	2 (resblocks 10–11 unlocked)
10	1.447	87.2%	2
12	1.449	87.4% ✓	2

The dip at epoch 6 is normal — unlocking new layers introduces instability before the model adapts. Accuracy recovered within two epochs.

Production scorecard at checkpoint epoch 12:

Metric	Value	Target	Status
Classification accuracy	0.874	≥ 0.90	❌
Macro F1	0.867	≥ 0.90	❌
Hard-neg pass rate	0.904	≥ 0.80	✅
Semantic R@1	0.880	≥ 0.70	✅
ECE	0.056	< 0.10	✅
Noise conf p95	0.466	< 0.50	✅

4/6 gates pass. Accuracy and F1 don't hit 0.90 yet. More on that at the end.

4. Hard-negative batching from confusion clusters you already know

Generalizable principle: use your domain knowledge from step 1 to build batches that force the model to learn the hard distinctions — not just recognize styles in isolation.

With random batching, the model might see Gothic and Romanesque in the same batch once every dozen iterations. Hard-negative batching fixes this deliberately.

~50% of each mini-batch was drawn from the three hardest confusion clusters I'd identified before training: Gothic/Romanesque, Greek Revival/Colonial/Georgian, and Queen Anne/Tudor/Edwardian. The model sees these pairs side-by-side every single iteration.

The effect: instead of learning 24 styles in isolation, the model is forced to learn the differences between the ones that actually look similar. That's the actual problem in architectural image retrieval — not recognizing styles in isolation, but separating them when they share visual DNA.

Residual confusions in the final matrix: Baroque ↔ Beaux-Arts (F1: 0.769 and 0.842), International Style ↔ Bauhaus (F1: 0.739 and 0.800). Georgian lands at 0.600 F1, but that's a val-set size artifact — only 5 validation samples. Every other class sits at or above 0.739.

5. Calibrate before you ship

Generalizable principle: accuracy tells you how often the model is wrong. Calibration tells you whether it knows when it's wrong. You need both before shipping.

87.4% accuracy means roughly 1 in 8 predictions is wrong. In a search product, those wrong predictions show up in results. If the model is overconfident on the 12.6% it gets wrong, you're actively surfacing confident errors to users.

Temperature calibration on the val set using scipy.optimize.minimize_scalar to minimize ECE:

	ECE
Pre-calibration	0.0938
Post-calibration	0.0559

The model now assigns lower confidence to predictions it's more likely to get wrong — which means the search system can use confidence scores to filter or rank results more reliably.

OOD check: ran the calibrated model on pure noise images. Mean confidence: 0.355. p95: 0.466 — below the 0.50 gate. The model doesn't confidently assign random images to known architectural classes. That matters when your index contains images from the open web.

The calibration code is ~30 lines in ml/finetune_clip_production_v2.ipynb. The core is scipy.optimize.minimize_scalar over a bounded temperature range, evaluated on val set NLL.

Honest results and what to take away

Four decisions: data filtering, two-stage schedule, hard-negative batching, temperature calibration that drove a 61.4% → 87.4% (+26 pp) gain on val accuracy. The training loop itself was mostly default settings.

4/6 production gates pass. Accuracy (0.874) and Macro F1 (0.867) are both below the 0.90 threshold. Two concrete next steps: expand the Georgian val set from 5 to 20+ samples (currently the smallest class by a large margin), and add harder augmentations targeting Baroque/International Style confusion pairs.

The fine-tuned embedder powers visquery.com today.

If you're fine-tuning CLIP on your own domain, the playbook is:

Map your confusion clusters before training — you need this for hard-negative batching
Filter your dataset with base CLIP's zero-shot classifier — 25 lines, free quality gate
Align text first, visual second — protect pretrained representations
Build batches around known hard pairs — force the model to learn the distinctions that matter
Calibrate on your val set before shipping — confidence scores are only useful if they're reliable

None of this is architecture-specific. The domain knowledge is the variable; the framework transfers.

Live on: visquery.com · Code: github.com/shivashrestha/visquery