DEV Community: Shiva Shrestha

Fine-tuning CLIP on a Niche Domain: How I Got +26pp Accuracy on Architectural Styles and What You Can Apply to Your Own Domain

Shiva Shrestha — Tue, 12 May 2026 18:21:40 +0000

Most fine-tuning write-ups end at "we got X% accuracy." This one walks through the four decisions before and after the training loop that actually moved the number. The training loop itself was the easy part. If you're fine-tuning a vision-language model on a niche domain, these are the decisions you'll face too.

The project: I fine-tuned OpenCLIP ViT-B/32 on 24 architectural style classes and shipped the embedder as the retrieval backbone for visquery.com, an architectural precedent search tool. Base CLIP zero-shot on my val set: 61.4%. Fine-tuned: 87.4%. That's +26 percentage points, and almost none of it came from tuning the training loop.

Each section below is a decision point with the reasoning behind it. Not just what I did, but why, and what the generalizable principle is for any domain.

1. Pick a domain where you can read the errors, not just count them

Generalizable principle: domain knowledge isn't just context, it's a forcing function for better decisions at every stage.

I'm an architect by training. When I open a confusion matrix and see my model conflating Baroque with Beaux-Arts, I know that's a fair mistake both styles share ornate facades, heavy cornices, and classical orders lifted from Rome. When it mixes up Georgian and Colonial, I can point to exactly which visual cues overlap (white symmetrical facades, pedimented entries) and which don't (window proportions, cornice detailing).

That's not just satisfying. It changes how you iterate.

Most fine-tuning posts use datasets where the author trusts the labels but can't explain the errors. You end up chasing metrics without knowing whether a mistake is a model failure or a labeling ambiguity or whether the two classes are genuinely hard to distinguish even for humans. Pick a domain you understand well enough to judge the confusions, not just measure them. If you can't do that yet, talk to a domain expert before you label anything.

For architectural styles, the hardest confusion clusters are: Gothic/Romanesque (pre-Renaissance, both stone, both vertical emphasis), Greek Revival/Colonial/Georgian (white-columned American residential and civic), and Queen Anne/Tudor/Edwardian (late 19th/early 20th British-derived residential). I knew these pairs before I wrote a single line of training code. That knowledge shaped every subsequent decision the hard-negative batching strategy in section 4 flows directly from this list.

2. Let base CLIP filter its own training data

Generalizable principle: use the pretrained model as a data quality gate before any fine-tuning begins.

Starting dataset: 7,018 training images across 24 classes, sourced from Wikimedia Commons under CC licenses. Before touching a training loop, I ran every image through base CLIP's zero-shot classifier and dropped anything where confidence on its own label fell below 0.05.

The intuition is simple: if the unmodified model sees zero signal that an image belongs to its labeled class, the label is probably wrong or the image is genuinely ambiguous. Training on it is noise. This works for any domain swap out the class names and it's a drop-in quality filter for your dataset.

Here's the filter in about 25 lines:

import open_clip
import torch
from PIL import Image

model, _, preprocess = open_clip.create_model_and_transforms('ViT-B-32', pretrained='openai')
model.eval()
tokenizer = open_clip.get_tokenizer('ViT-B-32')

CONFIDENCE_THRESHOLD = 0.05

def is_clean_sample(image_path: str, label: str, class_names: list[str]) -> bool:
    image = preprocess(Image.open(image_path).convert("RGB")).unsqueeze(0)
    texts = tokenizer([f"a photo of {c} architecture" for c in class_names])

    with torch.no_grad():
        img_feat = model.encode_image(image)
        txt_feat = model.encode_text(texts)
        img_feat /= img_feat.norm(dim=-1, keepdim=True)
        txt_feat /= txt_feat.norm(dim=-1, keepdim=True)
        probs = (100.0 * img_feat @ txt_feat.T).softmax(dim=-1)

    idx = class_names.index(label)
    return probs[0, idx].item() >= CONFIDENCE_THRESHOLD

clean = [(p, l) for p, l in all_samples if is_clean_sample(p, l, class_names)]
print(f"Kept {len(clean)}/{len(all_samples)} after quality filter")

After filtering, I oversampled each class to 280 images with augmentation — not duplication. Every copy gets independent transforms (random crops, flips, color jitter, Gaussian blur), so there are no duplicate gradients. Minimum class size before oversampling was 122 images.

3. Two-stage training: text tower first, then visual

Generalizable principle: align the text side to your label vocabulary before touching visual weights. Protect what the pretrained model already knows.

Single-stage fine-tuning across all layers tends to overwrite the general visual representations CLIP already learned. The two-stage approach preserves them.

Stage 1 (epochs 1–5, LR 5e-6): Freeze the entire visual encoder. Train only the text tower and projection heads. The goal here isn't accuracy — it's alignment. I'm teaching the model that "Baroque architecture" in my label vocabulary corresponds to the visual features CLIP already knows how to see. No point moving those visual weights until the text side is calibrated.

Stage 2 (epochs 6–15, LR 5e-7): Unfreeze resblocks 10 and 11 only — the last two transformer blocks in the visual encoder. Drop the LR by 10x. Now the model can develop fine-grained visual discriminability: learn that Baroque and Beaux-Arts look different, not just label differently.

Training log:

Epoch	Loss	Val Acc	Stage
1	1.915	80.2%	1 (text only)
3	1.568	84.5%	1
5	1.493	87.0%	1
6	1.476	86.2%	2 (resblocks 10–11 unlocked)
10	1.447	87.2%	2
12	1.449	87.4% ✓	2

The dip at epoch 6 is normal — unlocking new layers introduces instability before the model adapts. Accuracy recovered within two epochs.

Production scorecard at checkpoint epoch 12:

Metric	Value	Target	Status
Classification accuracy	0.874	≥ 0.90	❌
Macro F1	0.867	≥ 0.90	❌
Hard-neg pass rate	0.904	≥ 0.80	✅
Semantic R@1	0.880	≥ 0.70	✅
ECE	0.056	< 0.10	✅
Noise conf p95	0.466	< 0.50	✅

4/6 gates pass. Accuracy and F1 don't hit 0.90 yet. More on that at the end.

4. Hard-negative batching from confusion clusters you already know

Generalizable principle: use your domain knowledge from step 1 to build batches that force the model to learn the hard distinctions — not just recognize styles in isolation.

With random batching, the model might see Gothic and Romanesque in the same batch once every dozen iterations. Hard-negative batching fixes this deliberately.

~50% of each mini-batch was drawn from the three hardest confusion clusters I'd identified before training: Gothic/Romanesque, Greek Revival/Colonial/Georgian, and Queen Anne/Tudor/Edwardian. The model sees these pairs side-by-side every single iteration.

The effect: instead of learning 24 styles in isolation, the model is forced to learn the differences between the ones that actually look similar. That's the actual problem in architectural image retrieval — not recognizing styles in isolation, but separating them when they share visual DNA.

Residual confusions in the final matrix: Baroque ↔ Beaux-Arts (F1: 0.769 and 0.842), International Style ↔ Bauhaus (F1: 0.739 and 0.800). Georgian lands at 0.600 F1, but that's a val-set size artifact — only 5 validation samples. Every other class sits at or above 0.739.

5. Calibrate before you ship

Generalizable principle: accuracy tells you how often the model is wrong. Calibration tells you whether it knows when it's wrong. You need both before shipping.

87.4% accuracy means roughly 1 in 8 predictions is wrong. In a search product, those wrong predictions show up in results. If the model is overconfident on the 12.6% it gets wrong, you're actively surfacing confident errors to users.

Temperature calibration on the val set using scipy.optimize.minimize_scalar to minimize ECE:

	ECE
Pre-calibration	0.0938
Post-calibration	0.0559

The model now assigns lower confidence to predictions it's more likely to get wrong — which means the search system can use confidence scores to filter or rank results more reliably.

OOD check: ran the calibrated model on pure noise images. Mean confidence: 0.355. p95: 0.466 — below the 0.50 gate. The model doesn't confidently assign random images to known architectural classes. That matters when your index contains images from the open web.

The calibration code is ~30 lines in ml/finetune_clip_production_v2.ipynb. The core is scipy.optimize.minimize_scalar over a bounded temperature range, evaluated on val set NLL.

Honest results and what to take away

Four decisions: data filtering, two-stage schedule, hard-negative batching, temperature calibration that drove a 61.4% → 87.4% (+26 pp) gain on val accuracy. The training loop itself was mostly default settings.

4/6 production gates pass. Accuracy (0.874) and Macro F1 (0.867) are both below the 0.90 threshold. Two concrete next steps: expand the Georgian val set from 5 to 20+ samples (currently the smallest class by a large margin), and add harder augmentations targeting Baroque/International Style confusion pairs.

The fine-tuned embedder powers visquery.com today.

If you're fine-tuning CLIP on your own domain, the playbook is:

Map your confusion clusters before training — you need this for hard-negative batching
Filter your dataset with base CLIP's zero-shot classifier — 25 lines, free quality gate
Align text first, visual second — protect pretrained representations
Build batches around known hard pairs — force the model to learn the distinctions that matter
Calibrate on your val set before shipping — confidence scores are only useful if they're reliable

None of this is architecture-specific. The domain knowledge is the variable; the framework transfers.

Live on: visquery.com · Code: github.com/shivashrestha/visquery

Building a RAG Evaluation Harness That Actually Catches Problems

Shiva Shrestha — Tue, 05 May 2026 13:10:37 +0000

Most "chat with your website" projects ship without any measurement. Mine did too. The live demo was up, answers looked plausible, and I moved on. Then I built a proper evaluation harness and found out exactly how wrong "looks plausible" is as a quality signal.

This post covers the eval design, the bugs it caught, the prompt changes that fixed most of them, and the two metrics that still don't pass threshold after all the fixes. The failures are the interesting part.

The System

Web Intelligence is a RAG pipeline that turns any public URL into a queryable knowledge base. You give it a URL, it crawls up to 50 pages, chunks and embeds the text with Pinecone's multilingual-e5-large, and stores vectors in a serverless Pinecone index. At query time, top-k chunks are retrieved and passed to an LLM (Gemini 2.0 Flash or a local Ollama model) with a strict context-only prompt.

Nothing exotic. The evaluation harness is the part I want to talk about.

Eval Design: The Answerable/Unanswerable Split

Before writing a single metric, the most important design decision is splitting your question bank.

All eval questions
├── Answerable    → Hit@k · MRR · Faithfulness · Hallucination · Ctx Coverage
└── Unanswerable  → Rejection Rate (did the system correctly refuse?)

This matters because they measure fundamentally different behaviours. An unanswerable question where the system correctly refuses should not contribute Hit@1 = 0 to your retrieval average. Before I introduced the split, three out-of-scope questions were dragging down the Hit@k numbers, and there was no metric at all for whether the refusals were happening. The system was getting credit for nothing and penalised for things it was doing right.

The baseline: aboutamazon.com, 5 answerable questions + 3 unanswerable questions, top_k=5. Small sample - I'll address that.

Issue 1: Hit@1 Was 60% for the Wrong Reason

Two of five questions scored Hit@1 = 0. For Q01 ("What does Amazon do?"), the top-ranked chunk by cosine similarity (0.857) was Amazon's mission statement is clearly relevant. But my ground-truth keyword was "ecommerce" and the chunk text used "e-commerce" with a hyphen.

# Original - breaks on surface-form variants
def chunk_hit(chunk_text, keywords):
    text = chunk_text.lower()
    return any(kw in text for kw in keywords)

# Fixed — normalise before comparison
def _norm_kw(s: str) -> str:
    return re.sub(r'[\s\-_]', '', s.lower())

def chunk_hit(chunk_text, keywords):
    norm_text = _norm_kw(chunk_text)
    return any(_norm_kw(kw) in norm_text for kw in keywords)

Result: Hit@1 60% → 80%.

Q03 had a harder problem alongside the normalisation bug: the top chunk genuinely addressed Amazon's mission rather than its business lines, which is what the question targeted. That's a ranking problem. The embedding is working correctly - the mission statement is semantically related to "what Amazon does" - but a cross-encoder re-ranker scoring (query, chunk) pairs jointly would promote the more task-relevant chunk. That fix is still pending.

Issue 2: Hallucination Was 41% but the Metric Was Partly Lying

Before the prompt fix, hallucination averaged 41%. After the fix, it dropped to 28%. But the story of why it was 41% is more useful than the number.

The hallucination metric is 1 - ctx_coverage, where:

ctx_coverage = |answer_tokens ∩ context_tokens| / |answer_tokens|

With NLTK stopwords removed. The problem: verbosity inflates this metric without representing actual fabrication.

With my original prompt ("Prioritise the provided context", "Under 400 words"), answers averaged 219 words. The LLM produced long, connector-heavy responses. Words like "Overall", "As a result", "combining", "leveraging" don't appear in the retrieved chunks — but they're not factual claims either. They counted as hallucinated tokens.

I separated these two failure modes:

Mode	Example	Factual Risk
LLM knowledge leakage	`"Career Choice"`, `"The Climate Pledge"` inserted from training	High
Connector expansion	`"Overall, Amazon combines…"`, `"As a result…"`	Low

The fix: a hallucination_cw metric that counts only content words ≥5 characters. Connector words ("overall", "result", "based") are under that threshold and excluded. The verbosity_score field (max(0, (words − 150) / 150)) quantifies how much of the raw metric is inflation.

Issue 3: The Prompt Was Too Soft

The original prompt:

prompt = f"""You are a website content assistant. 
Prioritise the provided context when answering.
Under 400 words.

CONTEXT:
{context}

QUESTION:
{question}"""

"Prioritise" is not a constraint. The LLM treated it as a suggestion. On Amazon-specific questions, it injected training knowledge: product names, operational statistics, initiatives that weren't in any retrieved chunk.

The fixed prompt (current rag.py):

prompt = f"""You are a website content assistant. Answer ONLY using the text in the CONTEXT section below.

Rules:
- ONLY use information explicitly present in the CONTEXT. Do not add facts, names, or details from your training knowledge.
- If the context has nothing relevant, respond exactly: "Sorry, I couldn't find this information. Please try another question."
- Be concise and specific. No filler, no elaboration beyond what the context states.
- Under 150 words. If the question genuinely requires more, cap at 200 words maximum.

CONTEXT:
{context}

QUESTION:
{question}

ANSWER (cite only what the CONTEXT states):"""

Before/after:

Metric	Before	After	Threshold
Avg words	219	97	≤ 150
Hallucination (raw)	~41%	27%	—
Hallucination (CW) ★	~41%	28%	≤ 25%
Ctx Coverage	59%	73%	≥ 65%

The Two Metrics That Still Fail

Honest reporting: two checks are still red after all the fixes.

1. Hallucination (CW) 28% vs 25% threshold

Three points off. The verbosity fix eliminated most of the signal. What remains is genuine leakage, 2 to 3 content words per answer that came from training knowledge rather than retrieved chunks. The 150-word cap reduced it but didn't eliminate it. The next step is LLM-as-judge faithfulness (RAGAS-style claim decomposition) to measure actual factual correctness rather than surface-form overlap.

2. KW Overlap 53% vs 75% threshold

This one is partly self-inflicted. Before the word-cap fix, KW overlap was 83% — answers were long enough to include all expected keywords. After the 150-word cap, shorter correct answers naturally contain fewer words, including some expected keywords that dropped out. The keyword set was calibrated for 200-word answers. Two options: tighten to 2–3 high-signal keywords per question, or weight by TF-IDF importance so that high-information terms count more.

Full Results Summary

Track	Metric	Before	After	Threshold	Status
Answerable	Hit@1	60%	80%	≥ 80%	✅
Answerable	Hit@5	100%	100%	≥ 95%	✅
Answerable	MRR@5	0.767	0.883	≥ 0.75	✅
Answerable	Hallucination (CW)	~41%	28%	≤ 25%	❌
Answerable	Ctx Coverage	59%	73%	≥ 65%	✅
Answerable	KW Overlap	83%	53%	≥ 75%	❌
Answerable	Avg Words	219	97	≤ 150	✅
Unanswerable	Rejection Rate	unmeasured	100%	≥ 90%	✅

Scope note: one site, 8 questions. These are directional signals, not a production-grade benchmark.

What I'd Do Next

Cross-encoder re-ranking - replace bi-encoder-only ranking with a ms-marco-MiniLM-L-6-v2 cross-encoder as a second-pass re-ranker. Expected Hit@1 improvement: 80% → 90%+.

LLM-as-judge faithfulness - RAGAS-style: decompose each answer into atomic claims and verify each claim against retrieved chunks. Slower and costs tokens but measures actual correctness instead of token overlap.

Answer-length calibration - run the eval at word caps of 100/125/150/175 and plot hallucination (CW) vs KW overlap. Find the Pareto-optimal cap where both pass threshold simultaneously.

Keyword set recalibration - reduce to 2–3 high-signal terms per question, or adopt TF-IDF weighting.

Code and Demo

GitHub repo: web-intelligence

Live demo: web-intelligence-red.vercel.app

The eval notebook is at backend/rag_eval_single.ipynb. Results JSON written to data/eval_single_<site>_<date>.json on each run.

If you've built RAG eval harnesses and hit similar issues, especially the verbosity/hallucination conflation, I'd like to hear how you handled it ☺️.