<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Shiva Shrestha</title>
    <description>The latest articles on DEV Community by Shiva Shrestha (@shiva_shrestha_1b37675aab).</description>
    <link>https://dev.to/shiva_shrestha_1b37675aab</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3913866%2Fc21d6de0-d5b8-4219-bbe1-02f329bba992.png</url>
      <title>DEV Community: Shiva Shrestha</title>
      <link>https://dev.to/shiva_shrestha_1b37675aab</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/shiva_shrestha_1b37675aab"/>
    <language>en</language>
    <item>
      <title>Fine-tuning CLIP on a Niche Domain: How I Got +26pp Accuracy on Architectural Styles and What You Can Apply to Your Own Domain</title>
      <dc:creator>Shiva Shrestha</dc:creator>
      <pubDate>Tue, 12 May 2026 18:21:40 +0000</pubDate>
      <link>https://dev.to/shiva_shrestha_1b37675aab/fine-tuning-clip-on-a-niche-domain-how-i-got-26pp-accuracy-on-architectural-styles-and-what-you-md9</link>
      <guid>https://dev.to/shiva_shrestha_1b37675aab/fine-tuning-clip-on-a-niche-domain-how-i-got-26pp-accuracy-on-architectural-styles-and-what-you-md9</guid>
      <description>&lt;p&gt;Most fine-tuning write-ups end at "we got X% accuracy." This one walks through the four decisions before and after the training loop that actually moved the number. The training loop itself was the easy part. If you're fine-tuning a vision-language model on a niche domain, these are the decisions you'll face too.&lt;/p&gt;

&lt;p&gt;The project: I fine-tuned OpenCLIP ViT-B/32 on 24 architectural style classes and shipped the embedder as the retrieval backbone for &lt;a href="https://visquery.com" rel="noopener noreferrer"&gt;visquery.com&lt;/a&gt;, an architectural precedent search tool. Base CLIP zero-shot on my val set: &lt;strong&gt;61.4%&lt;/strong&gt;. Fine-tuned: &lt;strong&gt;87.4%&lt;/strong&gt;. That's +26 percentage points, and almost none of it came from tuning the training loop.&lt;/p&gt;

&lt;p&gt;Each section below is a decision point with the reasoning behind it. Not just what I did, but why, and what the generalizable principle is for any domain.&lt;/p&gt;

&lt;h2&gt;
  
  
  1. Pick a domain where you can read the errors, not just count them
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Generalizable principle: domain knowledge isn't just context, it's a forcing function for better decisions at every stage.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;I'm an architect by training. When I open a confusion matrix and see my model conflating Baroque with Beaux-Arts, I know that's a fair mistake both styles share ornate facades, heavy cornices, and classical orders lifted from Rome. When it mixes up Georgian and Colonial, I can point to exactly which visual cues overlap (white symmetrical facades, pedimented entries) and which don't (window proportions, cornice detailing).&lt;/p&gt;

&lt;p&gt;That's not just satisfying. It changes how you iterate.&lt;/p&gt;

&lt;p&gt;Most fine-tuning posts use datasets where the author trusts the labels but can't explain the errors. You end up chasing metrics without knowing whether a mistake is a model failure or a labeling ambiguity or whether the two classes are genuinely hard to distinguish even for humans. Pick a domain you understand well enough to judge the confusions, not just measure them. If you can't do that yet, talk to a domain expert before you label anything.&lt;/p&gt;

&lt;p&gt;For architectural styles, the hardest confusion clusters are: Gothic/Romanesque (pre-Renaissance, both stone, both vertical emphasis), Greek Revival/Colonial/Georgian (white-columned American residential and civic), and Queen Anne/Tudor/Edwardian (late 19th/early 20th British-derived residential). I knew these pairs before I wrote a single line of training code. That knowledge shaped every subsequent decision the hard-negative batching strategy in section 4 flows directly from this list.&lt;/p&gt;




&lt;h2&gt;
  
  
  2. Let base CLIP filter its own training data
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Generalizable principle: use the pretrained model as a data quality gate before any fine-tuning begins.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Starting dataset: 7,018 training images across 24 classes, sourced from Wikimedia Commons under CC licenses. Before touching a training loop, I ran every image through base CLIP's zero-shot classifier and dropped anything where confidence on its own label fell below 0.05.&lt;/p&gt;

&lt;p&gt;The intuition is simple: if the unmodified model sees zero signal that an image belongs to its labeled class, the label is probably wrong or the image is genuinely ambiguous. Training on it is noise. This works for any domain swap out the class names and it's a drop-in quality filter for your dataset.&lt;/p&gt;

&lt;p&gt;Here's the filter in about 25 lines:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;open_clip&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;torch&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;PIL&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;Image&lt;/span&gt;

&lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;_&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;preprocess&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;open_clip&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;create_model_and_transforms&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;ViT-B-32&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;pretrained&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;openai&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;eval&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="n"&gt;tokenizer&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;open_clip&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get_tokenizer&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;ViT-B-32&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;CONFIDENCE_THRESHOLD&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mf"&gt;0.05&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;is_clean_sample&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;image_path&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;label&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;class_names&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;list&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;bool&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;image&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;preprocess&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;Image&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;open&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;image_path&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;convert&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;RGB&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)).&lt;/span&gt;&lt;span class="nf"&gt;unsqueeze&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;texts&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;tokenizer&lt;/span&gt;&lt;span class="p"&gt;([&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;a photo of &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;c&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt; architecture&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;c&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;class_names&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;

    &lt;span class="k"&gt;with&lt;/span&gt; &lt;span class="n"&gt;torch&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;no_grad&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;
        &lt;span class="n"&gt;img_feat&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;encode_image&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;image&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="n"&gt;txt_feat&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;encode_text&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;texts&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="n"&gt;img_feat&lt;/span&gt; &lt;span class="o"&gt;/=&lt;/span&gt; &lt;span class="n"&gt;img_feat&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;norm&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;dim&lt;/span&gt;&lt;span class="o"&gt;=-&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;keepdim&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="n"&gt;txt_feat&lt;/span&gt; &lt;span class="o"&gt;/=&lt;/span&gt; &lt;span class="n"&gt;txt_feat&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;norm&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;dim&lt;/span&gt;&lt;span class="o"&gt;=-&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;keepdim&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="n"&gt;probs&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mf"&gt;100.0&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="n"&gt;img_feat&lt;/span&gt; &lt;span class="o"&gt;@&lt;/span&gt; &lt;span class="n"&gt;txt_feat&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;T&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;softmax&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;dim&lt;/span&gt;&lt;span class="o"&gt;=-&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="n"&gt;idx&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;class_names&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;index&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;label&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;probs&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;idx&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="nf"&gt;item&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;=&lt;/span&gt; &lt;span class="n"&gt;CONFIDENCE_THRESHOLD&lt;/span&gt;

&lt;span class="n"&gt;clean&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[(&lt;/span&gt;&lt;span class="n"&gt;p&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;l&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;p&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;l&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;all_samples&lt;/span&gt; &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="nf"&gt;is_clean_sample&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;p&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;l&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;class_names&lt;/span&gt;&lt;span class="p"&gt;)]&lt;/span&gt;
&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Kept &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="nf"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;clean&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;/&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="nf"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;all_samples&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt; after quality filter&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;After filtering, I oversampled each class to 280 images with augmentation — not duplication. Every copy gets independent transforms (random crops, flips, color jitter, Gaussian blur), so there are no duplicate gradients. Minimum class size before oversampling was 122 images.&lt;/p&gt;




&lt;h2&gt;
  
  
  3. Two-stage training: text tower first, then visual
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Generalizable principle: align the text side to your label vocabulary before touching visual weights. Protect what the pretrained model already knows.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Single-stage fine-tuning across all layers tends to overwrite the general visual representations CLIP already learned. The two-stage approach preserves them.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Stage 1 (epochs 1–5, LR 5e-6):&lt;/strong&gt; Freeze the entire visual encoder. Train only the text tower and projection heads. The goal here isn't accuracy — it's alignment. I'm teaching the model that "Baroque architecture" in my label vocabulary corresponds to the visual features CLIP already knows how to see. No point moving those visual weights until the text side is calibrated.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Stage 2 (epochs 6–15, LR 5e-7):&lt;/strong&gt; Unfreeze resblocks 10 and 11 only — the last two transformer blocks in the visual encoder. Drop the LR by 10x. Now the model can develop fine-grained visual discriminability: learn that Baroque and Beaux-Arts look different, not just label differently.&lt;/p&gt;

&lt;p&gt;Training log:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Epoch&lt;/th&gt;
&lt;th&gt;Loss&lt;/th&gt;
&lt;th&gt;Val Acc&lt;/th&gt;
&lt;th&gt;Stage&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;1&lt;/td&gt;
&lt;td&gt;1.915&lt;/td&gt;
&lt;td&gt;80.2%&lt;/td&gt;
&lt;td&gt;1 (text only)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;3&lt;/td&gt;
&lt;td&gt;1.568&lt;/td&gt;
&lt;td&gt;84.5%&lt;/td&gt;
&lt;td&gt;1&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;5&lt;/td&gt;
&lt;td&gt;1.493&lt;/td&gt;
&lt;td&gt;87.0%&lt;/td&gt;
&lt;td&gt;1&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;6&lt;/td&gt;
&lt;td&gt;1.476&lt;/td&gt;
&lt;td&gt;86.2%&lt;/td&gt;
&lt;td&gt;2 (resblocks 10–11 unlocked)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;10&lt;/td&gt;
&lt;td&gt;1.447&lt;/td&gt;
&lt;td&gt;87.2%&lt;/td&gt;
&lt;td&gt;2&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;12&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;1.449&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;
&lt;strong&gt;87.4%&lt;/strong&gt; ✓&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;2&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The dip at epoch 6 is normal — unlocking new layers introduces instability before the model adapts. Accuracy recovered within two epochs.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fr8xa6n283v1q9emm652n.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fr8xa6n283v1q9emm652n.png" alt="Production scorecard showing 4/6 gates pass at epoch 12" width="800" height="264"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Production scorecard at checkpoint epoch 12:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Metric&lt;/th&gt;
&lt;th&gt;Value&lt;/th&gt;
&lt;th&gt;Target&lt;/th&gt;
&lt;th&gt;Status&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Classification accuracy&lt;/td&gt;
&lt;td&gt;0.874&lt;/td&gt;
&lt;td&gt;≥ 0.90&lt;/td&gt;
&lt;td&gt;❌&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Macro F1&lt;/td&gt;
&lt;td&gt;0.867&lt;/td&gt;
&lt;td&gt;≥ 0.90&lt;/td&gt;
&lt;td&gt;❌&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Hard-neg pass rate&lt;/td&gt;
&lt;td&gt;0.904&lt;/td&gt;
&lt;td&gt;≥ 0.80&lt;/td&gt;
&lt;td&gt;✅&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Semantic R@1&lt;/td&gt;
&lt;td&gt;0.880&lt;/td&gt;
&lt;td&gt;≥ 0.70&lt;/td&gt;
&lt;td&gt;✅&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;ECE&lt;/td&gt;
&lt;td&gt;0.056&lt;/td&gt;
&lt;td&gt;&amp;lt; 0.10&lt;/td&gt;
&lt;td&gt;✅&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Noise conf p95&lt;/td&gt;
&lt;td&gt;0.466&lt;/td&gt;
&lt;td&gt;&amp;lt; 0.50&lt;/td&gt;
&lt;td&gt;✅&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;4/6 gates pass. Accuracy and F1 don't hit 0.90 yet. More on that at the end.&lt;/p&gt;




&lt;h2&gt;
  
  
  4. Hard-negative batching from confusion clusters you already know
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Generalizable principle: use your domain knowledge from step 1 to build batches that force the model to learn the hard distinctions — not just recognize styles in isolation.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;With random batching, the model might see Gothic and Romanesque in the same batch once every dozen iterations. Hard-negative batching fixes this deliberately.&lt;/p&gt;

&lt;p&gt;~50% of each mini-batch was drawn from the three hardest confusion clusters I'd identified before training: Gothic/Romanesque, Greek Revival/Colonial/Georgian, and Queen Anne/Tudor/Edwardian. The model sees these pairs side-by-side every single iteration.&lt;/p&gt;

&lt;p&gt;The effect: instead of learning 24 styles in isolation, the model is forced to learn the &lt;em&gt;differences between the ones that actually look similar&lt;/em&gt;. That's the actual problem in architectural image retrieval — not recognizing styles in isolation, but separating them when they share visual DNA.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fib9x9see8174gytz6gej.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fib9x9see8174gytz6gej.png" alt="Final confusion matrix: residual errors concentrated in Baroque/Beaux-Arts and International/Bauhaus" width="800" height="699"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fmhsuvaq48sdra7f26cre.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fmhsuvaq48sdra7f26cre.png" alt="Per-class F1 scores and t-SNE projection of the embedding space" width="800" height="354"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Residual confusions in the final matrix: Baroque ↔ Beaux-Arts (F1: 0.769 and 0.842), International Style ↔ Bauhaus (F1: 0.739 and 0.800). Georgian lands at 0.600 F1, but that's a val-set size artifact — only 5 validation samples. Every other class sits at or above 0.739.&lt;/p&gt;




&lt;h2&gt;
  
  
  5. Calibrate before you ship
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Generalizable principle: accuracy tells you how often the model is wrong. Calibration tells you whether it knows when it's wrong. You need both before shipping.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;87.4% accuracy means roughly 1 in 8 predictions is wrong. In a search product, those wrong predictions show up in results. If the model is overconfident on the 12.6% it gets wrong, you're actively surfacing confident errors to users.&lt;/p&gt;

&lt;p&gt;Temperature calibration on the val set using &lt;code&gt;scipy.optimize.minimize_scalar&lt;/code&gt; to minimize ECE:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;&lt;/th&gt;
&lt;th&gt;ECE&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Pre-calibration&lt;/td&gt;
&lt;td&gt;0.0938&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Post-calibration&lt;/td&gt;
&lt;td&gt;0.0559&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The model now assigns lower confidence to predictions it's more likely to get wrong — which means the search system can use confidence scores to filter or rank results more reliably.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F9yf7k4eecqc6nw48daaz.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F9yf7k4eecqc6nw48daaz.png" alt="Reliability diagram before and after calibration, plus OOD confidence distribution" width="800" height="305"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;OOD check: ran the calibrated model on pure noise images. Mean confidence: 0.355. p95: 0.466 — below the 0.50 gate. The model doesn't confidently assign random images to known architectural classes. That matters when your index contains images from the open web.&lt;/p&gt;

&lt;p&gt;The calibration code is ~30 lines in &lt;code&gt;ml/finetune_clip_production_v2.ipynb&lt;/code&gt;. The core is &lt;code&gt;scipy.optimize.minimize_scalar&lt;/code&gt; over a bounded temperature range, evaluated on val set NLL.&lt;/p&gt;




&lt;h2&gt;
  
  
  Honest results and what to take away
&lt;/h2&gt;

&lt;p&gt;Four decisions: data filtering, two-stage schedule, hard-negative batching, temperature calibration that drove a 61.4% → 87.4% (+26 pp) gain on val accuracy. The training loop itself was mostly default settings.&lt;/p&gt;

&lt;p&gt;4/6 production gates pass. Accuracy (0.874) and Macro F1 (0.867) are both below the 0.90 threshold. Two concrete next steps: expand the Georgian val set from 5 to 20+ samples (currently the smallest class by a large margin), and add harder augmentations targeting Baroque/International Style confusion pairs.&lt;/p&gt;

&lt;p&gt;The fine-tuned embedder powers &lt;a href="https://visquery.com" rel="noopener noreferrer"&gt;visquery.com&lt;/a&gt; today.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;If you're fine-tuning CLIP on your own domain, the playbook is:&lt;/strong&gt;&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Map your confusion clusters &lt;em&gt;before&lt;/em&gt; training — you need this for hard-negative batching&lt;/li&gt;
&lt;li&gt;Filter your dataset with base CLIP's zero-shot classifier — 25 lines, free quality gate&lt;/li&gt;
&lt;li&gt;Align text first, visual second — protect pretrained representations&lt;/li&gt;
&lt;li&gt;Build batches around known hard pairs — force the model to learn the distinctions that matter&lt;/li&gt;
&lt;li&gt;Calibrate on your val set before shipping — confidence scores are only useful if they're reliable&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;None of this is architecture-specific. The domain knowledge is the variable; the framework transfers.&lt;/p&gt;




&lt;p&gt;Live on: &lt;a href="https://visquery.com" rel="noopener noreferrer"&gt;visquery.com&lt;/a&gt; · Code: &lt;a href="https://github.com/shivashrestha/visquery" rel="noopener noreferrer"&gt;github.com/shivashrestha/visquery&lt;/a&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>machinelearning</category>
      <category>python</category>
      <category>computervision</category>
    </item>
    <item>
      <title>Building a RAG Evaluation Harness That Actually Catches Problems</title>
      <dc:creator>Shiva Shrestha</dc:creator>
      <pubDate>Tue, 05 May 2026 13:10:37 +0000</pubDate>
      <link>https://dev.to/shiva_shrestha_1b37675aab/building-a-rag-evaluation-harness-that-actually-catches-problems-198i</link>
      <guid>https://dev.to/shiva_shrestha_1b37675aab/building-a-rag-evaluation-harness-that-actually-catches-problems-198i</guid>
      <description>&lt;p&gt;Most "chat with your website" projects ship without any measurement. Mine did too. The live demo was up, answers looked plausible, and I moved on. Then I built a proper evaluation harness and found out exactly how wrong "looks plausible" is as a quality signal.&lt;/p&gt;

&lt;p&gt;This post covers the eval design, the bugs it caught, the prompt changes that fixed most of them, and the two metrics that still don't pass threshold after all the fixes. The failures are the interesting part.&lt;/p&gt;




&lt;h2&gt;
  
  
  The System
&lt;/h2&gt;

&lt;p&gt;Web Intelligence is a RAG pipeline that turns any public URL into a queryable knowledge base. You give it a URL, it crawls up to 50 pages, chunks and embeds the text with Pinecone's &lt;code&gt;multilingual-e5-large&lt;/code&gt;, and stores vectors in a serverless Pinecone index. At query time, top-k chunks are retrieved and passed to an LLM (Gemini 2.0 Flash or a local Ollama model) with a strict context-only prompt.&lt;/p&gt;

&lt;p&gt;Nothing exotic. The evaluation harness is the part I want to talk about.&lt;/p&gt;




&lt;h2&gt;
  
  
  Eval Design: The Answerable/Unanswerable Split
&lt;/h2&gt;

&lt;p&gt;Before writing a single metric, the most important design decision is splitting your question bank.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;All eval questions
├── Answerable    → Hit@k · MRR · Faithfulness · Hallucination · Ctx Coverage
└── Unanswerable  → Rejection Rate (did the system correctly refuse?)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This matters because they measure fundamentally different behaviours. An unanswerable question where the system correctly refuses should not contribute &lt;code&gt;Hit@1 = 0&lt;/code&gt; to your retrieval average. Before I introduced the split, three out-of-scope questions were dragging down the Hit@k numbers, and there was no metric at all for whether the refusals were happening. The system was getting credit for nothing and penalised for things it was doing right.&lt;/p&gt;

&lt;p&gt;The baseline: &lt;code&gt;aboutamazon.com&lt;/code&gt;, 5 answerable questions + 3 unanswerable questions, &lt;code&gt;top_k=5&lt;/code&gt;. Small sample - I'll address that.&lt;/p&gt;




&lt;h2&gt;
  
  
  Issue 1: Hit@1 Was 60% for the Wrong Reason
&lt;/h2&gt;

&lt;p&gt;Two of five questions scored Hit@1 = 0. For Q01 ("What does Amazon do?"), the top-ranked chunk by cosine similarity (0.857) was Amazon's mission statement is clearly relevant. But my ground-truth keyword was &lt;code&gt;"ecommerce"&lt;/code&gt; and the chunk text used &lt;code&gt;"e-commerce"&lt;/code&gt; with a hyphen.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Original - breaks on surface-form variants
&lt;/span&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;chunk_hit&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;chunk_text&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;keywords&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;text&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;chunk_text&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;lower&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nf"&gt;any&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;kw&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;text&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;kw&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;keywords&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# Fixed — normalise before comparison
&lt;/span&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;_norm_kw&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;s&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;re&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;sub&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;r&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;[\s\-_]&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;''&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;s&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;lower&lt;/span&gt;&lt;span class="p"&gt;())&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;chunk_hit&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;chunk_text&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;keywords&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;norm_text&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;_norm_kw&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;chunk_text&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nf"&gt;any&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;_norm_kw&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;kw&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;norm_text&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;kw&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;keywords&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Result: Hit@1 60% → &lt;strong&gt;80%&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;Q03 had a harder problem alongside the normalisation bug: the top chunk genuinely addressed Amazon's mission rather than its business lines, which is what the question targeted. That's a ranking problem. The embedding is working correctly - the mission statement is semantically related to "what Amazon does" - but a cross-encoder re-ranker scoring (query, chunk) pairs jointly would promote the more task-relevant chunk. That fix is still pending.&lt;/p&gt;




&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fle34d5bnbih389zwiu9h.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fle34d5bnbih389zwiu9h.png" alt=" "&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Issue 2: Hallucination Was 41% but the Metric Was Partly Lying
&lt;/h2&gt;

&lt;p&gt;Before the prompt fix, hallucination averaged 41%. After the fix, it dropped to 28%. But the story of &lt;em&gt;why&lt;/em&gt; it was 41% is more useful than the number.&lt;/p&gt;

&lt;p&gt;The hallucination metric is &lt;code&gt;1 - ctx_coverage&lt;/code&gt;, where:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;ctx_coverage&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="o"&gt;|&lt;/span&gt;&lt;span class="n"&gt;answer_tokens&lt;/span&gt; &lt;span class="err"&gt;∩&lt;/span&gt; &lt;span class="n"&gt;context_tokens&lt;/span&gt;&lt;span class="o"&gt;|&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="o"&gt;|&lt;/span&gt;&lt;span class="n"&gt;answer_tokens&lt;/span&gt;&lt;span class="o"&gt;|&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;With NLTK stopwords removed. The problem: &lt;strong&gt;verbosity inflates this metric without representing actual fabrication.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;With my original prompt (&lt;code&gt;"Prioritise the provided context"&lt;/code&gt;, &lt;code&gt;"Under 400 words"&lt;/code&gt;), answers averaged 219 words. The LLM produced long, connector-heavy responses. Words like &lt;code&gt;"Overall"&lt;/code&gt;, &lt;code&gt;"As a result"&lt;/code&gt;, &lt;code&gt;"combining"&lt;/code&gt;, &lt;code&gt;"leveraging"&lt;/code&gt; don't appear in the retrieved chunks — but they're not factual claims either. They counted as hallucinated tokens.&lt;/p&gt;

&lt;p&gt;I separated these two failure modes:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Mode&lt;/th&gt;
&lt;th&gt;Example&lt;/th&gt;
&lt;th&gt;Factual Risk&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;LLM knowledge leakage&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;"Career Choice"&lt;/code&gt;, &lt;code&gt;"The Climate Pledge"&lt;/code&gt; inserted from training&lt;/td&gt;
&lt;td&gt;High&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Connector expansion&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;"Overall, Amazon combines…"&lt;/code&gt;, &lt;code&gt;"As a result…"&lt;/code&gt;
&lt;/td&gt;
&lt;td&gt;Low&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The fix: a &lt;code&gt;hallucination_cw&lt;/code&gt; metric that counts only content words ≥5 characters. Connector words (&lt;code&gt;"overall"&lt;/code&gt;, &lt;code&gt;"result"&lt;/code&gt;, &lt;code&gt;"based"&lt;/code&gt;) are under that threshold and excluded. The &lt;code&gt;verbosity_score&lt;/code&gt; field (&lt;code&gt;max(0, (words − 150) / 150)&lt;/code&gt;) quantifies how much of the raw metric is inflation.&lt;/p&gt;




&lt;h2&gt;
  
  
  Issue 3: The Prompt Was Too Soft
&lt;/h2&gt;

&lt;p&gt;The original prompt:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;prompt&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;You are a website content assistant. 
Prioritise the provided context when answering.
Under 400 words.

CONTEXT:
&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;context&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;

QUESTION:
&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;question&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;code&gt;"Prioritise"&lt;/code&gt; is not a constraint. The LLM treated it as a suggestion. On Amazon-specific questions, it injected training knowledge: product names, operational statistics, initiatives that weren't in any retrieved chunk.&lt;/p&gt;

&lt;p&gt;The fixed prompt (current &lt;code&gt;rag.py&lt;/code&gt;):&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;prompt&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;You are a website content assistant. Answer ONLY using the text in the CONTEXT section below.

Rules:
- ONLY use information explicitly present in the CONTEXT. Do not add facts, names, or details from your training knowledge.
- If the context has nothing relevant, respond exactly: &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Sorry, I couldn&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;t find this information. Please try another question.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;
- Be concise and specific. No filler, no elaboration beyond what the context states.
- Under 150 words. If the question genuinely requires more, cap at 200 words maximum.

CONTEXT:
&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;context&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;

QUESTION:
&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;question&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;

ANSWER (cite only what the CONTEXT states):&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Before/after:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Metric&lt;/th&gt;
&lt;th&gt;Before&lt;/th&gt;
&lt;th&gt;After&lt;/th&gt;
&lt;th&gt;Threshold&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Avg words&lt;/td&gt;
&lt;td&gt;219&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;97&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;≤ 150&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Hallucination (raw)&lt;/td&gt;
&lt;td&gt;~41%&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;27%&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;—&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Hallucination (CW) ★&lt;/td&gt;
&lt;td&gt;~41%&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;28%&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;≤ 25%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Ctx Coverage&lt;/td&gt;
&lt;td&gt;59%&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;73%&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;≥ 65%&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;




&lt;h2&gt;
  
  
  The Two Metrics That Still Fail
&lt;/h2&gt;

&lt;p&gt;Honest reporting: two checks are still red after all the fixes.&lt;/p&gt;

&lt;h3&gt;
  
  
  1. Hallucination (CW) 28% vs 25% threshold
&lt;/h3&gt;

&lt;p&gt;Three points off. The verbosity fix eliminated most of the signal. What remains is genuine leakage, 2 to 3 content words per answer that came from training knowledge rather than retrieved chunks. The 150-word cap reduced it but didn't eliminate it. The next step is LLM-as-judge faithfulness (RAGAS-style claim decomposition) to measure actual factual correctness rather than surface-form overlap.&lt;/p&gt;

&lt;h3&gt;
  
  
  2. KW Overlap 53% vs 75% threshold
&lt;/h3&gt;

&lt;p&gt;This one is partly self-inflicted. Before the word-cap fix, KW overlap was 83% — answers were long enough to include all expected keywords. After the 150-word cap, shorter correct answers naturally contain fewer words, including some expected keywords that dropped out. The keyword set was calibrated for 200-word answers. Two options: tighten to 2–3 high-signal keywords per question, or weight by TF-IDF importance so that high-information terms count more.&lt;/p&gt;




&lt;h2&gt;
  
  
  Full Results Summary
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Track&lt;/th&gt;
&lt;th&gt;Metric&lt;/th&gt;
&lt;th&gt;Before&lt;/th&gt;
&lt;th&gt;After&lt;/th&gt;
&lt;th&gt;Threshold&lt;/th&gt;
&lt;th&gt;Status&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Answerable&lt;/td&gt;
&lt;td&gt;Hit@1&lt;/td&gt;
&lt;td&gt;60%&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;80%&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;≥ 80%&lt;/td&gt;
&lt;td&gt;✅&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Answerable&lt;/td&gt;
&lt;td&gt;Hit@5&lt;/td&gt;
&lt;td&gt;100%&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;100%&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;≥ 95%&lt;/td&gt;
&lt;td&gt;✅&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Answerable&lt;/td&gt;
&lt;td&gt;MRR@5&lt;/td&gt;
&lt;td&gt;0.767&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;0.883&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;≥ 0.75&lt;/td&gt;
&lt;td&gt;✅&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Answerable&lt;/td&gt;
&lt;td&gt;Hallucination (CW)&lt;/td&gt;
&lt;td&gt;~41%&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;28%&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;≤ 25%&lt;/td&gt;
&lt;td&gt;❌&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Answerable&lt;/td&gt;
&lt;td&gt;Ctx Coverage&lt;/td&gt;
&lt;td&gt;59%&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;73%&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;≥ 65%&lt;/td&gt;
&lt;td&gt;✅&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Answerable&lt;/td&gt;
&lt;td&gt;KW Overlap&lt;/td&gt;
&lt;td&gt;83%&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;53%&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;≥ 75%&lt;/td&gt;
&lt;td&gt;❌&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Answerable&lt;/td&gt;
&lt;td&gt;Avg Words&lt;/td&gt;
&lt;td&gt;219&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;97&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;≤ 150&lt;/td&gt;
&lt;td&gt;✅&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Unanswerable&lt;/td&gt;
&lt;td&gt;Rejection Rate&lt;/td&gt;
&lt;td&gt;unmeasured&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;100%&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;≥ 90%&lt;/td&gt;
&lt;td&gt;✅&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Scope note: one site, 8 questions. These are directional signals, not a production-grade benchmark.&lt;/p&gt;




&lt;h2&gt;
  
  
  What I'd Do Next
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Cross-encoder re-ranking&lt;/strong&gt; - replace bi-encoder-only ranking with a &lt;code&gt;ms-marco-MiniLM-L-6-v2&lt;/code&gt; cross-encoder as a second-pass re-ranker. Expected Hit@1 improvement: 80% → 90%+.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;LLM-as-judge faithfulness&lt;/strong&gt; - RAGAS-style: decompose each answer into atomic claims and verify each claim against retrieved chunks. Slower and costs tokens but measures actual correctness instead of token overlap.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Answer-length calibration&lt;/strong&gt; - run the eval at word caps of 100/125/150/175 and plot hallucination (CW) vs KW overlap. Find the Pareto-optimal cap where both pass threshold simultaneously.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Keyword set recalibration&lt;/strong&gt; - reduce to 2–3 high-signal terms per question, or adopt TF-IDF weighting.&lt;/p&gt;




&lt;h2&gt;
  
  
  Code and Demo
&lt;/h2&gt;

&lt;p&gt;GitHub repo: &lt;a href="https://github.com/shivashrestha/web-intelligence" rel="noopener noreferrer"&gt;web-intelligence&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Live demo: &lt;a href="https://web-intelligence-red.vercel.app" rel="noopener noreferrer"&gt;web-intelligence-red.vercel.app&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The eval notebook is at &lt;code&gt;backend/rag_eval_single.ipynb&lt;/code&gt;. Results JSON written to &lt;code&gt;data/eval_single_&amp;lt;site&amp;gt;_&amp;lt;date&amp;gt;.json&lt;/code&gt; on each run.&lt;/p&gt;

&lt;p&gt;If you've built RAG eval harnesses and hit similar issues, especially the verbosity/hallucination conflation, I'd like to hear how you handled it ☺️.&lt;/p&gt;

</description>
      <category>rag</category>
      <category>ai</category>
      <category>python</category>
      <category>llm</category>
    </item>
  </channel>
</rss>
