<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: AnubhavBharadwaaj</title>
    <description>The latest articles on DEV Community by AnubhavBharadwaaj (@anubhavbharadwaaj).</description>
    <link>https://dev.to/anubhavbharadwaaj</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3879939%2F72bc4e46-9d3a-4222-b5ab-1ed791acf5c3.png</url>
      <title>DEV Community: AnubhavBharadwaaj</title>
      <link>https://dev.to/anubhavbharadwaaj</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/anubhavbharadwaaj"/>
    <language>en</language>
    <item>
      <title>I tested a 4B model vs a 70B model on research papers. The 4B model won</title>
      <dc:creator>AnubhavBharadwaaj</dc:creator>
      <pubDate>Wed, 15 Apr 2026 07:37:27 +0000</pubDate>
      <link>https://dev.to/anubhavbharadwaaj/i-tested-a-4b-model-vs-a-70b-model-on-research-papers-the-4b-model-won-hln</link>
      <guid>https://dev.to/anubhavbharadwaaj/i-tested-a-4b-model-vs-a-70b-model-on-research-papers-the-4b-model-won-hln</guid>
      <description>&lt;p&gt;I've been competing in ML competitions (OpenAI Parameter Golf, &lt;br&gt;
WorldQuant IQC) and kept hitting the same wall: I'd read a paper, &lt;br&gt;
understand it conceptually, but lose hours hunting for the actual &lt;br&gt;
learning rate on page 14, the calibration procedure buried in a &lt;br&gt;
footnote, and the failure mode mentioned once in a table caption.&lt;/p&gt;

&lt;p&gt;So I built a CLI tool that extracts all of that into a structured &lt;br&gt;
file. One command, ~2 minutes per paper. That part isn't surprising.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;What surprised me is what happened when I gave those files to &lt;br&gt;
small models.&lt;/strong&gt;&lt;/p&gt;
&lt;h2&gt;
  
  
  The experiment
&lt;/h2&gt;

&lt;p&gt;I took a 33-page quantization survey paper and asked 10 specific &lt;br&gt;
implementation questions like:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;"What is the exact inference speedup of InceptionV3 with INT8?"&lt;/li&gt;
&lt;li&gt;"What is the energy cost of INT4 vs FP32 at 45nm?"&lt;/li&gt;
&lt;li&gt;"In symmetric quantization, what happens to zero point Z?"&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;I tested two setups:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Setup A:&lt;/strong&gt; Give the raw PDF to a large model (70B parameters)&lt;br&gt;
&lt;strong&gt;Setup B:&lt;/strong&gt; Give the pre-extracted skill file to a tiny model &lt;br&gt;
(4B parameters — runs on a phone)&lt;/p&gt;
&lt;h2&gt;
  
  
  The result
&lt;/h2&gt;

&lt;p&gt;The 4B model with the skill file gave more precise answers.&lt;/p&gt;

&lt;p&gt;Not "roughly equivalent." More precise. The 70B model with the &lt;br&gt;
raw PDF would say "approximately 2-4x speedup on GPU hardware." &lt;br&gt;
The 4B model with the skill file said "5.02x speedup on NVIDIA &lt;br&gt;
GTX 1080, reference [157]."&lt;/p&gt;
&lt;h2&gt;
  
  
  Why this happens
&lt;/h2&gt;

&lt;p&gt;It's not magic. It's structural:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Context window.&lt;/strong&gt; A 33-page PDF is ~50K tokens. A 4B model &lt;br&gt;
has an 8K context window. It literally can't fit the PDF. A &lt;br&gt;
500-line skill file is ~4K tokens. Fits easily.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Table parsing.&lt;/strong&gt; Small models are terrible at finding numbers &lt;br&gt;
in dense academic prose. A skill file puts every number in a &lt;br&gt;
labeled markdown table row. The model just reads a row.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Hallucination reduction.&lt;/strong&gt; When a small model can't find &lt;br&gt;
information, it guesses. With structured skill files, the &lt;br&gt;
information is either there (in a labeled field) or not. No &lt;br&gt;
ambiguous prose to misinterpret.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Variable definitions.&lt;/strong&gt; A PDF says "α" in one paragraph and &lt;br&gt;
"the weighting coefficient" three pages later. A skill file says &lt;br&gt;
&lt;code&gt;α = weighting coefficient for student loss&lt;/code&gt; right next to the &lt;br&gt;
equation.&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;
&lt;h2&gt;
  
  
  What the skill file looks like
&lt;/h2&gt;


&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="nn"&gt;---&lt;/span&gt;
&lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;quantization-for-efficient-neural-networks&lt;/span&gt;
&lt;span class="na"&gt;description&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Use&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;this&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;skill&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;when&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;implementing&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;model&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;quantization,"&lt;/span&gt;
  &lt;span class="s"&gt;post-training quantization (PTQ), quantization-aware training&lt;/span&gt; 
  &lt;span class="s"&gt;(QAT), or mixed-precision inference.&lt;/span&gt;
&lt;span class="nn"&gt;---&lt;/span&gt;

&lt;span class="c1"&gt;## Uniform Quantization&lt;/span&gt;
&lt;span class="s"&gt;Q(r) = Int(r/S) - Z&lt;/span&gt;
&lt;span class="na"&gt;where&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="s"&gt;r = real-valued input (activation or weight)&lt;/span&gt;
  &lt;span class="s"&gt;S = real-valued scaling factor&lt;/span&gt;
  &lt;span class="s"&gt;Z = integer zero point&lt;/span&gt;

&lt;span class="c1"&gt;## Inference Speedup Data&lt;/span&gt;
&lt;span class="pi"&gt;|&lt;/span&gt; &lt;span class="err"&gt;Model&lt;/span&gt;       &lt;span class="err"&gt;|&lt;/span&gt; &lt;span class="err"&gt;Quant&lt;/span&gt; &lt;span class="err"&gt;Type&lt;/span&gt; &lt;span class="err"&gt;|&lt;/span&gt; &lt;span class="err"&gt;Hardware&lt;/span&gt;        &lt;span class="err"&gt;|&lt;/span&gt; &lt;span class="err"&gt;Speedup&lt;/span&gt; &lt;span class="err"&gt;|&lt;/span&gt;
&lt;span class="err"&gt;|&lt;/span&gt;&lt;span class="s"&gt;-------------|-----------|-----------------|---------|&lt;/span&gt;
&lt;span class="err"&gt;|&lt;/span&gt;&lt;span class="s"&gt; ResNet50    | INT8      | NVIDIA GTX 1080 | 3.89x   |&lt;/span&gt;
&lt;span class="err"&gt;|&lt;/span&gt;&lt;span class="s"&gt; InceptionV3 | INT8      | NVIDIA GTX 1080 | 5.02x   |&lt;/span&gt;
&lt;span class="err"&gt;|&lt;/span&gt;&lt;span class="s"&gt; BERT        | INT8      | (unspecified)   | 4.0x    |&lt;/span&gt;

&lt;span class="err"&gt;#&lt;/span&gt;&lt;span class="s"&gt;# Key Takeaways&lt;/span&gt;
&lt;span class="err"&gt;1&lt;/span&gt;&lt;span class="s"&gt;. Use symmetric quantization for weights, asymmetric for activations&lt;/span&gt;
&lt;span class="err"&gt;2&lt;/span&gt;&lt;span class="s"&gt;. lr=1e-5 for QAT fine-tuning (NOT 1e-3 — causes oscillation)&lt;/span&gt;
&lt;span class="err"&gt;3&lt;/span&gt;&lt;span class="s"&gt;. Channelwise quantization for kernels — one scaling factor per channel&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;Every skill file follows this exact structure. Whether I generate &lt;br&gt;
it today or six months from now. Whether it's a quantization paper &lt;br&gt;
or a distillation paper.&lt;/p&gt;
&lt;h2&gt;
  
  
  The real value isn't accuracy — it's workflow
&lt;/h2&gt;

&lt;p&gt;Could you get the same answer by uploading the PDF to Claude Opus? &lt;br&gt;
Yes. Claude reads PDFs excellently.&lt;/p&gt;

&lt;p&gt;But:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Can you do that for 30 papers in one command? No.&lt;/li&gt;
&lt;li&gt;Will the output format be identical across months? No.&lt;/li&gt;
&lt;li&gt;Can you load the results into a 4B local model running offline? No.&lt;/li&gt;
&lt;li&gt;Do those ChatPDF sessions still exist six months later? No.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Skill files go in your git repo. They travel with your codebase. &lt;br&gt;
They work in Claude, Cursor, Windsurf, Ollama — any tool that &lt;br&gt;
reads files.&lt;/p&gt;
&lt;h2&gt;
  
  
  The tool
&lt;/h2&gt;

&lt;p&gt;It's called SkillForge. Single Python file, ~2000 lines, open source.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Free (uses OpenRouter free models)&lt;/span&gt;
python skillforge.py &lt;span class="nt"&gt;--arxiv&lt;/span&gt; 2103.13630 &lt;span class="nt"&gt;--provider&lt;/span&gt; openrouter

&lt;span class="c"&gt;# Batch mode — process your weekly reading list&lt;/span&gt;
python skillforge.py batch &lt;span class="nt"&gt;--list&lt;/span&gt; sources.txt &lt;span class="nt"&gt;--provider&lt;/span&gt; openrouter &lt;span class="nt"&gt;--paid&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Cost: $0 with free models, ~$0.03/paper with paid mode.&lt;/p&gt;

&lt;p&gt;If the quality isn't high enough, it auto-escalates through &lt;br&gt;
stronger models (gemini-flash → deepseek → gemini-pro → claude-sonnet &lt;br&gt;
→ claude-opus) until the target is met.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;GitHub:&lt;/strong&gt; &lt;a href="https://github.com/AnubhavBharadwaaj/skillforge" rel="noopener noreferrer"&gt;https://github.com/AnubhavBharadwaaj/skillforge&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Demo video:&lt;/strong&gt; &lt;a href="https://www.youtube.com/watch?v=O0J55eRcwZw" rel="noopener noreferrer"&gt;https://www.youtube.com/watch?v=O0J55eRcwZw&lt;/a&gt;&lt;/p&gt;




&lt;p&gt;The finding that small models + structured context beats large &lt;br&gt;
models + raw documents feels generalizable beyond papers. &lt;br&gt;
Any domain where you're feeding unstructured reference material &lt;br&gt;
to an LLM probably benefits from pre-structuring it — even if &lt;br&gt;
the structuring itself costs a frontier model call. You pay once, &lt;br&gt;
every subsequent query is cheaper and more accurate.&lt;/p&gt;

&lt;p&gt;Curious if anyone has seen similar results in other domains.&lt;/p&gt;

</description>
      <category>machinelearning</category>
      <category>ai</category>
      <category>opensource</category>
      <category>productivity</category>
    </item>
  </channel>
</rss>
