DEV Community

AnubhavBharadwaaj
AnubhavBharadwaaj

Posted on

I tested a 4B model vs a 70B model on research papers. The 4B model won

I've been competing in ML competitions (OpenAI Parameter Golf,
WorldQuant IQC) and kept hitting the same wall: I'd read a paper,
understand it conceptually, but lose hours hunting for the actual
learning rate on page 14, the calibration procedure buried in a
footnote, and the failure mode mentioned once in a table caption.

So I built a CLI tool that extracts all of that into a structured
file. One command, ~2 minutes per paper. That part isn't surprising.

What surprised me is what happened when I gave those files to
small models.

The experiment

I took a 33-page quantization survey paper and asked 10 specific
implementation questions like:

  • "What is the exact inference speedup of InceptionV3 with INT8?"
  • "What is the energy cost of INT4 vs FP32 at 45nm?"
  • "In symmetric quantization, what happens to zero point Z?"

I tested two setups:

Setup A: Give the raw PDF to a large model (70B parameters)
Setup B: Give the pre-extracted skill file to a tiny model
(4B parameters — runs on a phone)

The result

The 4B model with the skill file gave more precise answers.

Not "roughly equivalent." More precise. The 70B model with the
raw PDF would say "approximately 2-4x speedup on GPU hardware."
The 4B model with the skill file said "5.02x speedup on NVIDIA
GTX 1080, reference [157]."

Why this happens

It's not magic. It's structural:

  1. Context window. A 33-page PDF is ~50K tokens. A 4B model
    has an 8K context window. It literally can't fit the PDF. A
    500-line skill file is ~4K tokens. Fits easily.

  2. Table parsing. Small models are terrible at finding numbers
    in dense academic prose. A skill file puts every number in a
    labeled markdown table row. The model just reads a row.

  3. Hallucination reduction. When a small model can't find
    information, it guesses. With structured skill files, the
    information is either there (in a labeled field) or not. No
    ambiguous prose to misinterpret.

  4. Variable definitions. A PDF says "α" in one paragraph and
    "the weighting coefficient" three pages later. A skill file says
    α = weighting coefficient for student loss right next to the
    equation.

What the skill file looks like

---
name: quantization-for-efficient-neural-networks
description: "Use this skill when implementing model quantization,"
  post-training quantization (PTQ), quantization-aware training 
  (QAT), or mixed-precision inference.
---

## Uniform Quantization
Q(r) = Int(r/S) - Z
where:
  r = real-valued input (activation or weight)
  S = real-valued scaling factor
  Z = integer zero point

## Inference Speedup Data
| Model       | Quant Type | Hardware        | Speedup |
|-------------|-----------|-----------------|---------|
| ResNet50    | INT8      | NVIDIA GTX 1080 | 3.89x   |
| InceptionV3 | INT8      | NVIDIA GTX 1080 | 5.02x   |
| BERT        | INT8      | (unspecified)   | 4.0x    |

## Key Takeaways
1. Use symmetric quantization for weights, asymmetric for activations
2. lr=1e-5 for QAT fine-tuning (NOT 1e-3 — causes oscillation)
3. Channelwise quantization for kernels — one scaling factor per channel
Enter fullscreen mode Exit fullscreen mode

Every skill file follows this exact structure. Whether I generate
it today or six months from now. Whether it's a quantization paper
or a distillation paper.

The real value isn't accuracy — it's workflow

Could you get the same answer by uploading the PDF to Claude Opus?
Yes. Claude reads PDFs excellently.

But:

  • Can you do that for 30 papers in one command? No.
  • Will the output format be identical across months? No.
  • Can you load the results into a 4B local model running offline? No.
  • Do those ChatPDF sessions still exist six months later? No.

Skill files go in your git repo. They travel with your codebase.
They work in Claude, Cursor, Windsurf, Ollama — any tool that
reads files.

The tool

It's called SkillForge. Single Python file, ~2000 lines, open source.

# Free (uses OpenRouter free models)
python skillforge.py --arxiv 2103.13630 --provider openrouter

# Batch mode — process your weekly reading list
python skillforge.py batch --list sources.txt --provider openrouter --paid
Enter fullscreen mode Exit fullscreen mode

Cost: $0 with free models, ~$0.03/paper with paid mode.

If the quality isn't high enough, it auto-escalates through
stronger models (gemini-flash → deepseek → gemini-pro → claude-sonnet
→ claude-opus) until the target is met.

GitHub: https://github.com/AnubhavBharadwaaj/skillforge

Demo video: https://www.youtube.com/watch?v=O0J55eRcwZw


The finding that small models + structured context beats large
models + raw documents feels generalizable beyond papers.
Any domain where you're feeding unstructured reference material
to an LLM probably benefits from pre-structuring it — even if
the structuring itself costs a frontier model call. You pay once,
every subsequent query is cheaper and more accurate.

Curious if anyone has seen similar results in other domains.

Top comments (0)