I tested a 4B model vs a 70B model on research papers. The 4B model won

#ai #machinelearning #productivity #opensource

I've been competing in ML competitions (OpenAI Parameter Golf,
WorldQuant IQC) and kept hitting the same wall: I'd read a paper,
understand it conceptually, but lose hours hunting for the actual
learning rate on page 14, the calibration procedure buried in a
footnote, and the failure mode mentioned once in a table caption.

So I built a CLI tool that extracts all of that into a structured
file. One command, ~2 minutes per paper. That part isn't surprising.

What surprised me is what happened when I gave those files to
small models.

The experiment

I took a 33-page quantization survey paper and asked 10 specific
implementation questions like:

"What is the exact inference speedup of InceptionV3 with INT8?"
"What is the energy cost of INT4 vs FP32 at 45nm?"
"In symmetric quantization, what happens to zero point Z?"

I tested two setups:

Setup A: Give the raw PDF to a large model (70B parameters)
Setup B: Give the pre-extracted skill file to a tiny model
(4B parameters — runs on a phone)

The result

The 4B model with the skill file gave more precise answers.

Not "roughly equivalent." More precise. The 70B model with the
raw PDF would say "approximately 2-4x speedup on GPU hardware."
The 4B model with the skill file said "5.02x speedup on NVIDIA
GTX 1080, reference [157]."

Why this happens

It's not magic. It's structural:

Context window. A 33-page PDF is ~50K tokens. A 4B model
has an 8K context window. It literally can't fit the PDF. A
500-line skill file is ~4K tokens. Fits easily.
Table parsing. Small models are terrible at finding numbers
in dense academic prose. A skill file puts every number in a
labeled markdown table row. The model just reads a row.
Hallucination reduction. When a small model can't find
information, it guesses. With structured skill files, the
information is either there (in a labeled field) or not. No
ambiguous prose to misinterpret.
Variable definitions. A PDF says "α" in one paragraph and
"the weighting coefficient" three pages later. A skill file says
α = weighting coefficient for student loss right next to the
equation.

What the skill file looks like

---
name: quantization-for-efficient-neural-networks
description: "Use this skill when implementing model quantization,"
  post-training quantization (PTQ), quantization-aware training 
  (QAT), or mixed-precision inference.
---

## Uniform Quantization
Q(r) = Int(r/S) - Z
where:
  r = real-valued input (activation or weight)
  S = real-valued scaling factor
  Z = integer zero point

## Inference Speedup Data
| Model       | Quant Type | Hardware        | Speedup |
|-------------|-----------|-----------------|---------|
| ResNet50    | INT8      | NVIDIA GTX 1080 | 3.89x   |
| InceptionV3 | INT8      | NVIDIA GTX 1080 | 5.02x   |
| BERT        | INT8      | (unspecified)   | 4.0x    |

## Key Takeaways
1. Use symmetric quantization for weights, asymmetric for activations
2. lr=1e-5 for QAT fine-tuning (NOT 1e-3 — causes oscillation)
3. Channelwise quantization for kernels — one scaling factor per channel

Every skill file follows this exact structure. Whether I generate
it today or six months from now. Whether it's a quantization paper
or a distillation paper.

The real value isn't accuracy — it's workflow

Could you get the same answer by uploading the PDF to Claude Opus?
Yes. Claude reads PDFs excellently.

But:

Can you do that for 30 papers in one command? No.
Will the output format be identical across months? No.
Can you load the results into a 4B local model running offline? No.
Do those ChatPDF sessions still exist six months later? No.

Skill files go in your git repo. They travel with your codebase.
They work in Claude, Cursor, Windsurf, Ollama — any tool that
reads files.

The tool

It's called SkillForge. Single Python file, ~2000 lines, open source.

# Free (uses OpenRouter free models)
python skillforge.py --arxiv 2103.13630 --provider openrouter

# Batch mode — process your weekly reading list
python skillforge.py batch --list sources.txt --provider openrouter --paid

Cost: $0 with free models, ~$0.03/paper with paid mode.

If the quality isn't high enough, it auto-escalates through
stronger models (gemini-flash → deepseek → gemini-pro → claude-sonnet
→ claude-opus) until the target is met.

GitHub: https://github.com/AnubhavBharadwaaj/skillforge

Demo video: https://www.youtube.com/watch?v=O0J55eRcwZw

The finding that small models + structured context beats large
models + raw documents feels generalizable beyond papers.
Any domain where you're feeding unstructured reference material
to an LLM probably benefits from pre-structuring it — even if
the structuring itself costs a frontier model call. You pay once,
every subsequent query is cheaper and more accurate.

Curious if anyone has seen similar results in other domains.