I've been competing in ML competitions (OpenAI Parameter Golf,
WorldQuant IQC) and kept hitting the same wall: I'd read a paper,
understand it conceptually, but lose hours hunting for the actual
learning rate on page 14, the calibration procedure buried in a
footnote, and the failure mode mentioned once in a table caption.
So I built a CLI tool that extracts all of that into a structured
file. One command, ~2 minutes per paper. That part isn't surprising.
What surprised me is what happened when I gave those files to
small models.
The experiment
I took a 33-page quantization survey paper and asked 10 specific
implementation questions like:
- "What is the exact inference speedup of InceptionV3 with INT8?"
- "What is the energy cost of INT4 vs FP32 at 45nm?"
- "In symmetric quantization, what happens to zero point Z?"
I tested two setups:
Setup A: Give the raw PDF to a large model (70B parameters)
Setup B: Give the pre-extracted skill file to a tiny model
(4B parameters — runs on a phone)
The result
The 4B model with the skill file gave more precise answers.
Not "roughly equivalent." More precise. The 70B model with the
raw PDF would say "approximately 2-4x speedup on GPU hardware."
The 4B model with the skill file said "5.02x speedup on NVIDIA
GTX 1080, reference [157]."
Why this happens
It's not magic. It's structural:
Context window. A 33-page PDF is ~50K tokens. A 4B model
has an 8K context window. It literally can't fit the PDF. A
500-line skill file is ~4K tokens. Fits easily.Table parsing. Small models are terrible at finding numbers
in dense academic prose. A skill file puts every number in a
labeled markdown table row. The model just reads a row.Hallucination reduction. When a small model can't find
information, it guesses. With structured skill files, the
information is either there (in a labeled field) or not. No
ambiguous prose to misinterpret.Variable definitions. A PDF says "α" in one paragraph and
"the weighting coefficient" three pages later. A skill file says
α = weighting coefficient for student lossright next to the
equation.
What the skill file looks like
---
name: quantization-for-efficient-neural-networks
description: "Use this skill when implementing model quantization,"
post-training quantization (PTQ), quantization-aware training
(QAT), or mixed-precision inference.
---
## Uniform Quantization
Q(r) = Int(r/S) - Z
where:
r = real-valued input (activation or weight)
S = real-valued scaling factor
Z = integer zero point
## Inference Speedup Data
| Model | Quant Type | Hardware | Speedup |
|-------------|-----------|-----------------|---------|
| ResNet50 | INT8 | NVIDIA GTX 1080 | 3.89x |
| InceptionV3 | INT8 | NVIDIA GTX 1080 | 5.02x |
| BERT | INT8 | (unspecified) | 4.0x |
## Key Takeaways
1. Use symmetric quantization for weights, asymmetric for activations
2. lr=1e-5 for QAT fine-tuning (NOT 1e-3 — causes oscillation)
3. Channelwise quantization for kernels — one scaling factor per channel
Every skill file follows this exact structure. Whether I generate
it today or six months from now. Whether it's a quantization paper
or a distillation paper.
The real value isn't accuracy — it's workflow
Could you get the same answer by uploading the PDF to Claude Opus?
Yes. Claude reads PDFs excellently.
But:
- Can you do that for 30 papers in one command? No.
- Will the output format be identical across months? No.
- Can you load the results into a 4B local model running offline? No.
- Do those ChatPDF sessions still exist six months later? No.
Skill files go in your git repo. They travel with your codebase.
They work in Claude, Cursor, Windsurf, Ollama — any tool that
reads files.
The tool
It's called SkillForge. Single Python file, ~2000 lines, open source.
# Free (uses OpenRouter free models)
python skillforge.py --arxiv 2103.13630 --provider openrouter
# Batch mode — process your weekly reading list
python skillforge.py batch --list sources.txt --provider openrouter --paid
Cost: $0 with free models, ~$0.03/paper with paid mode.
If the quality isn't high enough, it auto-escalates through
stronger models (gemini-flash → deepseek → gemini-pro → claude-sonnet
→ claude-opus) until the target is met.
GitHub: https://github.com/AnubhavBharadwaaj/skillforge
Demo video: https://www.youtube.com/watch?v=O0J55eRcwZw
The finding that small models + structured context beats large
models + raw documents feels generalizable beyond papers.
Any domain where you're feeding unstructured reference material
to an LLM probably benefits from pre-structuring it — even if
the structuring itself costs a frontier model call. You pay once,
every subsequent query is cheaper and more accurate.
Curious if anyone has seen similar results in other domains.
Top comments (0)