Free students, paid teachers: how cheap LLMs learn from expensive ones

#ai #python #opensource #machinelearning

Every agent framework I tried assumed one paid frontier model. I wanted the opposite: an orchestrator that treats free and local models as first-class, and gets smarter over time without me paying per token. That idea turned into FreePalp, and the core trick is worth sharing on its own.

The problem with cheap models

A small/free model (Llama-3.1-8B, a local Ollama model) is fast and costs nothing, but it fails the hard tasks: multi-file edits, strict output formats, tool-use discipline. The usual answer is "just use a bigger model." That's expensive and gives up on the free tier entirely.

The trick: corrections accumulate

FreePalp runs a two-tier critic. Cheap deterministic checks first (did the promised file actually get written? did the model leak a tool call as text? did it slip identity?), and only then an LLM critic. When a cheap model fails and a stronger model succeeds on retry, FreePalp doesn't throw that success away. It distills the working procedure into a reusable SKILL.md — the same format Claude Code uses — capturing the steps, the tools involved, and the one lesson that fixed it.

Next time a similar task shows up, that skill is injected into the prompt before the cheap model even tries. So the cheap model gets it right on the first attempt — because it's standing on a procedure a stronger model already worked out. Free students, paid teachers, a skill set that grows.

Why deterministic checks matter

LLM critics are themselves LLMs — they hallucinate too. So the first tier is plain code: regexes and file-system checks that catch the specific failure modes weak models have (fake "I created the file" claims, stub content, leaked <tool_call> text). Cheap, fast, and they never lie. The LLM critic only spends tokens on the genuinely ambiguous cases.

The rest of the system

Routing across 10+ free providers (Groq, Cerebras, Gemini, OpenRouter…) with live quota/cooldown awareness, and local Ollama as the always-available fallback.
DAG decomposition + parallel subagents for multi-file work.
MCP client (any Model Context Protocol server's tools appear to the agent).
Token streaming, OpenAI-compatible /v1 (point any IDE plugin at it).
A real vector-memory graph you can explore, FTS5 search over your history.

It's MIT, Python, and a solo project — I keep an honest benchmark of where free models still hit a ceiling. If the teacher→student idea resonates, the code is here: https://github.com/verdyshd/freepalp — feedback on the critic/routing design very welcome.