I built a small pipeline on OpenClaw to stay on top of 3D printing news.
Nothing fancy — a Python script that pulls from YouTube, RSS feeds, and Reddit, uses a free LLM to summarize what's worth reading, and emails me a digest. I use OpenRouter's free tier because I'm cheap and the models are good enough for summarization.
It worked great. For about two weeks.
Then I started getting errors.
The problem nobody talks about
Here's something I didn't fully appreciate until it bit me: free models on OpenRouter change constantly. Models get added, removed, rate-limited into uselessness, or quietly replaced with different versions. If you hardcode your model list — which every tutorial tells you to do — you're building on sand.
One morning I woke up to this:
[06:03] LLM HTTP 404 [openai/gpt-oss-120b:free]: model not found
[06:03] LLM HTTP 429 [nousresearch/hermes-3-llama-3.1-405b:free]: rate limited
[06:03] LLM HTTP 404 [mistralai/mistral-small-3.1-24b-instruct:free]: model not found
[06:03] All free models exhausted — returning empty
Three of my six hardcoded models were dead. The pipeline silently produced nothing. I missed a week of content before I noticed.
Hardcoded lists are technical debt. Free model availability is a moving target. These two facts collide badly.
The fix: treat the model list as a live data source
OpenRouter has a public endpoint — no auth required — that returns their full model catalog:
GET https://openrouter.ai/api/v1/models
It returns ~346 models right now. Filtering to free ones with decent context windows gives you 10-15 candidates. The question is: which ones are actually worth using?
I wanted to rank them. My criteria:
- Context window — longer is better for summarization. A 262K context model can swallow an entire article thread without chunking.
- Model size — bigger models write better. A 70B model beats a 7B model for prose quality.
- Historical reliability — has this model actually worked when I've called it before?
That last one is the one nobody tracks. So I built tracking.
model-registry.py — the discovery layer
The registry script runs once every 6 hours. It:
- Checks if the cache (
~/.openclaw/free-models.json) is fresh — if yes, exits in <100ms (just a file stat) - If stale, hits the OpenRouter catalog and scores every free model:
def score_model(model_id, context_length):
context_score = min(context_length / 1000, 200) # caps at 200
size_score = get_size_score(model_id) # regex: 405b=200, 70b=140, 8b=50...
return context_score + size_score
- Takes the top 10, writes them to
free-models.json - Logs a diff — "Added: X, Removed: Y since last run"
The diff log is where it gets interesting. On my first run after building this, I discovered two models I'd never heard of that scored in my top 6. One of them — qwen/qwen3-next-80b-a3b-instruct:free — has a 262K context window and an 80B parameter count. It's now my primary model. It wasn't in any tutorial I'd read.
model-metrics.py — the performance layer
HTTP 200 doesn't mean the model was useful. A model can return 200 with three sentences of hallucinated nonsense that breaks your JSON parser downstream.
So I added tracking at two levels:
Level 1 — HTTP success:
t0 = time.time()
try:
resp = urllib.request.urlopen(req, timeout=90)
content = resp.read()...
record_metric(model_id, task, success=True,
latency_ms=int((time.time()-t0)*1000),
output_len=len(content))
except urllib.error.HTTPError as e:
record_metric(model_id, task, success=False,
latency_ms=..., error_code=str(e.code))
Level 2 — parse success (parse_ok):
After every call that expects structured JSON, I record whether the downstream parsing succeeded:
response = call_free_llm(prompt, task="claim_extraction")
try:
data = json.loads(response)
update_parse_ok(True) # output was actually usable
return data
except json.JSONDecodeError:
update_parse_ok(False) # model returned garbage
parse_ok is the metric I care about most. It answers: was this model actually useful, not just technically responsive?
After a week of pipeline runs, I get a table like this:
Model calls ok% p_ok% avg_ms errors
meta-llama/llama-3.3-70b-instruct:free 47 94% 88% 1240ms
qwen/qwen3-next-80b-a3b-instruct:free 31 97% 91% 1180ms
openai/gpt-oss-120b:free 12 58% 42% 1890ms 5×404
nousresearch/hermes-3-llama-3.1-405b:free 8 62% 55% 2100ms 3×404
The last two models look fine on paper (they're large, they have long context) but they're dying constantly. Their scores get penalized:
def score_penalty(stats_entry):
ok = stats_entry["ok_pct"]
if ok < 50: return 0.3 # heavy penalty
if ok < 70: return 0.7
if ok < 85: return 0.9
return 1.0 # no penalty
final_score = catalog_score * score_penalty(historical_stats)
When the registry next refreshes, those models sink to the bottom of the fallback chain. Automatically. Without me touching anything.
The result
The pipeline now:
- Discovers new free models within 6 hours of them appearing on OpenRouter
- Drops dead models from the rotation within one pipeline run
- Prioritizes models with proven parse reliability, not just raw specs
- Costs $0.00 extra — one public HTTP GET every 6 hours
The whole thing is ~250 lines across two files. No pip dependencies for the registry itself (stdlib only — json, urllib, sqlite3). The metrics use SQLite so they survive reboots and redeploys.
Grab the code
model-registry.py and model-metrics.py — both standalone, drop them next to any script that calls OpenRouter:
# Replace your hardcoded list with this:
REGISTRY_PATH = Path.home() / ".openclaw" / "free-models.json"
_FALLBACK = ["meta-llama/llama-3.3-70b-instruct:free", ...]
def load_free_models():
try:
data = json.loads(REGISTRY_PATH.read_text())
models = [m["id"] for m in data.get("models", [])]
if len(models) >= 2:
return models
except Exception:
pass
return list(_FALLBACK)
FREE_MODELS = load_free_models()
Run the registry as a preflight step before any pipeline that uses free models. If the cache is fresh, it exits immediately. If it's stale, it updates in ~1 second.
python3 model-registry.py --max-age 21600 # refresh if >6h old
python3 your-pipeline.py # now uses fresh model list
The thing I keep thinking about: I built this to find 3D printing news: the RepRap machines that print their own parts. Then foraging for news made me realize I needed this algorithm. Now the algorithm helps me find better news about the Van Neuman probe itself. It's turtles all the way down — but at least they're free turtles.
Full code on GitHub Gist:
Top comments (0)