DEV Community

Cover image for Free LLMs on OpenRouter Keep Going 404. I Fixed It With 120 Lines of Python
Josh Green
Josh Green

Posted on

Free LLMs on OpenRouter Keep Going 404. I Fixed It With 120 Lines of Python

I built a small pipeline on OpenClaw to stay on top of 3D printing news.

Nothing fancy — a Python script that pulls from YouTube, RSS feeds, and Reddit, uses a free LLM to summarize what's worth reading, and emails me a digest. I use OpenRouter's free tier because I'm cheap and the models are good enough for summarization.

It worked great. For about two weeks.

Then I started getting errors.


The problem nobody talks about

Here's something I didn't fully appreciate until it bit me: free models on OpenRouter change constantly. Models get added, removed, rate-limited into uselessness, or quietly replaced with different versions. If you hardcode your model list — which every tutorial tells you to do — you're building on sand.

One morning I woke up to this:

[06:03] LLM HTTP 404 [openai/gpt-oss-120b:free]: model not found
[06:03] LLM HTTP 429 [nousresearch/hermes-3-llama-3.1-405b:free]: rate limited
[06:03] LLM HTTP 404 [mistralai/mistral-small-3.1-24b-instruct:free]: model not found
[06:03] All free models exhausted — returning empty
Enter fullscreen mode Exit fullscreen mode

Three of my six hardcoded models were dead. The pipeline silently produced nothing. I missed a week of content before I noticed.

Hardcoded lists are technical debt. Free model availability is a moving target. These two facts collide badly.


The fix: treat the model list as a live data source

OpenRouter has a public endpoint — no auth required — that returns their full model catalog:

GET https://openrouter.ai/api/v1/models
Enter fullscreen mode Exit fullscreen mode

It returns ~346 models right now. Filtering to free ones with decent context windows gives you 10-15 candidates. The question is: which ones are actually worth using?

I wanted to rank them. My criteria:

  1. Context window — longer is better for summarization. A 262K context model can swallow an entire article thread without chunking.
  2. Model size — bigger models write better. A 70B model beats a 7B model for prose quality.
  3. Historical reliability — has this model actually worked when I've called it before?

That last one is the one nobody tracks. So I built tracking.


model-registry.py — the discovery layer

The registry script runs once every 6 hours. It:

  1. Checks if the cache (~/.openclaw/free-models.json) is fresh — if yes, exits in <100ms (just a file stat)
  2. If stale, hits the OpenRouter catalog and scores every free model:
def score_model(model_id, context_length):
    context_score = min(context_length / 1000, 200)  # caps at 200
    size_score = get_size_score(model_id)             # regex: 405b=200, 70b=140, 8b=50...
    return context_score + size_score
Enter fullscreen mode Exit fullscreen mode
  1. Takes the top 10, writes them to free-models.json
  2. Logs a diff — "Added: X, Removed: Y since last run"

The diff log is where it gets interesting. On my first run after building this, I discovered two models I'd never heard of that scored in my top 6. One of them — qwen/qwen3-next-80b-a3b-instruct:free — has a 262K context window and an 80B parameter count. It's now my primary model. It wasn't in any tutorial I'd read.


model-metrics.py — the performance layer

HTTP 200 doesn't mean the model was useful. A model can return 200 with three sentences of hallucinated nonsense that breaks your JSON parser downstream.

So I added tracking at two levels:

Level 1 — HTTP success:

t0 = time.time()
try:
    resp = urllib.request.urlopen(req, timeout=90)
    content = resp.read()...
    record_metric(model_id, task, success=True,
                 latency_ms=int((time.time()-t0)*1000),
                 output_len=len(content))
except urllib.error.HTTPError as e:
    record_metric(model_id, task, success=False,
                 latency_ms=..., error_code=str(e.code))
Enter fullscreen mode Exit fullscreen mode

Level 2 — parse success (parse_ok):

After every call that expects structured JSON, I record whether the downstream parsing succeeded:

response = call_free_llm(prompt, task="claim_extraction")
try:
    data = json.loads(response)
    update_parse_ok(True)   # output was actually usable
    return data
except json.JSONDecodeError:
    update_parse_ok(False)  # model returned garbage
Enter fullscreen mode Exit fullscreen mode

parse_ok is the metric I care about most. It answers: was this model actually useful, not just technically responsive?

After a week of pipeline runs, I get a table like this:

Model                                      calls  ok%  p_ok%  avg_ms  errors
meta-llama/llama-3.3-70b-instruct:free       47   94%   88%   1240ms
qwen/qwen3-next-80b-a3b-instruct:free        31   97%   91%   1180ms
openai/gpt-oss-120b:free                     12   58%   42%   1890ms  5×404
nousresearch/hermes-3-llama-3.1-405b:free    8    62%   55%   2100ms  3×404
Enter fullscreen mode Exit fullscreen mode

The last two models look fine on paper (they're large, they have long context) but they're dying constantly. Their scores get penalized:

def score_penalty(stats_entry):
    ok = stats_entry["ok_pct"]
    if ok < 50: return 0.3   # heavy penalty
    if ok < 70: return 0.7
    if ok < 85: return 0.9
    return 1.0               # no penalty

final_score = catalog_score * score_penalty(historical_stats)
Enter fullscreen mode Exit fullscreen mode

When the registry next refreshes, those models sink to the bottom of the fallback chain. Automatically. Without me touching anything.


The result

The pipeline now:

  • Discovers new free models within 6 hours of them appearing on OpenRouter
  • Drops dead models from the rotation within one pipeline run
  • Prioritizes models with proven parse reliability, not just raw specs
  • Costs $0.00 extra — one public HTTP GET every 6 hours

The whole thing is ~250 lines across two files. No pip dependencies for the registry itself (stdlib only — json, urllib, sqlite3). The metrics use SQLite so they survive reboots and redeploys.


Grab the code

model-registry.py and model-metrics.py — both standalone, drop them next to any script that calls OpenRouter:

# Replace your hardcoded list with this:
REGISTRY_PATH = Path.home() / ".openclaw" / "free-models.json"
_FALLBACK = ["meta-llama/llama-3.3-70b-instruct:free", ...]

def load_free_models():
    try:
        data = json.loads(REGISTRY_PATH.read_text())
        models = [m["id"] for m in data.get("models", [])]
        if len(models) >= 2:
            return models
    except Exception:
        pass
    return list(_FALLBACK)

FREE_MODELS = load_free_models()
Enter fullscreen mode Exit fullscreen mode

Run the registry as a preflight step before any pipeline that uses free models. If the cache is fresh, it exits immediately. If it's stale, it updates in ~1 second.

python3 model-registry.py --max-age 21600   # refresh if >6h old
python3 your-pipeline.py                     # now uses fresh model list
Enter fullscreen mode Exit fullscreen mode

The thing I keep thinking about: I built this to find 3D printing news: the RepRap machines that print their own parts. Then foraging for news made me realize I needed this algorithm. Now the algorithm helps me find better news about the Van Neuman probe itself. It's turtles all the way down — but at least they're free turtles.


Full code on GitHub Gist:

Top comments (0)