DEV Community: Josh Green

How I stopped fighting print lines and started covering them instead

Josh Green — Tue, 21 Jul 2026 22:47:27 +0000

I spent about two years sanding print lines off models before I admitted the whole approach was backwards.

If you print anything you care about, you know the moment. Geometry is clean, fit is right, colour is right. Then you turn it toward the light and every flat face is covered in fine horizontal print lines. The part is good. It just looks printed, and you wanted it to look made.

Sanding is the obvious fix and the wrong one for most prints

Sanding works on a flat panel if you have an afternoon. Coarse grit, finer grit, primer, more grit. I am not going to pretend it does nothing.

But most of what I print is organic. Figures, handles, terrain, grips. On those, sanding is a slow way to destroy the detail you just printed. Every pass rounds off the sharp edges, and you end up with a softer, blurrier version of the model that happens to be smoother. After enough ruined parts I stopped asking how to sand better and started asking why the print lines were visible at all.

Print lines show up because the surface around them is blank

Here is the part that changed how I work.

A print line is a regular, repeating feature sitting on a smooth surface. Evenly spaced ridges, plain background. Your eye is built to catch a regular pattern against a blank field, and that contrast is the thing that reads as 3D printed. The lines are not ugly on their own. They are ugly because there is nothing else on the surface.

That gives you two options. Remove the ridges, which is sanding, slow and rough on detail. Or remove the blank surface, by covering the whole thing in surface relief. Wood grain, stone, stipple, woven cloth, any irregular relief. Once the surface is busy everywhere, the regular print lines stop standing out, because now they are competing with a pattern instead of sitting alone.

This is why a stone-textured planter looks finished off the bed and the same planter printed smooth never does, no matter how you tune the machine.

The workflow

The old problem was adding that relief. The textbook answer is a displacement or remesh modifier in Blender, and I gave it a real try. Blender is free and it is powerful. It is also a steep learning curve. Remeshing, UVs, texture coordinates, modifier stacks, an evening or two the first time you do it, and not something you can hand to someone else, because running Blender headless from a script is its own project.

What I do now is much shorter:

Take the STL I already have, smooth faces and all.
Run it through a browser 3D file upscaler that bakes real surface relief into the mesh.
Slice and print as usual.

Because the relief becomes real geometry and not a painted-on image, it survives slicing and prints on any filament through any slicer. Nothing to install, no account, which is the only reason it actually stuck as a habit.

What to watch for

Match the relief depth to your layer height. If the relief is shallower than one layer, the print lines can still show through on glossy filament. A slightly deeper grain fixes it.
Organic relief is the most forgiving. Wood, leather and stone cover print lines completely, because irregular relief beats regular relief. Keep the fine geometric patterns for grips and panels.
You are covering the lines, not deleting them. If you need a mirror-smooth lens or a display piece that has to look manufactured, this is the wrong tool and you should sand and polish or use a chemical smoothing pass. For everything else, relief is faster and looks better.

Print lines were never a defect I had to grind out. They were a plain surface with a pattern on it, and the fix was to give the rest of the surface a pattern too. Cover them first, and the print comes off the bed already looking finished.

Local GLM 4.7 from Z.ai on dual Nvidia RTX 3090: when the smarter model is the wrong pick

Josh Green — Mon, 20 Jul 2026 16:48:09 +0000

Running models at home rewires the question you ask about them. In a hosted console the only thing you ever feel is quality, because a stranger is paying for the silicon. On machines you own, that silicon is the invoice, so every model has to answer something blunter than "is it good," which is "are you worth the card you are sitting on." This is what happened when I put that question to one specific model, Zhipu's GLM 4.7, on a pair of Nvidia RTX 3090s, and got back an answer that flipped on me halfway through.

What a home lab actually asks of a model

I keep a small stack of GPUs busy instead of renting them, so a model does not get to win on vibes. It has to be cheaper per useful token than the alternative already loaded next to it. GLM 4.7 is the interesting kind of candidate: it is a mixture-of-experts build, roughly thirty billion parameters on paper but only about three billion doing work per token, wrapped around a memory trick called Multi-head Latent Attention. On paper that is a model that should be both fast and light. The last time I tried it, on a much weaker box, it was neither, and I wrongly blamed the model before realizing the runtime simply was not doing the latent-attention math. So this round I gave it fair hardware and a fair opponent.

The two contenders, and a slow wire between the cards

The opponent was deliberately boring: an ordinary dense, grouped-query model in the same size class, the sort of thing that never trends but never surprises you either. Both ran under the same server so nothing about the plumbing favored one side. The hardware was two Nvidia RTX 3090s, twenty-four gigabytes apiece, forty-eight together, on a mid-range desktop. No fancy interconnect. The two cards gossip over a link I clocked at about 1.4 gigabytes a second, which is glacial by GPU standards, and that single unglamorous number ended up steering half the results.

Then I stopped theorizing and just clocked both models at every context length from a few thousand tokens out to a hundred and twenty-eight thousand.

Generation speed has a tipping point you can predict

Out of the gate GLM 4.7 was thrilling, close to ninety tokens a second on short prompts, well over double the dense model. And then it leaked speed as the context filled: down through the sixties, forties, twenties, and into the mid-teens at the deep end. The dense model, meanwhile, did almost nothing interesting, holding its lane the entire way with the stubbornness of a machine that has nothing clever to lose. The two curves met somewhere around sixty-four thousand tokens. Below that line the exciting model runs away with it. Above that line, the model nobody writes about is quietly faster.

The tax that never shows up in the demo

Everyone benchmarks generation speed. Almost nobody benchmarks the wait before the first token, the time the machine spends swallowing your prompt, which on long inputs is most of the wait you actually feel.

context	Zhipu GLM 4.7	dense model
4K	2626 tok/s	979 tok/s
16K	1849	947
32K	1308	899
64K	805	780
128K	450	631

Same silhouette as before. GLM 4.7 devours a short prompt roughly three times quicker, then bleeds that advantage as the prompt grows, and by the deepest context the plain model is ingesting text faster than it is. The clever model is a sprinter on both halves of the job, and sprinters are not who you want at the back of a long race.

Memory runs the story backwards

Here is the twist that keeps GLM 4.7 in the conversation. The same latent-attention trick that costs it speed deep in a long context also keeps its running memory unusually small, so on a single twenty-four gigabyte card it fits noticeably more context before it has to spill onto the second GPU.

So the two stories point in opposite directions. On speed the dense model wins the long game. On memory the clever model does, packing a bigger context onto one card. If your problem is "hold an enormous document on a single 3090," GLM 4.7 has a real, physical answer for you, it just charges you generation speed to give it. I even tried shrinking the memory further with a compressed cache and it backfired, slower at every depth for no real headroom, so I left that idea on the floor.

A second card can make things slower, not faster

The most counterintuitive stretch came from trying to use both RTX 3090s at once. Instinct says two cards beat one. For a single conversation at short context, one card actually won outright, because splitting the model across the pair makes every token pay a toll crossing that 1.4 gigabyte wire, and on short prompts the toll costs more than the extra card saves.

The math only tips once the context grows long enough that the per-token work dwarfs the toll, at which point splitting across both cards roughly doubles the memory bandwidth the model is starved for, and the pair pulls ahead. And the moment you stop chatting and start serving a crowd, it stops being a contest: batched across many streams the two cards together pushed toward five hundred tokens a second in aggregate, several times what one card manages alone. So even "use all my hardware" turned out to be another it-depends.

The decision tree I actually kept

I walked away with a verdict so unexciting it is almost rude:

Short, snappy, one person at a keyboard: run Zhipu's GLM 4.7 on a single card, it is genuinely quicker.
Long single-user context: run the boring dense model, still on one card, it is faster and simpler.
Serving real traffic: split GLM 4.7 across both RTX 3090s and take the throughput.

There is no overall champion, only a right answer per shape of job, and the skill is being honest about which job you are actually doing. I keep a quick model and a dependable model side by side and reach for whichever the task rewards, which is the thing the leaderboards structurally cannot tell you.

None of this benchmarking is academic for me. The same two cards run the small tools I build on ModelDirectory, and the heaviest of them are exactly the short-to-medium GPU jobs where a fast model on one card wins: an STL upscaler that adds real detail to a printable model, and an STL texture generator that paints one from a prompt. The lighter utilities barely warm the GPU, like the print orientation and printability checker and the print cost calculator, and being able to run all of them on hardware I already paid for is the whole reason the run-or-rent math keeps landing on run.

The next thing I want to know is whether that sixty-four thousand token crossover survives contact with real long workloads instead of tidy synthetic ones. That needs a better test rig, and it is where I am headed next.

Every Medium Publication That Accepts 3D Content (2026 Map)

Josh Green — Sat, 30 May 2026 07:39:47 +0000

If you create 3D content — printing guides, WebGL tutorials, Three.js projects, CAD workflows — and want to publish on Medium, you have a discovery problem. There is no central 3D publication on the platform.

I mapped every Medium publication that accepts 3D-related content. Here is the short version for developers.

For WebGL / Three.js / Browser 3D

JavaScript in Plain English (180K followers) — best fit for Three.js, WebGL, browser rendering. Submit by following the pub and sending drafts.

Level Up Coding (73K followers) — Three.js is an explicit topic. Email submit@gitconnected.com with your draft link.

ITNEXT (55K followers) — deep technical dives on rendering pipelines and WebGL optimization.

For 3D Printing / Maker Content

Geek Culture (33K followers) — most 3D-friendly general tech pub. Explicitly lists VR/AR/MR. Published 3D printing construction articles.

The Startup (739K followers) — massive audience. Frame your content as discovery/comparison stories.

For Design / UX

UX Collective (483K followers) — 3D fits when framed as spatial UX, AR product previews, 3D viewer design.

Bootcamp (166K followers) — case studies and design tutorials with a 3D angle.

Easy Entry Points

ILLUMINATION (250K+) — welcoming to new writers, broad tech.

Dev Genius (10K) — publishes daily, all skill levels.

Tags That Actually Work

The right tags make or break discovery:

Virtual Reality (494K followers) — highest reach for any 3D content
Augmented Reality (89K) — strong for spatial computing
3D Printing (2.1K) — small but targeted
Three.js (682) — tiny but engaged developers
WebGL (1.2K) — technical web graphics

Formula: 1 broad tag + 1 niche tag + 2-3 topic tags.

The Gap

There is no dominant 3D publication on Medium. The space is wide open. Until someone builds one, place your articles in existing tech and design pubs and match their editorial angle.

For previewing 3D files for article screenshots, I use GeometryViewer — browser-based, handles STL/OBJ/GLB, no install.

Full guide with submission links for all 15 publications: joshgreen-dev.github.io

Microsoft 3D Viewer Dies July 1 — The STL Gap Nobody Is Talking About

Josh Green — Thu, 28 May 2026 23:40:06 +0000

Here's something that should bother every developer who works with 3D files: Microsoft is permanently removing 3D Viewer from the Microsoft Store on July 1, 2026.

Their official replacement suggestion? Babylon.js Sandbox. A browser-based viewer that doesn't support STL files — the single most common format in 3D printing, CNC machining, and CAD export workflows.

Let that sink in. The world's largest OS vendor is dropping 3D file support and pointing users to a tool that can't open the most popular 3D format.

The Full Death Timeline

This isn't a one-off. Microsoft has systematically killed every 3D tool they built:

2017: "3D for Everyone!" — Creators Update launches
2023: Windows Mixed Reality deprecated
2024: HoloLens 2 production stopped
2024: Paint 3D removed from Store (November 4)
2024: FBX support permanently disabled in 3D Viewer (CVE-2024-20677)
2026: 3D Viewer deprecated (February)
2026: 3D Viewer removed from Store (July 1) ← YOU ARE HERE

Nine years from "3D for Everyone" to "3D for Nobody."

Why Developers Should Care

If you build anything that outputs 3D files — CAD tools, slicers, generative design, AI mesh generation, game asset pipelines — your Windows users just lost their default preview tool.

"Just use Blender" is the wrong answer. Blender is a 200MB professional content creation suite. Your user doesn't want to install Blender to preview an STL. They want to double-click and see the model.

The Browser-Based Alternative

I've switched to GeometryViewer for day-to-day 3D file preview. Here's why it works for my workflow:

Format coverage that actually matters:

STL (binary + ASCII), OBJ, GLB/GLTF, 3MF, FBX, PLY, STEP, DAE
Drag and drop — no account, no install
Works offline as a PWA (install once, works without internet)

Developer-friendly features:

Embeddable via <iframe> — one line to add a 3D viewer to any webpage
Shareable URLs — send a model preview link, recipient sees it instantly
Measurement tools, cross-sections, material simulation
No WebGL server-side rendering — everything runs client-side

The STL gap filled:

Proper normal handling (no inverted faces on manifold models)
Material preview that shows what a 3D print will actually look like
Handles large meshes without choking (tested with 50M+ triangle scans)

What Happens to Existing Installs?

Microsoft clarified: existing 3D Viewer installations won't be auto-deleted. The app keeps working. But:

No security patches after July 1
Can't reinstall after a clean Windows install or new PC
The FBX parser already has a known RCE vulnerability (CVE-2024-20677, CVSS 7.8) that Microsoft "fixed" by disabling FBX entirely

Running unpatched software that already had an RCE is not a plan. It's a liability.

If You're Building 3D Tools

Consider this the canary in the coal mine. Microsoft is not coming back to desktop 3D. Their 3D strategy is now:

Azure Remote Rendering (enterprise, cloud-based)
Partnership with Meta (Xbox branding on Quest hardware)
Copilot (AI-generated 3D via text prompts — not file viewing)

Desktop 3D file viewing is officially an open-source problem now. If your app generates 3D output and your users are on Windows, you need to either:

Bundle your own viewer
Point them to a browser-based solution like GeometryViewer
Or accept that they'll have no way to preview your files

The era of "Windows handles 3D files natively" ended in February 2026. Plan accordingly.

Why DDR5 Bandwidth Kills Dual-LLM Inference on APUs (Benchmarks Inside)

Josh Green — Thu, 28 May 2026 15:43:47 +0000

Did you know that a 35-billion-parameter model can generate tokens at the same compute cost as a 4B model? That single fact made me abandon a multi-model agent architecture I'd spent a weekend building. But I had to run the benchmarks first to understand why.

Here's the full breakdown, with commands, numbers, and the architectural reason it all falls apart on shared-memory hardware.

The Discovery That Changed Everything

I'd been running qwen3.6:35b on my Minisforum UM790Pro for weeks -- it's my daily driver for everything from coding to running GeometryViewer for 3D model previews. 17.8 tokens/second -- genuinely usable for interactive work. But I kept wondering: could I run a lightweight sidecar model alongside it for quick classification and tool-calling in an agent pipeline?

Before I even started benchmarking, I dug into what qwen3.6:35b actually is under the hood. It's a Mixture of Experts model: 256 total experts with only 8 activated per token. The architecture also incorporates SSM (State Space Model) components alongside traditional attention -- Mamba-style layers that handle certain sequence patterns more efficiently than pure transformers.

The math hit me: 8 out of 256 experts means each token only touches roughly 4-5B parameters worth of compute. The model carries 36 billion parameters of knowledge, but its per-token cost is comparable to a small dense model. I was planning to run a separate 4B model for "fast tasks" next to a model that already operates at 4B-class speed.

But I had to prove it with numbers.

Hardware and Ollama Setup

The UM790Pro specs that matter for this experiment:

CPU: AMD Ryzen 9 7940HS (Zen 4, 8C/16T)
iGPU: AMD Radeon 780M (12 RDNA 3 compute units)
RAM: 96 GB DDR5-5600 (~80 GB/s bandwidth)
GPU memory pool: 2 GB dedicated VRAM + 46 GB GTT = 48 GB GPU-accessible

That 48 GB GPU pool sounds enormous until you realize it's carved from the same DDR5 that the CPU also uses. There is no separate GDDR6 bus. Everything -- CPU inference, GPU inference, KV caches, OS operations -- flows through one 80 GB/s pipe.

Four models under test, managed through Ollama:

# Pull the models
ollama pull qwen3.6:35b
ollama pull gemma4-e2b-abliterated
ollama pull qwen3:4b-instruct
ollama pull qwen2.5:1.5b

# Check what's loaded and where
ollama ps

ollama ps shows you which models are in memory and whether they're on GPU or CPU. For forcing CPU-only inference (critical for these tests), you pass num_gpu as a model parameter:

# Force a model onto CPU -- zero GPU layers
curl http://localhost:11434/api/generate -d '{
  "model": "gemma4-e2b-abliterated",
  "prompt": "Explain quicksort in 3 sentences.",
  "options": { "num_gpu": 0 }
}'

Setting num_gpu: 0 tells Ollama to offload zero layers to the GPU, keeping the entire model in system RAM for CPU-only inference. This is how I isolated CPU vs GPU performance and tested mixed configurations.

To verify VRAM allocation, ollama ps gives you the breakdown:

NAME                          SIZE     PROCESSOR    UNTIL
qwen3.6:35b                   32.2 GB  100% GPU     4 minutes from now
gemma4-e2b-abliterated:latest  4.1 GB  100% GPU     4 minutes from now

On a discrete NVIDIA card you'd cross-reference with nvidia-smi, but on an AMD APU the GTT allocation is only visible through ollama ps or by reading /sys/kernel/debug/dri/0/amdgpu_gem_info.

The Benchmark Results

Every test used identical prompts fired simultaneously at both models. I measured generation throughput (tokens/second) across solo and dual-model runs.

Solo Baselines

Model	Parameters	GPU (tok/s)	CPU (tok/s)
qwen3.6:35b	36B (MoE)	17.8	--
gemma4-e2b-abliterated	4.6B	42.9	28.7
qwen3:4b-instruct	4B	26.2	19.6
qwen2.5:1.5b	1.5B	--	53.4

Dual-Model Runs

Both on GPU -- qwen3.6:35b + gemma4-e2b:

Model	Solo	Dual	Performance Hit
qwen3.6:35b (GPU)	17.8	13.1	-26%
gemma4-e2b (GPU)	42.9	25.3	-41%

GPU + tiny CPU -- qwen3.6:35b (GPU) + qwen2.5:1.5b (CPU):

Model	Solo	Dual	Performance Hit
qwen3.6:35b (GPU)	17.8	14.9	-16%
qwen2.5:1.5b (CPU)	53.4	26.2	-51%

GPU + medium CPU -- qwen3.6:35b (GPU) + gemma4-e2b (CPU, num_gpu=0):

Model	Solo	Dual	Performance Hit
qwen3.6:35b (GPU)	17.8	13.0	-27%
gemma4-e2b (CPU)	28.7	13.4	-53%

GPU + large-context CPU -- qwen3.6:35b (GPU) + qwen3:4b-instruct (CPU, num_gpu=0):

Model	Solo	Dual	Performance Hit
qwen3.6:35b (GPU)	17.8	11.6	-35%
qwen3:4b-instruct (CPU)	19.6	11.1	-43%

That last combination was the worst. The 4B instruct model supports 256K context, and its KV cache ballooned to 24.2 GB. Combined with the 35B model's 32 GB GPU allocation, we were saturating every available byte of bandwidth.

Why It Happens: One Bus to Rule Them All

On a discrete GPU setup, the CPU reads model weights from DDR5 over its memory controller while the GPU reads from its own GDDR6 over a completely separate bus (often 300+ GB/s). Two independent pipes, no contention.

On an APU, both the Zen 4 CPU cores and the RDNA 3 compute units share a single memory controller connected to the same DDR5 DIMMs. The theoretical peak is ~80 GB/s, and that bandwidth is divided between every consumer.

DDR5-5600 (96 GB) -- ~80 GB/s shared
       |
  +----+----+
  |         |
CPU cores  780M iGPU
(Zen 4)    (12 CUs)
  |         |
 model      model
weights    weights
  |         |
  +-- SAME MEMORY CONTROLLER --+

LLM inference is almost entirely memory-bound. Each generated token requires streaming the model's weights through the compute units. A 35B MoE model activating 8 experts per token still needs to read those expert weights from memory every single time. When a CPU-side model is doing the same thing simultaneously, the two streams compete for the same bandwidth.

Even the "best" dual-model result (35B GPU + 1.5B CPU) cost 16% on the big model. The 1.5B model is tiny enough that its memory footprint barely dents bandwidth -- but it still halved its own throughput because the 35B model was dominating the bus.

The Agent Framework Problem

My original goal was a planner-executor agent setup: the 35B model reasons about what to do, a small model handles tool calls. Sounds efficient in theory.

In practice, agent frameworks are sequential. The planner generates a plan, then the executor runs a tool, then the planner evaluates the result. At any given moment, only one model is actively generating. The other sits idle in memory, consuming VRAM or RAM that could instead feed the active model a larger context window.

Combined with the MoE insight -- the 35B model already runs at small-model speeds -- the dual-model architecture solves a problem that does not exist on this hardware.

Bonus: Finding Orphan Blobs in Ollama

While investigating model storage during this project, I found 12.9 GB of wasted disk space. Ollama uses content-addressed storage under ~/.ollama/models/, so multiple model tags can reference the same weight blob. But when you delete a model, the blob sometimes lingers.

Here's how to find orphans:

# 1. Collect every blob hash referenced by a manifest
find ~/.ollama/models/manifests -name '*' -type f \
  -exec grep -oh 'sha256:[a-f0-9]*' {} \; | sort -u > /tmp/referenced_blobs.txt

# 2. List every blob on disk
ls ~/.ollama/models/blobs/ | sed 's/-/:/g' | sort -u > /tmp/disk_blobs.txt

# 3. Find blobs on disk that no manifest references
comm -13 /tmp/referenced_blobs.txt /tmp/disk_blobs.txt

Any hash that appears in the output of step 3 is an orphan. There's no ollama prune command yet, so you delete them manually. On my system this reclaimed nearly 13 GB from a single forgotten blob.

Also worth knowing: qwen3.6:35b, qwen3.6:latest, and qwen3.6:35b-nothink all resolve to the same 23.9 GB blob. Ollama's content-addressing means you're not actually tripling your disk usage by pulling multiple tags of the same weights.

The Verdict

If you're running local LLMs on a shared-memory APU (any AMD APU, any Intel with Arc iGPU, any machine without a discrete GPU), here's the takeaway:

One model at a time. The memory bus is your bottleneck, and dual-model inference taxes it regardless of CPU/GPU split.
MoE models are your best friend on this hardware. You get large-model reasoning quality at small-model inference cost. No need for a sidecar.
Use your surplus RAM for context, not extra models. A single 35B MoE with a 64K context window is more useful than two models fighting over bandwidth.
Watch for Ollama's iGPU memory reporting bug (#14953) -- loading multiple models can trigger OOM crashes because Ollama misjudges available iGPU memory.
Audit your blob storage. Orphan blobs from deleted models add up fast.

The UM790Pro with 96 GB of DDR5 is genuinely impressive hardware for local inference. 17.8 tok/s from a 35B-class model on an integrated GPU, in a box the size of a paperback. Just don't try to make it do two things at once.

If you're into 3D printing and web dev like me, check out GeometryViewer -- a browser-based 3D model viewer I built that runs great on this same hardware. And you can find my other projects on GitHub.

Tested on: Minisforum UM790Pro, Ryzen 9 7940HS, 96 GB DDR5-5600, Ollama v0.9.x, Ubuntu Linux.

Free LLMs on OpenRouter Keep Going 404. I Fixed It With 120 Lines of Python

Josh Green — Sun, 08 Mar 2026 17:54:08 +0000

I built a small pipeline on OpenClaw to stay on top of 3D printing news.

Nothing fancy — a Python script that pulls from YouTube, RSS feeds, and Reddit, uses a free LLM to summarize what's worth reading, and emails me a digest. I use OpenRouter's free tier because I'm cheap and the models are good enough for summarization.

It worked great. For about two weeks.

Then I started getting errors.

The problem nobody talks about

Here's something I didn't fully appreciate until it bit me: free models on OpenRouter change constantly. Models get added, removed, rate-limited into uselessness, or quietly replaced with different versions. If you hardcode your model list — which every tutorial tells you to do — you're building on sand.

One morning I woke up to this:

[06:03] LLM HTTP 404 [openai/gpt-oss-120b:free]: model not found
[06:03] LLM HTTP 429 [nousresearch/hermes-3-llama-3.1-405b:free]: rate limited
[06:03] LLM HTTP 404 [mistralai/mistral-small-3.1-24b-instruct:free]: model not found
[06:03] All free models exhausted — returning empty

Three of my six hardcoded models were dead. The pipeline silently produced nothing. I missed a week of content before I noticed.

Hardcoded lists are technical debt. Free model availability is a moving target. These two facts collide badly.

The fix: treat the model list as a live data source

OpenRouter has a public endpoint — no auth required — that returns their full model catalog:

GET https://openrouter.ai/api/v1/models

It returns ~346 models right now. Filtering to free ones with decent context windows gives you 10-15 candidates. The question is: which ones are actually worth using?

I wanted to rank them. My criteria:

Context window — longer is better for summarization. A 262K context model can swallow an entire article thread without chunking.
Model size — bigger models write better. A 70B model beats a 7B model for prose quality.
Historical reliability — has this model actually worked when I've called it before?

That last one is the one nobody tracks. So I built tracking.

model-registry.py — the discovery layer

The registry script runs once every 6 hours. It:

Checks if the cache (~/.openclaw/free-models.json) is fresh — if yes, exits in <100ms (just a file stat)
If stale, hits the OpenRouter catalog and scores every free model:

def score_model(model_id, context_length):
    context_score = min(context_length / 1000, 200)  # caps at 200
    size_score = get_size_score(model_id)             # regex: 405b=200, 70b=140, 8b=50...
    return context_score + size_score

Takes the top 10, writes them to free-models.json
Logs a diff — "Added: X, Removed: Y since last run"

The diff log is where it gets interesting. On my first run after building this, I discovered two models I'd never heard of that scored in my top 6. One of them — qwen/qwen3-next-80b-a3b-instruct:free — has a 262K context window and an 80B parameter count. It's now my primary model. It wasn't in any tutorial I'd read.

model-metrics.py — the performance layer

HTTP 200 doesn't mean the model was useful. A model can return 200 with three sentences of hallucinated nonsense that breaks your JSON parser downstream.

So I added tracking at two levels:

Level 1 — HTTP success:

t0 = time.time()
try:
    resp = urllib.request.urlopen(req, timeout=90)
    content = resp.read()...
    record_metric(model_id, task, success=True,
                 latency_ms=int((time.time()-t0)*1000),
                 output_len=len(content))
except urllib.error.HTTPError as e:
    record_metric(model_id, task, success=False,
                 latency_ms=..., error_code=str(e.code))

Level 2 — parse success (parse_ok):

After every call that expects structured JSON, I record whether the downstream parsing succeeded:

response = call_free_llm(prompt, task="claim_extraction")
try:
    data = json.loads(response)
    update_parse_ok(True)   # output was actually usable
    return data
except json.JSONDecodeError:
    update_parse_ok(False)  # model returned garbage

parse_ok is the metric I care about most. It answers: was this model actually useful, not just technically responsive?

After a week of pipeline runs, I get a table like this:

Model                                      calls  ok%  p_ok%  avg_ms  errors
meta-llama/llama-3.3-70b-instruct:free       47   94%   88%   1240ms
qwen/qwen3-next-80b-a3b-instruct:free        31   97%   91%   1180ms
openai/gpt-oss-120b:free                     12   58%   42%   1890ms  5×404
nousresearch/hermes-3-llama-3.1-405b:free    8    62%   55%   2100ms  3×404

The last two models look fine on paper (they're large, they have long context) but they're dying constantly. Their scores get penalized:

def score_penalty(stats_entry):
    ok = stats_entry["ok_pct"]
    if ok < 50: return 0.3   # heavy penalty
    if ok < 70: return 0.7
    if ok < 85: return 0.9
    return 1.0               # no penalty

final_score = catalog_score * score_penalty(historical_stats)

When the registry next refreshes, those models sink to the bottom of the fallback chain. Automatically. Without me touching anything.

The result

The pipeline now:

Discovers new free models within 6 hours of them appearing on OpenRouter
Drops dead models from the rotation within one pipeline run
Prioritizes models with proven parse reliability, not just raw specs
Costs $0.00 extra — one public HTTP GET every 6 hours

The whole thing is ~250 lines across two files. No pip dependencies for the registry itself (stdlib only — json, urllib, sqlite3). The metrics use SQLite so they survive reboots and redeploys.

Grab the code

model-registry.py and model-metrics.py — both standalone, drop them next to any script that calls OpenRouter:

# Replace your hardcoded list with this:
REGISTRY_PATH = Path.home() / ".openclaw" / "free-models.json"
_FALLBACK = ["meta-llama/llama-3.3-70b-instruct:free", ...]

def load_free_models():
    try:
        data = json.loads(REGISTRY_PATH.read_text())
        models = [m["id"] for m in data.get("models", [])]
        if len(models) >= 2:
            return models
    except Exception:
        pass
    return list(_FALLBACK)

FREE_MODELS = load_free_models()

Run the registry as a preflight step before any pipeline that uses free models. If the cache is fresh, it exits immediately. If it's stale, it updates in ~1 second.

python3 model-registry.py --max-age 21600   # refresh if >6h old
python3 your-pipeline.py                     # now uses fresh model list

The thing I keep thinking about: I built this to find 3D printing news: the RepRap machines that print their own parts. Then foraging for news made me realize I needed this algorithm. Now the algorithm helps me find better news about the Van Neuman probe itself. It's turtles all the way down — but at least they're free turtles.

Full code on GitHub Gist:

Never Had Such a Good Grip on Funcional Prints Before

Josh Green — Sat, 07 Mar 2026 12:47:50 +0000

I bought my Bambu Lab P1S last year after moving to Budapest. Cheap rent, fast internet, and suddenly I had space for a hobby that wasn't just staring at VS Code for 14 hours a day.

Everyone warned me I’d print "a few useful things then revert to benchies and baby Yodas." They weren’t entirely wrong, I definitely have a small army of calibration cubes I should throw out. But prints like the one I saw today remind me why I got into this in the first place.

Someone designed and printed a custom hand control hoods backing for their road bike. Not a phone mount. Not a GoPro adapter. The actual rubberized part your palms rest on for hours during a ride.

The Problem with Stock Parts
I spent three years doing Shopify theme development, and one thing that stuck with me: most products are designed for the mythical "average user." That works fine if you’re selling t-shirts. It doesn’t work when you’re talking about close contact human anatomy and high-performance equipment.

Road bike hoods are a perfect example. They’re mass-manufactured to fit some statistical middle ground of hand size and grip preference. My hands are slightly larger than average — not enough that I need custom everything, but enough that stock hoods always feel slightly off after a century ride.

Why This Print Matters
Being able to model your own hood backing and iterate on the shape changes everything. Too thick? Adjust the model. Want more texture in one area? Add it. Need a different angle for your specific bars? Measure twice, print once.

Become a Medium member
This is the kind of application that justifies the entire printer for me. Not the ability to make plastic trinkets cheaper than Amazon, but the ability to make things that don't exist for sale anywhere.

The Material Question
Here's where it gets interesting. Hoods need specific properties — grip when your hands are sweaty, some give so they're comfortable, but enough structure that they don't deform when you're pulling hard on a climb. Weather resistance matters too.

PLA won't survive a wet ride. ABS might work but it's not exactly pleasant against your skin. TPU seems like the obvious choice, though getting the right shore hardness would take some testing. Maybe nylon with a soft-touch coating?

I haven’t printed cycling components myself yet. My P1S mostly runs functional prints — brackets, organizers, the occasional prototype for a friend’s product display. But seeing functional bike parts like this makes me want to branch out.

What's Next
I need to find a local bike shop that's 3D-printing-curious. Or just suck it up and start modeling my own solutions to the minor annoyances that aren't worth designing and injection-molding, but are absolutely worth a weekend of CAD and a $2 print.

If you’ve done cycling components, hit me up. I’d love to know what materials actually hold up in the real world — not just in theory, but after six months of road grime and sweat.