Stephan Miller

Posted on May 25 • Originally published at stephanmiller.com on May 18

Building a Cost-Saving Agent Skill That Accidentally Became Its Own Weekly Blog Post

#vibecoding #largelanguagemodels

I had a vault note from a few weeks before this all came to a head. It said, in my own voice and barely punctuated, “I really need to figure out openrouter.” Past me wrote that and moved on. Past me did not yet know that the cost of figuring it out the wrong way is fifteen dollars per coffee break.

Here is what happened. I have a Pro Claude subscription. I love it. I rarely hit the weekly token cap. What kills me is the session timer: every weekend I have a more hours to actually code, I burn through one or two session windows fast, and I’m out. So I started looking around (opencode, Pi, OpenRouter) for a “swap in when Claude rate limits me” alternative or when I decide to let loose a bunch of agents on a project

That was a fine idea right up until I picked an unfamiliar model on OpenRouter, started a coding session, walked away to grab coffee, came back fifteen minutes later, and watched my OpenRouter dashboard tell me I’d just spent fifteen dollars. Fifteen bucks isn’t a fortune. Fifteen bucks per coffee break would add up pretty quick though. I sat there staring at the screen and thought: there are fifteen bazillion models on OpenRouter, and trial-and-error is a gambling problem.

So I built a thing. This is the story of that thing. It’s also the story of how the thing I built to save money turned into a weekly blog post which I post in my Large Language Models category.

The Session Time Trap
$15 in Fifteen Minutes
The Skill: At the Shape Level
The Adaptation Log: the Idea That Makes It Actually Work
Four Weeks of Evolution
The Recursive Twist
The Vault Is the Substrate
The Receipts

The Session Time Trap

For me, the weekly Claude cap in Pro is generous enough that I rarely brush it (for now, but that’s going to change). The session timer, on the other hand, ends my Saturday afternoon while I’m still in the middle of my work.

You can extend a session by turning on extra usage and paying by token. I’ve done that and I usually use that to finish whatever I’m working on, so I can stop using it for the day. But with open source models catching up to frontier models, I started wondering if I was limiting myself.

This is why I started looking at OpenRouter in the first place. Not to leave Claude. Just to have a parallel rail I could swap onto when the session timer killed me mid-bug-hunt. The hope was: never run out of session time again, just route to whatever model can keep going.

The problem was that I had no idea which model to route to. I’d open the OpenRouter rankings, see a hundred names I didn’t recognize, click one that sounded reasonable, and ship it.

$15 in Fifteen Minutes

I’m not going to name the specific model. It wasn’t entirely the model’s fault. It was mine. I picked it because it was cheap: I was being smart about costs, I told myself. I scrolled the OpenRouter listings, saw the pricing, thought “that’s a fraction of what Claude charges,” and picked it.

What I did not verify was whether it could actually handle tool calls correctly. It could not. Instead of completing a task and stopping, it looped. Called the same tools again. Got confused by the results. Called them again. It wasn’t reasoning: it was a stuck record that happened to cost money per revolution. Fifteen minutes of that loop, fifteen dollars out the door, and I’m wondering what just happened.

That was the moment I realized the thing I needed wasn’t another cheap model to test. The thing I needed was a system. Something that ran on the rotating model landscape, cross-referenced what was actually working, including which cheap models were actually reliable, and produced a reading I could trust without spending an afternoon on it. Because cheap and broken is more expensive than expensive and correct. Static instructions weren’t going to cut it either: the model space changes weekly. Hard-coding “use this one” would rot before I read this sentence back.

I needed a Claude Code skill.

The Skill: At the Shape Level

Here’s what it does. I’m going to describe the shape and not the recipe. The whole point of this kind of tooling is that you tune it to your tasks, your price tolerance, your stack, your willingness to test sketchy Chinese models on production code. If I posted my prompts and you ran them, you’d get something generic and so would I. The value is in the tuning. So: shape, not recipe.

The skill cross-references three different kinds of signal:

Volume signal : where the actual money is flowing. Token counts on a public router platform. This tells you what people are trying, but not whether they kept using it.
Head-to-head signal : which model wins when two anonymous outputs are placed side by side and a human picks. This tells you what people prefer in voting conditions, but not what they’re using day to day.
Lived-experience signal : what people say after using a model for weeks. Specific projects, specific failures, specific switching stories. This tells you what’s actually working, but it’s loud-minority biased and slow.

Each one of those sources lies in its own way. The truth lives in the intersection. The skill’s whole job is to assemble the intersection into a five-minute brief that I read before I open OpenRouter on a weekend.

The brief lands in my Obsidian vault. It has trending movers, category-by-category breakdowns, hype-vs-value analysis, horror stories, upcoming releases. I read it Saturday morning. Then I open OpenRouter and I know which slug to type. That’s it.

The Adaptation Log: the Idea That Makes It Actually Work

Here is the architectural lesson worth taking away even if you don’t build a model-buzz skill specifically. It took me about two weeks of running the skill to figure this out, and once I did, the whole thing got dramatically better.

Static instructions rot. Especially in a domain that changes weekly. The skill I wrote in week one had assumptions about which sources were accessible, which categories existed, which org subreddits were active, which providers were stealth-launching models. Half of those assumptions were obsolete by week three. If I’d just kept editing the main prompt file every week to fix what I noticed, I’d have a Frankenstein-prompt by month’s end and no memory of why any specific line was there.

What works is making the skill take notes to itself. After every run, the skill appends to a small log file: things observed, things that broke, patterns worth carrying forward, false patterns to avoid. The next run reads that log first, before doing anything else, so it walks in already smarter than the version that ran a week ago.

A few examples of the kind of thing that ends up in the log, abstracted enough that they’re useful but not specific enough to be a recipe:

Some signals look like adoption but are actually marketing stunts. When a brand-new model spikes massively in volume during a free promotional window that expires next week, that volume isn’t telling you what it looks like it’s telling you. The skill learned to identify this pattern after the second time it almost led with a marketing-stunt headline.
Some sources stop working without telling you. A subreddit gets locked. An API starts returning 403s. A tool’s “trending” tab quietly changes its sort order. The log records what to fall back to.
Some patterns repeat with a delay. When two big labs ship competing models on the same day, they’re timing each other to split the news cycle. The skill now knows to look for the second drop instead of treating the first as the only story.

These are the kinds of patterns that don’t survive in static instructions. They survive in a log that gets re-read every run. The adaptation log is the difference between a tool that gets dumber every week and a tool that gets smarter.

If you build any AI workflow that has to operate in a domain that changes faster than your prompts, this is the architectural pattern. Static instructions plus a self-edited operational log. The log is small, but the log is everything.

Four Weeks of Evolution

Here’s the actual play-by-play of how this thing has changed since I built it:

Week one. First run. I was just trying to get the data without losing my mind. Reddit blocked me on day one. I had to teach the skill to use aggregator sites instead of direct Reddit access. Wrote that down in the log. Already, week one, the skill was learning things I hadn’t anticipated.

Week two. Two big labs shipped major models on the same day. The skill’s instructions handled “one model launches, here’s how to cover it” but not “two simultaneous launches that are partially competing for the same coverage slot.” I had to teach it to compare the two announcements rather than just covering them in sequence. Wrote that down too.

Week three. First fake spike. A major vendor gave a new model away for free, the rankings rocketed by quadruple-digit percentages, and the entire signal was meaningless. The skill nearly led with the spike as the week’s headline before I caught it and made it rework the section. The log gained a new pattern: free-period distortion. Future runs will detect it automatically.

Week four. The big pivot. I was doing the cheapskate math for the third Sunday in a row and noticed something structural: the head-to-head leaderboard at the top is compressed. The entire visible top end fits inside a tiny rating spread. The prices fan out by an order of magnitude or more. So the question every week wasn’t really “what’s best”, but “what’s the cheapest option that’s still in striking distance of best, by category.” I wrote a methodology for it, wired it into the skill as the new centerpiece, and now the brief leads with this every week. The skill is meaningfully different from what it was three weeks ago, and it’ll be different again next week.

The pattern across all four weeks: every week I find some piece of the analysis I’m doing by hand, and I move it into the skill. The skill is a running record of what should be automated next.

The Recursive Twist

I built this to save money on AI usage. That’s still what it does. I haven’t burned fifteen dollars in five minutes since I started using it. Mission accomplished.

What I did not anticipate is that the brief the skill produces is also a perfectly fine weekly blog post. The post drives traffic. The traffic justifies more time on the skill. The cycle compounds.

I did not plan any of this. It just happened. There is something structurally different about tools that produce content versus tools that just save you time. Tools that save time give you a quieter afternoon. Tools that produce content have an exhaust pipe and once you notice the exhaust pipe, you start aiming it at things that need promoting. The promotional material is free, because it was always going to get produced. The only question was where it was going to go.

The meta-twist that gets me every time: this blog post you’re reading right now exists because the skill exists. The skill produced its own promotional material this week. I am writing a post about a thing that wrote a post about itself. I’m not entirely sure what to do with that information except keep going.

The Vault Is the Substrate

The arrangement looks like this:

The skill produces a brief in my Obsidian vault every Saturday morning
The brief gets handed to another skill, which produces a draft post
The draft gets fact-checked, edited, and shipped to Jekyll
The shipped post becomes a backlink the next week’s brief reads as prior context, so the skill knows which stories are already covered

The vault is the substrate. The skill is the engine. The blog is the surface. Each layer leaves traces the others read. None of this is a content management system. It’s a notebook with skills attached to it.

This is what vibe coding looks like for content instead of code. Build a thing, watch it tell you what to build next, iterate: but the artifacts are paragraphs instead of pull requests. Same kind of accidental compounding. Same need to keep an adaptation log so you don’t end up with a stack of stale tooling that you have to keep fighting.

The Receipts

I haven’t accidentally burned fifteen dollars in fifteen minutes since the skill went live. I now spend more time on the skill than the skill saves me, but in the spending I get a weekly blog post and a continuously-updated map of the model landscape that I trust enough to use. The math, in dollar terms, is fine. The math, in time, is also fine because the time produces content.

The skill is going to be different by next week. I’ll find a new pattern, write a new adaptation note, refactor a section of the methodology.

Meanwhile, while finishing this up, I noticed two more patterns that should be in the adaptation log. When a major free-period model retires its free tier, and the rankings haven’t normalized yet, there’s a “transition week” pattern the skill doesn’t handle gracefully. And the stealth slot on the rankings has been quietly sitting on a codename for over a week without resolution, which the skill currently treats as “must resolve within 48 hours” — that assumption needs to relax. So I’m going to add notes for both of those, kick off the next run, and we’ll see what shows up Saturday.

You can find those posts in my Large Language Models category.

Top comments (1)

Harjot Singh • May 31

Love that the cost-saving skill accidentally turned into recurring content - that's the underrated compounding win of building agent skills: each one is reusable infrastructure AND a story, so the work pays twice. The "skill" abstraction itself is the interesting part - packaging a capability the agent can invoke (with its own scoped logic and cost profile) is way more maintainable than stuffing everything into one mega-prompt.

On the cost-saving angle specifically: the skills that save the most are usually the ones that prevent wasted work before it happens (cache a result, short-circuit a no-op, route a cheap subtask away from the expensive model) rather than ones that optimize after the fact. Prevention beats cleanup, in tokens as in everything. That's the kind of scoped, cost-aware capability I lean on in Moonshift (a multi-agent pipeline that ships a prompt to a deployed SaaS) - small skills with tight cost profiles compose into a build that holds ~$3 flat. Fun post, and the accidental-blog-series outcome is great. What does the skill actually do to save cost - caching, routing, or skipping redundant work? Curious which lever it landed on.