Three models, three opinions, zero dollars

#ai #llm #devops #productivity

A few weeks ago I was paying about $1.50 every time I asked my tooling for a "second opinion" on a decision. Three model calls, roughly fifty cents of tokens each, fired several times a day. On the invoice it rounded to nothing. In aggregate it was the most expensive habit I had that produced no code.

This week the same three opinions cost me zero on the margin. Same named models, same prompts — I just route the checks through subscriptions I already pay a flat monthly rate for, so each additional council adds nothing to the bill. That routing is a five-minute trick and I'll get to it. What's worth more than the trick is why I run three models instead of one, because that's what actually changed the quality of what I ship.

One model agreeing with you is not a review

I used to run a single strong model over a decision and treat its answer as a check. It isn't one. A single model has a consistent way of being wrong, and when you prompt it to validate something you wrote, it tends to find the reading that lets it agree with you. You don't get a review. You get a confident echo.

What broke me of this was a rule I almost shipped into my own knowledge base. I'd written something like "don't trust the model for cumulative counting over long inputs" — I'd watched it lose track of a running total across a big document and I generalized it into a law of nature. One model, asked to sanity-check the rule, happily agreed; it had seen the same failure.

Then I put the same claim to a model from a different family. It pushed back immediately: that failure mode is specific to a particular lineage and prompt shape, not a universal property of language models, and here's a case where it doesn't reproduce. It was right. I'd been about to enshrine a one-vendor quirk as a general truth and let future-me act on it. A second opinion from the same lineage probably wouldn't have caught it, because the blind spot is shared.

In my experience the failures aren't scattered — they clump by vendor. When two models from different families both push back on the same thing, that's the spot I actually read.

What I actually run

Three voices, deliberately from at least two different lineages:

a GPT-class model (one vendor's reasoning lineage)
a mid-tier Claude (Sonnet)
a top-tier Claude (Opus)

Same prompt to all three. I ask each for a verdict and its reasoning, never just the verdict. The reasoning is where you see whether two "yeses" actually agree, or just happened to land in the same place by different logic. Then the synthesis is mine. The council doesn't vote and I don't average them. I read where they diverge, and that's almost always the exact spot I was hand-waving past.

Why three and not two or five? Five has been more slowdown than signal for me — the last two calls are usually just the first three opinions reworded, and you're waiting on the slowest one. Three is enough to break a tie and I haven't found a reason to go higher.

The routing, and the one gotcha

Here's the part that took the marginal cost to zero. Instead of billing each probe as a metered API request, I spend it against subscriptions I'm already paying a flat rate for. One honest caveat before you copy this: subscription seats are meant for interactive use, so check your own plan's terms before you wire this into a cron job — I run it for my own development, not as a way around metered billing.

# Pseudocode. The real version is a small router keyed on an env var.

def council(prompt):
    # One reasoning-heavy model from the other vendor, called through a
    # subscription-backed local harness rather than the metered API.
    gpt   = harness_probe(model="<your-gpt-class-model>", prompt=prompt)

    # Both Claude voices via the `claude -p` CLI, which spends the
    # subscription's OAuth seat rather than the metered API key.
    sonnet = claude_cli(model="sonnet", prompt=prompt)
    opus   = claude_cli(model="opus",   prompt=prompt)

    return gpt, sonnet, opus

The gotcha that cost me an afternoon: in my environment, claude -p preferred the metered API key whenever that key was present in the environment, which quietly defeats the whole point. Strip it from the child process so the CLI falls back to the subscription's OAuth path:

import os, subprocess

def claude_cli(model, prompt):
    env = {k: v for k, v in os.environ.items() if k != "ANTHROPIC_API_KEY"}
    out = subprocess.run(
        ["claude", "-p", "--output-format", "text", "--model", model],
        input=prompt, capture_output=True, text=True, env=env, timeout=600,
    )
    return out.stdout.strip()

The same claude -p --model X pattern runs headless, so a scheduled council uses the same path as an interactive one.

The crossover nobody runs the math on

The reason this matters beyond my invoice: my own usage had split into two cost buckets, and I was pricing both the same way out of habit.

Per-token API billing is cheap for a one-shot and punishing for repetition. A flat subscription is the other shape: silly for a single call, nearly free once you're running the same kind of check all day. The crossover is real and it's not high — at a handful of probes a day the metered call still wins, but at the volume I actually run, dozens of small second-opinions most days, the flat rate wins outright and stops me from rationing a check I should always be doing.

So I stopped architecting validation around "how many tokens will this burn" and started asking "which seat does this spend." A lot of us still price this work per token, myself included until recently, mostly because that's the number the invoice trains you to watch.

When I don't bother

A council validates; it doesn't invent. For a genuinely novel problem where I need a new idea rather than a check on an existing one, a three-way committee is the wrong tool, because it converges, and convergence is the last thing you want when you're actually stuck. And for a true one-shot decision I'll only ever make once, the metered API call is fine; the subscription routing only earns its keep when you're doing this all day.

But for the daily question — is this rule too broad, is this fix treating the symptom, did I miss something obvious — three models picking at it beats one agreeing with me. And now it costs nothing on the margin, which means I actually run it instead of talking myself out of the fifty cents.

Fillip Kosorukov co-authored peer-reviewed research in substance-use psychology (Journal of Substance Use, 2023; PMID: 37275205; BS Psychology, Summa Cum Laude, University of New Mexico, 2020) and now builds founder-led software. More at fillipkosorukov.net · ORCID 0009-0004-1430-2859.