DEV Community: Brian Mello

Knowing which model openrouter/auto actually ran (and what it cost)

Brian Mello — Thu, 25 Jun 2026 14:02:29 +0000

If you route through OpenRouter with openrouter/auto, you've traded a decision for a convenience. The router picks a model per request based on price, latency, and availability, and most of the time it picks well. The catch is that "most of the time" is doing a lot of work in that sentence, and you don't find out which model handled a given request until you go looking.

This matters more than it used to. Agents like OpenClaw and a growing pile of internal tools default to auto precisely because they don't want to hard-code a model. That's reasonable. But it means a single agent loop can fan out across three or four different models in an afternoon, each with its own per-token price, and your only signal is an aggregate bill at the end of the month. When the bill is surprising, "which model, on which request, for which developer" is exactly the breakdown you don't have.

Here's how to get it back without rearchitecting anything.

The data is already there

OpenRouter records the model it selected on every completion, even when you asked for auto. You don't need to intercept traffic to see this. The platform exposes it two ways.

The first is per-request, at generation time. A chat completion response includes a model field with the concrete model that ran — not openrouter/auto, but the actual anthropic/claude-... or google/gemini-... string it resolved to. If you control the call site, log that field alongside the generation id:

resp = client.chat.completions.create(
    model="openrouter/auto",
    messages=messages,
)
print(resp.model)  # e.g. "anthropic/claude-3.5-sonnet", not "openrouter/auto"

The second is after the fact, which is the one that scales. OpenRouter has an activity/usage endpoint that returns historical generations with the resolved model, token counts, and cost per row. Polling that endpoint on a schedule gives you a complete ledger without touching your application code at all:

curl https://openrouter.ai/api/v1/activity \
  -H "Authorization: Bearer $OPENROUTER_API_KEY"

You can also fetch a single generation by id from the generation endpoint if you want to enrich a specific log line. The shape varies, so check the current OpenRouter docs for exact field names before you build against it — but the resolved model and the cost are both in there.

Turning rows into per-model, per-developer truth

Raw generation rows are not an answer. The work is in the grouping, and the grouping is only as good as your key hygiene.

The single most useful thing you can do is give every developer (and every agent) their own OpenRouter API key. Not for security theater — for attribution. If everyone shares one key, the usage endpoint hands you a pile of generations with no way to say who incurred them. One key per identity turns an anonymous stream into rows you can GROUP BY.

With that in place, a daily poll and a few lines of aggregation get you the table you actually want:

from collections import defaultdict

rows = fetch_activity()  # list of generation records
by_model_dev = defaultdict(lambda: {"cost": 0.0, "calls": 0})

for r in rows:
    key = (r["owner"], r["model"])  # owner derived from which API key
    by_model_dev[key]["cost"] += r["usage"]["cost"]
    by_model_dev[key]["calls"] += 1

Now auto stops being a black box. You can see that one developer's agent quietly resolved to an expensive model for 80% of its calls last Tuesday, while everyone else stayed on the cheap tier. That's a finding. You can't act on a number you can't attribute.

Setting a threshold that catches the bad day

A ledger tells you what happened. To catch a runaway agent the same day, you need a tripwire, and the cheapest useful one is a per-developer baseline.

Compute each developer's mean daily spend and standard deviation over a trailing window — two or three weeks is plenty. Then flag any day that crosses mean + 3σ. Three sigma is deliberately boring: it ignores normal variance and only fires when something is genuinely off, which is what you want from an alert you'll actually keep enabled. A retry loop with no backoff, an agent stuck re-reading the same 200KB file, a test harness someone left running over lunch — these don't creep, they spike, and a 3σ band on a per-developer baseline catches them on the day they happen rather than in next month's reconciliation.

Wire that check into your daily poll, and post the exceedances somewhere people look. The whole thing is maybe forty lines of code plus a cron entry.

Why polling beats a proxy here

The instinct is often to put a gateway in front of OpenRouter so you can see everything. For cost visibility specifically, that's more machine than the job needs. A proxy is now in your critical path: it can add latency, it can go down and take your inference with it, and it sees every prompt and completion your developers send — a meaningful expansion of what you're storing and securing. The usage API gives you the same cost and model data after the fact, read-only, with none of that exposure. If the question is "what did we spend and on which model," you don't need to stand in the middle of the conversation to answer it.

That's the whole approach: one key per identity, poll the usage endpoint, group by model and developer, and put a 3σ tripwire on each baseline. It's a quiet afternoon of work and it pays for itself the first time auto does something expensive you didn't expect.

If you'd rather not maintain the poller, the keys, and the alerting yourself, that's what we built Reckon for — read-only usage-API polling (no proxy, no SDK, it never sees your prompts), KMS-encrypted keys, per-developer and now per-model OpenRouter visibility in realtime, Slack digests with same-day anomaly alerts, a /spend command, and a Linear integration. It's free up to three developers and $19 per developer per month on Pro. Either way — build it or don't — the visibility is worth more than the bill it saves you.

A spend baseline and a 3σ alarm: catching a runaway AI agent the same day

Brian Mello — Wed, 17 Jun 2026 17:08:58 +0000

Every team that hands an AI agent a budget eventually meets the same surprise: a quiet bill that wasn't quiet at all, it just wasn't being watched. A retry loop with no backoff. An agent stuck in a tool-call cycle, re-reading the same file forty times. Someone who pointed the test suite at the most expensive model "just to see." None of these announce themselves. They show up at the end of the month as a number nobody can explain.

The fix isn't a spending cap — caps are blunt, and the legitimate heavy user hits them as often as the runaway does. The fix is a baseline: learn what normal looks like for each developer, then alert the same day when a person blows past their own normal. Here's how to build that with a trailing window and a bit of arithmetic you already know.

Step 1: get a per-developer, per-day spend table

You can't baseline what you don't record. The goal is the most boring table imaginable:

(day, developer, model, cost_usd)

Where this data comes from depends on your provider, but the principle is constant: capture attribution at call time, where you actually know who the caller is, rather than reverse-engineering it from a billing PDF later. Two low-friction ways to get there without standing up a proxy:

A key per developer. Issue each engineer (or each service) their own provider API key. Spend groups by key for free, and revoking a leaked key is one click.
Usage accounting in the response. Most providers will hand you the cost if you ask. On OpenRouter, for example, add "usage": {"include": true} to the request and the response carries token counts, the resolved cost, and — importantly if you use openrouter/auto — the actual model that served the request rather than the router alias. Log that alongside your internal user id and you're done.

The reason the model field matters: a spend spike on a cheap model is a volume problem; the same dollar spike on an expensive model is a routing or config problem. You want to tell those apart on day one, which means recording the real model, not auto.

Step 2: compute each developer's own baseline

Here is the part people overcomplicate. You do not need a forecasting model. You need a trailing mean and standard deviation per developer, and the classic mean + 3σ threshold. Three sigma is roughly the line past which "busy Tuesday" becomes "something is wrong," and because it's computed per person, your heaviest user and your lightest user get appropriately different ceilings automatically. Nobody hand-tunes a magic number.

Use a trailing window — say 14 days — and crucially, exclude today from the baseline so a spike can't inflate the very threshold it should trip:

SELECT developer, day, spend, mu, sigma
FROM (
  SELECT developer, day, spend,
         AVG(spend)    OVER w AS mu,
         STDDEV(spend) OVER w AS sigma
  FROM daily_spend
  WINDOW w AS (
    PARTITION BY developer ORDER BY day
    ROWS BETWEEN 14 PRECEDING AND 1 PRECEDING
  )
) t
WHERE spend > mu + 3 * sigma;

Any row that comes back is a developer who spent more today than their last two weeks would predict, by a wide margin.

Step 3: handle the awkward edges before they embarrass you

A naive 3σ alert fires constantly in week one and then cries wolf. Three guards earn their keep:

Minimum history. Don't alert until a developer has, say, 7 days of data. With two data points, σ is meaningless and everything looks anomalous.
A dollar floor. mean + 3σ on someone who normally spends 12 cents will fire at 40 cents — true, and useless. Require the day to also clear an absolute floor (a few dollars) before it counts. Anomaly and materiality.
New-developer ramp. A first week of heavy use is onboarding, not a runaway. Suppress alerts during a grace period or widen the threshold while history is thin.

These three turn a noisy statistical curiosity into something an on-call engineer will actually keep enabled.

Step 4: make it land where people look

A query that runs in a notebook nobody opens is not a control. Wire the job to a daily cron and push hits to wherever your team already lives — a Slack message naming the developer, the model, today's spend, and their baseline is enough to start a conversation in seconds:

⚠️ priya — $58.40 today vs. baseline $9.10 (+5.4σ), 90% on claude-3.7-sonnet

That one line tells you who, how much, how unusual, and where to look. Same-day, while the loop is still running and you can actually kill it, instead of next month when the money's gone.

Worth saying: this never reads a prompt

Notice what this system needs and what it doesn't. It needs timestamps, identities, models, and costs. It does not need the contents of a single request. A cost monitor built on usage data is read-only by construction — it sees what was spent and by whom, never what was asked. That's the right privacy posture for something the whole engineering org has to trust, and it's a good constraint to design toward even if you build it yourself.

If you'd rather not maintain this

The above is an honest afternoon of work plus the ongoing tax of owning a polling worker, a spend table, an alerting job, and the three edge-case guards that keep it from crying wolf. If you'd rather not maintain that yourself, it's roughly what Reckon does — read-only usage-API polling (no proxy, no SDK, never sees your prompts), KMS-encrypted keys, per-developer baselines with same-day anomaly alerts and Slack digests, a /spend command, and a Linear integration. It now tracks OpenRouter usage by model in realtime too, so teams on openrouter/auto — OpenClaw users included — get per-model, per-developer visibility without changing how they call anything.

Build it or buy it; the discipline is the same. Learn each person's normal, watch for the day they leave it, and find out while you can still do something about it.

Building an MCP server in production: lessons from 2,300 npm downloads

Brian Mello — Tue, 16 Jun 2026 16:05:43 +0000

i shipped an MCP server six months ago expecting it to be a weekend toy. it's now been pulled down a little over 2,300 times from npm and runs inside other people's editors every day. that gap — between "stdio script i tested once" and "thing strangers depend on" — is where i learned everything i actually know about the Model Context Protocol. this post is the stuff the spec doesn't tell you.

if you've read the MCP docs you know the happy path: expose some tools, return some content, the client (Claude Desktop, an IDE, whatever) calls them. what the docs are quieter about is what happens when a real client on a real machine hits your server in ways you didn't anticipate. here's what broke, and what i'd tell myself on day one.

the transport will surprise you

the first thing nobody warns you about: stdout is sacred. MCP over stdio uses your process's standard output as the message channel. the moment some dependency deep in your tree does a stray console.log, you've injected garbage into the JSON-RPC stream and the client disconnects with a parse error that points nowhere useful.

i lost an evening to this. the fix is boring and absolute — every diagnostic goes to stderr, and stdout belongs to the protocol alone:

// logger.js — the only logging that's safe in a stdio MCP server
export function log(...args) {
  // stdout is reserved for JSON-RPC frames. everything else -> stderr.
  process.stderr.write(`[2ndopinion] ${args.join(" ")}\n`);
}

// and defensively, at startup, trap anything that didn't get the memo:
console.log = (...args) => log("(captured console.log)", ...args);

that last line caught two transitive dependencies that logged on init. without it the server worked on my machine and failed on half my users'.

tool descriptions are a UX surface, not metadata

the second lesson reframed how i think about the whole protocol. your tool's description field isn't documentation — it's a prompt. the model reads it to decide whether and how to call your tool. a vague description means the tool gets called at the wrong time, with the wrong arguments, or not at all.

my first version of a review tool looked like this:

server.tool(
  "review",
  "Reviews code",
  { code: z.string() },
  async ({ code }) => runReview(code)
);

"reviews code" tested fine when i prompted it, because i knew what i meant. in the wild, models called it on commit messages, on config files, on half-finished functions, and then surfaced confusing output. rewriting the description to say what the tool does, when to use it, and what it returns cut the misfire rate dramatically:

server.tool(
  "review_diff",
  "Run a multi-model code review on a git diff or code snippet. " +
  "Use this BEFORE merging when you want a second opinion on correctness, " +
  "security, or edge cases. Returns issues ranked by severity with file:line refs. " +
  "Do not use for prose, commit messages, or config files.",
  {
    diff: z.string().describe("unified diff or raw code to review"),
    language: z.string().optional().describe("e.g. 'python', 'typescript'"),
  },
  async ({ diff, language }) => runReview(diff, language)
);

the naming matters too. review_diff tells the model what shape of input it wants; review invites it to throw anything at you. treat every string a model reads as a prompt, because it is one.

timeouts, partial failure, and the upstream you don't control

my server fans a diff out to three upstream model APIs and merges the results. in development, all three always answered. in production, one provider would occasionally hang for 40 seconds, and because i was doing a naive Promise.all, the whole tool call blocked behind the slowest — or worse, rejected entirely when one upstream 500'd.

Promise.allSettled with a per-call timeout fixed both the latency tail and the all-or-nothing failure mode. a review from two of three models is still a useful review; a hang is not.

const withTimeout = (p, ms) =>
  Promise.race([
    p,
    new Promise((_, rej) => setTimeout(() => rej(new Error("timeout")), ms)),
  ]);

const settled = await Promise.allSettled(
  models.map((m) => withTimeout(m.review(diff), 25_000))
);

const reviews = settled
  .filter((r) => r.status === "fulfilled")
  .map((r) => r.value);

if (reviews.length === 0) throw new Error("all reviewers failed or timed out");

partial results beat perfect results that never arrive. this is true of most fan-out systems, but it's especially true when an LLM is waiting on the other end and a stalled tool call just looks like the whole agent froze.

version skew is the silent killer

clients implement MCP at slightly different speeds. a capability that works in one editor's build throws "method not found" in another. i now log the client's declared protocol version on the initialize handshake and degrade gracefully instead of assuming everyone is on the latest spec. the single most useful thing i added was a --doctor flag so users could self-diagnose before opening an issue:

$ npx 2ndopinion-cli --doctor

2ndopinion mcp doctor
─────────────────────────────────────────
✓ node v20.11.0            (>= 18 required)
✓ stdio transport          handshake ok
✓ ANTHROPIC_API_KEY        found
✓ OPENAI_API_KEY           found
✗ GEMINI_API_KEY           missing — gemini reviewer disabled
✓ client protocol          2025-03-26
─────────────────────────────────────────
2 of 3 reviewers ready. consensus needs >= 2. you're good to go.

half my early "it's broken" reports were a missing env var. the doctor turned a support thread into a one-line answer the user could read themselves.

what the downloads taught me

the through-line across all of this: an MCP server isn't a library, it's a contract with a non-deterministic caller you can't see. the model decides when to call you, with inputs you didn't write, on a machine you'll never log into. defensive output handling, descriptions written as prompts, partial-failure tolerance, and self-diagnostics aren't polish — they're the difference between 2,300 downloads and 2,300 uninstalls.

i build all of this into 2ndOpinion, which runs Claude, Codex, and Gemini in parallel to cross-review a diff and returns a weighted consensus — each model's accuracy is tracked per language and issue type, so the verdict isn't just a vote, it's a calibrated one. it ships as the MCP server this post is about, plus a REST API, a GitHub PR agent, and the CLI.

if you want to feel the partial-consensus behavior yourself, npx 2ndopinion-cli runs a review in your terminal in about a minute, and the $5 starter pack (100 credits) is enough to put it through a real backlog before you decide anything. and if you're building your own MCP server: log to stderr, write your descriptions like prompts, and ship a doctor command. future-you, reading the bug reports, will be grateful.

Proxy, Gateway, or Poll the Usage API? Picking an Architecture for AI Cost Visibility

Brian Mello — Wed, 10 Jun 2026 17:08:26 +0000

At some point your AI bill stops being a rounding error and someone asks the obvious question: who spent what, on which model, doing what? Answering it means putting something between your developers and the providers — or putting something next to the providers. There are three common shapes, and the choice has real consequences for latency, failure modes, and what data you end up holding. Most teams pick one by accident and regret it later. Here's how to pick one on purpose.

The three shapes

1. Inline proxy. You stand up a service that every LLM request flows through. It forwards to OpenAI/Anthropic/OpenRouter, reads the response, records tokens and cost, and returns the completion. LiteLLM-style gateways do this.

2. SDK / wrapper instrumentation. You wrap the client library so each call emits a metric before returning. No separate network hop, but every call site has to use your wrapper.

3. Usage-API polling. You touch the request path not at all. Instead you periodically read the provider's own metering API — the usage and activity endpoints most providers already expose — and reconstruct who-spent-what from data the platform computed for you.

They sound interchangeable. They are not.

Where each one bites you

Latency and availability. An inline proxy is now on the critical path of every model call. Its p99 is your p99. Its downtime is your outage. You will, eventually, add retries and a circuit breaker so a metering hiccup doesn't take down inference — at which point you're maintaining a piece of production infrastructure whose entire job is to watch production infrastructure. Polling has zero request-path latency by construction; if the poller is down you lose freshness, not traffic.

Coverage. SDK instrumentation only sees calls that go through your SDK. The moment a developer runs a tool you didn't wrap — an agent like OpenClaw, a curl in a CI script, a notebook — that spend is invisible. A proxy catches everything if you can force all egress through it, which in practice means network policy work. Polling catches everything the provider meters, regardless of how the call was made, because it reads the provider's ledger rather than the traffic.

Attribution granularity. This is the subtle one. With openrouter/auto and similar auto-routing, the model that actually ran is chosen server-side per request. A naive proxy that only logs the requested model records auto and learns nothing; it has to parse the response body to recover the real model. Polling the usage API gets the resolved per-model breakdown directly, because that's what the provider bills on. For per-developer attribution, the cleanest trick under any architecture is one API key per developer or workload — then attribution is a GROUP BY key, not a log-parsing exercise.

Data exposure. A proxy and an SDK wrapper both sit in the data path, which means prompts and completions pass through code you now own. That's a security surface: logs that accidentally capture prompt text, a breach that now includes user content, a compliance review that takes three times as long. Polling reads counts and dollars, never message bodies. If your goal is cost visibility, ask whether your cost tool has any business seeing the contents of a prompt. Usually it doesn't.

A polling implementation you can run today

The polling approach is the least discussed and the easiest to stand up, so here's a concrete skeleton. The two ideas that make it work: per-developer keys for attribution, and a scheduled read of the activity endpoint.

import requests

def pull_usage(provisioning_key, dev_keys):
    rows = []
    for dev, key_id in dev_keys.items():
        r = requests.get(
            "https://openrouter.ai/api/v1/activity",
            headers={"Authorization": f"Bearer {provisioning_key}"},
            params={"key": key_id},
        )
        for item in r.json()["data"]:
            rows.append({
                "dev": dev,
                "model": item["model"],   # resolved model, even for auto routes
                "cost": item["usage"],
                "date": item["date"],
            })
    return rows

Run that on a few-minute cron, push the rows into whatever time-series store or warehouse you already have, and you've got near-realtime per-developer, per-model spend — including a real breakdown of what openrouter/auto actually resolved to — without a single byte of production traffic flowing through you.

Turn the series into something that pages you

Visibility you have to remember to look at isn't visibility. Add a cheap anomaly check so the system tells you instead. Per-developer baselines beat one global threshold, because the engineer who runs nightly eval suites shouldn't trip the same wire as the one who normally spends two dollars a day:

import statistics as s

def is_anomaly(history, today):           # history = trailing 14-30 days
    mu, sigma = s.mean(history), s.pstdev(history)
    return today > mu + 3 * sigma         # ~3 sigma = "this isn't a busy Tuesday"

Wire that to the poller and a runaway agent stuck in a retry loop gets caught the same day, not on next month's invoice. The gap between those two is the difference between a Slack message and a postmortem.

So which one?

If you need to enforce policy in the request path — hard budget cutoffs, key rotation, request rewriting — you genuinely want a gateway, and you should accept the operational cost of running one. If you fully control every call site and value a single network hop, SDK instrumentation is fine. But if what you actually want is visibility and alerting — who spent what, on which model, and tell me when it's weird — polling the usage API gives you that with no request-path risk and no prompt exposure. Most teams asking the who-spent-what question want the third thing and reach for the first.

If you'd rather not build and babysit the poller, the per-key provisioning, and the baselining, that's what we made Reckon: read-only usage-API polling (no proxy, no SDK, never sees your prompts), KMS-encrypted keys, per-developer and per-model breakdowns — including realtime openrouter/auto attribution for OpenClaw and anyone else on auto-routing — Slack digests, same-day anomaly alerts, a /spend command, and a Linear integration. Free up to three developers, then $19 per developer per month ($99/mo minimum). The architecture above stands on its own, though — choose the shape that matches what you're actually trying to do, not the one that sounds most thorough.

The math of multi-model consensus: when 3 cheap reviews beat 1 expensive one

Brian Mello — Tue, 09 Jun 2026 16:06:48 +0000

there's a reflex in AI tooling that says: when in doubt, reach for the biggest model. bigger model, better review, fewer escaped bugs. it feels obviously true. but if you actually write down the probabilities, the reflex falls apart for a large class of problems. three smaller, cheaper reviews — read together correctly — can beat one expensive one, and not by a little.

this isn't a vibes argument. it's the same math that makes RAID arrays more reliable than a single expensive disk, and ensemble classifiers beat single models in practice. let me show the numbers, then the catch, then how to actually wire it up.

the single-reviewer ceiling

say your best, most expensive model catches a real bug 80% of the time on a given diff. that's genuinely good. it also means it misses one in five. run it again on the same diff and you don't get to 96% — you get back to 80%, because the second pass has the same blind spots as the first. a model's errors aren't random noise you can average away by re-rolling. they're systematic. the bug it can't see, it can't see twice.

so the ceiling on a single reviewer isn't set by how many times you ask. it's set by the model's correlation with itself, which is 1. you are stuck at 80%.

why independent errors change everything

now suppose instead of one model at 80% you take three models that each catch only 70% — individually worse — but whose mistakes are uncorrelated. a bug one misses, another tends to catch, because they were trained on different data with different objectives and have learned different "smells."

the probability that all three miss the same real bug, if their misses were fully independent, is:

0.30 × 0.30 × 0.30 = 0.027

that's a 97.3% catch rate from three reviewers that are each individually worse than your expensive one. the expensive single model sat at 80%. the trio of cheaper models lands near 97% — purely because their failures don't line up.

real models aren't fully independent, so you never get the textbook number. but even at partial independence the direction holds, and it holds hard. here's the same calculation generalized, with a correlation knob so you can see how the advantage decays as the models start failing alike:

def union_catch_rate(per_model_recall, n_models, corr=0.0):
    """Approximate catch rate for n independent-ish reviewers.
    corr=0 -> fully independent, corr=1 -> fully correlated (no gain)."""
    miss = 1 - per_model_recall
    independent_all_miss = miss ** n_models
    correlated_all_miss = miss                 # behaves like a single model
    blended = (1 - corr) * independent_all_miss + corr * correlated_all_miss
    return 1 - blended

for c in (0.0, 0.25, 0.5, 0.75):
    rate = union_catch_rate(0.70, 3, corr=c)
    print(f"corr={c:>4}: 3x70% reviewers -> {rate:.1%}")

corr= 0.0: 3x70% reviewers -> 97.3%
corr=0.25: 3x70% reviewers -> 90.5%
corr= 0.5: 3x70% reviewers -> 83.7%
corr=0.75: 3x70% reviewers -> 76.8%

the lesson lives in that table. when your reviewers fail independently (top row) three cheap ones crush one expensive one. when they fail alike (bottom row) you've just paid three times for one opinion. the entire game is decorrelating your reviewers, which in practice means using different model families, not the same model three times or three models from the same lab fine-tuned off the same base.

the catch: cost, latency, and false positives

three reviews aren't free, and the naive pitch ignores three real costs.

the first is money, but it cuts the surprising way. three small-model calls are usually cheaper than one frontier call, not more expensive — a mid-tier model runs a fraction of the per-token price of a flagship. so "3 cheap beats 1 expensive" is often literally cheaper, not a quality-for-cost trade.

the second is latency, and here you win for free: the three reviews are embarrassingly parallel. fire them concurrently and wall-clock time is the slowest of the three, not the sum — roughly the latency of one call.

the third is the real cost, and it's false positives. three reviewers flag more total stuff, and some of it is noise. union everything blindly and you bury the developer in low-value nits. this is where flat counting fails: "two of three flagged it" treats every model as equally credible on every question, which is plainly false. a model strong on Python concurrency may be weak on SQL injection. the fix is to weight each model's vote by its measured accuracy on this specific language and issue type, and to separate the high-agreement findings from the worth-a-glance solos.

findings = run_parallel(diff, models=["a", "b", "c"])   # concurrent

# weight each vote by that model's track record on (language, category)
def weighted_consensus(finding):
    return sum(model_accuracy[m][finding.lang][finding.category]
              for m in finding.flagged_by)

high_conf = [f for f in findings if weighted_consensus(f) >= THRESHOLD]
worth_a_glance = [f for f in findings if f not in high_conf]

that's the difference between an ensemble that helps and one that just yells louder.

what it looks like in practice

once you weight the votes and parallelize the calls, a three-model review of a staged diff runs in about the time of a single call and reads like this:

$ npx 2ndopinion-cli review --staged

  scanning 4 files · 3 models · weighted consensus · parallel

  src/auth/session.py
    ⚠ high   security      JWT verified without exp check        (consensus 0.94)
    ⚠ med    concurrency   token refresh races on shared dict    (consensus 0.71)
  src/billing/charge.py
    ⚠ high   numeric       float math on currency amounts        (consensus 0.89)
  src/api/routes.py
    · low    style         unused import (1 model, low weight)   (consensus 0.22)

  4 findings · 2 high · 1 med · 1 muted · 1.9s

note the bottom finding: one model flagged it, the weighting knew that model is weak on style calls, so it got muted rather than shoved in your face. that's the math doing triage.

where this goes

you can build the core of this yourself in an afternoon, and you probably should — running three model families against one rubric and unioning the findings is most of the value. the part that's tedious to maintain is the bookkeeping underneath: tracking per-model, per-language, per-issue-type accuracy over time so the weights stay calibrated, and remembering bug shapes you've already seen so they get flagged instantly instead of re-derived every run.

that compounding bookkeeping is what 2ndOpinion handles. it runs Claude, Codex, and Gemini in parallel and returns a calibrated, weighted consensus — each model's vote scaled by its measured accuracy per language and issue type rather than a flat majority — plus a pattern memory that recognizes known bug shapes on sight. it ships as an MCP server, a REST API, a CLI, and a GitHub PR agent.

the takeaway works with or without any tool: stop equating "best review" with "biggest model." reach for reviewers that fail differently, run them in parallel, and weight what they say. if you want the calibrated version without maintaining the accuracy tables yourself, get2ndopinion.dev has a $5 starter pack (100 credits) and a 7-day Pro trial — or just run npx 2ndopinion-cli on a staged diff and watch three cheap opinions outvote one expensive guess.

Which model actually ran? Tracking `openrouter/auto` usage by model in realtime

Brian Mello — Wed, 03 Jun 2026 17:07:02 +0000

If you route through OpenRouter with openrouter/auto, you have a small, recurring mystery on your hands: you asked for "the best available model," OpenRouter picked one, and unless you went looking, you have no idea which one served the request or what it cost. Multiply that by an autonomous agent making hundreds of calls an hour — OpenClaw and friends love auto — and your month-end bill becomes a whodunit.

The good news is that OpenRouter already records everything you need. You just have to go get it. Here's how to build per-model, per-developer visibility without a proxy, an SDK wrapper, or touching a single prompt.

The one field everyone misses

When you send a chat completion to OpenRouter, the response body tells you which model actually answered. Even if you requested openrouter/auto, the model field in the response is the resolved model — anthropic/claude-3.5-sonnet, google/gemini-flash-1.5, whatever the router chose.

{
  "id": "gen-abc123",
  "model": "anthropic/claude-3.5-sonnet",
  "choices": [ ... ],
  "usage": { "prompt_tokens": 1840, "completion_tokens": 412 }
}

That id is the thread you pull. OpenRouter exposes a generation-lookup endpoint that returns the authoritative record for any call — including the native cost in credits, the upstream provider, and token counts:

curl https://openrouter.ai/api/v1/generation?id=gen-abc123 \
  -H "Authorization: Bearer $OPENROUTER_API_KEY"

The response includes total_cost, model, tokens_prompt, tokens_completion, and the provider that served it. The completion response gives you the model instantly; the generation endpoint gives you the dollars-and-cents truth a beat later (costs settle slightly after the call returns). For most monitoring you want both: the model in realtime, the cost on a short delay.

Polling, not proxying

You have two ways to capture this. One is to sit in the request path — a proxy or a patched SDK that intercepts every call. That works, but now you own a piece of latency-critical, prompt-handling infrastructure, and you've put yourself between your developers and their model. If your monitor hiccups, their agents hiccup.

The other way is to read the usage data after the fact, out of band. OpenRouter has an activity surface you can poll on a schedule:

# Account-level credit + usage snapshot
curl https://openrouter.ai/api/v1/credits \
  -H "Authorization: Bearer $OPENROUTER_API_KEY"

Poll that on an interval, diff successive snapshots, and you have spend velocity without ever being in the hot path. The tradeoff is honest: polling gives you near-realtime, not sub-second, and you trade a little freshness for never being a dependency of the thing you're watching. For cost monitoring — as opposed to, say, rate limiting — that's the right trade nearly every time.

Attributing spend to a developer

A single org key tells you the org spent money, not who spent it. The clean fix is one OpenRouter key per developer (or per agent), each tagged. OpenRouter lets you create multiple keys, so issue them per person and keep a map:

KEY_OWNERS = {
    "sk-or-v1-aaa...": "ravi",
    "sk-or-v1-bbb...": "dana",
    "sk-or-v1-ccc...": "agent-ci",
}

Now poll each key's usage and bucket by both owner and resolved model. A minimal aggregator:

import requests, collections

def snapshot(key):
    r = requests.get(
        "https://openrouter.ai/api/v1/credits",
        headers={"Authorization": f"Bearer {key}"},
        timeout=10,
    )
    r.raise_for_status()
    return r.json()["data"]["total_usage"]  # cumulative credits used

spend = collections.defaultdict(float)
for key, owner in KEY_OWNERS.items():
    spend[owner] = snapshot(key)

Persist each poll with a timestamp and you can compute per-developer, per-window deltas. Join that against the per-call model you captured from completion responses, and you finally get the table you actually wanted: how much each person (or agent) spent, broken out by which model auto chose for them.

Turning data into a tripwire

A dashboard nobody opens won't catch a runaway agent at 2 a.m. You want a threshold that pages you. The cheapest version that works: compute each developer's rolling daily spend, then flag any day that exceeds their own mean plus three standard deviations.

import statistics

def is_anomaly(history, today):
    if len(history) < 7:
        return False  # not enough baseline yet
    mu = statistics.mean(history)
    sigma = statistics.pstdev(history)
    return today > mu + 3 * sigma

Per-developer baselines matter more than one global number. The engineer fine-tuning prompts all day has a legitimately high baseline; the teammate who normally spends two dollars and suddenly spends eighty is your actual signal. A global threshold drowns the second case in the first. Wire the check to fire a Slack message the same day the anomaly appears — a spike you learn about at month-end is just an expensive history lesson.

The shape of the whole thing

Put together, the pattern is: read the resolved model from each completion response for realtime per-model attribution, poll the usage and generation endpoints out of band for authoritative cost, key per developer for attribution, store snapshots, and run a mean-plus-3σ check per person that alerts the same day. No proxy, no prompt access, no latency added to anyone's critical path. A read-only key and a cron job genuinely get you most of the way there.

If you'd rather not maintain the poller, the per-key plumbing, and the baseline math yourself, that's roughly what we build at Reckon — read-only usage-API polling (no proxy, no SDK, never sees your prompts), KMS-encrypted keys, and now realtime OpenRouter tracking by model, so openrouter/auto and OpenClaw runs show up per-model and per-developer as they happen, with Slack digests, same-day anomaly alerts, a /spend command, and Linear integration. Free for up to 3 developers. (Disclosure: I work on Reckon.) But nothing above requires us — the endpoints are right there, and a weekend is enough to wire your own.

I let Claude and Codex argue about my code for a week. Here's what they caught.

Brian Mello — Tue, 02 Jun 2026 16:05:43 +0000

single-model code review has a structural blind spot, and it took me an embarrassingly long time to name it: the model that reviews your diff is the same kind of model that would have written the diff. it shares the failure modes. ask one LLM to find the bug it didn't notice the first time and you often get a confident "looks good to me" — the same confidence that produced the bug.

so for a week i ran every diff in a side project through two models instead of one — Claude and Codex (GPT) — and made them review independently, then compared where they disagreed. the disagreements turned out to be the entire point. here's what actually got caught, why a second opinion works, and how to wire it up yourself.

why two models beat one

the intuition people reach for is "two reviewers catch more than one." true but boring. the sharper version is about correlation of errors.

if a single reviewer misses a class of bug — say, async race conditions — it misses that class every time. running it twice doesn't help; the second pass has the same blind spot. you need a reviewer whose errors are uncorrelated with the first.

different model families fail differently. Claude and GPT were trained on different data mixes, with different RLHF, and they've internalized different "smells." where their judgments diverge is exactly where the uncertainty lives. a bug that one flags and the other misses is a bug worth a human's two minutes. a line both flag is almost certainly real. a line neither flags... well, that's the residual risk you can't escape, but it's smaller than what one model leaves behind.

concretely, over the week the overlap looked roughly like this: about 60% of real issues were caught by both, 30% by exactly one, and the remaining 10% needed me. that 30% is the whole argument. it's the bugs that would have shipped under a single-model setup.

the bug neither half-caught alone

here's a real one, simplified. a token-bucket rate limiter in TypeScript:

class RateLimiter {
  private tokens: number;
  private lastRefill: number;

  constructor(private capacity: number, private refillPerSec: number) {
    this.tokens = capacity;
    this.lastRefill = Date.now();
  }

  tryConsume(cost = 1): boolean {
    const now = Date.now();
    const elapsed = (now - this.lastRefill) / 1000;
    this.tokens = Math.min(
      this.capacity,
      this.tokens + elapsed * this.refillPerSec
    );
    this.lastRefill = now;

    if (this.tokens >= cost) {
      this.tokens -= cost;
      return true;
    }
    return false;
  }
}

Claude flagged the obvious-in-hindsight problem: under concurrent calls in a single event loop this is mostly fine, but the moment you share this instance across an await boundary, two callers can both read this.tokens before either writes it back, and you over-admit. it suggested making the refill-and-consume step atomic.

Codex flagged something else entirely: Date.now() is wall-clock, so an NTP adjustment or a clock skew can make elapsed negative, which silently removes tokens and stalls the limiter. it suggested performance.now() or a monotonic source and a Math.max(0, elapsed) clamp.

neither model named both issues on its own. read independently, you'd have patched one and shipped the other. read together, the diff that came out was actually correct:

tryConsume(cost = 1): boolean {
  const now = performance.now();
  const elapsed = Math.max(0, (now - this.lastRefill) / 1000);
  this.tokens = Math.min(this.capacity, this.tokens + elapsed * this.refillPerSec);
  this.lastRefill = now;

  if (this.tokens < cost) return false;
  this.tokens -= cost;
  return true;
}

that's the pattern that repeated all week. each model is a strong reviewer with a specific squint. the value isn't in either review — it's in the diff between them.

the failure mode of consensus

here's the honest caveat, because two models also fail in a way one doesn't: they agree confidently and are both wrong. this happened with a date-parsing helper both models pronounced clean. both missed that it assumed MM/DD/YYYY and would silently misread DD/MM input. agreement is not proof. it's a strong prior, not a verdict.

which is why raw "both said yes" isn't enough. you want to know how reliable each model is on this specific kind of thing. Claude might be stronger on concurrency in TypeScript; Codex might be stronger on numeric edge cases in Python. weighting each model's vote by its track record per language and issue type beats a flat majority. a flat vote treats every reviewer as equally credible on every question, which is obviously false the second you watch them work.

wiring it up yourself

the minimal version is a script: send the diff to two providers with an identical rubric, parse structured findings, and surface the union with agreement annotated. the prompt matters — ask for a severity and a category on each finding so you can diff them programmatically:

RUBRIC = """Review this diff. For each issue return JSON:
{"line": int, "severity": "high|med|low",
 "category": "concurrency|numeric|security|logic|style",
 "issue": str}. Only real issues. Empty list if none."""

findings_a = review(diff, model="claude")
findings_b = review(diff, model="gpt")

both = intersect(findings_a, findings_b)   # high-confidence
solo = symmetric_diff(findings_a, findings_b)  # worth a human glance

that's genuinely most of the value, and you should build it before you buy anything — understanding the mechanism is worth the afternoon.

where this goes

once i had the two-model loop running locally i didn't want to maintain prompt rubrics, JSON parsing, and per-model reliability tracking by hand. that bookkeeping — which model to trust on which language and issue type — is the part that actually compounds over time, and it's tedious to do well.

that's the itch 2ndOpinion scratches. it runs Claude, Codex, and Gemini in parallel and returns a calibrated, weighted consensus, where each model's vote is weighted by its measured accuracy per language and per issue type rather than a flat majority. it also keeps a pattern memory, so a bug shape it's seen before gets flagged instantly instead of re-litigated. you can run it as an MCP server, a REST API, a CLI, or a GitHub PR agent.

the fastest way to feel the difference is the CLI on a real diff:

$ npx 2ndopinion-cli review --staged

  scanning 3 files · 2 models · weighted consensus

  src/rate-limiter.ts
    ⚠ high   concurrency   over-admits across await boundary   (consensus 0.91)
    ⚠ high   numeric       Date.now() can go backwards         (consensus 0.88)
  src/parse-date.ts
    ⚠ med    logic         ambiguous MM/DD vs DD/MM parsing    (consensus 0.64)

  3 issues · 2 high · 1 med · 1.8s

the week's takeaway is simple enough to act on without any tool: stop asking one model to check its own kind of work. add a reviewer that fails differently. if you want the calibrated, weighted version of that without maintaining the plumbing, get2ndopinion.dev has a $5 starter pack (100 credits) and a 7-day Pro trial — or just run npx 2ndopinion-cli on a staged diff and watch where the two models disagree.

AI Code Review in 2026: How the Tools Actually Differ (A Builder's Field Guide)

Brian Mello — Fri, 22 May 2026 17:21:07 +0000

If you searched "AI code review" six months ago, the landscape looked roughly like CodeRabbit, a handful of GitHub-bot startups, and your IDE's built-in assistant. Today it's a much wider field — Qodo, Greptile, Bito, Coderabbit, Codium, Sourcegraph's Cody, plus every IDE shipping its own "review this change" button — and the answer to "which one should I use?" depends on questions nobody seems to be asking out loud.

I run 2ndOpinion, a multi-model AI code review CLI. So yes, I'm biased. I'm going to try to be honest about it anyway, because what I actually want is for you to pick the right category of tool for how you work — and then, within that category, pick the one that matches your tradeoffs. If that's not us, that's fine.

Here's how I think about the landscape after building in it for the better part of a year.

The three categories that actually exist

The category labels matter more than the brand names. Almost every tool falls into one of three buckets:

Async PR reviewers. Bot-on-GitHub, bot-on-GitLab. Reviews show up as comments after you push. CodeRabbit, Qodo Merge, Bito, Greptile are the loudest names here.
In-editor copilots. "Review this change" inside Cursor, VS Code Copilot, Cody, JetBrains AI. Synchronous, in-flow, ephemeral.
CLI / CI reviewers. Run locally on a diff or in a pipeline step. Output is structured, scriptable. This is where 2ndOpinion lives, alongside tools like Aider's review modes and a growing pile of homegrown CI wrappers.

These aren't competing products as much as competing times in the day when AI reviews your code. Some teams use all three. Most should use at least two.

What each category is actually good at

Async PR reviewers are best when the reviewer is supposed to be a teammate-shaped entity — leaving inline comments, approving or requesting changes, surfacing in the same UI where humans review. The strength is integration with the social workflow of a PR. The weakness is timing: feedback arrives after you've context-switched. By the time the bot comments, you're already in your next branch.

In-editor copilots are best for shipping velocity. The review happens while the code is still warm. The weakness is the same model bias I keep writing about — the model that helped you write the code is the worst possible reviewer of that code. If your editor's copilot and your editor's reviewer are the same model, you're getting a confidence boost, not a review.

CLI / CI reviewers are best for policy — making review a gate, not a suggestion. They run on every diff, with consistent thresholds, in an environment you control. The weakness is that they're harder to set up than installing a GitHub app, and the output is less pretty than inline comments.

If you only pick one, pick based on whether your bottleneck is catching bugs (CI), velocity (editor), or team review hygiene (PR bot).

The single-model vs multi-model split

Cutting across all three categories is a more interesting axis: how many models is the tool actually consulting?

Most of the well-known tools today are single-model. CodeRabbit publishes its model choices, Qodo lets you swap, Cursor uses whichever model you've selected in the sidebar. The review you get is one model's opinion.

A smaller group runs more than one model. 2ndOpinion runs Claude, Codex, and Gemini and surfaces both the individual reviews and a synthesized consensus verdict. A handful of newer tools are starting to do similar things.

I've written about why this matters in detail before, but the short version: each model has systematic blind spots that don't show up until you compare its review to another model's. Single-model review feels comprehensive because the model is confident. Multi-model review feels noisier because it actually surfaces the disagreement that was there all along.

If your tolerance for false negatives is low — security-sensitive code, infra, anything touching money — multi-model is worth the extra cost. If your tolerance is high — internal tools, prototypes, anything you'll rewrite in a month — single-model is probably fine.

What I'd actually recommend, by team shape

Solo developer, fast iteration. In-editor review only. Cursor or Copilot's review feature, plus whatever you're already using to write the code. Don't add a CI gate that blocks your own merges — you'll bypass it within a week.

Small team (2–5 engineers), shipping to production. PR bot for the team-review surface, plus a CLI/CI step for the actual gate. The PR bot gives you the social workflow. The CLI gives you the consistent policy.

Mid-size team, security-sensitive code. All three layers, with multi-model at the CI gate. The CI step is where you can afford the latency and cost of running multiple models — every PR runs through it once, and the cost is bounded.

Large org, monorepo. This is the case where I'd most strongly recommend a CLI/CI tool over a PR bot. PR bots tend to scale badly on monorepos — they choke on large diffs, or they review files the change didn't actually touch, or they cost a fortune because every PR pulls in the whole context. CLI tools let you scope the review precisely.

Where 2ndOpinion fits (and where it doesn't)

The honest pitch: if you want multi-model consensus, in a CLI or MCP server form factor, with first-class CI integration, that's what we do. We don't have a GitHub PR bot. We're not in your editor as a sidebar. We're a CLI and an MCP server.

If you want a pretty PR comment with inline annotations, you probably want CodeRabbit or Qodo Merge. If you want a sidebar reviewer inside Cursor, Cursor's own review is the right answer.

What we're good at: running every diff through Claude, Codex, and Gemini in parallel, getting back three independent reviews plus a synthesized verdict, and either running it locally as a CLI or wiring it in as an MCP tool inside Claude Code, Cursor, or any MCP-compatible editor. Setup is one npm install -g 2ndopinion-cli and three API keys.

How to actually decide

A working heuristic:

If your last production bug was the kind of thing a careful reviewer would have caught and AI didn't, you need either a different model or more models. Try multi-model.
If your last production bug was the kind of thing nobody would have caught, you don't need more models — you need better tests, observability, or rollback infrastructure. AI review won't save you.
If your bottleneck is "PRs sitting unreviewed for two days," any of the async PR bots will help. The specific brand matters less than picking one and getting your team to actually trust it.
If your bottleneck is "we ship a lot but we ship buggy code," that's a CI gate problem. Single-model is a start; multi-model is the upgrade.

The thing nobody in the AI-tooling space wants to say out loud is that the tool isn't the constraint. The constraint is whether your team treats the review output as signal or noise. Pick the tool that produces a kind of output your team will actually act on — and then enforce that they act on it.

If you want to try multi-model consensus review on your next diff, the CLI is one command: npm install -g 2ndopinion-cli. Setup walkthrough and the MCP server config at get2ndopinion.dev.

I Let Three AIs Argue About My Vibe-Coded App — Here's What They Caught

Brian Mello — Fri, 15 May 2026 17:08:33 +0000

I built a small side project in Cursor over a weekend. Login, dashboard, a couple of forms, a Stripe-style checkout flow. The kind of thing that feels done. Clicking around, everything works. The vibes are immaculate.

So I did the responsible adult thing: I shipped it.

It broke in three places within 48 hours. None of the breaks were in code I had written by hand. They were in code an AI had generated that I had skimmed, nodded at, and moved on from.

That's the trap of vibe coding. The AI is fluent. You're fluent at reading what the AI made. Neither of you is the kind of pedantic loser who notices that the "Cancel" button on the checkout modal actually submits the form on mobile Safari because someone forgot to add type="button" somewhere three components deep.

This is the story of the second app I built, where I tried something different. I let three AI testing agents argue about my app before I shipped it. They caught seven things I would have missed. They also disagreed with each other in ways that, weirdly, made me trust the result more.

The setup

The app was a simple expense-splitting tool — Splitwise but uglier and free. Built in Cursor, deployed on Vercel, total dev time around eight hours spread across two evenings. By the end I had:

Email/password signup
A "create group" flow
An "add expense" form with split logic
A settle-up view

Standard vibe-coded SaaS skeleton. Worked on my machine. Looked fine on my phone.

Instead of clicking around for an hour and calling it good, I pointed 2ndOpinion Testing at the URL. The pitch on the box is "AI agents test your app like real users, then cross-examine each other's findings." I'd seen the demo. This was the first time I'd used it on something I actually cared about.

What "three AIs arguing" actually looks like

The product runs three different model-backed agents at your app concurrently. Each one explores independently — clicking, typing, navigating like a confused new user who has never seen the thing before. They each file a report on what's broken or weird.

Then comes the part that earns the courtroom metaphor in the marketing: the agents cross-examine each other's findings. Agent A claims the signup form is broken. Agent B says they signed up just fine. The system makes them reproduce, defend, or retract.

You don't end up with three separate reports to read. You end up with one verdict: here's what's actually wrong, here's what one agent thought was wrong but couldn't reproduce, here's what all three independently flagged.

Reading the final verdict felt like reading the minutes of a deposition. In a good way.

The seven things they caught

I'll walk through them in increasing order of "ouch, I should have caught that."

1. The signup email field accepted "test" as a valid email. All three agents flagged this. Front-end validation was just required, no type="email". Cursor had generated a form with the bare minimum and I hadn't tightened it. Five-second fix. Would have looked terrible the first time a real user mistyped.

2. The "Add Expense" form let you submit $0. Two of three agents tried it, both succeeded, both filed it. The third agent said "this is probably intentional, some groups track zero-dollar IOUs." The system made them argue about it. They settled on "probably a bug, ask the developer." It was a bug.

3. The settle-up calculation rounded wrong on three-way splits. $10 split three ways became $3.33 + $3.33 + $3.33, which is $9.99. Someone was always going to be a penny off. One agent caught it by splitting a coffee three ways and noticing the totals didn't reconcile. The other two had only tested two-way splits.

4. Pressing Enter in the "group name" field submitted the form before I'd added any members. Only one agent caught this — the others were filling forms by clicking the submit button like polite humans. The one that pressed Enter found a half-broken state where the group existed but had no members and couldn't be edited.

5. The mobile nav menu didn't close after navigating. Two agents flagged it. Classic AI-generated React component thing. The menu had open/close state, but route changes didn't reset it.

6. The password reset email link 404'd. I had not, in fact, set up the password reset route. The "Forgot password?" link went to /reset-password which did not exist. I had written the link before writing the page and never come back to it. One agent found this by clicking every link on the login screen. Embarrassing.

7. The Stripe-style checkout for the (currently mocked) "Pro" tier accepted submissions but didn't go anywhere. I had stubbed out the Pro upgrade flow and forgotten about it. The button looked real. The page it led to was a 404.

Seven real things. None of them catastrophic, all of them the kind of thing that, on a launch day with twenty people poking at your app, accumulate into "this product feels janky."

The part I didn't expect: the disagreements

The disagreements are what convinced me this approach actually works. Here are two:

Was the signup flow too slow? One agent flagged the signup as "slow, took 4 seconds." The other two said it felt normal. The system made the first agent show its work. Turned out it had been testing on a throttled connection it had picked up from somewhere in its state, and the other two hadn't. The finding got retracted. If I had just had one agent, I'd have gone hunting for a phantom performance problem.

Was the "delete group" confirmation modal confusing? Two agents thought the wording was unclear. The third said it was fine. The argument ended with "this is subjective, flagging for human review." That's the right answer. The tool wasn't pretending to be sure when it wasn't.

I have used single-AI testing tools before. They sound confident about everything, including the wrong things. Watching three agents disagree and then resolve felt much closer to the experience of having three different humans review a PR. Some things were unanimous. Some things were noise. The noise got filtered before it got to me.

What I'd tell another vibe coder

A few things, in order of how often I've now had to repeat them to friends:

You don't need to learn Playwright. You don't need to write Cypress specs. You don't need to even know what "end-to-end testing" is in the traditional sense. If you built your app in Bolt, Lovable, v0, or Replit, the testing tool you want is the same kind of thing — point it at a URL, let it figure out what to do.

You do need to test before you ship, not after. The temptation when you've spent a weekend vibing with an AI is to deploy on Sunday night, post on X, and hope. Resist. A 20-minute pre-flight on a Sunday afternoon catches the seven things that would have been a soft launch disaster.

You should care about the disagreements more than the agreements. If your testing tool always sounds 100% confident, it's lying to you. Real bugs aren't unanimous. The interesting findings are the ones where one agent saw something and the others didn't — and you get told whether the holdout was right.

Try it

If you've vibe-coded anything in the last month and it's sitting in a Vercel deployment waiting for you to feel brave enough to share the link, I'd run it through this before you do.

Try 2ndOpinion Testing →

You paste a URL. Three AIs argue about it. You ship with fewer surprises. That's the whole product.

The Splitwise-but-uglier app is still up, by the way. Seven fewer embarrassments than it would have had. I'll take it.

How to Add Multi-Model AI Code Review to Your CI/CD Pipeline

Brian Mello — Sat, 09 May 2026 17:31:55 +0000

Running AI code review locally is fine for solo work. The moment you have a team, the question becomes: how do I make the AI an actual gate in the pipeline, not a thing one person remembers to run before they push?

This is a walkthrough for wiring 2ndOpinion — the multi-model AI code review CLI — into a CI/CD pipeline. I'll show GitHub Actions in full, then sketch the same pattern for GitLab CI and CircleCI. The interesting decisions aren't where the YAML goes; they're around consensus thresholds, blocking vs informational mode, and what happens when Claude, Codex, and Gemini disagree on the same diff (which, from our review logs, is roughly 15% of the time).

What "AI code review in CI" actually means

There are two shapes this takes, and the YAML is almost identical for either. The difference is the policy:

Informational mode. Every PR runs the review. Findings are posted as a comment or check annotation. Nothing blocks merge. Humans decide what to do.
Blocking mode. Review runs on every PR. If the consensus surface flags a HIGH severity finding, the check fails and merge is blocked until the author either fixes it or someone with override permission ships anyway.

I recommend starting in informational mode for the first week or two. AI reviewers — even three of them cross-examining each other — surface false positives. You want the team to learn the noise floor before the bot can block their merges, otherwise the first false-positive blocker generates a Slack thread that ends with "let's just turn this off."

The minimum GitHub Actions config

Here's the workflow file I use as a starting point. Drop it in .github/workflows/ai-review.yml:

name: AI Code Review

on:
  pull_request:
    types: [opened, synchronize, reopened]

jobs:
  review:
    runs-on: ubuntu-latest
    permissions:
      contents: read
      pull-requests: write
    steps:
      - uses: actions/checkout@v4
        with:
          fetch-depth: 0  # need full history for diffs

      - name: Set up Node
        uses: actions/setup-node@v4
        with:
          node-version: '20'

      - name: Install 2ndOpinion CLI
        run: npm install -g 2ndopinion-cli

      - name: Run multi-model review
        env:
          ANTHROPIC_API_KEY: ${{ secrets.ANTHROPIC_API_KEY }}
          OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}
          GOOGLE_API_KEY: ${{ secrets.GOOGLE_API_KEY }}
          GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
        run: |
          2ndopinion review \
            --base origin/${{ github.base_ref }} \
            --head HEAD \
            --format github-comment \
            --severity-threshold medium \
            --comment-pr ${{ github.event.pull_request.number }}

A few things worth calling out:

fetch-depth: 0 is necessary because actions/checkout defaults to a shallow clone, and the CLI needs full history to compute the actual PR diff against the base branch. Skip this and your review runs against an empty diff, which produces a confidently empty review.

Three API keys. Multi-model review means three providers. If you only set one, the CLI degrades to single-model mode and prints a warning. That's fine for a smoke test, but the whole reason you're doing this is the multi-model surface — the disagreement signal.

--severity-threshold medium suppresses LOW findings in the PR comment. LOW is mostly nits and style preferences, and posting them on every PR trains your team to ignore the bot. Keep MEDIUM and HIGH visible; suppress LOW.

Going from informational to blocking

To turn this into a merge gate, change one flag and one branch protection setting.

In the workflow:

2ndopinion review \
  --base origin/${{ github.base_ref }} \
  --head HEAD \
  --format github-comment \
  --severity-threshold medium \
  --fail-on high \
  --comment-pr ${{ github.event.pull_request.number }}

The --fail-on high flag tells the CLI to exit with a non-zero status if any HIGH severity finding has consensus from at least 2 of 3 models. The 2-of-3 threshold matters — it's why you don't want to block on single-model verdicts. Any single model can confidently invent a critical bug. Two models independently flagging the same critical bug is meaningfully harder to fake.

Then in Settings → Branches → Branch protection for your default branch, add the AI Code Review / review check to the required checks list. Now the merge button is gated.

I'd hold this back for at least a week of informational-mode runs. Look at the false positive rate. If you're getting more than one false HIGH per ten PRs, tune the consensus threshold up to 3-of-3 instead of 2-of-3 before flipping the gate on:

--fail-on high --consensus-required 3

That's stricter — only blocks when all three models agree the finding is HIGH. False positive rate drops, false negative rate goes up. Tradeoff worth making early; you can loosen later once the team trusts the bot.

GitLab CI

Same pattern, different YAML. .gitlab-ci.yml:

ai-code-review:
  stage: test
  image: node:20
  rules:
    - if: $CI_PIPELINE_SOURCE == 'merge_request_event'
  variables:
    GIT_DEPTH: 0
  script:
    - npm install -g 2ndopinion-cli
    - 2ndopinion review
        --base origin/$CI_MERGE_REQUEST_TARGET_BRANCH_NAME
        --head HEAD
        --format gitlab-note
        --severity-threshold medium
        --comment-mr $CI_MERGE_REQUEST_IID

The CLI knows about GitLab's note format and uses CI_JOB_TOKEN automatically if it's available in the environment, so you don't need to set up a separate token unless you want bot-attributed comments.

CircleCI

CircleCI's config doesn't have the same first-class PR concept, but the CLI handles it. .circleci/config.yml:

version: 2.1
jobs:
  ai-review:
    docker:
      - image: cimg/node:20.11
    steps:
      - checkout
      - run: npm install -g 2ndopinion-cli
      - run:
          name: Run review
          command: |
            2ndopinion review \
              --base origin/main \
              --head HEAD \
              --format json \
              --output review.json
      - store_artifacts:
          path: review.json

CircleCI doesn't have a native PR-comment surface, so I store the review as a build artifact and add a separate small script to POST the JSON to the GitHub PR via a personal access token. Less elegant than the GitHub Actions path, but it works.

What to do when the models disagree

The reason multi-model review is in CI in the first place is that disagreements are signal, not noise. The CLI's default behavior on a finding where models split:

3-of-3 agree (HIGH): posted as a HIGH finding, blocks merge if --fail-on high is set.
2-of-3 agree (HIGH): posted as a HIGH finding with the dissenting model's argument attached, blocks if --consensus-required 2.
1-of-3 (HIGH): posted as a NOTE-level finding with the model's argument and the other two models' counter-arguments. Never blocks. Visible to humans.

That last category is the most underrated output. About 8% of our diffs produce a 1-of-3 HIGH where exactly one model is convinced something is broken and the other two say it's fine. Most of those are false positives by the lone model. But about a quarter of them — by far the most interesting quarter — are real bugs that two models missed. You don't want those silently dropped, but you also don't want them blocking merges. NOTE-level surfacing is the right answer.

Cost and time, in case you're worried about either

Median review on a typical 200-line diff: about 40 seconds wall-clock and roughly $0.06 in combined API spend across the three providers. That's wall-clock time the developer doesn't spend; it runs in parallel with the rest of the CI matrix. The cost works out to less than a tenth of what most teams pay for any single human reviewer's hour, which is the right comparison — multi-model review doesn't replace human review, it replaces the human reviewer asking "did you check for race conditions" by hand.

We've seen teams skip the AI review step for files larger than 1000 lines or generated files (lockfiles, schema dumps) — --exclude '**/*.lock' and --max-diff-lines 1000 handle both.

If you want to try this on a real repo, the CLI is npm install -g 2ndopinion-cli and the docs for every flag mentioned above are at get2ndopinion.dev. The MCP server flavor (for plugging the same review engine into Claude Code or Cursor as an agent tool) is also there.

We publish a weekly build-in-public update; this post is part of it. If you wire 2ndOpinion into your CI and one of your three models flags something the other two missed on a real diff, send the case over — those are the ones we use to tune the consensus thresholds.

How to Test Your AI-Built App Without Writing a Single Test

Brian Mello — Fri, 01 May 2026 17:07:41 +0000

You opened Cursor. You typed "build me a booking app." Forty-five minutes later, you have something that runs. The login works. The calendar mostly works. You ship it.

Then a friend tries it and the date picker goes blank on iOS. Another user finds that hitting back after a failed payment leaves the form locked. Someone else can't sign up because their email has a plus sign in it.

Welcome to the gap nobody talks about in the vibe coding era: AI-built apps are easier to ship than ever, and exactly as buggy as you'd expect from code you didn't fully read. Traditional testing — the unit tests, the integration tests, the Selenium suites — assumes you have time, expertise, and patience to write them. Vibe coders have none of those. So the apps go out untested.

This post is about the shift that's quietly happening: AI testing tools that act like real users, find real bugs in your AI-built app, and never ask you to write a line of test code.

Why traditional testing fails vibe coders

Selenium is from 2004. Cypress and Playwright are better, but the workflow hasn't really changed: you write a script that says click this, type that, assert this. Then your AI rebuilds the navbar and your selectors break. You spend an afternoon fixing tests instead of shipping features.

The friction is bad enough for full-time engineers. For someone who built their app in Bolt or Lovable over a weekend, it's a hard no. You're not going to become a QA engineer. You're going to ship and hope.

There is a third option, and it turns out it's pretty obvious in hindsight: have the AI do the testing too.

The shift: AI agents that test like users

The new generation of testing tools doesn't ask you to describe what to test. It asks for a URL.

You point the tool at your live app. An agent opens it, looks at the page, and behaves like a curious human. It clicks things. It fills in forms. It tries weird inputs. It tries to break the flow. It does the things you'd do if you sat down to test your own app, except it doesn't get bored after the third happy path.

This is fundamentally different from script-based testing. Scripts only test what you told them to test. An agent explores. It can find the dead end you didn't know existed because you never thought to script the click that gets you there.

The closest analogy is hiring a junior QA contractor who's actually thorough — except this one shows up in thirty seconds and costs less than a sandwich.

What "no scripts" actually means

When I say no scripts, I mean it literally. No selectors. No fixtures. No mocks. No test framework to install. The mental model is more like:

Here's my app: testing.example.com
Here's a sentence about what it does: a booking flow for a yoga studio
Find the bugs

That's the entire input. You spend more time writing the sentence than configuring anything else.

The output is a verdict. Not a failed test name and a stack trace — a plain-English description of what's broken, why it matters, and how to reproduce it. Screenshots included.

For a vibe coder, this is the whole point. You don't want to learn the testing tool. You want to know if your app is shippable.

Three AIs walk into a courtroom

Here's where it gets interesting. A single AI agent tests your app and tells you it found three bugs. How do you know it's right? AI agents hallucinate. They confidently report "the login button doesn't work" when actually they just couldn't find it because a cookie banner was in the way.

The fix is the same fix that's working everywhere else AI gets used in production: you ask more than one model, and you make them justify themselves.

In 2ndOpinion's testing product, three different agents test your app independently. Then they cross-examine each other's findings, courtroom style. Did Agent A really see this bug, or did it misread the page? Can Agent B reproduce it? Does Agent C agree that the failure mode is what A says it is?

What you get back is a verdict with confidence levels. The bugs that all three agents independently found are almost certainly real. The ones only one agent flagged usually aren't. This cuts the false-positive rate dramatically and saves you from chasing ghosts.

If you've ever been burned by an AI tool that confidently lied to you, this is the cure. Make them argue. The truth tends to survive the argument.

A typical workflow

Here's what testing an AI-built app looks like when you remove the scripting:

You finish a feature in Cursor, v0, Lovable, Replit — wherever you build. You deploy it. You paste the URL into your testing tool. You add a one-line description of what the app does and which flow you care about. You hit go.

A few minutes later, you have a list of issues. Not "test_login_button_failed at line 47." Something like: "If a user enters an email with a plus sign, the signup form silently fails. No error message appears, the button just stops responding. Reproduced in Chrome and Safari."

You take that to your AI coding tool. You paste the bug. You ask for a fix. You redeploy. You re-run the test. You ship.

The total cycle is maybe twenty minutes. Compare that to writing a single Cypress test from scratch, which is also twenty minutes — except you're still at zero coverage when you finish.

What this catches that you'd miss

The category of bugs that AI testing finds reliably is the one vibe coders most often ship by accident.

Edge cases in input handling. The plus sign in the email. The apostrophe in the last name. The phone number with a country code.

Broken back buttons and refresh behavior. The state that doesn't persist. The form that reposts on refresh. The "session expired" page that has no way out.

Mobile-specific weirdness. The viewport that doesn't scroll. The keyboard that covers the submit button. The autofocus that fights with the iOS keyboard.

Auth flows that work for the happy path and explode otherwise. Wrong password. Expired link. Already-registered email. OAuth cancellation halfway through.

These are the bugs your friends find on Twitter the day after you launch. They're also the ones a single AI rarely catches reliably, which is why the multi-agent cross-examination matters.

Where this is heading

Two things are going to happen over the next year. First, this kind of testing becomes default. Pasting a URL and getting a verdict will feel as obvious as pasting an error into ChatGPT. Second, the tools that survive will be the ones that handle disagreement honestly — that show you when their agents argued, who won, and why.

The vibe coder workflow has been bottlenecked on testing for two years. The unblocking is happening right now, and it doesn't involve learning Selenium.

If you've built something in Cursor, Bolt, or Lovable and you're nervous about shipping it: that's a reasonable feeling, and the tools to act on it finally exist.

If you want to try this on something you've built, 2ndOpinion Testing is the macOS desktop app I'm building for exactly this. Paste a URL, get a verdict. No scripts, no selectors, no test framework to learn.

When Claude, Codex, and Gemini Disagree on the Same Code

Brian Mello — Fri, 24 Apr 2026 17:06:36 +0000

When we tell people 2ndOpinion runs every pull request past Claude, Codex, and Gemini and then cross-examines the findings, the most common follow-up is: "Do they actually disagree? Or is this just three models rubber-stamping each other?"

The answer, from about six months of production review logs, is that they disagree often enough to matter. Not on everything — maybe 15% of diffs — but the disagreements cluster on exactly the kinds of bugs that hurt in production: concurrency, null handling, subtle security issues, and "this works but it's going to page you at 3am" architectural smells.

Here are four real cases, lightly anonymized, where the three models read the same code and came back with meaningfully different verdicts. If you're trying to decide whether multi-model review is worth the extra tokens, these are the kinds of arguments it's buying you.

Case 1: The async/await race that only one model saw

The diff was a webhook handler in a Node.js payments service. Roughly:

app.post("/webhook/stripe", async (req, res) => {
  const event = verifySignature(req);
  const existing = await db.events.findOne({ id: event.id });
  if (existing) return res.status(200).send("duplicate");
  await processEvent(event);
  await db.events.insert({ id: event.id, status: "processed" });
  res.status(200).send("ok");
});

Codex flagged it as a textbook race condition: two copies of the same webhook arriving within milliseconds both pass the findOne check before either has written to events, both run processEvent, and you charge the customer twice. Recommended fix: a unique index on id plus wrap the processing in an idempotency key pattern.

Claude said the code was fine and suggested minor cleanup — extract processEvent into a service, add structured logging.

Gemini agreed with Codex about the race but suggested a different fix — optimistic insert first, catch the unique constraint violation, return early if duplicate. Cleaner on the happy path.

The consensus step flagged the race because two of three models saw it. Without cross-checking, whichever single model you happened to be using would have told you the diff was either shippable or a bug — a coin flip on a payments handler.

The lesson isn't that Claude is worse at concurrency. Rerun this prompt on a different day and the models trade places. The lesson is that any single model has blind spots that are invisible until a different model looks at the same code.

Case 2: The "working" SQL that was quietly injectable

A new internal admin endpoint, Python, roughly:

def search_users(query: str, sort: str = "created_at"):
    sql = f"SELECT * FROM users WHERE email ILIKE %s ORDER BY {sort} DESC"
    return db.execute(sql, [f"%{query}%"])

Gemini immediately flagged the SQL injection in the sort parameter — the %s parameterization protects query, but sort is interpolated directly into the string. An attacker who controls sort can turn this into ORDER BY (SELECT ...) DESC and exfiltrate data.

Codex flagged it too, with a suggested allowlist: if sort not in {"created_at", "email", "last_login"}: raise ValueError(...).

Claude said the query was safe because the user parameter was parameterized — and it was technically right about query, but it missed that sort is a user-controllable input from the same request.

This is the most dangerous kind of AI review error: confidently correct about one thing, silent on a worse thing right next to it. A single-model review that happened to land on Claude that day would have said "LGTM." The second opinion is exactly what you want for security-adjacent diffs — one model being wrong is common, two models being wrong in the same direction is rare.

Case 3: The memory leak that wasn't

Sometimes consensus is wrong and the outlier is right. React component, roughly:

useEffect(() => {
  const ws = new WebSocket(url);
  ws.onmessage = (e) => setMessages((m) => [...m, e.data]);
  return () => ws.close();
}, [url]);

Claude and Gemini both flagged a missing cleanup of the onmessage handler and warned about a memory leak if the component re-mounted rapidly.

Codex pushed back — because ws is created inside the effect and ws.close() is called in the cleanup, the socket is garbage-collected along with its handlers. The handler doesn't need explicit removal. The two-of-three majority was wrong; the outlier was right.

This is where our cross-examination step earns its keep. Instead of defaulting to "majority wins," the consensus layer asks the dissenting model to defend its position, then asks the other two to respond. In this case Codex explained the GC behavior, Claude acknowledged the correction, and the final verdict downgraded the finding from "bug" to "stylistic nit."

If you only run majority voting, you get the wrong answer on cases like this. If you run proper cross-examination, you get the right answer and the reasoning, which is how engineers actually build trust in AI review.

Case 4: The Rust borrow checker dispute

A small but contentious one. The diff refactored a hot path:

fn process(items: Vec<Item>) -> Vec<Processed> {
    items.iter()
        .map(|i| transform(i.clone()))
        .collect()
}

Codex flagged the .clone() as wasteful and suggested taking items by value and using into_iter() to move instead of clone.

Gemini agreed with the performance critique but added a nuance — if Item contains anything expensive to clone (like an Arc<Mutex<T>>), the clone is specifically what you don't want in a hot path.

Claude defended the clone. Its argument: if transform is defined for &Item in the existing codebase and changing it breaks fifteen other callers, the clone is the minimal-risk change. "Optimal" and "mergeable" are different targets.

None of the three models was wrong. They were optimizing for different objectives, which is a pattern we see constantly — performance versus maintainability, correctness versus velocity, local improvement versus blast-radius. Multi-model review surfaces that there is a tradeoff rather than presenting one model's preferred answer as The Answer. That's usually more useful than a confident single verdict.

What we do with the disagreements

The short version of the product: every review goes to all three models in parallel. Findings that all three agree on are high-confidence and reported first. Findings where models disagree trigger a cross-examination round where each model sees the others' output and gets a chance to revise. Anything still contested is surfaced to the human reviewer with the full argument attached, rather than hidden behind a single "LGTM."

That last part is the one most people underestimate. You don't want the AI to resolve every disagreement — some disagreements are the signal. A human reviewer who sees "Claude says ship, Gemini says block, here's why" makes a better decision than one who sees a single-model verdict in either direction.

If your team is running code review with one model and wondering what the second opinion would say, that's the whole pitch. Install the CLI with npm i -g 2ndopinion-cli, run 2ndopinion review, and see where your models actually disagree. Or wire it into Claude Code / Cursor as an MCP server — docs at get2ndopinion.dev.

We publish a weekly build-in-public update, and this post is part of it. If you have a case where two AI reviewers disagreed on your code and you're curious what a third would say, send it over — the weird diffs are the fun ones.