DEV Community: Michael Lee

How to Read a 2026 AI Benchmark Chart Without Getting Fooled

Michael Lee — Tue, 07 Jul 2026 05:47:27 +0000

Originally published on the TierUp blog. A field guide to SWE-bench Pro, Terminal-Bench 2.1, and GPQA Diamond — what they measure and where they break.

Every model launch in 2026 ships with the same artifact: a bar chart where the new model's bar is tallest. The benchmarks on that chart are mostly good ones — better than what we had two years ago. But each has failure modes the marketing copy won't mention. Here's a field guide.

SWE-bench Pro: the coding benchmark that replaced the coding benchmark

SWE-bench Verified used to be the coding number. It's now effectively retired at the frontier: OpenAI publicly stopped evaluating on it, and audits reportedly found training-data overlap across frontier models plus a large share of hard tasks with flawed tests. When every model scores 70%+ on problems it may have memorized, the number stops meaning anything.

Scale's SWE-bench Pro is the replacement: 1,865 real issue-to-patch tasks across 41 repositories in Python, Go, TypeScript, and JavaScript, split into public (731), held-out (858), and commercial (276) sets. Contamination is fought structurally — tasks come from strong-copyleft codebases and fully private commercial repos that model trainers can't legally ingest. The reset was brutal: at launch, Claude Opus 4.1 and GPT-5 scored ~23% here versus 70%+ on Verified. Today Claude Opus 4.8 leads at 69.2%, with Z.ai's open-weight GLM-5.2 at 62.1.

Caveat: watch which subset a vendor quotes. At launch, GPT-5 scored 23.1% on the public set but 14.9% on the commercial set. Same model, same benchmark name, meaningfully different number.

Terminal-Bench 2.1: agents in a real shell

Terminal-Bench 2.1 drops an agent into containerized terminal environments — 89 hard, human-authored tasks like compiling projects, training models, and configuring servers — and checks the end state with automated tests. It's the best public proxy we have for "can this thing actually operate a computer unattended." Current top scores: Claude Fable 5 at 88.0%, GPT-5.5 around 83–84%.

Two caveats. First, version churn: 2.1 is harder than 2.0, so scores across versions are not comparable — a model "dropping" between versions may have gotten better. Second, harness sensitivity: Terminal-Bench scores a model plus an agent scaffold, and the same model posts different numbers under different harnesses. Z.ai's GLM-5.2 announcement lists GPT-5.5 at 84.0; an independent leaderboard lists 83.4. Small gap here, but scaffold choice has swung other results by far more. Always ask: whose harness?

GPQA Diamond: saturated, and noisy at the top

GPQA Diamond is 198 PhD-level multiple-choice questions in biology, physics, and chemistry — hard enough that PhD-holding experts scored ~69.7%. It was a great differentiator in 2024. In 2026, the frontier clusters at 91–94% (Gemini 3.1 Pro ~94.3%, Claude Opus 4.7/4.8 ~94.2/93.6%), and that's the problem: with 198 questions, one question is half a point, and Epoch AI's runs carry ±2% error bars plus formatting-related scoring noise. A 0.7-point lead on GPQA Diamond is statistically indistinguishable from a tie. The same is true of AIME-style math, where top models now score 98–99%.

When a 2026 launch chart leads with GPQA or AIME, that's a tell: the interesting benchmarks must not have been flattering.

The successor benchmarks aren't clean either

Humanity's Last Exam exists precisely because everything above saturated — frontier models sit around 35–40% against a ~90% human-expert baseline, so there's headroom. But quality control is shaky: one analysis estimates roughly 30% of its chemistry/biology reference answers are likely wrong, and many vendor-quoted HLE scores never land on the official leaderboard. Newer isn't automatically cleaner.

How to actually read the chart

Check saturation. Any benchmark where leaders cluster above ~90% ranks noise, not capability.
Check contamination design. Prefer benchmarks with held-out or private splits (SWE-bench Pro) over static public sets.
Check the harness and subset. Vendor-run agentic scores are model+scaffold scores on the vendor's chosen split. Look for the independent leaderboard number.
Distrust single numbers entirely. GLM-5.2 beats GPT-5.5 on SWE-bench Pro and loses to it on Terminal-Bench 2.1. Neither number alone tells you which to deploy — your workload decides which benchmark is the relevant one.

The uncomfortable conclusion: "which model is best" now genuinely depends on the task, and re-litigating that question every launch week is a job in itself. That's the job we do at TierUp so you can just pick a tier.

Sources

Stop Optimizing for the Cheapest Token. Optimize Quality-per-Dollar.

Michael Lee — Tue, 07 Jul 2026 05:43:22 +0000

Originally published on the TierUp blog. The 2026 evidence on LLM routing: why both "always the flagship" and "always the cheapest" leave money on the table.

For the first couple of years of the LLM API era, teams picked a model the way they picked a database: once, emotionally, and then defended the choice in perpetuity. Some hardcoded the frontier model because "quality matters." Others hardcoded the cheapest model because "it's mostly good enough." Both camps are leaving money — or capability — on the table, and in 2026 the third-party evidence for that has gotten hard to ignore.

The price spread makes single-model choices indefensible

The gap between tiers is not 2x. Digital Applied's June 2026 routing guide puts current input pricing at roughly $0.44/M tokens for DeepSeek V4, $1/M for Claude Haiku 4.5, $3/M for Sonnet 4.6, $5/M for GPT-5.5, and $25/M for Opus 4.8 — with the full spread from cheapest input to priciest frontier output running around 100x.

A 100x spread means the routing decision is worth more than almost any other optimization you can make. Prompt caching might save you 50–90% on repeated prefixes; batching might save 50%. Sending a "reformat this JSON" request to a model priced 100x below the frontier saves 99%.

And the spread is a moving target. Epoch AI's analysis found that the price to reach a fixed capability level has been falling between 9x and 900x per year depending on the benchmark, with a median around 50x annually. Concretely: the capability you're paying frontier prices for today will be available at mid-tier prices in months. A hardcoded model choice is a depreciating asset. Gartner now projects that inference on a trillion-parameter model will cost providers over 90% less by 2030 than in 2025.

The research: most queries don't need the frontier

This isn't just a pricing observation — it's an empirical one about workloads. The peer-reviewed RouteLLM work (cited in Digital Applied's guide) showed a trained router achieving 85% cost savings on MT Bench while retaining 95% of GPT-4 quality, with its matrix-factorization router needing the frontier model on only about 14% of queries. The authors' principle is worth framing: all queries that can be handled by weaker models should be routed to those models.

Production numbers line up with the lab. Eden AI's 2026 router comparison reports routing reduces LLM costs by 30–85% depending on workload and quality requirements, and Digital Applied cites teams seeing 40–85% bill reductions, with even a crude 70/30 cheap-to-frontier split yielding roughly 67% savings.

Notice what "quality-per-dollar" is not: it is not "use the cheapest model." On the hard 14–30% of your traffic, the cheap model fails, you retry, you burn user trust, and your effective cost per successful outcome exceeds what the frontier model would have charged. Cheapest-token optimization and best-benchmark optimization are the same mistake in opposite directions — both evaluate the model in isolation instead of evaluating cost per solved task.

The honest caveats

Routing is not free lunch, and it's worth stating the failure modes plainly:

Silent quality regression is the real risk. Digital Applied's guide describes degraded answers surfacing in customer tickets days later rather than on a dashboard. The mitigation is unglamorous: an eval suite of a few hundred representative cases that gates any routing-policy change, exactly like a test suite gates a deploy.
Router overhead matters, but less than you'd think — rule-based routing adds under 1ms and even ML classifiers add 50–100ms against typical 500–2,000ms inference times.
Some workloads shouldn't be routed. If 95% of your traffic genuinely needs frontier reasoning, a router is complexity without payoff. Measure first.

Where TierUp fits

This thesis is why TierUp exists. Instead of hardcoding a model ID, you pick a performance tier and we route each request to the model currently offering the best quality-per-dollar at that tier — repriced as the market moves, so Epoch AI's 50x-per-year deflation shows up on your bill instead of your provider's margin. Same API shape, below-retail pricing, and no vendor archaeology every time a new model ships.

Sources

Why AI Bills Explode While Token Prices Fall

Michael Lee — Tue, 07 Jul 2026 05:40:17 +0000

Originally published on the TierUp blog. Per-token prices fell ~280x in two years and enterprise AI budgets still tripled — here's the math behind the paradox.

Here's the paradox defining AI budgets in 2026: per-token prices have been in freefall, and total spend keeps going up anyway. Henon's analysis leads with the headline version — token prices fell 98% while enterprise AI costs tripled. Oplexa's inference-cost report, citing Epoch AI and AnalyticsWeek data, frames it even more starkly: effective per-token costs down roughly 280x over two years (from ~$30/M in 2023 to ~$0.10/M for comparable capability in 2026), while average enterprise AI budgets grew from about $1.2M in 2024 to $7M in 2026 — and inference now eats ~85% of the AI budget, up from 40% in 2023.

Falling prices didn't fail. Volume won.

Where the volume comes from

Agents multiply calls. A chatbot answers a question with one model call. An agent plans, calls tools, reads results, retries, and self-checks. Gartner's March 2026 analysis, as cited by Oplexa, found agentic workflows make 10–20 LLM calls per user-initiated task and consume 5–30x more tokens than a standard chatbot interaction. Every product that quietly upgraded from "chat" to "agent" this year multiplied its token volume by an order of magnitude without changing its pricing page — or yours.

RAG inflates every call. Retrieval-augmented requests carry 3–5x more tokens than the bare question, per the same Gartner-cited analysis. That's the point of RAG — but it means your input volume scales with your document chunking strategy, not your user count. And as we covered in the tokenizer tax post, fat contexts can also push you across long-context pricing thresholds.

Always-on beats per-request. Monitoring agents, background summarizers, and scheduled pipelines consume tokens around the clock whether or not a human is watching. Usage stops tracking headcount.

Humans, given leverage, use more of it. TechCrunch's June 2026 report on the industry's cost scramble has the receipts: Jellyfish's research head measured per-developer token consumption rising ~18.6x in nine months. Their study found the heaviest token users were about twice as productive — but spent 10x more tokens getting there. Uber reportedly blew through its entire 2026 AI coding budget by April. Priceline saw a Cursor renewal come back 4–5x more expensive, with one engineer spending $40,000 on tokens in a single month. One company reportedly discovered a $500 million Claude bill after failing to set usage limits.

The pattern across all four: cost per token fell, tokens per outcome exploded, and outcomes per user grew. Multiply three curves and the product points up.

The mitigation checklist

The response emerging across the industry — TechCrunch describes the conversation shifting wholesale from capability to "guardrails," and a Tokenomics Foundation standards body launching this month — amounts to FinOps for AI. The practical version:

Route by task difficulty. Most calls in an agent loop are glue — classification, extraction, formatting — and don't need a frontier model. Oplexa reports model routing cutting spend 60–80%, the single largest lever on their list.
Set hard budgets and per-group limits. Priceline's approach per TechCrunch: token limits on employee groups. Alerts are not limits; limits are limits. (See also: the reported $500M bill.)
Cache aggressively. Prompt caching (up to 90% off cached input) and semantic caching (30–50% savings per Oplexa) attack the RAG-inflation problem directly.
Batch what isn't interactive. Batch APIs run 50% off at major providers. Background summarizers and nightly pipelines rarely need real-time pricing.
Cap agent loops. Set maximum iterations and maximum tool calls per task. An agent that retries itself into a 20-call loop is a cost incident, not a feature.
Trim retrieval. Measure whether your 3–5x context inflation actually improves answers. Rerank harder, stuff less.
Meter tokens per outcome. Track tokens-per-resolved-task, not spend-per-month. It's the only metric that separates "we're doing more" from "we're wasting more."

The honest takeaway

Rising AI spend isn't automatically a problem — Jellyfish's data shows the heavy spenders really were more productive. The problem is unexamined spend: frontier models doing glue work, uncapped loops, and nobody owning the tokens-per-outcome number. Prices will keep falling. Your bill will keep rising. The only variable you control is how much of that bill buys something.

Routing every call to the cheapest tier that clears your quality bar is item one on the checklist — and it's the entire premise of TierUp.

Sources

The Tokenizer Tax: How Your Bill Goes Up Without a Price Change

Michael Lee — Sun, 05 Jul 2026 16:09:54 +0000

Originally published on the TierUp blog. A case study in how an LLM bill rises 12–27% with zero change to the rate card.

The rate card is the least interesting number on your AI invoice. What you actually pay is price × tokens, and providers have far more ways to move the second factor than the first. The clearest recent example: Claude Opus 4.7.

Same price, more tokens

When Opus 4.7 shipped this spring, the sticker price didn't move — $5/M input, $25/M output, the same rates Anthropic has held since Opus 4.1, as Finout's pricing analysis notes. What changed was the tokenizer. Anthropic's own documentation disclosed that the new tokenizer produces 1.0–1.35x as many tokens for the same text, with the high end landing on code, structured data, and non-English text.

Independent measurements suggest the official range was, if anything, conservative:

ClaudeCodeCamp's measurement post found 1.47x on technical documentation and 1.445x on real CLAUDE.md files — above the documented ceiling — with a weighted average of about 1.325x across real coding-session content. Characters-per-token fell from 4.33 to 3.60 for English prose and from 3.66 to 2.69 for TypeScript.
OpenRouter's analysis (published April 27, 2026) measured 32–45% token inflation across prompt-size buckets, translating to real-world cost increases of 12–27% for most workloads. The interesting exception: prompts under 2K tokens came out about 1.6% cheaper, and prompt caching absorbed much of the inflation on very long contexts (93% of the extra tokens were cache reads in the 128K+ bucket).
ClaudeCodeCamp's end-to-end estimate: a typical 80-turn coding session that cost about $6.65 on Opus 4.6 runs $7.86–$8.76 on 4.7 — a 20–30% increase at an identical rate card.

To be fair, this wasn't a stealth price hike. Anthropic documented the range and gave a rationale — finer-grained tokens improve literal instruction-following and tool-call precision, per the stated reasoning quoted in ClaudeCodeCamp's writeup. You may well be getting a better model per dollar. But if your budget model assumed "price unchanged = cost unchanged," it's now wrong by up to a quarter.

The other hidden multipliers

The tokenizer tax is one member of a family. None of these show up as a price change; all of them change what you pay.

The output premium. Every major model charges a multiple for output over input: 5x on Opus 4.7 and Sonnet 4.6 ($25 vs $5, $15 vs $3), 6x on GPT-5.5 and Gemini 3 Flash, per the major pricing trackers (see our pricing roundup). As Finout puts it, output token growth matters more than input growth precisely because of this multiplier. A model that's slightly chattier — longer explanations, more verbose chain-of-thought, bigger tool-call payloads — raises your bill with no pricing announcement at all.

Long-context surcharges. Gemini 3.1 Pro charges $2/$12 up to 200K context but $4/$18 beyond it, per CloudZero's pricing data. Cross that threshold with a bloated RAG pipeline and your marginal input rate doubles — again with no change to any published price.

Cache invalidation on model upgrades. Prompt caching is the biggest legitimate discount available (up to 90% on cache reads). But caches are model-partitioned: when you upgrade, every cached prefix must be rewritten — and after a tokenizer change, the prefix you're re-caching is 1.3–1.45x larger than before, as ClaudeCodeCamp documented. Budget for an expensive cold-start week after every migration.

Retries and truncation. A failed or truncated call you retry costs full price both times; the arithmetic is unforgiving in agent loops where one flaky step re-runs an entire chain. Timeouts, malformed tool calls, and max-token truncations are all billable events.

What to do about it

Meter tokens, not requests. Track tokens-per-task over time; that's the metric that catches a tokenizer change or creeping verbosity. Dollar dashboards lag; token dashboards lead.
Re-benchmark cost on every model upgrade, not just quality. Run your standard eval set and compare billed tokens, not request counts, before and after.
Cap output. Set max_tokens deliberately and prefer terse output formats — every output token is 4–6 input tokens' worth of money.
Watch context thresholds. If you're near a long-context pricing tier, trimming retrieval is a step-function saving, not a marginal one.

The rate card is marketing. The multipliers are the bill. Tracking those multipliers across providers is most of what cost-aware routing means — and it's the work TierUp does so you don't have to.

Sources

The State of LLM API Pricing: July 2026

Michael Lee — Sun, 05 Jul 2026 14:17:05 +0000

Originally published on the TierUp blog.

If you last looked at a model price sheet a year ago, the single most important thing that changed isn't any one number. It's the spread. As of this month, published per-token prices run from about $0.075 per million input tokens at the bottom (Gemini 2.5 Flash-Lite, per APIpulse's June 2026 survey) to $30 input / $180 output at the top (OpenAI's GPT-5.5 Pro tier, confirmed across APIpulse, CloudZero, and CostGoat).

That's roughly a 400x spread on input and a 600x spread on output. Two API calls that look identical in your code can differ in cost by more than two orders of magnitude depending on one string: the model name.

The landscape in one table

Prices below are per million tokens, cross-checked against three trackers updated between May 11 and July 5, 2026. Prices move; verify against the provider's page before committing budget.

Model	Input $/M	Output $/M
GPT-5.5 Pro	$30.00	$180.00
Claude Opus 4.7	$5.00	$25.00
GPT-5.5	$5.00	$30.00
Claude Sonnet 4.6	$3.00	$15.00
Gemini 3.1 Pro (≤200K context)	$2.00	$12.00
Claude Haiku 4.5	$1.00	$5.00
Gemini 3 Flash	$0.50	$3.00
Gemini 2.5 Flash-Lite	$0.075	$0.30

A few footnotes that matter more than they look:

Long context costs extra. Gemini 3.1 Pro doubles its input rate (to $4/M) and raises output to $18/M once you cross 200K tokens of context, per CloudZero's data.
Naming churn is real. CloudZero's May snapshot listed the $30/$180 OpenAI tier as "GPT-5.4 Pro"; APIpulse and CostGoat now list "GPT-5.5 Pro" at the identical price. The tier is stable even when the model name isn't — plan around tiers, not names.
Open-weight-hosted models anchor the floor. DeepSeek's models are listed at $0.27/$1.10 (V3.2, CloudZero) down to $0.14/$0.28 for newer flash variants (APIpulse). The budget floor is crowded and keeps dropping.

What the spread actually means for you

The middle tier is where most production work belongs. Claude Sonnet 4.6 ($3/$15) and GPT-5.4 ($2.50/$15) are the consensus workhorses in every tracker we checked — frontier-adjacent quality at roughly 1/12th the cost of the Pro tiers. The $30/$180 tier buys measurably better performance on hard reasoning, but at 12x the price of models that handle the large majority of real workloads fine.

Output pricing is the quiet killer. Every model in the table charges 4–6x more for output than input. If your workload is generation-heavy (long answers, code, reports), the output column is the one to optimize — a topic big enough that we wrote a separate post on hidden cost multipliers.

Discounts are large and underused. Batch APIs run 50% off and prompt caching discounts cached input by up to 90% at the major providers, per CloudZero. If you're paying rack rate on repetitive prefixes, you're overpaying by design.

The uncomfortable implication

A 400–600x price spread means model selection is now a bigger cost lever than any infrastructure decision most teams will make this year. Hardcoding a flagship model name into every call path was defensible when the spread was 10x. At 600x, it's a budget decision being made by a config file nobody has reviewed since March.

The practical move: classify your workloads by the quality they actually need, route each class to the cheapest tier that clears the bar, and re-check quarterly — because as the naming churn above shows, the map gets redrawn every few months. That's the exact problem TierUp's tier-based routing exists to automate — disclosure: I'm the founder, and the tier-1 free playground at tierup.ai/try needs no signup if you want to see it.

Sources

Tiers, not models: designing an LLM router on Cloudflare Workers

Michael Lee — Sun, 05 Jul 2026 09:40:32 +0000

Every LLM app I've shipped had the same shelf life: pick the best model, hardcode it, and watch it become the second-best model within a month. The fix I keep seeing is a config file full of model strings and a quarterly migration chore. I wanted the abstraction one level up: "how smart does this request need to be?" — so I built a router around performance tiers instead of model names.

The tier contract

Four tiers: Speed / Balance / Intelligence / Reasoning. The API is OpenAI-compatible; model: "tier-2" is the only change a client makes:

from openai import OpenAI

client = OpenAI(
    base_url="https://api.tierup.ai/v1",
    api_key="...",
)

resp = client.chat.completions.create(
    model="tier-2",  # 1=speed, 2=balance, 3=intelligence, 4=reasoning
    messages=[{"role": "user", "content": "..."}],
)

Each tier maps to the current best-value model in its class — that mapping is my problem, versioned server-side, so an upgrade reaches every client with zero code changes on their side.

The stack, concretely

One Cloudflare Worker (Hono) fronts everything: auth (API key or Supabase JWT), a D1 database for users/wallets/request logs, KV for rate limits, and OpenRouter as the upstream aggregator. The Worker validates the request, checks the wallet, rewrites tier-N to the mapped model, proxies (streaming or not), then strips provider/model details from the response so the tier abstraction doesn't leak. Usage and cost are logged per request in D1; billing deducts from a prepaid wallet.

What was genuinely hard

Streaming + billing: you can't know the cost until the last SSE chunk, so billing runs in waitUntil after the stream closes — and you have to trust (and verify) the usage block in the final chunk.
Error compatibility: OpenAI-SDK clients break on nonstandard error bodies; every upstream failure has to be reshaped into the OpenAI error schema.
Health vs function: our /health returned 200 while auth was down (paused upstream DB) and, separately, while completions were broken (a corrupted API-key secret). Reachability lies. We now run a synthetic probe every 6h that signs up a disposable user, logs in, runs a tier-1 completion, and deletes itself — that's the only health check we trust.

The economics (disclosure)

This runs on top of OpenRouter and is priced ~50% under retail while we find out whether tier-routing is a thing people want — a subsidized PMF experiment, stated plainly on the site. Tier 1 is currently free. If you want to poke at it: tierup.ai (playground with no signup at tierup.ai/try, $25 credit, no card). I'm more interested in critique of the tier abstraction than in signups — comments very welcome.

The 1% Problem: Why Nobody Answers Cold Email Anymore (and What Actually Works in 2026)

Michael Lee — Sun, 14 Jun 2026 06:37:39 +0000

Originally published on the DonaTalk blog.

If you send cold emails for a living, you already feel it: reply rates that were 8–10% a decade ago now hover around 1–3% — and "positive reply" rates are a fraction of that. Industry studies from Backlinko, Gong, and Belkins all converge on the same uncomfortable picture: the average cold email campaign needs 100+ sends to produce a single interested response.

Why cold outreach keeps getting worse

Volume exploded. AI writing tools made it free to send "personalized" email at infinite scale — so every decision-maker's inbox became a wall of lookalike sequences.
Filters got smarter. Google and Microsoft now route bulk-pattern mail to spam or "Promotions" before a human ever sees it.
Trust collapsed. When everything is "personalized," nothing is. Recipients assume automation and delete on sight.

The math is brutal. At a 1% reply rate, a salesperson sending 50 emails a day generates roughly one conversation every two days — before qualification. The cost per actual meeting from cold email, fully loaded with SDR time and tooling, routinely exceeds $300–$800.

The signal problem, not a copy problem

Most "fix your cold email" advice optimizes subject lines and CTAs. But the core issue isn't copy — it's that email costs the sender nothing, so it carries no signal. A busy executive can't tell the difference between a rep who spent an hour researching them and a robot that scraped their LinkedIn. Both messages look identical, so both get ignored.

Economists call this a signaling failure. The fix isn't better words; it's attaching a cost to the ask that proves you're serious.

What attaching real skin-in-the-game looks like

That's the idea behind DonaTalk: instead of sending email #101 into the void, you commit a $10+ donation to the recipient's favorite charity in exchange for a 15-minute meeting. The donation only goes through if they accept — no acceptance, no charge.

For the seller, $10–$25 per accepted meeting is dramatically cheaper than the fully-loaded cost of cold-email meetings — and it filters for prospects willing to actually engage.
For the recipient, an unwanted interruption becomes funding for a cause they chose. Saying yes does good, literally.
For the charity, business development becomes a new donation stream.

Cold email isn't dead — but it's drowning in its own volume. The next decade of outreach belongs to channels where the ask costs something. Try DonaTalk and turn your next 100 unanswered emails into one meeting that funds a charity.