- Book: AI Agents Pocket Guide
- Also by me: Prompt Engineering Pocket Guide
- My project: Hermes IDE | GitHub — an IDE for developers who ship with Claude Code and other AI coding tools
- Me: xgabriel.com | GitHub
On April 21, 2026, Anthropic pulled Claude Code from the $20 Pro plan for a slice of new signups. The next day, GitHub froze new signups for Copilot Pro and reshuffled the individual tiers. Earlier that month OpenAI had already moved: a $100 Codex Pro tier on April 9, and standalone Codex shifted to pay-as-you-go on April 3. Cursor, the same week, moved frontier models behind Max Mode for legacy Team and Enterprise plans.
Read together, those changes are not five vendors making independent calls. Five vendors are converging on the same answer to the same question.
The question is: what do you charge a developer who, given a $20/month flat plan, will run a coding agent in a tight loop for eight hours a day and consume $400 of inference?
The answer is no longer a flat $20.
What's actually happening
For most of 2024 and 2025 the benchmark price for a coding tool sat at $20/month. Copilot Pro, Claude Pro with Claude Code attached, ChatGPT Plus with Codex inside. Three vendors, one number, a price shaped by what a single developer felt comfortable paying out of pocket.
Underneath that number, the inference cost shape changed. Claude 3.5 Sonnet's per-token price was a meaningful fraction of Claude 4.5 Sonnet's. Reasoning models doubled or tripled the per-task cost. Coding agents (the kind that read code and run a test suite in a loop) often multiplied per-task tokens 10×+ over single-turn completion. The flat $20 stopped covering the median, then the 25th percentile.
The pricing shifts in April 2026 are vendors admitting the math.
- GitHub Copilot. Reshuffled tiers. The $10 individual plan still exists; usage limits tightened. Premium-request multipliers are now load-bearing: a request to GPT-5 burns more of the monthly allotment than one to a smaller model. The frontier-model seat is the higher tier.
- Claude Code. Pro plan ($20) no longer reliably includes Claude Code as it did. Server-side prompt cache TTL was cut from one hour to five minutes. The Max plans ($100/$200) are where heavy users land.
- Cursor. Frontier models behind Max Mode for legacy Team/Enterprise plans. Compute-based overages on the $20 Pro tier.
- OpenAI Codex. Shipped a $100 Pro tier on April 9. Standalone Codex seats moved to pay-as-you-go on April 3.
The shape that connects all of those: flat consumption no longer matches flat billing. Vendors are introducing one of three patterns to fix it:
- Hard caps with overage at retail rates.
- Premium-request multipliers, where the tier buys you N units and each model costs M units per call.
- Higher-priced plans with explicit headroom for power users instead of an unlimited promise.
Each is credit-burn billing wearing a different label.
The 3 signals that pricing will tighten further
Three signals over the next six months.
Signal 1: cache TTL changes
Watch prompt-cache TTL across vendors. A long cache TTL is a subsidy: the vendor eats the storage cost so you can replay the same context cheaply. When TTL drops, the vendor is shifting cost to you. Anthropic going from 1 hour to 5 minutes on Claude Code's prompt cache (documented in the issue thread and analyzed here) is the most explicit version of this so far. If Cursor or Copilot shorten their context-window caching, the next tier change is queued.
Signal 2: model routing defaults
Watch which model is the default for "auto" mode in each tool. Two years ago, defaulting to the most capable model was a marketing weapon. Today it's a margin problem. When Copilot's "auto" silently routes to a smaller model and the frontier model becomes opt-in via a multiplier, that's a price increase encoded in routing.
Signal 3: enterprise SLA decoupling
Watch what happens to enterprise contracts that bundle "frontier models" without naming them. The GPT-5 / Claude 4.7 / Gemini 3 generation is expensive enough that bundling without per-model caps is unsustainable. Expect vendors to either rename old SKUs to exclude GPT-5 / Claude 4.7 / Gemini 3, or insert per-model caps mid-term.
What engineering teams should renegotiate or instrument
Four moves for the team that doesn't want to be surprised by a 3× bill increase in Q3.
Renegotiate based on per-developer-token, not per-seat. If your contract is on seats, ask for a token cap and overage rate in writing. The seat number stops mattering when the tier behind it gets reshuffled.
Instrument actual usage now. Most teams have no idea what the tail looks like. The 80th-percentile dev tends to consume several times the median (rough rule of thumb; build the meter and confirm against your own data). When pricing moves to consumption, that tail eats your budget before you see it.
Decouple the IDE from the model. Tools that let you bring your own API key, or route across providers, give you negotiating room when one vendor's pricing moves. Lock-in to a flat-rate IDE only matters while the flat rate exists.
Set per-task budgets at the agent layer. A coding agent without a token ceiling is a runaway loop waiting for a prompt that confuses it. Cap per-task spend in code, surface the cap in the IDE, and alert when 80% of the cap is used. The pattern is the same one you'd use for a runaway recursive function.
A Python helper to estimate your bill and project pricing scenarios
The estimator below takes a CSV of usage telemetry and projects three pricing-shift scenarios so you can see the bill shape under each. Drop your last 30 days of data in, get three numbers back.
import csv
from dataclasses import dataclass
from collections import defaultdict
@dataclass
class Usage:
user_id: str
model: str
input_tokens: int
output_tokens: int
# Indicative April 2026 retail per-million-token rates.
# Adjust to your actual provider rate card.
RATES = {
"claude-4.5-sonnet": {"in": 3.0, "out": 15.0},
"claude-4.5-opus": {"in": 15.0, "out": 75.0},
"gpt-5": {"in": 5.0, "out": 20.0},
"gpt-5-mini": {"in": 0.5, "out": 2.0},
"gemini-3-pro": {"in": 4.0, "out": 16.0},
}
def load_usage(path: str) -> list[Usage]:
rows = []
with open(path) as f:
for row in csv.DictReader(f):
rows.append(
Usage(
user_id=row["user_id"],
model=row["model"],
input_tokens=int(row["input_tokens"]),
output_tokens=int(row["output_tokens"]),
)
)
return rows
def retail_cost(rows: list[Usage]) -> float:
total = 0.0
for r in rows:
rate = RATES.get(r.model)
if not rate:
continue
total += r.input_tokens / 1_000_000 * rate["in"]
total += r.output_tokens / 1_000_000 * rate["out"]
return total
def per_user_cost(rows: list[Usage]) -> dict[str, float]:
bucket = defaultdict(list)
for r in rows:
bucket[r.user_id].append(r)
return {u: retail_cost(rs) for u, rs in bucket.items()}
def scenario_flat(seats: int, price_per_seat: float) -> float:
return seats * price_per_seat
def scenario_credit_burn(
rows: list[Usage],
seats: int,
price_per_seat: float,
included_credits_usd: float,
overage_multiplier: float,
) -> float:
base = scenario_flat(seats, price_per_seat)
per_user = per_user_cost(rows)
overage = 0.0
for _, cost in per_user.items():
if cost > included_credits_usd:
overage += (cost - included_credits_usd) * overage_multiplier
return base + overage
def scenario_pure_passthrough(rows: list[Usage], markup: float = 1.2) -> float:
return retail_cost(rows) * markup
def project(path: str, seats: int) -> dict[str, float]:
rows = load_usage(path)
return {
"current_flat_$20": scenario_flat(seats, 20.0),
"credit_burn_$30_with_$50_credits_1.2x_overage":
scenario_credit_burn(
rows,
seats=seats,
price_per_seat=30.0,
included_credits_usd=50.0,
overage_multiplier=1.2,
),
"pure_passthrough_1.2x": scenario_pure_passthrough(rows, 1.2),
}
if __name__ == "__main__":
import sys
print(project(sys.argv[1], seats=int(sys.argv[2])))
Run it on a 30-day export from your provider:
$ python estimate.py usage.csv 50
# Illustrative output (your numbers will differ based on your usage CSV):
{
'current_flat_$20': 1000.0,
'credit_burn_$30_with_$50_credits_1.2x_overage': 2840.0,
'pure_passthrough_1.2x': 6120.0,
}
Three numbers. Three scenarios. The first is what you pay today. The second is what you'd pay if your vendor moved to a credit-burn model with a slightly higher base and a 1.2× overage multiplier (the most likely near-term outcome). The third is the floor your vendor's actual cost looks like, and it bounds how aggressive the next pricing change can be.
If your projection-2 number is 3× projection-1, you have a budgeting problem. If projection-3 is 6× projection-1, your vendor has a margin problem and you have a future-pricing problem.
What replaces flat-rate
The honest answer: a layered model.
A small flat fee for IDE seat and tooling. A pool of included credits sized for typical use. Per-call multipliers on frontier models. Pay-as-you-go above the pool. The pool resets monthly. Multipliers shift as model costs shift.
For developers who used $30 of inference on a flat-$20 plan, the bill goes up. For developers who used $5, it stays flat or drops. The vendor stops subsidizing the high tail with the median user.
This is the same arc SaaS storage pricing took in 2014, the same arc cloud compute took in 2010. Capable inference is genuinely expensive.
The teams that come out ahead instrument usage now, cap per-task spend in their agent loops, and renegotiate the vendor contract on every tier change.
Further reading
- A Medium write-up by Zoom In AI reads the same six weeks as a single arc and is worth a click for an alternate framing.
If this was useful
The agent-cost-ceiling pattern in the closing section is one of the operational primitives in chapter 7 of the AI Agents Pocket Guide. If you're paying per token now, the Prompt Engineering Pocket Guide is the cheapest line item in your stack — most of the cost reductions in the estimator above come from prompt structure, not vendor switching. And Hermes IDE lets you bring your own API key, which is exactly the decoupling the renegotiation section recommends.


Top comments (0)