On June 9, 2026, OpenAI confidentially filed its S-1 with the SEC, targeting a Q4 listing at up to a $1 trillion valuation. Inside the filing: $1.22 lost for every $1 of revenue, $14B projected losses for 2026, profitability not expected until 2030.
If you ship code that runs on top of LLM APIs, this is your problem now. Here's the math, the architecture, and the code to deal with it.
The Pricing Reset Already Happened
- March 2026: 114 of 483 tracked AI models changed prices
- April 2026: Anthropic began charging Claude Enterprise for full compute cost
- June 1, 2026: GitHub Copilot moved all plans to usage-based "AI Credits" billing (1 credit = $0.01)
- Reports of developer bills going from $29/mo to $750/mo (~25x)
- Agentic coding sessions consuming up to $40 per task
OpenAI's head of ChatGPT publicly called their current pricing "accidental." That is corporate-speak for "we are about to raise prices."
Why The Economics Broke
| Metric | Value | Source |
|---|---|---|
| OpenAI inference costs 2025 | $8.4B | Industry analysis |
| OpenAI inference costs 2026 (projected) | $14.1B | S-1 / industry |
| OpenAI revenue 2025 | $3.7B | Pre-IPO disclosures |
| OpenAI loss 2025 | ~$5B | Pre-IPO disclosures |
| Anthropic ARR (May 2026) | $44B | Sacra |
| Anthropic Q2 2026 operating profit (projected) | $559M | Pre-IPO disclosures |
Inference cost in 2025 was bigger than total revenue. That is not a business that survives a quarterly earnings call.
The structural problem: training is one-time and amortizable, inference is variable and never stops. Efficiency improvements (distillation, MoE, quantization, custom silicon) compound at ~30-50% per year. Frontier usage compounds several times faster. Investor capital has been bridging the gap. Public markets will not.
What You Need to Build Now
1. Real Cost Telemetry
Stop reading the monthly invoice and guessing. Instrument per-call:
// middleware/llm-cost-tracker.ts
interface LLMCallMetrics {
userId: string;
feature: string;
model: string;
inputTokens: number;
outputTokens: number;
cachedTokens: number;
costUSD: number;
latencyMs: number;
timestamp: Date;
}
export async function instrumentedComplete(
client: OpenAI,
request: ChatCompletionRequest,
context: { userId: string; feature: string }
): Promise<ChatCompletion> {
const start = Date.now();
const response = await client.chat.completions.create(request);
const metrics: LLMCallMetrics = {
userId: context.userId,
feature: context.feature,
model: request.model,
inputTokens: response.usage.prompt_tokens,
outputTokens: response.usage.completion_tokens,
cachedTokens: response.usage.prompt_tokens_details?.cached_tokens ?? 0,
costUSD: calculateCost(
request.model,
response.usage.prompt_tokens,
response.usage.completion_tokens,
response.usage.prompt_tokens_details?.cached_tokens ?? 0
),
latencyMs: Date.now() - start,
timestamp: new Date(),
};
await emitToTelemetry(metrics);
return response;
}
Pipe this into ClickHouse, BigQuery, or whatever your analytics warehouse is. Build dashboards by user, feature, model, and time. The first time your CFO asks "which 10 customers are unprofitable today," you should be able to answer in 30 seconds.
2. Model Routing
The biggest cost lever is not the contract you sign with OpenAI. It is which model you call. A frontier model can be 10-30x the cost of a smaller fine-tuned model for the same job.
// router/model-selector.ts
type TaskComplexity = "trivial" | "standard" | "complex" | "frontier";
const MODEL_TIERS = {
trivial: { model: "gpt-4.1-nano", maxTokens: 500 },
standard: { model: "claude-haiku-4-5", maxTokens: 2000 },
complex: { model: "gemini-3.5-flash", maxTokens: 8000 },
frontier: { model: "claude-opus-4-8", maxTokens: 32000 },
};
export function selectModel(task: {
type: string;
inputLength: number;
requiresReasoning: boolean;
userTier: "free" | "pro" | "enterprise";
}): typeof MODEL_TIERS[TaskComplexity] {
if (task.type === "classification" || task.type === "extraction") {
return MODEL_TIERS.trivial;
}
if (!task.requiresReasoning && task.inputLength < 4000) {
return MODEL_TIERS.standard;
}
if (task.requiresReasoning && task.userTier !== "free") {
return MODEL_TIERS.frontier;
}
return MODEL_TIERS.complex;
}
For a lot of production workloads, this kind of router collapses inference spend by 60-80% without users noticing. Inception's Mercury 2 (1,000+ tokens/sec via diffusion) and Gemini 3.5 Flash (4x speed of previous Gemini) exist specifically because this market is starving for cheap, fast models on high-volume calls.
3. Prompt + Response Caching
Cache aggressively. OpenAI's prompt caching, Anthropic's cached input pricing, and your own response cache for deterministic queries are all on the table.
import { createHash } from "crypto";
const responseCache = new Map<string, { response: string; expires: number }>();
function cacheKey(model: string, messages: ChatMessage[], temperature: number): string {
const payload = JSON.stringify({ model, messages, temperature });
return createHash("sha256").update(payload).digest("hex");
}
export async function cachedComplete(req: ChatCompletionRequest): Promise<string> {
// Only cache deterministic calls
if (req.temperature !== 0) return doComplete(req);
const key = cacheKey(req.model, req.messages, req.temperature);
const hit = responseCache.get(key);
if (hit && hit.expires > Date.now()) return hit.response;
const fresh = await doComplete(req);
responseCache.set(key, { response: fresh, expires: Date.now() + 3600_000 });
return fresh;
}
For workflows with repeated identical or near-identical calls (classification, moderation, summarization of recurring content), this often returns 30-50% cost reduction on its own.
4. Open-Weights Fallback
Self-hosted Llama, Qwen, Mistral, or DeepSeek on your own infrastructure (or on Together, Fireworks, Anyscale) now competes credibly with commercial APIs on many workloads, especially high-volume predictable ones. Even running open weights as a fallback for non-customer-facing internal workflows creates leverage in your next vendor negotiation.
5. Usage-Based Pricing in Your Own Product
If your product passes LLM inference through to customers via a flat monthly subscription, you are exposed. GitHub just demonstrated what happens when you cannot absorb that exposure. Move toward usage-based pricing on AI features with proper metering. The companies who do this in Q3 2026 will look modern. The ones who wait until Q1 2027 will look like gougers.
The Inflection
Anthropic's contrast tells the second half of the story. $44B ARR. First operating profit ($559M) projected for Q2 2026. The path to profitability runs through enterprise pricing, sustained workloads, and charging real money for real compute. It exists. OpenAI's S-1 has to convince public markets that they can get there too.
The companies that will navigate this cleanly are the ones treating their inference architecture as a first-class engineering problem right now. Telemetry, routing, caching, open-weights leverage, usage-based pricing on your own product. Five line items. Worth starting on all of them this sprint.
If your inference cost doubled tomorrow, would your business still work? If it tripled in 24 months? If your favorite provider raised list prices 50% next quarter? If you do not know, you do not know your business. The subsidy era hid that fact for almost everyone.
What are you doing about it in your stack? Drop your favorite cost-reduction patterns in the comments. Always curious what's working at other teams.
Top comments (0)