If you're building AI features in 2026, your gross margin depends on a question most developers don't have a good answer to: what does one inference actually cost?
The answer isn't in the model card. It's in the physical infrastructure chain that runs from a fab in Taiwan to a data centre in Virginia. Here's how to estimate it.
The easy part: API pricing
If you're using an API (OpenAI, Anthropic, Together, Groq), your per-token cost is known. The hidden variable is cache-hit rate. Prompt caching drops cost by 2-10x depending on how much of your system prompt is shared across requests. If you haven't measured your cache-hit ratio, you don't know your true cost.
Most teams I've seen get 30-50% cache hits on well-structured prompts and close to 0% on dynamic ones. That's a 2x difference in effective cost hiding in plain sight.
The harder part: self-hosted inference
Running your own models means paying for GPU time whether you're using it or not. The number that matters: utilization rate.
A single H100 at $2-3/hour needs to be generating tokens >60% of the time to beat API pricing at scale. Below 30% utilization, you'd have been better off on an API. Most self-hosted deployments I see run at 15-25% because of traffic spikes and idle standby capacity.
The break-even math:
- API: pay per token, zero fixed cost
- Self-hosted: ~$2k/month per GPU all-in, first ~2M tokens are effectively "paying off the fixed cost"
- Breakeven: ~4-5M tokens/month per GPU, assuming 60% utilization
The hidden constraint: hardware availability
This is the part most infrastructure analyses miss. In 2026, GPU lead times are still 12-18 months for new deployments. H200s are shipping but allocated. The secondary market for A100s is active but prices haven't dropped as much as expected — because demand from inference workloads has replaced training demand.
What this means for your deployment plan: if you need GPUs in a quarter, you're renting. If you're renting, you can't amortize hardware cost. If you can't amortize, you're at the mercy of spot pricing — which has swung 40% in a single month twice this year already.
The one number to track
For any AI feature, track cost per completed interaction — not cost per token. Token counts hide the real metric: how many tokens does your average user interaction consume?
A chatbot using Claude Sonnet 4.6 ($3/M input, $15/M output) averaging 2,000 tokens per conversation with a typical 70/30 input/output split costs roughly $0.013 per conversation. At 100k conversations/month, that's $1,300 — significant enough that a 10% improvement in token efficiency pays for an engineer's time.
Most teams don't know their average cost per interaction. That's the first number worth instrumenting — without it, you can't tell whether optimisation matters or not.
Summary
- Measure your cache-hit ratio (30-50% is typical; anything below 20% means expensive redundant computation)
- Track cost per completed interaction, not per token
- Know your self-host breakeven point (~4M tokens/month per GPU)
- Assume 12-month lead times for hardware — plan accordingly
- Spot GPU pricing can swing 40% in a month; don't build on spot for production
The infrastructure layer is the part of AI most developers treat as someone else's problem. It isn't. The teams that understand their cost-per-interaction will build features that survive the margin compression that's coming.
Top comments (0)